game-playtest

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AlterLab GameForge -- Structured Playtest Analysis

AlterLab GameForge -- 结构化Playtest分析

Playtesting is not asking players if they had fun. It is the disciplined observation of player behavior to identify where the design succeeds and where it fails. The player's mouth lies -- their hands do not. Nintendo has known this for decades: Miyamoto famously watches players silently, trusting their confusion over their compliments. Larian ran thousands of community playtests during BG3's Early Access, and every major system change traced back to behavioral data, not forum polls. This workflow provides a rigorous behavioral observation framework that transforms raw playtest sessions into actionable design insights.

Playtest不是询问玩家是否玩得开心，而是通过严谨的玩家行为观察，找出设计的成功之处与不足。玩家的嘴会说谎——但他们的操作不会。任天堂几十年来都明白这一点：宫本茂以默默观察玩家而闻名，相比玩家的赞美，他更相信他们的困惑表现。拉瑞安（Larian）在《博德之门3》（BG3）抢先体验期间开展了数千次社区playtest，每一项重大系统变更都源于行为数据，而非论坛投票。本工作流提供了一套严谨的行为观察框架，可将原始playtest会话转化为可落地的设计洞见。

Purpose & Triggers

目的与触发场景

Invoke this workflow when:

A build is ready for external eyes and you need structured feedback, not just reactions
Specific design questions need answering: "Do players understand the crafting system?" not "Is the game good?"
Onboarding flow needs validation -- can new players learn the core mechanic without a tutorial?
Difficulty curve assessment -- are players in the flow channel or oscillating between boredom and frustration?
A new feature has been integrated and its impact on the overall experience is unknown
Pre-release polish pass needs data on which rough edges matter most to players
Competitive analysis requires side-by-side comparison with a reference game

Do NOT use this workflow when:

You need to test a raw mechanic in isolation (use
```
game-prototype
```
instead)
The build is so broken that testers will spend most of their time hitting bugs (fix critical bugs first, then playtest)
You want marketing quotes or positive testimonials (that is PR, not playtesting)

调用本工作流的场景：

版本构建完成，需要结构化反馈而非单纯的主观感受
需要解答特定设计问题：比如“玩家是否理解 crafting系统？”而非“游戏好不好玩？”
需要验证新手引导流程——新玩家能否无需教程就掌握核心机制？
需要评估难度曲线——玩家是否处于心流状态，还是在无聊与挫败间摇摆？
新功能已集成，但其对整体体验的影响未知
预发布打磨阶段，需要数据明确哪些细节问题对玩家影响最大
竞品分析需要与参考游戏进行对比测试

请勿调用本工作流的场景：

需要孤立测试原始机制（请使用
```
game-prototype
```
）
版本存在严重漏洞，测试者大部分时间都在遇到bug（先修复关键漏洞，再开展playtest）
需要获取营销话术或正面评价（这属于PR范畴，而非playtest）

Critical Rules

核心规则

Define questions before inviting testers. Every playtest answers specific questions. "Is it fun?" is not a question -- it is a prayer. "Can players complete the first dungeon without dying more than twice?" is a question. Celeste's playtests asked "can players learn the dash mechanic within the first three screens?" -- specific, observable, actionable.
The facilitator does not play. You observe. You take notes. You do not help, explain, suggest, or react. Your poker face is a scientific instrument.
Minimum 5 testers per session. Fewer than 5 and you are collecting anecdotes, not data. Individual player quirks dominate small samples. At 5+ testers, patterns emerge.
Never test with the development team. They know too much. Their muscle memory, mental models, and context make them incapable of experiencing the game as a new player. Nintendo's internal playtesting teams are deliberately kept away from development discussions so they approach each session cold. Your developers are blind to every onboarding problem they have already internalized.
Behavioral data outranks verbal data. If a player says "the controls feel fine" but you observed them pressing the wrong button 11 times in a 10-minute session, the behavioral data wins. Always. Larian tracked BG3 playtester behavior at the input level -- they knew which dialogue options players hovered over before choosing, and that hesitation data informed their rewrite of Act 1.
Separate observation from interpretation. During the session, record what happened. After the session, interpret what it means. Mixing the two in real-time creates confirmation bias.
Reference
docs/game-design-theory.md
for Flow Theory and MDA Framework when analyzing player engagement and emotional responses.

邀请测试者前先明确问题。每一次playtest都要解答特定问题。“好玩吗？”不是问题——是祈祷。“玩家能否在不死亡超过两次的情况下完成第一个地牢？”才是问题。《蔚蓝》（Celeste）的playtest问题是“玩家能否在前三屏内学会冲刺机制？”——具体、可观察、可落地。
主持人不参与游戏。你只负责观察、记录。不要提供帮助、解释、建议或做出反应。你的“扑克脸”是科学研究工具。
每场测试至少5名测试者。少于5人时，你收集的只是个案而非数据。个体玩家的特殊习惯会主导小样本结果。达到5名及以上测试者时，行为模式才会显现。
绝不要让开发团队参与测试。他们知道得太多。肌肉记忆、思维模型和背景知识让他们无法以新玩家的视角体验游戏。任天堂的内部测试团队会刻意远离开发讨论，以便每次测试都能保持“零基础”视角。你的开发者早已内化了新手引导中的所有问题，对此视而不见。
行为数据优先于口头数据。如果玩家说“操控感觉不错”，但你观察到他们在10分钟内按错按钮11次，那么行为数据才是真相。永远如此。拉瑞安在BG3的playtest中追踪到了测试者的输入级行为——他们知道玩家在选择对话选项前会犹豫哪些内容，这些犹豫数据直接推动了第一章的重写。
区分观察与解读。测试过程中，只记录发生了什么。测试结束后，再解读其含义。实时混淆二者会产生确认偏差。
分析玩家参与度与情绪反应时，请参考
docs/game-design-theory.md
中的Flow Theory与MDA Framework。

Workflow

工作流

Step 1: Pre-Playtest Preparation

Define the test objectives. Write 3-5 specific questions this playtest will answer. Each question should be:

Observable (you can determine the answer by watching, not just asking)
Actionable (the answer directly informs a design decision)
Scoped (answerable within a single play session)

Good test questions:

"Do players discover the dodge-roll mechanic organically within the first two encounters?"
"At what point in the progression curve do players stop voluntarily exploring and start rushing to objectives?"
"Does the resource scarcity in Act 2 create tension or frustration?"

Prepare the observation sheet. For each test question, define:

What specific player behaviors indicate success (positive signals)
What specific player behaviors indicate failure (negative signals)
Where in the game session to watch most closely (critical observation windows)

Create the per-player tracking form:

Player ID: ___
Session Date: ___
Session Duration: ___
Test Build Version: ___

Timestamped Observations:
[MM:SS] [Observation] [Category: Action/Hesitation/Confusion/Emotion/Verbal]

Post-Session Survey Responses:
Q1: ___
Q2: ___
Q3: ___

Set up recording infrastructure:

Screen capture with audio (mandatory -- you will miss things in real-time that the recording catches)
Face camera if available (facial micro-expressions reveal engagement, confusion, and frustration that players will never verbalize)
Input logging if your engine supports it (heatmaps of where players click, where they die, where they spend time)
Ensure recordings are timestamped and synchronized so you can cross-reference player expression with game events

Prepare the test environment:

Use a consistent hardware setup across all testers (different frame rates and input devices contaminate results)
Remove development overlays, debug menus, and console access
Disable any developer shortcuts or god-mode toggles
Have a clean save state ready so every tester starts from the same point
Test the recording setup with a dry run before the first tester arrives

Brief your facilitators (if you have helpers):

Their only job is to observe and record. Not to help. Not to explain. Not to react.
If a tester asks "What do I do?" the correct response is: "What do you think you should do?"
If a tester is completely stuck for more than 90 seconds on a non-critical path, they may offer a single neutral hint ("Have you tried interacting with the glowing object?"). Log this as a critical finding.
Facilitators should not sit directly next to the player. Peripheral awareness of being watched changes behavior. Sit behind and to the side.

Step 2: During the Playtest -- Silent Observation Protocol

This is where discipline matters most. You are a scientist. Your personal feelings about the game are irrelevant during this phase.

Real-time observation categories:

Actions -- What is the player doing?

Record moment-to-moment decisions. Not just "player fought the boss" but "player circled the boss for 15 seconds before attacking, suggesting they were looking for a weak point or building courage."
Track navigation patterns. Do players go where you intended? Where do they go instead? Unintended exploration paths reveal what the environment is actually communicating versus what you think it communicates.
Note input patterns. Button mashing (panic or boredom), deliberate presses (strategic engagement), repeated failed inputs (control confusion).

Hesitations -- Where does the player pause?

A pause before a door means the player is anticipating what is behind it (good -- you created tension).
A pause at a menu means the player does not understand the options (bad -- your UI is unclear).
A pause in combat means the player is either strategizing (good) or overwhelmed (bad). Their facial expression and subsequent action disambiguate.

Confusions -- Where does the player misunderstand?

Track "expectation mismatches" -- moments where the player clearly expected one outcome and got another. These are the highest-value findings in any playtest.
Note instances where the player uses a mechanic incorrectly but thinks they are using it correctly. This reveals that your feedback systems are not communicating state clearly.
Watch for players reading the same tooltip or sign multiple times -- it means the information was unclear or they do not trust their own understanding.

Emotions -- What is the player feeling?

Delight indicators: leaning forward, widening eyes, spontaneous laughter, "cool" or "whoa" vocalizations, showing the screen to someone nearby
Frustration indicators: sighing, leaning back, crossing arms, clicking more aggressively, muttering, eye-rolling
Engagement indicators: losing track of time, ignoring phone notifications, asking "can I keep playing?" at the end
Disengagement indicators: checking phone, looking around the room, playing with reduced attention, asking "how much longer?"
Flow state indicators: quiet focus, rhythmic input patterns, surprise when told time is up, difficulty recalling specific moments (they were "in it"). Hades playtests reportedly showed players losing 30+ minutes without checking the clock -- the gold standard for flow state confirmation

Map emotional responses to specific game moments. This creates an emotional heatmap of the play session -- where are the peaks and valleys? Compare this to your intended emotional arc from the design document.

Verbal observations (think-aloud protocol, if used):

Record the player's real-time narration without filtering or correcting.
Flag moments where what the player says contradicts what they are doing -- these are gold. "This is easy" followed by dying three times reveals a gap between perceived and actual skill.

Step 3: Post-Session Debrief

Keep it short. 5-7 minutes maximum. The player's attention is most valuable while the experience is fresh, but fatigue sets in quickly after a play session.

Core debrief questions (ask in this order):

"What was the game about?" -- tests whether the core fantasy and theme communicated clearly
"What were you trying to do most of the time?" -- reveals whether the player understood the primary objective and core loop
"Was there a moment that stood out as particularly good?" -- identifies delight peaks from the player's perspective (cross-reference with your observations)
"Was there a moment that felt confusing or frustrating?" -- identifies friction from the player's perspective
"If you could change one thing, what would it be?" -- reveals the player's top-of-mind pain point

Optional deep-dive questions (only if time permits and the answer informs a test question):

"Did you feel like you understood what your options were at any given time?" -- tests decision clarity
"Did the difficulty feel about right, too easy, or too hard?" -- subjective difficulty assessment (triangulate with behavioral data)
"Was there anything you wanted to do that the game didn't let you?" -- reveals affordance gaps

Do NOT ask:

"Did you like it?" -- useless. Social pressure ensures a positive answer.
"Would you buy it?" -- irrelevant at this stage and puts the player in an evaluative mindset that suppresses honest feedback.
Leading questions: "Did you notice how the lighting changed in the cave?" -- you are feeding them the observation you want.

Step 4: Post-Playtest Analysis

Wait at least 2 hours after the last session before analyzing. Immediate analysis is contaminated by recency bias -- the last tester's experience dominates your thinking.

Cross-player pattern identification:

Compile observations into a matrix: rows are game moments/features, columns are players
Highlight moments where 3+ players exhibited the same behavior -- these are systemic findings, not individual quirks
Identify divergence points: moments where players split into distinct behavior groups (this reveals a design fork that may need to be resolved or embraced)

Finding classification: Categorize every finding by severity:

Severity	Definition	Action Required
Critical	Breaks the core experience. Player cannot progress, or the intended emotion is inverted (frustration instead of triumph).	Must fix before next playtest.
Major	Degrades the experience significantly. Player can proceed but the quality of the experience is noticeably diminished.	Should fix in current milestone.
Minor	Could be better. Player notices but is not significantly impacted.	Fix when convenient, or batch into a polish pass.
Observation	Interesting behavioral note that does not indicate a problem but may inform future design decisions.	Log for reference. No action required.

Recommendation generation: For each Critical and Major finding, generate a specific, actionable recommendation:

What to change (be concrete -- "reduce enemy count in room 3 from 5 to 3" not "make it easier")
Why it will help (connect the recommendation to the observed behavior)
Expected impact (what should change in the next playtest if this fix works)
Potential side effects (will this fix create new problems elsewhere?)

Longitudinal comparison: If prior playtest data exists, compare results across iterations:

Which findings from the previous playtest were addressed, and did the fixes work?
Which problems persisted despite attempted fixes (these may be structural, not surface-level)?
Is the overall trajectory improving? Are you fixing more than you are breaking?

Step 5: Report Generation and Distribution

Compile the analysis into the standardized Playtest Report format (see Output Format below). Distribute to the full team with a 2-sentence executive summary at the top -- the lead designer and producer need the headline without reading 10 pages.

步骤1：测试前准备

明确测试目标。列出本次playtest要解答的3-5个具体问题。每个问题需满足：

可观察（通过观察而非询问就能得出答案）
可落地（答案能直接指导设计决策）
范围明确（单次测试会话内可解答）

优秀的测试问题示例：

“玩家能否在前两场战斗中自主发现闪避翻滚机制？”
“玩家在进度曲线的哪个阶段会停止主动探索，转而直奔目标？”
“第二章的资源稀缺性会带来紧张感还是挫败感？”

准备观察记录表。针对每个测试问题，定义：

哪些具体玩家行为代表成功（积极信号）
哪些具体玩家行为代表失败（消极信号）
测试会话中需要重点关注的时段（关键观察窗口）

创建单玩家跟踪表单：

Player ID: ___
Session Date: ___
Session Duration: ___
Test Build Version: ___

Timestamped Observations:
[MM:SS] [Observation] [Category: Action/Hesitation/Confusion/Emotion/Verbal]

Post-Session Survey Responses:
Q1: ___
Q2: ___
Q3: ___

搭建录制基础设施：

带音频的屏幕录制（必填——实时观察会遗漏细节，录制内容可事后回看）
如有条件，开启面部摄像头（面部微表情能揭示玩家不会口头表达的参与度、困惑与挫败感）
若引擎支持，开启输入日志（玩家点击位置、死亡地点、停留时间的热力图）
确保录制内容带时间戳并同步，以便将玩家表情与游戏事件交叉比对

准备测试环境：

所有测试者使用一致的硬件配置（不同帧率和输入设备会影响结果）
移除开发Overlay、调试菜单和控制台权限
禁用任何开发者快捷键或无敌模式开关
准备干净的存档状态，确保所有测试者从同一节点开始
正式测试前进行试运行，验证录制设置正常

向主持人（如有助手）说明要求：

他们的唯一工作是观察和记录。不要提供帮助、解释或做出反应。
如果测试者问“我该做什么？”，正确回应是：“你觉得你应该做什么？”
如果测试者在非关键路径上完全卡壳超过90秒，可提供一句中立提示（比如“你试过和发光物体互动吗？”）。并将此记录为关键发现。
主持人不要坐在测试者正旁边。被注视的感知会改变玩家行为。应坐在测试者身后侧方。

步骤2：测试期间——静默观察协议

这一阶段最需要纪律性。你是一名科学家，对游戏的个人感受在此阶段无关紧要。

实时观察分类：

行为——玩家在做什么？

记录每一刻的决策。不只是“玩家与Boss战斗”，而是“玩家绕Boss转圈15秒后才发起攻击，说明他们在寻找弱点或鼓起勇气”。
追踪导航模式。玩家是否走向你预期的方向？他们实际走向了哪里？意外的探索路径揭示了环境实际传递的信息，与你认为传递的信息之间的差异。
记录输入模式。连按按钮（恐慌或无聊）、刻意按键（策略性参与）、重复错误输入（操控困惑）。

犹豫——玩家在哪里停顿？

门前停顿意味着玩家在预期门后的内容（好现象——你营造了紧张感）。
菜单停顿意味着玩家不理解选项（坏现象——UI设计不清晰）。
战斗中的停顿可能是玩家在制定策略（好现象）或不知所措（坏现象）。他们的面部表情和后续行为可区分这两种情况。

困惑——玩家在哪里产生误解？

追踪“预期偏差”——玩家明确预期某一结果但实际得到另一种结果的时刻。这些是playtest中最有价值的发现。
记录玩家错误使用机制但认为自己操作正确的情况。这说明你的反馈系统没有清晰传达状态。
留意玩家反复阅读同一提示或标识的情况——这意味着信息表述不清，或者玩家不信任自己的理解。

情绪——玩家的感受是什么？

愉悦信号：身体前倾、眼睛睁大、自发大笑、发出“酷”或“哇”的声音、向身边人展示屏幕
挫败信号：叹气、身体后靠、抱臂、点击动作更用力、喃喃自语、翻白眼
参与信号：忘记时间、忽略手机通知、测试结束时问“我能继续玩吗？”
脱离信号：查看手机、环顾房间、注意力分散、问“还要多久？”
心流状态信号：专注沉默、输入节奏稳定、被告知时间到感到惊讶、难以回忆具体时刻（他们完全“沉浸其中”）。据报道，《哈迪斯》（Hades）的playtest中，玩家会连续玩30多分钟不看时间——这是心流状态的黄金标准。

将情绪反应与特定游戏时刻关联。这会生成测试会话的情绪热力图——哪些时段是峰值，哪些是低谷？将其与设计文档中预期的情绪曲线对比。

口头观察（若使用出声思考协议）：

如实记录玩家的实时表述，不筛选或纠正。
标记玩家言行不一的时刻——这些是黄金发现。比如“这很简单”之后连续死亡三次，说明玩家感知技能与实际技能存在差距。

步骤3：测试后访谈

保持简短。最多5-7分钟。玩家的注意力在体验刚结束时最有价值，但测试后很快会产生疲劳。

核心访谈问题（按以下顺序提问）：

“这个游戏是关于什么的？”——测试核心设定与主题是否传达清晰
“你大部分时间都在尝试做什么？”——揭示玩家是否理解主要目标与核心循环
“有没有哪个时刻特别出彩？”——从玩家视角识别愉悦峰值（与你的观察交叉比对）
“有没有哪个时刻让你感到困惑或挫败？”——从玩家视角识别体验摩擦点
“如果只能改一个地方，你会改什么？”——揭示玩家最在意的痛点

可选深度问题（仅在时间允许且答案能解答测试问题时提问）：

“你觉得自己在任何时刻都清楚有哪些选择吗？”——测试决策清晰度
“难度感觉合适、太简单还是太难？”——主观难度评估（结合行为数据 triangulate）
“有没有什么你想做但游戏不允许的事？”——揭示功能缺口

请勿提问：

“你喜欢这个游戏吗？”——毫无意义。社交压力会确保玩家给出正面答案。
“你会买它吗？”——现阶段无关紧要，且会让玩家进入评价心态，抑制真实反馈。
诱导性问题：“你注意到洞穴里的灯光变化了吗？”——你在灌输自己想要的观察结果。

步骤4：测试后分析

最后一场测试结束后至少等待2小时再进行分析。即时分析会受近因偏差影响——最后一名测试者的体验会主导你的判断。

跨玩家模式识别：

将观察结果整理成矩阵：行是游戏时刻/功能，列是玩家
高亮3名及以上玩家表现出相同行为的时刻——这些是系统性发现，而非个体特例
识别分歧点：玩家行为分成不同群体的时刻（这揭示了可能需要解决或接纳的设计分支）

发现分类： 按严重程度对所有发现进行分类：

严重程度	定义	行动要求
Critical（致命）	破坏核心体验。玩家无法推进，或预期情绪完全反转（本该胜利却感到挫败）	必须在下一次playtest前修复
Major（严重）	显著降低体验质量。玩家可以推进，但体验质量明显下降	应在当前里程碑内修复
Minor（轻微）	可以优化。玩家会注意到，但不会受到显著影响	方便时修复，或批量纳入打磨阶段
Observation（观察）	有趣的行为记录，不代表问题，但可能为未来设计提供参考	记录存档。无需行动

生成建议： 针对每一项Critical和Major发现，生成具体、可落地的建议：

要修改什么（具体明确——比如“将3号房间的敌人数量从5个减少到3个”而非“降低难度”）
为什么这会有帮助（将建议与观察到的行为关联）
预期影响（如果修复生效，下一次playtest中会有哪些变化）
潜在副作用（修复是否会在其他地方引发新问题？）

纵向对比： 如果存在过往playtest数据，对比不同迭代的结果：

上一次playtest的哪些发现已解决，修复是否有效？
哪些问题在尝试修复后仍存在（这些可能是结构性问题，而非表面问题）？
整体趋势是否向好？修复的问题是否比新增的多？

步骤5：报告生成与分发

将分析结果整理成标准化的Playtest报告格式（见下方输出格式）。分发给整个团队，报告顶部附上2句话的执行摘要——首席设计师和制作人无需阅读10页内容，只需了解核心信息。

Output Format

输出格式

undefined

undefined

Playtest Report: [Build Name / Version]

Playtest Report: [版本名称/版本号]

Date: [YYYY-MM-DD]

Facilitator: [Name]

Facilitator: [姓名]

Testers: [Count] ([demographic notes if relevant])

Testers: [人数]（[相关人口统计信息]）

Session Duration: [Average across testers]

Session Duration: [测试者平均时长]

Executive Summary

执行摘要

[2 sentences: What was the most important finding? What is the recommended priority action?]

[2句话：最重要的发现是什么？推荐的优先行动是什么？]

Test Objectives

测试目标

[Question 1] -- [Answered / Partially Answered / Unanswered]
[Question 2] -- [Answered / Partially Answered / Unanswered]
[Question 3] -- [Answered / Partially Answered / Unanswered]

[问题1] -- [已解答/部分解答/未解答]
[问题2] -- [已解答/部分解答/未解答]
[问题3] -- [已解答/部分解答/未解答]

Findings Matrix

发现矩阵

ID	Finding	Severity	Players Affected	Game Moment	Recommendation
F1	[desc]	Critical	4/5	[moment]	[action]
F2	[desc]	Major	3/5	[moment]	[action]
F3	[desc]	Minor	2/5	[moment]	[action]

ID	发现内容	严重程度	受影响玩家数	游戏时刻	建议
F1	[描述]	Critical	4/5	[时刻]	[行动]
F2	[描述]	Major	3/5	[时刻]	[行动]
F3	[描述]	Minor	2/5	[时刻]	[行动]

Emotional Response Map

情绪反应图谱

[Timeline showing emotional peaks and valleys across the session, with game moments annotated]

Opening -> [emotion] -> [event] -> [emotion] -> [event] -> [emotion] -> Close

[显示会话期间情绪峰值与低谷的时间线，标注对应游戏事件]

开场 -> [情绪] -> [事件] -> [情绪] -> [事件] -> [情绪] -> 结束

Onboarding Assessment

新手引导评估

Core mechanic understood without explanation: [X/5 players]
Time to first successful use of primary mechanic: [average time]
Tutorial/hint engagement: [how many players read vs. skipped]
First death/failure cause: [most common reason]

无需解释理解核心机制的玩家数：[X/5]
首次成功使用核心机制的时间：[平均时长]
教程/提示参与度：[阅读vs跳过的玩家数]
首次死亡/失败原因：[最常见原因]

Flow Analysis (per docs/game-design-theory.md)

心流分析（参考docs/game-design-theory.md）

Estimated flow channel adherence: [percentage of session time in flow]
Anxiety spikes (challenge > skill): [moments]
Boredom dips (skill > challenge): [moments]
Flow entry points: [moments where players appeared to enter flow state]

心流状态占比：[会话中心流状态的时间百分比]
焦虑峰值（挑战>技能）：[时刻]
无聊低谷（技能>挑战）：[时刻]
心流进入点：[玩家进入心流状态的时刻]

Player Behavior Patterns

玩家行为模式

Navigation: [Where did players go? Where did they NOT go? Where did they get lost?]
Combat/Core Loop: [How did players engage with the primary mechanic?]
Exploration: [What did players investigate voluntarily?]
Resource Management: [How did players handle scarcity/abundance?]

导航：[玩家去了哪里？没去哪里？在哪里迷路？]
战斗/核心循环：[玩家如何参与核心机制？]
探索：[玩家主动探索了哪些内容？]
资源管理：[玩家如何处理资源稀缺/充足的情况？]

Comparison to Previous Playtest

与过往Playtest对比

Previous Finding	Status	Notes
[Finding from last time]	Fixed / Improved / Unchanged / Regressed	[details]

过往发现	状态	备注
[上次测试的发现]	已修复/改善/未变化/恶化	[细节]

Prioritized Action Items

优先级行动项

[Highest priority action] -- addresses findings [F1, F3]
[Second priority action] -- addresses finding [F2]
[Third priority action] -- addresses finding [F4]

[最高优先级行动] -- 解决发现[F1, F3]
[次优先级行动] -- 解决发现[F2]
[第三优先级行动] -- 解决发现[F4]

Raw Observation Notes

原始观察记录

[Attached or linked per-player observation sheets]

[附上或链接单玩家观察记录表]

Recording Index

录制索引

Player	Recording File	Key Timestamps
P1	[filename]	[MM:SS notable moments]
P2	[filename]	[MM:SS notable moments]

undefined

玩家	录制文件	关键时间戳
P1	[文件名]	[MM:SS 重要时刻]
P2	[文件名]	[MM:SS 重要时刻]

undefined

Quality Criteria

质量标准

Question specificity: Every test objective is observable, actionable, and scoped. No vague "is it fun?" questions survived the planning phase.
Observation rigor: At least 80% of findings are grounded in behavioral data (what players DID), not verbal data (what players SAID). Verbal data is supporting evidence, not primary evidence.
Pattern validity: Findings classified as systemic (Major or Critical) are supported by observations from at least 3 out of 5 testers. Single-player observations are classified as Minor or Observation.
Recommendation concreteness: Every Critical and Major finding has a specific, implementable recommendation -- not "make it better" but "reduce the number of enemies in room 3 from 5 to 3 and add a health pickup before the encounter."
Longitudinal tracking: The report includes comparison to at least one prior playtest (if one exists), tracking whether previous findings were addressed and whether fixes were effective.
Emotional mapping: The report includes an emotional response map showing where delight and frustration occurred in the session timeline, cross-referenced with specific game events.
Facilitator neutrality: The report documents any instances where the facilitator intervened (explained, helped, hinted) and flags those moments as potentially contaminated data.

问题明确性：所有测试目标均为可观察、可落地、范围明确的问题。模糊的“好玩吗？”类问题已在规划阶段被排除。
观察严谨性：至少80%的发现基于行为数据（玩家的实际操作），而非口头数据（玩家的表述）。口头数据仅作为辅助证据，而非主要证据。
模式有效性：被归类为系统性（Major或Critical）的发现需得到至少3/5测试者的观察支持。单玩家观察结果归类为Minor或Observation。
建议具体性：每一项Critical和Major发现都有具体、可实施的建议——而非“优化体验”，而是“将3号房间的敌人数量从5个减少到3个，并在战斗前添加一个生命值补给”。
纵向追踪：报告包含与至少一次过往playtest的对比（若存在），追踪过往发现是否已解决，修复是否有效。
情绪映射：报告包含情绪反应图谱，显示会话中愉悦与挫败的发生时段，并与特定游戏事件交叉比对。
主持人中立性：报告记录主持人任何干预（解释、帮助、提示）的情况，并标记这些时刻的数据可能受污染。

AI Playtesting Agents

AI Playtest代理

Human playtesting remains the gold standard for evaluating player experience, but AI playtesting agents can supplement human sessions by providing coverage, regression testing, and overnight stress testing that would be impractical to do with human testers.

nunu.ai Pattern: Define test goals in plain English (e.g., "complete the tutorial without dying," "find and defeat the boss in level 3," "attempt to sequence-break past the locked door"). AI bots execute overnight, producing session recordings and behavioral logs. Use this for:

Regression testing after balance changes (did this patch break the tutorial completion rate?)
Coverage testing (can any path through the level design lead to a softlock?)
Stress testing (what happens when 100 agents play simultaneously in a multiplayer environment?)

modl.ai Pattern: Autonomous test bots that explore your game without specific goals, mapping reachable states and identifying areas where the bot gets stuck. Use this for:

Pathfinding validation (are there navigation mesh holes?)
State machine integrity (can the bot reach an unrecoverable state?)
Content coverage (what percentage of the level geometry is actually reachable?)

What AI Testing Cannot Replace:

Emotional response measurement (delight, frustration, engagement)
Aesthetic evaluation (does this FEEL good?)
Social dynamics in multiplayer (AI cannot replicate human social behavior)
First-impression testing (AI has no expectations to violate)

Use AI testing for coverage and regression. Use human testing for experience and emotion. Never substitute one for the other.

人工playtest仍是评估玩家体验的黄金标准，但AI playtest代理可作为补充，提供人工测试难以实现的覆盖范围、回归测试和夜间压力测试。

nunu.ai模式： 用自然语言定义测试目标（比如“不死亡完成教程”、“找到并击败3关Boss”、“尝试跳过锁着的门”）。AI机器人在夜间执行测试，生成会话录制和行为日志。适用于：

平衡性调整后的回归测试（本次补丁是否破坏了教程完成率？）
覆盖测试（关卡设计中的任何路径是否会导致软锁？）
压力测试（100个代理同时在多人环境中游玩会发生什么？）

modl.ai模式： 自主测试机器人无需特定目标即可探索游戏，绘制可达状态图并识别机器人卡壳的区域。适用于：

寻路验证（是否存在导航网格漏洞？）
状态机完整性（机器人能否进入无法恢复的状态？）
内容覆盖（关卡几何结构中实际可达的比例是多少？）

AI测试无法替代的内容：

情绪反应测量（愉悦、挫败、参与度）
美学评估（体验是否“舒服”？）
多人游戏中的社交动态（AI无法复制人类社交行为）
第一印象测试（AI没有可被打破的预期）

AI测试用于覆盖范围和回归测试，人工测试用于体验和情绪评估。永远不要用其中一种替代另一种。

Structured Session Planning Template

结构化会话规划模板

Before any playtest session, complete this planning template to ensure focus and reproducibility:

SESSION PLAN
Date: [YYYY-MM-DD]
Build Version: [version string]
Session Type: [First Impression / Targeted Feature / Regression / Full Playthrough]
Duration: [planned session length per tester]
Tester Count: [number of testers scheduled]

Test Questions (max 5):
1. [Specific, observable, actionable question]
2. [Specific, observable, actionable question]
3. [Specific, observable, actionable question]

Focus Areas:
- [Game section or feature under scrutiny]
- [Specific interaction or flow to observe]

Recording Setup:
- Screen capture: [tool and settings]
- Face camera: [available / not available]
- Input logging: [enabled / not available]

Facilitator Notes:
- [Any special instructions for this session]
- [Known issues to ignore during testing]

开展任何playtest会话前，完成以下规划模板，确保测试聚焦且可复现：

SESSION PLAN
Date: [YYYY-MM-DD]
Build Version: [版本字符串]
Session Type: [第一印象/目标功能/回归测试/完整流程]
Duration: [每位测试者的计划时长]
Tester Count: [计划测试者人数]

Test Questions（最多5个）:
1. [具体、可观察、可落地的问题]
2. [具体、可观察、可落地的问题]
3. [具体、可观察、可落地的问题]

Focus Areas:
- [受关注的游戏章节或功能]
- [需观察的特定交互或流程]

Recording Setup:
- Screen capture: [工具与设置]
- Face camera: [可用/不可用]
- Input logging: [启用/不可用]

Facilitator Notes:
- [本次会话的特殊说明]
- [测试期间可忽略的已知问题]

Four-Question Playtest Focus

四问题Playtest聚焦法

When time is limited or you need a rapid signal from testers, reduce the debrief to these four questions. They are ordered to surface the highest-value insights with minimal tester fatigue:

"What confused you?" -- Identifies onboarding failures, unclear mechanics, and communication gaps. Confusion is the most actionable finding because it points directly to specific moments that need redesign.
"When were you bored?" -- Identifies pacing dead zones, reward gaps, and content that fails to engage. Boredom is harder to detect through observation alone because bored players often continue playing out of politeness.
"When did you want to stop?" -- Identifies frustration peaks, fatigue walls, and the natural session length for your game. The difference between "I wanted to stop at the boss" and "I wanted to stop during the inventory management" reveals which systems are friction sources.
"What would you show a friend?" -- Identifies delight peaks and the game's natural marketing hook. Whatever the tester would show a friend is the moment your trailer should lead with and your store page should emphasize.

These four questions replace the longer debrief when session time is constrained. They are not a substitute for full behavioral observation during the session itself.

当时间有限或需要快速从测试者处获取信号时，将访谈简化为以下四个问题。按此顺序提问可在最小化测试者疲劳的前提下，获取最高价值的洞见：

“什么让你感到困惑？”——识别新手引导失败、机制模糊和沟通缺口。困惑是最可落地的发现，因为它直接指向需要重新设计的特定时刻。
“什么时候你感到无聊？”——识别节奏停滞区、奖励缺口和无法吸引玩家的内容。仅通过观察很难发现无聊，因为无聊的玩家通常会出于礼貌继续游玩。
“什么时候你想停止游玩？”——识别挫败峰值、疲劳阈值和游戏的自然会话时长。“我在Boss战想停止”和“我在背包管理时想停止”的区别，揭示了哪些系统是体验摩擦的来源。
“你会把哪个时刻展示给朋友？”——识别愉悦峰值和游戏的天然营销卖点。测试者想展示给朋友的时刻，就是你的预告片和商店页面应该重点突出的内容。

当会话时间紧张时，这四个问题可替代完整访谈。但它们无法替代测试期间的完整行为观察。

Remote Playtesting Setup

远程Playtest设置

Most indie developers cannot run in-person sessions consistently. Remote playtesting is the realistic default and has specific requirements:

Recording infrastructure:

Ask testers to install OBS or use the built-in recording in their OS (Xbox Game Bar on Windows, QuickTime on macOS). Provide written setup instructions in advance -- do not spend the session troubleshooting recording software.
Request front-facing webcam footage if the tester consents. Most do. Even a low-quality webcam captures the emotional responses that drive the highest-value findings.
Use Discord, Zoom, or Google Meet for audio. Keep your own mic muted during observation. Hearing yourself breathing changes tester behavior.
Use a private itch.io link, Steam early access key, or direct download for build distribution. Never email executables -- they get flagged by security software and contaminate first impressions.

Session control:

Block 90 minutes: 10 minutes setup, 60 minutes play, 20 minutes debrief. Testers who run long naturally truncate debrief quality.
Use a shared document or form to capture the debrief so you are not transcribing in real-time. Google Forms works well for standardized questions; a shared Google Doc works well for open-ended responses.
For asynchronous remote testing, provide a structured observation prompt: "After you finish, write down: one moment that surprised you, one moment you felt confused, and one moment where you felt especially engaged. Do not overthink it -- raw reactions are more valuable than considered ones."

What remote testing cannot capture:

Genuine facial micro-expressions without webcam (which some testers decline)
Physical environment context (distractions, hardware quality, sound environment)
True cold-start behavior if testers have seen screenshots or trailers

Remote testing works well for mid-development playtests where the goal is identifying specific friction points. For first-impression testing of core onboarding, in-person sessions with a facilitator present are preferable when feasible.

大多数独立开发者无法持续开展线下测试。远程playtest是现实中的默认选择，且有特定要求：

录制基础设施：

要求测试者安装OBS或使用系统内置录制工具（Windows的Xbox Game Bar、macOS的QuickTime）。提前提供书面设置说明——不要在会话中花费时间排查录制软件问题。
若测试者同意，请求提供前置摄像头画面。大多数人会同意。即使是低质量摄像头也能捕捉到驱动最高价值发现的情绪反应。
使用Discord、Zoom或Google Meet进行音频沟通。观察期间保持自己的麦克风静音。自己的呼吸声会改变测试者的行为。
使用私人itch.io链接、Steam抢先体验密钥或直接下载分发版本。永远不要通过邮件发送可执行文件——它们会被安全软件拦截，且会影响第一印象。

会话控制：

预留90分钟：10分钟设置、60分钟游玩、20分钟访谈。超时的测试者会自然降低访谈质量。
使用共享文档或表单记录访谈内容，避免实时转录。Google Forms适用于标准化问题；共享Google Doc适用于开放式回答。
对于异步远程测试，提供结构化观察提示：“游玩结束后，写下：一个让你惊讶的时刻、一个让你困惑的时刻、一个让你特别投入的时刻。不要过度思考——原始反应比深思熟虑的回答更有价值。”

远程测试无法捕捉的内容：

无摄像头时的真实面部微表情（部分测试者会拒绝提供）
物理环境背景（干扰因素、硬件质量、声音环境）
若测试者看过截图或预告片，无法获得真正的“零基础”行为

远程测试适用于中期开发阶段的playtest，目标是识别特定体验摩擦点。对于核心新手引导的第一印象测试，若条件允许，优先选择有主持人在场的线下会话。

Finding Evidence Thresholds

发现证据阈值

Avoid the trap of calling a finding "Critical" because one tester had a strong reaction. Apply these evidence thresholds before classifying:

Severity	Minimum Evidence Standard
Critical	4 or more of 5 testers exhibited the behavior, OR 1 tester experienced a complete session-ending failure (crash, softlock, cannot proceed)
Major	3 or more of 5 testers showed the behavior, with behavioral confirmation (not just verbal report)
Minor	2 of 5 testers noted it verbally, OR 1 tester showed behavioral evidence
Observation	Any smaller signal worth logging for future reference

When evidence is ambiguous (e.g., 2 testers showed a behavior and 3 did not), note the split explicitly in the report. A 2/5 signal is not a finding to act on immediately -- it is a finding to watch in the next playtest. If it appears again in the next session, it escalates to Major.

避免因一名测试者反应强烈就将某一发现归类为“Critical”。分类前请应用以下证据阈值：

严重程度	最低证据标准
Critical	5名测试者中有4名及以上表现出该行为，或1名测试者遭遇完全无法继续的会话终止问题（崩溃、软锁、无法推进）
Major	5名测试者中有3名及以上表现出该行为，且有行为数据佐证（而非仅口头报告）
Minor	5名测试者中有2名口头提及，或1名测试者有行为证据
Observation	任何值得记录存档供未来参考的小信号

当证据模糊时（比如2名测试者表现出该行为，3名未表现），请在报告中明确标注分歧。2/5的信号无需立即采取行动——只需在下一次playtest中持续关注。若在下一次会话中再次出现，则升级为Major。

Example Use Cases

示例用例

"We just finished our first playable build. Help me plan a structured playtest session -- what questions should I be asking and how do I set up observation?"
"I ran a playtest last week and took notes on 6 players. Here are my raw observations -- help me analyze the data and generate a prioritized findings report."
"Players keep dying in the same spot in level 3. I think it's a difficulty spike but I'm not sure. Help me design a targeted playtest to diagnose the problem."
"We changed our control scheme based on last month's playtest feedback. Help me design a follow-up test to see if the changes actually fixed the issues we identified."
"Our game has a 20-minute onboarding sequence and I suspect we're losing players before they reach the core loop. Help me set up a first-time user experience playtest with specific metrics to track."
"I'm a solo developer and can't run in-person sessions. Help me set up a remote playtest for my action RPG using Discord and OBS."
"I have 6 playtest reports from three separate sessions over the past 2 months. Help me identify which findings have been persistent across sessions vs. which ones were one-time observations."

“我们刚完成第一个可玩版本。帮我规划一场结构化playtest会话——我应该问什么问题，如何设置观察环节？”
“我上周开展了一场playtest，记录了6名玩家的情况。这是我的原始观察记录——帮我分析数据并生成优先级发现报告。”
“玩家一直在3关的同一个位置死亡。我觉得这是难度陡增，但不确定。帮我设计一场针对性playtest来诊断问题。”
“我们根据上个月的playtest反馈修改了操控方案。帮我设计一场跟进测试，看看修改是否真的解决了我们发现的问题。”
“我们的游戏有20分钟的新手引导流程，我怀疑玩家在进入核心循环前就流失了。帮我设置一场首次用户体验playtest，确定要追踪的具体指标。”
“我是独立开发者，无法开展线下会话。帮我用Discord和OBS为我的动作RPG设置远程playtest。”
“我有过去2个月三次不同会话的6份playtest报告。帮我识别哪些发现是跨会话持续存在的，哪些是单次观察结果。”