dotnet-trace-collect

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

.NET Trace Collect

.NET 跟踪数据收集

This skill helps developers diagnose production performance issues by recommending the right diagnostic tools for their environment, guiding data collection, and suggesting analysis approaches. It does not analyze code for anti-patterns or perform the analysis itself.
本技能通过为开发者推荐适配其环境的诊断工具、指导数据收集并提供分析方法,帮助他们排查生产环境中的性能问题。本技能不会分析代码中的反模式,也不直接执行数据分析操作。

When to Use

适用场景

  • A developer needs to investigate a production performance issue (high CPU, memory leak, slow requests, excessive GC, networking errors, etc.)
  • Choosing the right diagnostic tool for a specific runtime, OS, or deployment topology
  • Setting up and running diagnostic tool commands for data collection
  • Understanding trade-offs between available tools (e.g. PerfView vs dotnet-trace)
  • Collecting diagnostics from containerized or Kubernetes workloads
  • 开发者需要排查生产环境中的性能问题(CPU占用过高、内存泄漏、请求缓慢、GC过于频繁、网络错误等)
  • 为特定运行时、操作系统或部署架构选择合适的诊断工具
  • 配置并运行诊断工具命令以收集数据
  • 了解不同可用工具的取舍(例如 PerfView 与 dotnet-trace)
  • 从容器化或 Kubernetes 工作负载中收集诊断数据

When Not to Use

不适用场景

  • Reviewing source code for performance anti-patterns (use a code review skill instead)
  • Benchmarking during development (e.g. BenchmarkDotNet setup)
  • Analyzing collected trace or dump files (this skill recommends tools for analysis, but does not perform it)
  • 审查源代码中的性能反模式(请使用代码审查类技能)
  • 开发阶段的基准测试(例如 BenchmarkDotNet 配置)
  • 分析已收集的跟踪或转储文件(本技能仅推荐分析工具,不执行分析操作)

Inputs

输入项

InputRequiredDescription
SymptomYesWhat the developer is observing (high CPU, memory growth, slow requests, hangs, excessive GC, HTTP 5xx errors, networking timeouts, connection failures, assembly loading failures, etc.)
RuntimeYes.NET Framework or modern .NET (and version, especially whether .NET 10+)
OSYesWindows or Linux
DeploymentYesNon-container, container, or Kubernetes
Admin privilegesRecommendedWhether the developer has admin/root access on the target machine
Repro characteristicsRecommendedWhether the issue is easy to reproduce or requires a long time to manifest
输入项是否必填描述
症状开发者观察到的问题(CPU过高、内存增长、请求缓慢、程序挂起、GC过于频繁、HTTP 5xx错误、网络超时、连接失败、程序集加载失败等)
运行时.NET Framework 或现代.NET(需提供版本,尤其是是否为.NET 10+)
操作系统Windows 或 Linux
部署方式非容器、容器或 Kubernetes
管理员权限推荐开发者是否在目标机器上拥有管理员/root权限
复现特征推荐问题是否容易复现,或是需要很长时间才会显现

Workflow

工作流程

Step 1: Understand the environment

步骤1:了解环境

Determine or ask the developer to clarify:
  1. Symptom: What they are observing (high CPU, memory leak, slow requests, hangs, excessive GC, HTTP 5xx errors, networking timeouts, connection failures, assembly loading failures, etc.)
  2. Runtime: .NET Framework or modern .NET? If modern .NET, which version? (Especially whether .NET 10 or later.)
  3. OS: Windows or Linux?
  4. Deployment: Running directly on the host, in a container, or in Kubernetes?
  5. Admin privileges: Do they have admin/root access on the target machine or container?
  6. Repro characteristics: Does the issue reproduce quickly, or does it take a long time to manifest?
  7. Workload context: Determine or ask the user if you are running in the context of the workload (i.e., on the same machine or connected to the same environment where the issue is occurring). If so, you can run diagnostic commands directly on their behalf. If not, provide the commands as guidance for the user to run themselves.
Use this information to select the right tool in Step 2.
确定或请开发者明确以下信息:
  1. 症状:观察到的具体问题(CPU过高、内存泄漏、请求缓慢、程序挂起、GC过于频繁、HTTP 5xx错误、网络超时、连接失败、程序集加载失败等)
  2. 运行时:.NET Framework 还是现代.NET?如果是现代.NET,具体版本是多少?(尤其是是否为.NET 10或更高版本)
  3. 操作系统:Windows 还是 Linux?
  4. 部署方式:直接在主机运行、在容器中运行,还是在 Kubernetes 中运行?
  5. 管理员权限:开发者在目标机器或容器上是否拥有管理员/root权限?
  6. 复现特征:问题是否能快速复现,还是需要很长时间才会显现?
  7. 工作负载上下文:确定或询问用户是否处于工作负载上下文(即与问题发生环境在同一机器或已连接到该环境)。如果是,可以直接代表用户运行诊断命令;如果不是,则提供命令供用户自行运行。
利用这些信息在步骤2中选择合适的工具。

Step 2: Recommend diagnostic tools

步骤2:推荐诊断工具

Select tools based on the environment using the priority rules below. Once a tool is selected, load the corresponding reference file for detailed command-line usage.
根据以下优先级规则,结合环境选择工具。选定工具后,加载对应的参考文件获取详细的命令行使用说明。

Tool reference lookup

工具参考文件查询

EnvironmentReference file(s)
Windows + modern .NET + admin
references/perfview.md
Windows + modern .NET, no admin
references/dotnet-trace-collect.md
Windows + .NET Framework
references/perfview.md
Linux + .NET 10+ + root
references/dotnet-trace-collect-linux.md
Linux + pre-.NET 10
references/dotnet-trace-collect.md
Linux + native stacks needed
references/perfcollect.md
Container/K8s (console access)
references/dotnet-trace-collect.md
(or
dotnet-trace-collect-linux.md
)
Container/K8s (no console)
references/dotnet-monitor.md
环境参考文件
Windows + 现代.NET + 管理员权限
references/perfview.md
Windows + 现代.NET,无管理员权限
references/dotnet-trace-collect.md
Windows + .NET Framework
references/perfview.md
Linux + .NET 10+ + root权限
references/dotnet-trace-collect-linux.md
Linux + .NET 10之前版本
references/dotnet-trace-collect.md
Linux + 需要捕获原生调用栈
references/perfcollect.md
容器/K8s(可访问控制台)
references/dotnet-trace-collect.md
(或
dotnet-trace-collect-linux.md
容器/K8s(无法访问控制台)
references/dotnet-monitor.md

Quick decision matrix (first-pass triage)

快速决策矩阵(初步分类)

EnvironmentPreferred toolFallback / Notes
Windows + modern .NET + adminPerfViewIf admin is unavailable, use
dotnet-trace
Windows + .NET Framework + adminPerfViewWithout admin, there is no trace fallback; for hangs/memory leaks, provide dump commands directly (
procdump -ma
or Task Manager) since
dump-collect
does not support .NET Framework
Linux + .NET 10+ + root
dotnet-trace collect-linux
Use
dotnet-trace
if root or kernel prerequisites are not met
Linux + pre-.NET 10
dotnet-trace
Add
perfcollect
when native stacks are needed (requires root)
Linux container/KubernetesConsole tools if in workload context;
dotnet-monitor
if no console access
See Linux Container / Kubernetes section for details
环境首选工具备选方案/说明
Windows + 现代.NET + 管理员权限PerfView如果无管理员权限,使用
dotnet-trace
Windows + .NET Framework + 管理员权限PerfView无管理员权限时,没有跟踪工具备选方案;针对挂起/内存泄漏问题,直接提供转储命令(
procdump -ma
或任务管理器),因为
dump-collect
不支持.NET Framework
Linux + .NET 10+ + root权限
dotnet-trace collect-linux
如果不满足root或内核前置条件,使用
dotnet-trace
Linux + .NET 10之前版本
dotnet-trace
当需要原生调用栈时,添加
perfcollect
(需要root权限)
Linux 容器/Kubernetes处于工作负载上下文时使用控制台工具;无法访问控制台时使用
dotnet-monitor
详见Linux容器/Kubernetes部分的说明

Windows (non-container, modern .NET)

Windows(非容器,现代.NET)

  1. PerfView (preferred) — produces richer ETW-based data; requires admin privileges. For slow requests, add
    /ThreadTime
    to capture thread-level wait and block detail.
  2. dotnet-trace
    — fallback when admin privileges are not available.
  3. For long-running repros: use PerfView with a
    /StopOn
    trigger that fires on the symptom you want to capture (e.g.,
    /StopOnPerfCounter
    ,
    /StopOnGCEvent
    ,
    /StopOnException
    ) and a circular buffer (
    /CircularMB
    +
    /BufferSizeMB
    ). Critical: the stop trigger must fire on the interesting event, not the recovery. The circular buffer continuously overwrites old data, so if you trigger on recovery, the buffer may have already overwritten the interesting behavior by the time collection stops. Only add
    /StartOn
    if the start event is known to precede the stop event. For slow requests, do not include a stop trigger by default — let the user design one based on their specific scenario.
  1. PerfView(首选)—— 生成更丰富的基于ETW的数据;需要管理员权限。针对请求缓慢问题,添加
    /ThreadTime
    参数以捕获线程级别的等待和阻塞细节。
  2. dotnet-trace
    —— 无管理员权限时的备选方案。
  3. 针对长时间复现的问题:使用带有
    /StopOn
    触发条件的PerfView,触发条件基于要捕获的症状(例如
    /StopOnPerfCounter
    /StopOnGCEvent
    /StopOnException
    ),并配合循环缓冲区(
    /CircularMB
    +
    /BufferSizeMB
    )。关键注意事项:停止触发条件必须针对目标事件,而非恢复事件。 循环缓冲区会持续覆盖旧数据,因此如果在恢复时触发,收集停止时缓冲区可能已经覆盖了关键行为。仅当明确启动事件先于停止事件时,才添加
    /StartOn
    参数。针对请求缓慢问题,默认不设置停止触发条件——由用户根据具体场景自行设计。

Windows containers

Windows容器

  1. PerfView — most Windows containers (including Kubernetes on Windows) use process-isolation by default. Collect from the host with
    /EnableEventsInContainers
    . After collection, you have two options:
    • Analyze locally while the container is still running — PerfView can reach into the live container to resolve symbols, so you can open the trace immediately on the host machine.
    • Analyze off-machine — before the container shuts down, copy the
      .etl.zip
      into the container and run
      PerfViewCollect merge /ImageIDsOnly
      inside it to embed symbol information. Then copy the merged trace out. Without this merge step, symbols for binaries inside the container will be unresolvable on other machines.
    For the less common Hyper-V containers, collect inside the container directly. See references/perfview.md for detailed commands.
  2. dotnet-monitor
    ,
    dotnet-trace
    — inside the container if the tools are installed in the image. For dumps, invoke the
    dump-collect
    skill.
  1. PerfView—— 大多数Windows容器(包括Windows上的Kubernetes)默认使用进程隔离模式。从主机收集数据时添加
    /EnableEventsInContainers
    参数。收集完成后,有两种选择:
    • 容器运行时本地分析—— PerfView可以直接访问运行中的容器解析符号,因此可以立即在主机上打开跟踪文件进行分析。
    • 离线分析—— 在容器关闭前,将
      .etl.zip
      文件复制到容器内,并在容器中运行
      PerfViewCollect merge /ImageIDsOnly
      以嵌入符号信息。然后将合并后的跟踪文件复制出来。如果不执行此合并步骤,其他机器将无法解析容器内二进制文件的符号。
    对于较罕见的Hyper-V容器,直接在容器内收集数据。详见 references/perfview.md 中的详细命令。
  2. dotnet-monitor
    dotnet-trace
    —— 如果工具已安装在镜像中,可在容器内使用。如需转储文件,调用
    dump-collect
    技能。

Windows (.NET Framework)

Windows(.NET Framework)

  1. PerfView — the primary diagnostic tool for .NET Framework on Windows. Requires admin.
  2. Same trigger guidance for long repros: use
    /StopOn
    triggers that fire on the symptom (e.g.,
    /StopOnPerfCounter
    ,
    /StopOnGCEvent
    ,
    /StopOnException
    ) with
    /CircularMB
    +
    /BufferSizeMB
    .
  3. Without admin: PerfView requires admin, and there are no alternative trace tools for .NET Framework. Process dumps can still be captured without admin — provide dump commands directly (e.g.,
    procdump -ma <PID>
    or Task Manager) since the
    dump-collect
    skill does not support .NET Framework. Dumps can help diagnose hangs and memory leaks. However, for high CPU, slow requests, and excessive GC, there is no way to investigate on .NET Framework without admin access. Advise the user to obtain admin privileges.
  1. PerfView—— Windows上.NET Framework的主要诊断工具。需要管理员权限。
  2. 针对长时间复现问题的触发条件指导:使用基于症状的
    /StopOn
    触发条件(例如
    /StopOnPerfCounter
    /StopOnGCEvent
    /StopOnException
    ),配合
    /CircularMB
    +
    /BufferSizeMB
  3. 无管理员权限:PerfView需要管理员权限,且.NET Framework没有其他跟踪工具备选方案。仍可在无管理员权限下捕获进程转储——直接提供转储命令(例如
    procdump -ma <PID>
    或任务管理器),因为
    dump-collect
    技能不支持.NET Framework。转储文件可帮助排查挂起和内存泄漏问题。但针对CPU过高请求缓慢GC过于频繁问题,无管理员权限时无法在.NET Framework环境下排查。建议用户获取管理员权限。

Linux (non-container, .NET 10+)

Linux(非容器,.NET 10+)

  1. dotnet-trace collect-linux
    (preferred) — uses
    perf_events
    for richer traces including native call stacks and kernel events. Captures machine-wide by default (no PID required). Requires root and kernel >= 6.4.
  2. dotnet-trace
    — fallback when root privileges are not available or kernel requirements are not met. Managed stacks only.
  1. dotnet-trace collect-linux
    (首选)—— 使用
    perf_events
    生成更丰富的跟踪数据,包括原生调用栈和内核事件。默认捕获整个机器的数据(无需指定PID)。需要root权限和内核版本 >=6.4。
  2. dotnet-trace
    —— 无root权限或不满足内核要求时的备选方案。仅捕获托管调用栈。

Linux (non-container, pre-.NET 10)

Linux(非容器,.NET 10之前版本)

  1. dotnet-trace
    (preferred) — managed trace collection; no admin required.
  2. perfcollect
    — when native call stacks are needed (requires admin/root).
  1. dotnet-trace
    (首选)—— 托管跟踪数据收集;无需管理员权限。
  2. perfcollect
    —— 当需要原生调用栈时使用(需要管理员/root权限)。

Linux Container / Kubernetes

Linux容器/Kubernetes

If running in the context of the workload (i.e., you have console access to the container), prefer console-based tools. These are easier to set up than
dotnet-monitor
, which requires authentication configuration and sidecar deployment:
  1. dotnet-trace collect-linux
    (.NET 10+ with root) — produces the richest traces including native call stacks and kernel events.
  2. dotnet-trace
    — inside the container if the tool is installed in the image. For dumps, invoke the
    dump-collect
    skill.
  3. perfcollect
    — inside the container when native stacks are needed on pre-.NET 10 (requires
    SYS_ADMIN
    /
    --privileged
    ).
If not running in the workload context (no console access), or if
dotnet-monitor
is already deployed:
  1. dotnet-monitor
    — designed for containers; runs as a sidecar. No tools needed in the app container. Easiest option when console access is not available.
如果处于工作负载上下文(即可以访问容器控制台),优先使用控制台工具。这些工具比
dotnet-monitor
更易设置,后者需要配置认证和边车部署:
  1. dotnet-trace collect-linux
    (.NET 10+ 且有root权限)—— 生成最丰富的跟踪数据,包括原生调用栈和内核事件。
  2. dotnet-trace
    —— 如果工具已安装在镜像中,可在容器内使用。如需转储文件,调用
    dump-collect
    技能。
  3. perfcollect
    —— 在.NET 10之前版本的容器中需要原生调用栈时使用(需要
    SYS_ADMIN
    /
    --privileged
    权限)。
如果不处于工作负载上下文(无法访问控制台),或已部署
dotnet-monitor
  1. dotnet-monitor
    —— 为容器环境设计;以边车形式运行。无需在应用容器中安装工具。无法访问控制台时的最佳选择。

Memory dumps

内存转储文件

When dumps are needed (memory leaks, hangs), do not provide dump collection commands directly for modern .NET — invoke the
dump-collect
skill instead. The
dump-collect
skill only supports modern .NET (.NET Core 3.0+). For .NET Framework, provide dump collection guidance directly (e.g.,
procdump -ma <PID>
or Task Manager). This skill focuses on trace collection only.
当需要转储文件(内存泄漏、程序挂起)时,不要直接提供现代.NET的转储收集命令—— 请调用
dump-collect
技能。
dump-collect
技能仅支持现代.NET(.NET Core 3.0+)。针对**.NET Framework**,直接提供转储收集指导(例如
procdump -ma <PID>
或任务管理器)。本技能仅专注于跟踪数据收集。

Memory leaks

内存泄漏

  • Capture two dumps as memory is increasing (e.g., one early, one after significant growth). Invoke the
    dump-collect
    skill for dump collection — do not provide dump commands directly. Diff the dumps in PerfView to see which objects have increased — this is the most effective way to identify what is leaking.
  • Without admin privileges: Two process dumps can give a sense of what's growing on the heap, but may not be enough to identify the root cause. If dumps aren't sufficient, reproduce the issue in an environment where admin privileges are available to collect richer data (traces).
  • Modern .NET on Linux (pre-.NET 10): Recommend two dump captures (invoke
    dump-collect
    skill) for heap diff, plus
    dotnet-trace
    while memory is growing (for allocation tracking). No trigger needed — capture during the growth period. Both together give the best picture.
  • Modern .NET 10+ on Linux with admin: Recommend two dump captures (invoke
    dump-collect
    skill) for heap diff, plus
    dotnet-trace collect-linux
    while memory is growing (richer data including native stacks). No trigger needed.
  • .NET Framework: Recommend two dumps plus a PerfView trace while memory is growing to see what is being allocated. The
    dump-collect
    skill does not support .NET Framework, so provide dump commands directly (e.g.,
    procdump -ma <PID>
    or right-click → Create Dump File in Task Manager). No trigger is needed — just capture the trace during the growth period. Do not wait for an
    OutOfMemoryException
    .
  • 捕获两个转储文件,分别在内存增长初期和显著增长后。调用
    dump-collect
    技能收集转储文件——不要直接提供命令。在PerfView中对比两个转储文件,查看哪些对象数量增加——这是识别泄漏源的最有效方法。
  • 无管理员权限:两个进程转储文件可以大致了解堆中增长的对象,但可能不足以确定根本原因。如果转储文件不够,建议在有管理员权限的环境中复现问题,以收集更丰富的数据(跟踪文件)。
  • Linux上的现代.NET(.NET 10之前版本):建议捕获两个转储文件(调用
    dump-collect
    技能)进行堆对比,同时在内存增长期间使用
    dotnet-trace
    (用于分配跟踪)。无需触发条件——在增长期间捕获数据。两者结合可提供最完整的信息。
  • Linux上的现代.NET 10+(有管理员权限):建议捕获两个转储文件(调用
    dump-collect
    技能)进行堆对比,同时在内存增长期间使用
    dotnet-trace collect-linux
    (更丰富的数据,包括原生调用栈)。无需触发条件。
  • .NET Framework:建议捕获两个转储文件,同时在内存增长期间使用PerfView跟踪以查看分配的对象。
    dump-collect
    技能不支持.NET Framework,因此直接提供转储命令(例如
    procdump -ma <PID>
    或右键任务管理器中的进程→创建转储文件)。无需触发条件——只需在增长期间捕获跟踪文件。不要等到发生
    OutOfMemoryException

Excessive GC

GC过于频繁

Excessive GC requires a trace to analyze GC events, pause times, and allocation patterns — a dump is not sufficient.
  • Windows (PerfView): Use
    PerfView collect /GCCollectOnly
    to capture GC events.
  • Linux (dotnet-trace): Use
    dotnet-trace collect -p <PID> --profile gc-verbose
    .
  • Linux .NET 10+ with root: Use
    dotnet-trace collect-linux --profile gc-verbose
    for richer data with native stacks.
  • Containers:
    dotnet-monitor
    can capture GC traces via its REST API (
    /trace?profile=gc-verbose
    ).
GC过于频繁需要跟踪文件来分析GC事件、暂停时间和分配模式——转储文件不足以排查此类问题。
  • Windows(PerfView):使用
    PerfView collect /GCCollectOnly
    捕获GC事件。
  • Linux(dotnet-trace):使用
    dotnet-trace collect -p <PID> --profile gc-verbose
  • Linux上的.NET 10+(有管理员权限):使用
    dotnet-trace collect-linux --profile gc-verbose
    获取更丰富的数据,包括原生调用栈。
  • 容器
    dotnet-monitor
    可通过REST API(
    /trace?profile=gc-verbose
    )捕获GC跟踪文件。

Slow Requests

请求缓慢

Slow requests require a thread time trace to see where threads are spending time — waiting on locks, I/O, external calls, etc. Use larger buffers since thread time traces generate more data. For ASP.NET Core applications, also enable
Microsoft.AspNetCore.Hosting
and
Microsoft-AspNetCore-Server-Kestrel
providers to get server-side request lifecycle timing (when requests arrive, how long they take to process).
  • Windows (PerfView): Use
    PerfView /ThreadTime collect /BufferSizeMB:1024 /CircularMB:2048
    . The
    /ThreadTime
    argument adds thread-level wait and block detail. For ASP.NET Core, add Kestrel providers:
    PerfView /ThreadTime collect /BufferSizeMB:1024 /CircularMB:2048 /Providers:*Microsoft.AspNetCore.Hosting,*Microsoft-AspNetCore-Server-Kestrel
    . Do not include a stop trigger by default — let the user design one based on their specific scenario.
  • Linux (dotnet-trace):
    dotnet-trace
    captures thread time data by default — no special arguments needed. Use
    dotnet-trace collect -p <PID>
    . For ASP.NET Core, add Kestrel providers:
    dotnet-trace collect -p <PID> --providers Microsoft.AspNetCore.Hosting,Microsoft-AspNetCore-Server-Kestrel
    .
  • Linux .NET 10+ with root: Use
    dotnet-trace collect-linux --profile thread-time
    for richer data with native stacks. For ASP.NET Core, add:
    --providers Microsoft.AspNetCore.Hosting,Microsoft-AspNetCore-Server-Kestrel
    .
  • Containers:
    dotnet-monitor
    can capture traces via its REST API (
    /trace?pid=<PID>&durationSeconds=30
    ).
请求缓慢需要线程时间跟踪文件来查看线程的时间消耗位置——等待锁、I/O、外部调用等。使用更大的缓冲区,因为线程时间跟踪会生成更多数据。对于ASP.NET Core应用,还需启用
Microsoft.AspNetCore.Hosting
Microsoft-AspNetCore-Server-Kestrel
提供程序以获取服务器端请求生命周期的时间数据(请求到达时间、处理时长等)。
  • Windows(PerfView):使用
    PerfView /ThreadTime collect /BufferSizeMB:1024 /CircularMB:2048
    /ThreadTime
    参数添加线程级别的等待和阻塞细节。对于ASP.NET Core,添加Kestrel提供程序:
    PerfView /ThreadTime collect /BufferSizeMB:1024 /CircularMB:2048 /Providers:*Microsoft.AspNetCore.Hosting,*Microsoft-AspNetCore-Server-Kestrel
    。默认不设置停止触发条件——由用户根据具体场景自行设计。
  • Linux(dotnet-trace)
    dotnet-trace
    默认捕获线程时间数据——无需特殊参数。使用
    dotnet-trace collect -p <PID>
    。对于ASP.NET Core,添加Kestrel提供程序:
    dotnet-trace collect -p <PID> --providers Microsoft.AspNetCore.Hosting,Microsoft-AspNetCore-Server-Kestrel
  • Linux上的.NET 10+(有管理员权限):使用
    dotnet-trace collect-linux --profile thread-time
    获取更丰富的数据,包括原生调用栈。对于ASP.NET Core,添加:
    --providers Microsoft.AspNetCore.Hosting,Microsoft-AspNetCore-Server-Kestrel
  • 容器
    dotnet-monitor
    可通过REST API(
    /trace?pid=<PID>&durationSeconds=30
    )捕获跟踪文件。

Hangs

程序挂起

  1. Start with a trace to understand what threads are doing. Use the appropriate trace tool for the environment (PerfView with
    /ThreadTime
    on Windows,
    dotnet-trace
    on Linux,
    dotnet-trace collect-linux --profile thread-time
    on .NET 10+ Linux with root). The trace can reveal:
    • Livelocks (threads spinning without forward progress) — threads appear busy but the application makes no progress.
    • Thread starvation — the ThreadPool is exhausted and queued work items are not being processed. This can look like a deadlock but has a different root cause.
    • Whether there is any forward progress at all — if some threads are making progress, the issue may be a bottleneck rather than a true hang.
  2. If the trace does not explain the hang, the issue may be a true deadlock (threads waiting on each other in a cycle). In this case, invoke the
    dump-collect
    skill to collect a process dump — do not provide dump commands directly.
  3. Analyze the dump with a debugger to inspect thread stacks and identify the lock cycle:
    • Windows: Visual Studio or WinDbg with the SOS debugger extension.
    • Linux:
      lldb
      with the SOS debugger extension.
  1. 先收集跟踪文件以了解线程的行为。使用适配环境的跟踪工具(Windows上使用带
    /ThreadTime
    的PerfView,Linux上使用
    dotnet-trace
    ,.NET 10+ Linux且有管理员权限时使用
    dotnet-trace collect-linux --profile thread-time
    )。跟踪文件可揭示:
    • 活锁(线程空转但无进展)——线程看似忙碌,但应用无进展。
    • 线程饥饿——线程池耗尽,排队的工作项无法处理。这看起来像死锁,但根本原因不同。
    • 是否有进展——如果部分线程有进展,问题可能是瓶颈而非真正的死锁。
  2. 如果跟踪文件无法解释挂起问题,可能是真正的死锁(线程循环等待彼此)。此时调用
    dump-collect
    技能收集进程转储文件——不要直接提供命令。
  3. 使用调试器分析转储文件以检查线程栈并识别锁循环:
    • Windows:Visual Studio 或带有SOS调试扩展的WinDbg。
    • Linux:带有SOS调试扩展的
      lldb

Networking Issues

网络问题

Networking issues (HTTP 5xx errors from downstream services, request timeouts, connection failures, DNS resolution failures, TLS handshake failures, connection pool exhaustion) require both a thread-time trace and networking event providers. The thread-time trace shows where threads are blocked (slow downstream calls, thread starvation), while the networking events show the request lifecycle — which requests failed, what status codes came back, how long DNS resolution and TLS handshakes took, and how long requests waited for a connection from the pool.
For .NET Framework,
PerfView /ThreadTime
already collects the relevant networking events (from the
System.Net
ETW provider) — no additional providers are needed.
For modern .NET, you must explicitly enable the
System.Net.*
EventSource providers:
ProviderWhat it covers
System.Net.Http
HttpClient/SocketsHttpHandler — request lifecycle, HTTP status codes, connection pool
System.Net.NameResolution
DNS lookups (start/stop, duration)
System.Net.Security
TLS/SSL handshakes (SslStream)
System.Net.Sockets
Low-level socket connect/disconnect
Key events from
System.Net.Http
:
RequestStart
(scheme, host, port, path),
RequestStop
(statusCode —
-1
if no response was received),
RequestFailed
(exception message for timeouts, connection refused, etc.),
RequestLeftQueue
(time waiting for a connection from the pool — indicates connection pool exhaustion),
ConnectionEstablished
,
ConnectionClosed
.
Collect a thread-time trace with networking providers enabled (modern .NET only — .NET Framework needs only
PerfView /ThreadTime
):
  • Windows (PerfView): Use
    PerfView /ThreadTime collect /BufferSizeMB:1024 /CircularMB:2048 /Providers:*System.Net.Http,*System.Net.NameResolution,*System.Net.Security,*System.Net.Sockets
    . For .NET Framework, omit the
    /Providers
    flag —
    /ThreadTime
    already includes the networking events. The thread-time trace shows where threads are blocked while the networking events show what requests are failing and why.
  • Linux (dotnet-trace):
    dotnet-trace
    captures thread time data by default, but specifying
    --providers
    overrides the defaults so you must also include
    --profile
    :
    dotnet-trace collect -p <PID> --profile dotnet-common,dotnet-sampled-thread-time --providers System.Net.Http,System.Net.NameResolution,System.Net.Security,System.Net.Sockets
    .
  • Linux .NET 10+ with root: Use
    dotnet-trace collect-linux --profile dotnet-common,cpu-sampling,thread-time --providers System.Net.Http,System.Net.NameResolution,System.Net.Security,System.Net.Sockets
    .
  • Containers:
    dotnet-monitor
    can capture traces with custom providers via its REST API.
网络问题(下游服务返回HTTP 5xx错误、请求超时、连接失败、DNS解析失败、TLS握手失败、连接池耗尽)需要线程时间跟踪文件和网络事件提供程序。线程时间跟踪文件显示线程阻塞的位置(缓慢的下游调用、线程饥饿),而网络事件显示请求生命周期——哪些请求失败、状态码、DNS解析和TLS握手时长、请求等待连接池连接的时间等。
针对**.NET Framework**,
PerfView /ThreadTime
已收集相关网络事件(来自
System.Net
ETW提供程序)——无需额外提供程序。
针对现代.NET,必须显式启用
System.Net.*
EventSource提供程序:
提供程序覆盖范围
System.Net.Http
HttpClient/SocketsHttpHandler——请求生命周期、HTTP状态码、连接池
System.Net.NameResolution
DNS查询(开始/结束、时长)
System.Net.Security
TLS/SSL握手(SslStream)
System.Net.Sockets
底层套接字连接/断开
System.Net.Http
的关键事件:
RequestStart
(协议、主机、端口、路径)、
RequestStop
(statusCode——无响应时为
-1
)、
RequestFailed
(超时、连接拒绝等异常信息)、
RequestLeftQueue
(等待连接池连接的时间——指示连接池耗尽)、
ConnectionEstablished
ConnectionClosed
收集启用网络提供程序的线程时间跟踪文件(仅现代.NET需要——.NET Framework只需
PerfView /ThreadTime
):
  • Windows(PerfView):使用
    PerfView /ThreadTime collect /BufferSizeMB:1024 /CircularMB:2048 /Providers:*System.Net.Http,*System.Net.NameResolution,*System.Net.Security,*System.Net.Sockets
    。针对.NET Framework,省略
    /Providers
    参数——
    /ThreadTime
    已包含网络事件。线程时间跟踪文件显示线程阻塞位置,网络事件显示请求失败的原因。
  • Linux(dotnet-trace)
    dotnet-trace
    默认捕获线程时间数据,但指定
    --providers
    会覆盖默认设置,因此必须同时包含
    --profile
    dotnet-trace collect -p <PID> --profile dotnet-common,dotnet-sampled-thread-time --providers System.Net.Http,System.Net.NameResolution,System.Net.Security,System.Net.Sockets
  • Linux上的.NET 10+(有管理员权限):使用
    dotnet-trace collect-linux --profile dotnet-common,cpu-sampling,thread-time --providers System.Net.Http,System.Net.NameResolution,System.Net.Security,System.Net.Sockets
  • 容器
    dotnet-monitor
    可通过REST API捕获带有自定义提供程序的跟踪文件。

Assembly Loading Issues

程序集加载问题

For modern .NET, assembly loading issues (
FileNotFoundException
,
FileLoadException
,
ReflectionTypeLoadException
, version conflicts, duplicate assembly loads across AssemblyLoadContexts) require collecting assembly loader binder events from the
Microsoft-Windows-DotNETRuntime
provider with the Loader keyword (
0x4
). These events trace every step of the runtime's assembly resolution algorithm — which paths were probed, which AssemblyLoadContext handled the load, whether the load succeeded or failed, and why. For .NET Framework, the same provider and keyword work for ETW-based collection; additionally, the Fusion Log Viewer (
fuslogvw.exe
) can diagnose assembly binding failures without requiring a trace.
The provider specification is
Microsoft-Windows-DotNETRuntime:0x4:4
(provider name, AssemblyLoader keyword, Informational verbosity).
  • Windows (PerfView): A default PerfView trace already includes binder events - simply run
    PerfView collect
    with no extra providers. For a smaller trace file, use
    PerfView collect /ClrEvents:Default-Profile
    , which removes the most verbose default events while keeping the events necessary for diagnosing assembly loading issues.
  • Linux / cross-platform (dotnet-trace): Use
    dotnet-trace collect --clrevents assemblyloader -- <path-to-built-exe>
    to launch and trace the process, or
    dotnet-trace collect --clrevents assemblyloader -p <PID>
    to attach to a running process.
  • Linux .NET 10+ with root: Use
    dotnet-trace collect-linux --clrevents assemblyloader
    .
  • Containers:
    dotnet-monitor
    can capture traces with the loader provider via its REST API.
For short-lived processes that fail on startup (common with assembly loading issues), prefer the
dotnet-trace
launch form (
-- <path-to-built-exe>
) over attaching by PID, since the process may exit before you can attach.
Explain the trade-offs when recommending a tool. For example:
  • PerfView gives richer data but needs admin; runs on Windows including Windows containers.
  • dotnet-trace
    works cross-platform without admin but captures less system-level detail.
  • perfcollect
    captures native call stacks but needs admin/root.
  • dotnet-monitor
    is the best option for containers/K8s when console access is not available, but requires sidecar deployment and authentication configuration.
针对现代.NET,程序集加载问题(
FileNotFoundException
FileLoadException
ReflectionTypeLoadException
、版本冲突、跨AssemblyLoadContext的重复程序集加载)需要收集
Microsoft-Windows-DotNETRuntime
提供程序中带有Loader关键字(
0x4
)的程序集加载器绑定事件。这些事件跟踪运行时程序集解析算法的每一步——探测了哪些路径、哪个AssemblyLoadContext处理加载、加载成功或失败的原因等。针对.NET Framework,相同的提供程序和关键字适用于基于ETW的收集;此外,Fusion日志查看器(
fuslogvw.exe
)可在无需跟踪文件的情况下诊断程序集绑定失败。
提供程序规范为
Microsoft-Windows-DotNETRuntime:0x4:4
(提供程序名称、AssemblyLoader关键字、信息级详细程度)。
  • Windows(PerfView):默认的PerfView跟踪已包含绑定事件——只需运行
    PerfView collect
    无需额外提供程序。如需更小的跟踪文件,使用
    PerfView collect /ClrEvents:Default-Profile
    ,该命令会移除最冗长的默认事件,但保留诊断程序集加载问题所需的事件。
  • Linux/跨平台(dotnet-trace):使用
    dotnet-trace collect --clrevents assemblyloader -- <path-to-built-exe>
    启动并跟踪进程,或使用
    dotnet-trace collect --clrevents assemblyloader -p <PID>
    附加到运行中的进程。
  • Linux上的.NET 10+(有管理员权限):使用
    dotnet-trace collect-linux --clrevents assemblyloader
  • 容器
    dotnet-monitor
    可通过REST API捕获带有加载器提供程序的跟踪文件。
针对启动时失败的短生命周期进程(程序集加载问题的常见场景),优先使用
dotnet-trace
的启动形式(
-- <path-to-built-exe>
)而非通过PID附加,因为进程可能在附加前就已退出。
推荐工具时请说明取舍。例如:
  • PerfView提供更丰富的数据,但需要管理员权限;可在Windows(包括Windows容器)上运行。
  • dotnet-trace
    跨平台且无需管理员权限,但捕获的系统级细节较少。
  • perfcollect
    可捕获原生调用栈,但需要管理员/root权限。
  • dotnet-monitor
    是无法访问控制台时容器/K8s环境的最佳选择,但需要边车部署和认证配置。

Step 3: Guide data collection

步骤3:指导数据收集

Provide the specific commands for the recommended tool. Load the appropriate reference file from the tool reference lookup table for detailed command-line examples.
Key guidance to include:
  1. Installation: How to install the tool if it is not already available (e.g.
    dotnet tool install -g dotnet-trace
    ). When recommending multiple tools, provide installation and usage instructions for each one — do not mention a tool without showing how to install and use it.
  2. PID discovery (required before any
    -p <PID>
    command)
    : Verify the target process first (for example:
    dotnet-trace ps
    ,
    curl <monitor-endpoint>/processes
    , or
    ps
    inside a container). If the app is expected to be PID 1 in a container, still verify before collecting.
  3. Collection command: The exact command to run, including relevant providers, output format, and duration.
  4. Container considerations:
    • Collecting from inside the container: ensure the tool is installed in the image or use
      kubectl cp
      to copy it in.
    • Collecting from outside the container: use
      dotnet-monitor
      as a sidecar with a shared diagnostic port (Unix domain socket in
      /tmp
      ).
    • Kubernetes:
      dotnet-monitor
      as a sidecar container, or
      kubectl debug
      for ephemeral debug containers.
  5. Long-running repros (Windows/PerfView): show how to use trigger arguments and circular buffer settings.
  6. Output location: Where the collected file will be saved and how to copy it off the target for analysis.
  7. Artifact handoff checklist: Include runtime version, OS/kernel, container image tag or build SHA, PID/process name, UTC collection start/end timestamps, exact command used, and final artifact path when handing traces to someone else for analysis.
提供推荐工具的具体命令。从工具参考文件查询表中加载对应的参考文件获取详细的命令行示例。
需包含的关键指导:
  1. 安装:如果工具未安装,说明安装方法(例如
    dotnet tool install -g dotnet-trace
    )。当推荐多个工具时,为每个工具提供安装和使用说明——不要仅提及工具而不说明如何安装和使用。
  2. PID发现(任何
    -p <PID>
    命令的前置步骤)
    :先验证目标进程(例如:
    dotnet-trace ps
    curl <monitor-endpoint>/processes
    或容器内的
    ps
    命令)。即使应用在容器中预期为PID 1,收集前仍需验证。
  3. 收集命令:具体的运行命令,包括相关提供程序、输出格式和时长。
  4. 容器注意事项
    • 容器内收集:确保工具已安装在镜像中,或使用
      kubectl cp
      复制工具到容器内。
    • 容器外收集:将
      dotnet-monitor
      作为边车部署,共享诊断端口(
      /tmp
      中的Unix域套接字)。
    • Kubernetes:将
      dotnet-monitor
      作为边车容器,或使用
      kubectl debug
      创建临时调试容器。
  5. 长时间复现的问题(Windows/PerfView):说明如何使用触发参数和循环缓冲区设置。
  6. 输出位置:收集的文件保存位置,以及如何将文件复制到目标机器外进行分析。
  7. 工件移交检查清单:将跟踪文件移交他人分析时,需包含运行时版本、操作系统/内核版本、容器镜像标签或构建SHA、PID/进程名称、UTC收集开始/结束时间、使用的具体命令、最终工件路径。

Step 4: Recommend analysis approach

步骤4:推荐分析方法

After data is collected, recommend the appropriate tool for analysis. Do not perform the analysis — just point the developer to the right tool and documentation.
Collected DataAnalysis ToolNotes
.nettrace
file
PerfView (Windows), Speedscope (web)PerfView gives the richest view on Windows
.etl
/
.etl.zip
file
PerfViewETW traces from PerfView or perfcollect
perf.data.nl
from perfcollect
PerfView (Windows)Copy the file to a Windows machine and open with PerfView
数据收集完成后,推荐合适的分析工具。不要执行分析操作——只需为开发者指明正确的工具和文档。
收集的数据分析工具说明
.nettrace
文件
PerfView(Windows)、Speedscope(网页版)Windows上PerfView提供最丰富的视图
.etl
/
.etl.zip
文件
PerfView来自PerfView或perfcollect的ETW跟踪文件
perfcollect生成的
perf.data.nl
PerfView(Windows)将文件复制到Windows机器,使用PerfView打开

Validation

验证

  • The recommended tool is compatible with the developer's runtime, OS, and deployment topology
  • The collection command runs without errors
  • The output file is generated in the expected location
  • The developer knows which analysis tool to use for the collected data
  • 推荐的工具与开发者的运行时、操作系统和部署架构兼容
  • 收集命令可正常运行无错误
  • 输出文件在预期位置生成
  • 开发者了解针对收集的数据应使用哪种分析工具

Common Pitfalls

常见陷阱

PitfallSolution
Using
dotnet-trace
on .NET Framework
dotnet-trace
only works with modern .NET (.NET Core 3.0+). Use PerfView for .NET Framework.
PerfView without admin privilegesPerfView requires admin for ETW tracing. Fall back to
dotnet-trace
if admin is not available.
perfcollect
in container without
SYS_ADMIN
Containers drop
SYS_ADMIN
by default. Run with
--privileged
or add
SYS_ADMIN
capability, or fall back to
dotnet-trace
.
Huge trace files from long reprosOn Windows, use PerfView
/StopOn
triggers that fire on the symptom you want to capture (e.g.,
/StopOnPerfCounter
,
/StopOnGCEvent
,
/StopOnException
) with
/CircularMB
and
/BufferSizeMB
. Never trigger on recovery — the circular buffer continuously overwrites old data, so the interesting behavior may be lost by the time collection stops.
Diagnostic port not accessible in containerMount
/tmp
as a shared volume between the app container and
dotnet-monitor
sidecar for the diagnostic Unix domain socket.
Forgetting to install tools in container imageAdd
dotnet tool install
to your Dockerfile, or use
dotnet-monitor
as a sidecar to avoid modifying the app image.
Exposing
dotnet-monitor
with
--no-auth
in production
Keep auth enabled, bind to localhost, and use
kubectl port-forward
for access. Use
--no-auth
only for short-lived isolated debugging.
Collecting only CPU/thread-time trace for networking issuesCPU and thread-time traces alone do not show HTTP status codes, DNS timing, or connection pool behavior. Add the networking providers (
System.Net.Http
,
System.Net.NameResolution
,
System.Net.Security
,
System.Net.Sockets
) alongside the thread-time trace.
Enabling all networking providers when only one is neededEach networking provider adds overhead. If the issue is clearly HTTP-level (5xx status codes),
System.Net.Http
alone may be sufficient. Add DNS, TLS, and socket providers when the root cause is unclear.
陷阱解决方案
在.NET Framework上使用
dotnet-trace
dotnet-trace
仅支持现代.NET(.NET Core 3.0+)。针对.NET Framework使用PerfView。
无管理员权限使用PerfViewPerfView需要管理员权限进行ETW跟踪。无管理员权限时,退而使用
dotnet-trace
在无
SYS_ADMIN
权限的容器中使用
perfcollect
容器默认移除
SYS_ADMIN
权限。使用
--privileged
参数或添加
SYS_ADMIN
能力,或退而使用
dotnet-trace
长时间复现问题产生超大跟踪文件在Windows上,使用基于目标症状的PerfView
/StopOn
触发条件(例如
/StopOnPerfCounter
/StopOnGCEvent
/StopOnException
),并配合
/CircularMB
/BufferSizeMB
绝对不要在恢复时触发——循环缓冲区会持续覆盖旧数据,因此收集停止时关键行为可能已丢失。
容器中诊断端口无法访问在应用容器和
dotnet-monitor
边车之间挂载
/tmp
作为共享卷,用于诊断Unix域套接字。
忘记在容器镜像中安装工具在Dockerfile中添加
dotnet tool install
命令,或使用
dotnet-monitor
作为边车以避免修改应用镜像。
生产环境中使用
--no-auth
暴露
dotnet-monitor
保持认证启用,绑定到本地主机,并使用
kubectl port-forward
进行访问。仅在短期隔离调试时使用
--no-auth
仅收集CPU/线程时间跟踪文件排查网络问题CPU和线程时间跟踪文件无法显示HTTP状态码、DNS时长或连接池行为。在线程时间跟踪文件基础上添加网络提供程序(
System.Net.Http
System.Net.NameResolution
System.Net.Security
System.Net.Sockets
)。
仅需一个网络提供程序时启用所有提供程序每个网络提供程序都会增加开销。如果问题明确为HTTP级别(5xx状态码),仅启用
System.Net.Http
可能就足够。当根本原因不明确时,再添加DNS、TLS和套接字提供程序。