sarif-parsing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SARIF Parsing Best Practices

SARIF解析最佳实践

You are a SARIF parsing expert. Your role is to help users effectively read, analyze, and process SARIF files from static analysis tools.

你是一名SARIF解析专家，你的职责是帮助用户高效读取、分析和处理来自静态分析工具的SARIF文件。

When to Use

适用场景

Use this skill when:

Reading or interpreting static analysis scan results in SARIF format
Aggregating findings from multiple security tools
Deduplicating or filtering security alerts
Extracting specific vulnerabilities from SARIF files
Integrating SARIF data into CI/CD pipelines
Converting SARIF output to other formats

在以下场景中使用本技能：

读取或解读SARIF格式的静态分析扫描结果
聚合多个安全工具的检测结果
去重或过滤安全告警
从SARIF文件中提取特定漏洞
将SARIF数据集成到CI/CD流水线中
将SARIF输出转换为其他格式

When NOT to Use

不适用场景

Do NOT use this skill for:

Running static analysis scans (use CodeQL or Semgrep skills instead)
Writing CodeQL or Semgrep rules (use their respective skills)
Analyzing source code directly (SARIF is for processing existing scan results)
Triaging findings without SARIF input (use variant-analysis or audit skills)

请勿在以下场景中使用本技能：

运行静态分析扫描（请使用CodeQL或Semgrep相关技能）
编写CodeQL或Semgrep规则（请使用对应技能）
直接分析源代码（SARIF用于处理已有的扫描结果）
在无SARIF输入的情况下对检测结果进行分类（请使用变体分析或审计技能）

SARIF Structure Overview

SARIF结构概述

SARIF 2.1.0 is the current OASIS standard. Every SARIF file has this hierarchical structure:

sarifLog
├── version: "2.1.0"
├── $schema: (optional, enables IDE validation)
└── runs[] (array of analysis runs)
    ├── tool
    │   ├── driver
    │   │   ├── name (required)
    │   │   ├── version
    │   │   └── rules[] (rule definitions)
    │   └── extensions[] (plugins)
    ├── results[] (findings)
    │   ├── ruleId
    │   ├── level (error/warning/note)
    │   ├── message.text
    │   ├── locations[]
    │   │   └── physicalLocation
    │   │       ├── artifactLocation.uri
    │   │       └── region (startLine, startColumn, etc.)
    │   ├── fingerprints{}
    │   └── partialFingerprints{}
    └── artifacts[] (scanned files metadata)

SARIF 2.1.0是当前的OASIS标准。每个SARIF文件都具有以下层级结构：

sarifLog
├── version: "2.1.0"
├── $schema: (optional, enables IDE validation)
└── runs[] (array of analysis runs)
    ├── tool
    │   ├── driver
    │   │   ├── name (required)
    │   │   ├── version
    │   │   └── rules[] (rule definitions)
    │   └── extensions[] (plugins)
    ├── results[] (findings)
    │   ├── ruleId
    │   ├── level (error/warning/note)
    │   ├── message.text
    │   ├── locations[]
    │   │   └── physicalLocation
    │   │       ├── artifactLocation.uri
    │   │       └── region (startLine, startColumn, etc.)
    │   ├── fingerprints{}
    │   └── partialFingerprints{}
    └── artifacts[] (scanned files metadata)

Why Fingerprinting Matters

指纹识别的重要性

Without stable fingerprints, you can't track findings across runs:

Baseline comparison: "Is this a new finding or did we see it before?"
Regression detection: "Did this PR introduce new vulnerabilities?"
Suppression: "Ignore this known false positive in future runs"

Tools report different paths (

/path/to/project/

/github/workspace/

), so path-based matching fails. Fingerprints hash the content (code snippet, rule ID, relative location) to create stable identifiers regardless of environment.

如果没有稳定的指纹，你就无法跨运行周期跟踪检测结果：

基线对比：“这是新的检测结果还是之前已经发现过的？”
回归检测：“该PR是否引入了新漏洞？”
告警抑制：“在后续运行中忽略这个已知的误报”

不同工具报告的路径不同（例如

/path/to/project/

与

/github/workspace/

），因此基于路径的匹配会失效。指纹通过对内容（代码片段、规则ID、相对位置）进行哈希运算，生成不受环境影响的稳定标识符。

Tool Selection Guide

工具选择指南

Use Case	Tool	Installation
Quick CLI queries	jq	`brew install jq` / `apt install jq`
Python scripting (simple)	pysarif	`pip install pysarif`
Python scripting (advanced)	sarif-tools	`pip install sarif-tools`
.NET applications	SARIF SDK	NuGet package
JavaScript/Node.js	sarif-js	npm package
Go applications	garif	`go get github.com/chavacava/garif`
Validation	SARIF Validator	sarifweb.azurewebsites.net

适用场景	工具	安装方式
快速CLI查询	jq	`brew install jq` / `apt install jq`
Python脚本（简单场景）	pysarif	`pip install pysarif`
Python脚本（高级场景）	sarif-tools	`pip install sarif-tools`
.NET应用	SARIF SDK	NuGet包
JavaScript/Node.js应用	sarif-js	npm包
Go应用	garif	`go get github.com/chavacava/garif`
验证	SARIF Validator	sarifweb.azurewebsites.net

Strategy 1: Quick Analysis with jq

策略1：使用jq快速分析

For rapid exploration and one-off queries:

bash

undefined

用于快速探索和一次性查询：

bash

undefined

Pretty print the file

jq '.' results.sarif

Count total findings

jq '[.runs[].results[]] | length' results.sarif

List all rule IDs triggered

jq '[.runs[].results[].ruleId] | unique' results.sarif

Extract errors only

jq '.runs[].results[] | select(.level == "error")' results.sarif

Get findings with file locations

jq '.runs[].results[] | { rule: .ruleId, message: .message.text, file: .locations[0].physicalLocation.artifactLocation.uri, line: .locations[0].physicalLocation.region.startLine }' results.sarif

Filter by severity and get count per rule

jq '[.runs[].results[] | select(.level == "error")] | group_by(.ruleId) | map({rule: .[0].ruleId, count: length})' results.sarif

Extract findings for a specific file

jq --arg file "src/auth.py" '.runs[].results[] | select(.locations[].physicalLocation.artifactLocation.uri | contains($file))' results.sarif

undefined

jq --arg file "src/auth.py" '.runs[].results[] | select(.locations[].physicalLocation.artifactLocation.uri | contains($file))' results.sarif

undefined

Strategy 2: Python with pysarif

策略2：使用Python与pysarif

For programmatic access with full object model:

python

from pysarif import load_from_file, save_to_file

用于通过完整对象模型进行程序化访问：

python

from pysarif import load_from_file, save_to_file

Load SARIF file

sarif = load_from_file("results.sarif")

Iterate through runs and results

for run in sarif.runs: tool_name = run.tool.driver.name print(f"Tool: {tool_name}")

for result in run.results:
    print(f"  [{result.level}] {result.rule_id}: {result.message.text}")

    if result.locations:
        loc = result.locations[0].physical_location
        if loc and loc.artifact_location:
            print(f"    File: {loc.artifact_location.uri}")
            if loc.region:
                print(f"    Line: {loc.region.start_line}")

for run in sarif.runs: tool_name = run.tool.driver.name print(f"Tool: {tool_name}")

for result in run.results:
    print(f"  [{result.level}] {result.rule_id}: {result.message.text}")

    if result.locations:
        loc = result.locations[0].physical_location
        if loc and loc.artifact_location:
            print(f"    File: {loc.artifact_location.uri}")
            if loc.region:
                print(f"    Line: {loc.region.start_line}")

Save modified SARIF

save_to_file(sarif, "modified.sarif")

undefined

save_to_file(sarif, "modified.sarif")

undefined

Strategy 3: Python with sarif-tools

策略3：使用Python与sarif-tools

For aggregation, reporting, and CI/CD integration:

python

from sarif import loader

用于聚合、报告和CI/CD集成：

python

from sarif import loader

Load single file

sarif_data = loader.load_sarif_file("results.sarif")

Or load multiple files

sarif_set = loader.load_sarif_files(["tool1.sarif", "tool2.sarif"])

Get summary report

report = sarif_data.get_report()

Get histogram by severity

errors = report.get_issue_type_histogram_for_severity("error") warnings = report.get_issue_type_histogram_for_severity("warning")

Filter results

high_severity = [r for r in sarif_data.get_results() if r.get("level") == "error"]


**sarif-tools CLI commands:**

```bash

high_severity = [r for r in sarif_data.get_results() if r.get("level") == "error"]


**sarif-tools CLI命令：**

```bash

Summary of findings

sarif summary results.sarif

List all results with details

sarif ls results.sarif

Get results by severity

sarif ls --level error results.sarif

Diff two SARIF files (find new/fixed issues)

sarif diff baseline.sarif current.sarif

Convert to other formats

sarif csv results.sarif > results.csv sarif html results.sarif > report.html

undefined

sarif csv results.sarif > results.csv sarif html results.sarif > report.html

undefined

Strategy 4: Aggregating Multiple SARIF Files

策略4：聚合多个SARIF文件

When combining results from multiple tools:

python

import json
from pathlib import Path

def aggregate_sarif_files(sarif_paths: list[str]) -> dict:
    """Combine multiple SARIF files into one."""
    aggregated = {
        "version": "2.1.0",
        "$schema": "https://json.schemastore.org/sarif-2.1.0.json",
        "runs": []
    }

    for path in sarif_paths:
        with open(path) as f:
            sarif = json.load(f)
            aggregated["runs"].extend(sarif.get("runs", []))

    return aggregated

def deduplicate_results(sarif: dict) -> dict:
    """Remove duplicate findings based on fingerprints."""
    seen_fingerprints = set()

    for run in sarif["runs"]:
        unique_results = []
        for result in run.get("results", []):
            # Use partialFingerprints or create key from location
            fp = None
            if result.get("partialFingerprints"):
                fp = tuple(sorted(result["partialFingerprints"].items()))
            elif result.get("fingerprints"):
                fp = tuple(sorted(result["fingerprints"].items()))
            else:
                # Fallback: create fingerprint from rule + location
                loc = result.get("locations", [{}])[0]
                phys = loc.get("physicalLocation", {})
                fp = (
                    result.get("ruleId"),
                    phys.get("artifactLocation", {}).get("uri"),
                    phys.get("region", {}).get("startLine")
                )

            if fp not in seen_fingerprints:
                seen_fingerprints.add(fp)
                unique_results.append(result)

        run["results"] = unique_results

    return sarif

当需要合并多个工具的结果时：

python

import json
from pathlib import Path

def aggregate_sarif_files(sarif_paths: list[str]) -> dict:
    """Combine multiple SARIF files into one."""
    aggregated = {
        "version": "2.1.0",
        "$schema": "https://json.schemastore.org/sarif-2.1.0.json",
        "runs": []
    }

    for path in sarif_paths:
        with open(path) as f:
            sarif = json.load(f)
            aggregated["runs"].extend(sarif.get("runs", []))

    return aggregated

def deduplicate_results(sarif: dict) -> dict:
    """Remove duplicate findings based on fingerprints."""
    seen_fingerprints = set()

    for run in sarif["runs"]:
        unique_results = []
        for result in run.get("results", []):
            # Use partialFingerprints or create key from location
            fp = None
            if result.get("partialFingerprints"):
                fp = tuple(sorted(result["partialFingerprints"].items()))
            elif result.get("fingerprints"):
                fp = tuple(sorted(result["fingerprints"].items()))
            else:
                # Fallback: create fingerprint from rule + location
                loc = result.get("locations", [{}])[0]
                phys = loc.get("physicalLocation", {})
                fp = (
                    result.get("ruleId"),
                    phys.get("artifactLocation", {}).get("uri"),
                    phys.get("region", {}).get("startLine")
                )

            if fp not in seen_fingerprints:
                seen_fingerprints.add(fp)
                unique_results.append(result)

        run["results"] = unique_results

    return sarif

Strategy 5: Extracting Actionable Data

策略5：提取可执行数据

python

import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class Finding:
    rule_id: str
    level: str
    message: str
    file_path: Optional[str]
    start_line: Optional[int]
    end_line: Optional[int]
    fingerprint: Optional[str]

def extract_findings(sarif_path: str) -> list[Finding]:
    """Extract structured findings from SARIF file."""
    with open(sarif_path) as f:
        sarif = json.load(f)

    findings = []
    for run in sarif.get("runs", []):
        for result in run.get("results", []):
            loc = result.get("locations", [{}])[0]
            phys = loc.get("physicalLocation", {})
            region = phys.get("region", {})

            findings.append(Finding(
                rule_id=result.get("ruleId", "unknown"),
                level=result.get("level", "warning"),
                message=result.get("message", {}).get("text", ""),
                file_path=phys.get("artifactLocation", {}).get("uri"),
                start_line=region.get("startLine"),
                end_line=region.get("endLine"),
                fingerprint=next(iter(result.get("partialFingerprints", {}).values()), None)
            ))

    return findings

python

import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class Finding:
    rule_id: str
    level: str
    message: str
    file_path: Optional[str]
    start_line: Optional[int]
    end_line: Optional[int]
    fingerprint: Optional[str]

def extract_findings(sarif_path: str) -> list[Finding]:
    """Extract structured findings from SARIF file."""
    with open(sarif_path) as f:
        sarif = json.load(f)

    findings = []
    for run in sarif.get("runs", []):
        for result in run.get("results", []):
            loc = result.get("locations", [{}])[0]
            phys = loc.get("physicalLocation", {})
            region = phys.get("region", {})

            findings.append(Finding(
                rule_id=result.get("ruleId", "unknown"),
                level=result.get("level", "warning"),
                message=result.get("message", {}).get("text", ""),
                file_path=phys.get("artifactLocation", {}).get("uri"),
                start_line=region.get("startLine"),
                end_line=region.get("endLine"),
                fingerprint=next(iter(result.get("partialFingerprints", {}).values()), None)
            ))

    return findings

Filter and prioritize

def prioritize_findings(findings: list[Finding]) -> list[Finding]: """Sort findings by severity.""" severity_order = {"error": 0, "warning": 1, "note": 2, "none": 3} return sorted(findings, key=lambda f: severity_order.get(f.level, 99))

undefined

undefined

Common Pitfalls and Solutions

常见问题与解决方案

1. Path Normalization Issues

1. 路径标准化问题

Different tools report paths differently (absolute, relative, URI-encoded):

python

from urllib.parse import unquote
from pathlib import Path

def normalize_path(uri: str, base_path: str = "") -> str:
    """Normalize SARIF artifact URI to consistent path."""
    # Remove file:// prefix if present
    if uri.startswith("file://"):
        uri = uri[7:]

    # URL decode
    uri = unquote(uri)

    # Handle relative paths
    if not Path(uri).is_absolute() and base_path:
        uri = str(Path(base_path) / uri)

    # Normalize separators
    return str(Path(uri))

不同工具报告的路径格式不同（绝对路径、相对路径、URI编码）：

python

from urllib.parse import unquote
from pathlib import Path

def normalize_path(uri: str, base_path: str = "") -> str:
    """Normalize SARIF artifact URI to consistent path."""
    # Remove file:// prefix if present
    if uri.startswith("file://"):
        uri = uri[7:]

    # URL decode
    uri = unquote(uri)

    # Handle relative paths
    if not Path(uri).is_absolute() and base_path:
        uri = str(Path(base_path) / uri)

    # Normalize separators
    return str(Path(uri))

2. Fingerprint Mismatch Across Runs

2. 跨运行周期的指纹不匹配

Fingerprints may not match if:

File paths differ between environments
Tool versions changed fingerprinting algorithm
Code was reformatted (changing line numbers)

Solution: Use multiple fingerprint strategies:

python

def compute_stable_fingerprint(result: dict, file_content: str = None) -> str:
    """Compute environment-independent fingerprint."""
    import hashlib

    components = [
        result.get("ruleId", ""),
        result.get("message", {}).get("text", "")[:100],  # First 100 chars
    ]

    # Add code snippet if available
    if file_content and result.get("locations"):
        region = result["locations"][0].get("physicalLocation", {}).get("region", {})
        if region.get("startLine"):
            lines = file_content.split("\n")
            line_idx = region["startLine"] - 1
            if 0 <= line_idx < len(lines):
                # Normalize whitespace
                components.append(lines[line_idx].strip())

    return hashlib.sha256("".join(components).encode()).hexdigest()[:16]

在以下情况下，指纹可能不匹配：

不同环境下的文件路径不同
工具版本更新导致指纹算法变更
代码被格式化（行号发生变化）

解决方案： 使用多种指纹策略：

python

def compute_stable_fingerprint(result: dict, file_content: str = None) -> str:
    """Compute environment-independent fingerprint."""
    import hashlib

    components = [
        result.get("ruleId", ""),
        result.get("message", {}).get("text", "")[:100],  # First 100 chars
    ]

    # Add code snippet if available
    if file_content and result.get("locations"):
        region = result["locations"][0].get("physicalLocation", {}).get("region", {})
        if region.get("startLine"):
            lines = file_content.split("\n")
            line_idx = region["startLine"] - 1
            if 0 <= line_idx < len(lines):
                # Normalize whitespace
                components.append(lines[line_idx].strip())

    return hashlib.sha256("".join(components).encode()).hexdigest()[:16]

3. Missing or Incomplete Data

3. 数据缺失或不完整

SARIF allows many optional fields. Always use defensive access:

python

def safe_get_location(result: dict) -> tuple[str, int]:
    """Safely extract file and line from result."""
    try:
        loc = result.get("locations", [{}])[0]
        phys = loc.get("physicalLocation", {})
        file_path = phys.get("artifactLocation", {}).get("uri", "unknown")
        line = phys.get("region", {}).get("startLine", 0)
        return file_path, line
    except (IndexError, KeyError, TypeError):
        return "unknown", 0

SARIF允许许多可选字段，务必使用防御性访问方式：

python

def safe_get_location(result: dict) -> tuple[str, int]:
    """Safely extract file and line from result."""
    try:
        loc = result.get("locations", [{}])[0]
        phys = loc.get("physicalLocation", {})
        file_path = phys.get("artifactLocation", {}).get("uri", "unknown")
        line = phys.get("region", {}).get("startLine", 0)
        return file_path, line
    except (IndexError, KeyError, TypeError):
        return "unknown", 0

4. Large File Performance

4. 大文件性能问题

For very large SARIF files (100MB+):

python

import ijson  # pip install ijson

def stream_results(sarif_path: str):
    """Stream results without loading entire file."""
    with open(sarif_path, "rb") as f:
        # Stream through results arrays
        for result in ijson.items(f, "runs.item.results.item"):
            yield result

对于超大SARIF文件（100MB以上）：

python

import ijson  # pip install ijson

def stream_results(sarif_path: str):
    """Stream results without loading entire file."""
    with open(sarif_path, "rb") as f:
        # Stream through results arrays
        for result in ijson.items(f, "runs.item.results.item"):
            yield result

5. Schema Validation

5. Schema验证

Validate before processing to catch malformed files:

bash

undefined

在处理前进行验证，以捕获格式错误的文件：

bash

undefined

Using ajv-cli

npm install -g ajv-cli ajv validate -s sarif-schema-2.1.0.json -d results.sarif

Using Python jsonschema

pip install jsonschema


```python
from jsonschema import validate, ValidationError
import json

def validate_sarif(sarif_path: str, schema_path: str) -> bool:
    """Validate SARIF file against schema."""
    with open(sarif_path) as f:
        sarif = json.load(f)
    with open(schema_path) as f:
        schema = json.load(f)

    try:
        validate(sarif, schema)
        return True
    except ValidationError as e:
        print(f"Validation error: {e.message}")
        return False

pip install jsonschema


```python
from jsonschema import validate, ValidationError
import json

def validate_sarif(sarif_path: str, schema_path: str) -> bool:
    """Validate SARIF file against schema."""
    with open(sarif_path) as f:
        sarif = json.load(f)
    with open(schema_path) as f:
        schema = json.load(f)

    try:
        validate(sarif, schema)
        return True
    except ValidationError as e:
        print(f"Validation error: {e.message}")
        return False

CI/CD Integration Patterns

CI/CD集成模式

GitHub Actions

yaml

- name: Upload SARIF
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: results.sarif

- name: Check for high severity
  run: |
    HIGH_COUNT=$(jq '[.runs[].results[] | select(.level == "error")] | length' results.sarif)
    if [ "$HIGH_COUNT" -gt 0 ]; then
      echo "Found $HIGH_COUNT high severity issues"
      exit 1
    fi

yaml

- name: Upload SARIF
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: results.sarif

- name: Check for high severity
  run: |
    HIGH_COUNT=$(jq '[.runs[].results[] | select(.level == "error")] | length' results.sarif)
    if [ "$HIGH_COUNT" -gt 0 ]; then
      echo "Found $HIGH_COUNT high severity issues"
      exit 1
    fi

Fail on New Issues

发现新问题则失败

python

from sarif import loader

def check_for_regressions(baseline: str, current: str) -> int:
    """Return count of new issues not in baseline."""
    baseline_data = loader.load_sarif_file(baseline)
    current_data = loader.load_sarif_file(current)

    baseline_fps = {get_fingerprint(r) for r in baseline_data.get_results()}
    new_issues = [r for r in current_data.get_results()
                  if get_fingerprint(r) not in baseline_fps]

    return len(new_issues)

python

from sarif import loader

def check_for_regressions(baseline: str, current: str) -> int:
    """Return count of new issues not in baseline."""
    baseline_data = loader.load_sarif_file(baseline)
    current_data = loader.load_sarif_file(current)

    baseline_fps = {get_fingerprint(r) for r in baseline_data.get_results()}
    new_issues = [r for r in current_data.get_results()
                  if get_fingerprint(r) not in baseline_fps]

    return len(new_issues)

Key Principles

核心原则

Validate first: Check SARIF structure before processing
Handle optionals: Many fields are optional; use defensive access
Normalize paths: Tools report paths differently; normalize early
Fingerprint wisely: Combine multiple strategies for stable deduplication
Stream large files: Use ijson or similar for 100MB+ files
Aggregate thoughtfully: Preserve tool metadata when combining files

先验证：处理前检查SARIF结构
处理可选字段：许多字段是可选的，使用防御性访问
标准化路径：不同工具报告的路径不同，尽早标准化
合理使用指纹：结合多种策略实现稳定去重
流式处理大文件：对100MB以上的文件使用ijson或类似工具
谨慎聚合：合并文件时保留工具元数据

Skill Resources

技能资源

For ready-to-use query templates, see {baseDir}/resources/jq-queries.md:

40+ jq queries for common SARIF operations
Severity filtering, rule extraction, aggregation patterns

For Python utilities, see {baseDir}/resources/sarif_helpers.py:

```
normalize_path()
```
- Handle tool-specific path formats
```
compute_fingerprint()
```
- Stable fingerprinting ignoring paths
```
deduplicate_results()
```
- Remove duplicates across runs

如需现成的查询模板，请查看{baseDir}/resources/jq-queries.md：

40余个用于常见SARIF操作的jq查询
severity过滤、规则提取、聚合模式

如需Python工具类，请查看{baseDir}/resources/sarif_helpers.py：

```
normalize_path()
```
- 处理工具特定的路径格式
```
compute_fingerprint()
```
- 忽略路径的稳定指纹生成
```
deduplicate_results()
```
- 跨运行周期去重