sarif-parsing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSARIF Parsing Best Practices
SARIF解析最佳实践
You are a SARIF parsing expert. Your role is to help users effectively read, analyze, and process SARIF files from static analysis tools.
你是一名SARIF解析专家,你的职责是帮助用户高效读取、分析和处理来自静态分析工具的SARIF文件。
When to Use
适用场景
Use this skill when:
- Reading or interpreting static analysis scan results in SARIF format
- Aggregating findings from multiple security tools
- Deduplicating or filtering security alerts
- Extracting specific vulnerabilities from SARIF files
- Integrating SARIF data into CI/CD pipelines
- Converting SARIF output to other formats
在以下场景中使用本技能:
- 读取或解读SARIF格式的静态分析扫描结果
- 聚合多个安全工具的检测结果
- 去重或过滤安全告警
- 从SARIF文件中提取特定漏洞
- 将SARIF数据集成到CI/CD流水线中
- 将SARIF输出转换为其他格式
When NOT to Use
不适用场景
Do NOT use this skill for:
- Running static analysis scans (use CodeQL or Semgrep skills instead)
- Writing CodeQL or Semgrep rules (use their respective skills)
- Analyzing source code directly (SARIF is for processing existing scan results)
- Triaging findings without SARIF input (use variant-analysis or audit skills)
请勿在以下场景中使用本技能:
- 运行静态分析扫描(请使用CodeQL或Semgrep相关技能)
- 编写CodeQL或Semgrep规则(请使用对应技能)
- 直接分析源代码(SARIF用于处理已有的扫描结果)
- 在无SARIF输入的情况下对检测结果进行分类(请使用变体分析或审计技能)
SARIF Structure Overview
SARIF结构概述
SARIF 2.1.0 is the current OASIS standard. Every SARIF file has this hierarchical structure:
sarifLog
├── version: "2.1.0"
├── $schema: (optional, enables IDE validation)
└── runs[] (array of analysis runs)
├── tool
│ ├── driver
│ │ ├── name (required)
│ │ ├── version
│ │ └── rules[] (rule definitions)
│ └── extensions[] (plugins)
├── results[] (findings)
│ ├── ruleId
│ ├── level (error/warning/note)
│ ├── message.text
│ ├── locations[]
│ │ └── physicalLocation
│ │ ├── artifactLocation.uri
│ │ └── region (startLine, startColumn, etc.)
│ ├── fingerprints{}
│ └── partialFingerprints{}
└── artifacts[] (scanned files metadata)SARIF 2.1.0是当前的OASIS标准。每个SARIF文件都具有以下层级结构:
sarifLog
├── version: "2.1.0"
├── $schema: (optional, enables IDE validation)
└── runs[] (array of analysis runs)
├── tool
│ ├── driver
│ │ ├── name (required)
│ │ ├── version
│ │ └── rules[] (rule definitions)
│ └── extensions[] (plugins)
├── results[] (findings)
│ ├── ruleId
│ ├── level (error/warning/note)
│ ├── message.text
│ ├── locations[]
│ │ └── physicalLocation
│ │ ├── artifactLocation.uri
│ │ └── region (startLine, startColumn, etc.)
│ ├── fingerprints{}
│ └── partialFingerprints{}
└── artifacts[] (scanned files metadata)Why Fingerprinting Matters
指纹识别的重要性
Without stable fingerprints, you can't track findings across runs:
- Baseline comparison: "Is this a new finding or did we see it before?"
- Regression detection: "Did this PR introduce new vulnerabilities?"
- Suppression: "Ignore this known false positive in future runs"
Tools report different paths ( vs ), so path-based matching fails. Fingerprints hash the content (code snippet, rule ID, relative location) to create stable identifiers regardless of environment.
/path/to/project//github/workspace/如果没有稳定的指纹,你就无法跨运行周期跟踪检测结果:
- 基线对比:“这是新的检测结果还是之前已经发现过的?”
- 回归检测:“该PR是否引入了新漏洞?”
- 告警抑制:“在后续运行中忽略这个已知的误报”
不同工具报告的路径不同(例如与),因此基于路径的匹配会失效。指纹通过对内容(代码片段、规则ID、相对位置)进行哈希运算,生成不受环境影响的稳定标识符。
/path/to/project//github/workspace/Tool Selection Guide
工具选择指南
| Use Case | Tool | Installation |
|---|---|---|
| Quick CLI queries | jq | |
| Python scripting (simple) | pysarif | |
| Python scripting (advanced) | sarif-tools | |
| .NET applications | SARIF SDK | NuGet package |
| JavaScript/Node.js | sarif-js | npm package |
| Go applications | garif | |
| Validation | SARIF Validator | sarifweb.azurewebsites.net |
| 适用场景 | 工具 | 安装方式 |
|---|---|---|
| 快速CLI查询 | jq | |
| Python脚本(简单场景) | pysarif | |
| Python脚本(高级场景) | sarif-tools | |
| .NET应用 | SARIF SDK | NuGet包 |
| JavaScript/Node.js应用 | sarif-js | npm包 |
| Go应用 | garif | |
| 验证 | SARIF Validator | sarifweb.azurewebsites.net |
Strategy 1: Quick Analysis with jq
策略1:使用jq快速分析
For rapid exploration and one-off queries:
bash
undefined用于快速探索和一次性查询:
bash
undefinedPretty print the file
Pretty print the file
jq '.' results.sarif
jq '.' results.sarif
Count total findings
Count total findings
jq '[.runs[].results[]] | length' results.sarif
jq '[.runs[].results[]] | length' results.sarif
List all rule IDs triggered
List all rule IDs triggered
jq '[.runs[].results[].ruleId] | unique' results.sarif
jq '[.runs[].results[].ruleId] | unique' results.sarif
Extract errors only
Extract errors only
jq '.runs[].results[] | select(.level == "error")' results.sarif
jq '.runs[].results[] | select(.level == "error")' results.sarif
Get findings with file locations
Get findings with file locations
jq '.runs[].results[] | {
rule: .ruleId,
message: .message.text,
file: .locations[0].physicalLocation.artifactLocation.uri,
line: .locations[0].physicalLocation.region.startLine
}' results.sarif
jq '.runs[].results[] | {
rule: .ruleId,
message: .message.text,
file: .locations[0].physicalLocation.artifactLocation.uri,
line: .locations[0].physicalLocation.region.startLine
}' results.sarif
Filter by severity and get count per rule
Filter by severity and get count per rule
jq '[.runs[].results[] | select(.level == "error")] | group_by(.ruleId) | map({rule: .[0].ruleId, count: length})' results.sarif
jq '[.runs[].results[] | select(.level == "error")] | group_by(.ruleId) | map({rule: .[0].ruleId, count: length})' results.sarif
Extract findings for a specific file
Extract findings for a specific file
jq --arg file "src/auth.py" '.runs[].results[] | select(.locations[].physicalLocation.artifactLocation.uri | contains($file))' results.sarif
undefinedjq --arg file "src/auth.py" '.runs[].results[] | select(.locations[].physicalLocation.artifactLocation.uri | contains($file))' results.sarif
undefinedStrategy 2: Python with pysarif
策略2:使用Python与pysarif
For programmatic access with full object model:
python
from pysarif import load_from_file, save_to_file用于通过完整对象模型进行程序化访问:
python
from pysarif import load_from_file, save_to_fileLoad SARIF file
Load SARIF file
sarif = load_from_file("results.sarif")
sarif = load_from_file("results.sarif")
Iterate through runs and results
Iterate through runs and results
for run in sarif.runs:
tool_name = run.tool.driver.name
print(f"Tool: {tool_name}")
for result in run.results:
print(f" [{result.level}] {result.rule_id}: {result.message.text}")
if result.locations:
loc = result.locations[0].physical_location
if loc and loc.artifact_location:
print(f" File: {loc.artifact_location.uri}")
if loc.region:
print(f" Line: {loc.region.start_line}")for run in sarif.runs:
tool_name = run.tool.driver.name
print(f"Tool: {tool_name}")
for result in run.results:
print(f" [{result.level}] {result.rule_id}: {result.message.text}")
if result.locations:
loc = result.locations[0].physical_location
if loc and loc.artifact_location:
print(f" File: {loc.artifact_location.uri}")
if loc.region:
print(f" Line: {loc.region.start_line}")Save modified SARIF
Save modified SARIF
save_to_file(sarif, "modified.sarif")
undefinedsave_to_file(sarif, "modified.sarif")
undefinedStrategy 3: Python with sarif-tools
策略3:使用Python与sarif-tools
For aggregation, reporting, and CI/CD integration:
python
from sarif import loader用于聚合、报告和CI/CD集成:
python
from sarif import loaderLoad single file
Load single file
sarif_data = loader.load_sarif_file("results.sarif")
sarif_data = loader.load_sarif_file("results.sarif")
Or load multiple files
Or load multiple files
sarif_set = loader.load_sarif_files(["tool1.sarif", "tool2.sarif"])
sarif_set = loader.load_sarif_files(["tool1.sarif", "tool2.sarif"])
Get summary report
Get summary report
report = sarif_data.get_report()
report = sarif_data.get_report()
Get histogram by severity
Get histogram by severity
errors = report.get_issue_type_histogram_for_severity("error")
warnings = report.get_issue_type_histogram_for_severity("warning")
errors = report.get_issue_type_histogram_for_severity("error")
warnings = report.get_issue_type_histogram_for_severity("warning")
Filter results
Filter results
high_severity = [r for r in sarif_data.get_results()
if r.get("level") == "error"]
**sarif-tools CLI commands:**
```bashhigh_severity = [r for r in sarif_data.get_results()
if r.get("level") == "error"]
**sarif-tools CLI命令:**
```bashSummary of findings
Summary of findings
sarif summary results.sarif
sarif summary results.sarif
List all results with details
List all results with details
sarif ls results.sarif
sarif ls results.sarif
Get results by severity
Get results by severity
sarif ls --level error results.sarif
sarif ls --level error results.sarif
Diff two SARIF files (find new/fixed issues)
Diff two SARIF files (find new/fixed issues)
sarif diff baseline.sarif current.sarif
sarif diff baseline.sarif current.sarif
Convert to other formats
Convert to other formats
sarif csv results.sarif > results.csv
sarif html results.sarif > report.html
undefinedsarif csv results.sarif > results.csv
sarif html results.sarif > report.html
undefinedStrategy 4: Aggregating Multiple SARIF Files
策略4:聚合多个SARIF文件
When combining results from multiple tools:
python
import json
from pathlib import Path
def aggregate_sarif_files(sarif_paths: list[str]) -> dict:
"""Combine multiple SARIF files into one."""
aggregated = {
"version": "2.1.0",
"$schema": "https://json.schemastore.org/sarif-2.1.0.json",
"runs": []
}
for path in sarif_paths:
with open(path) as f:
sarif = json.load(f)
aggregated["runs"].extend(sarif.get("runs", []))
return aggregated
def deduplicate_results(sarif: dict) -> dict:
"""Remove duplicate findings based on fingerprints."""
seen_fingerprints = set()
for run in sarif["runs"]:
unique_results = []
for result in run.get("results", []):
# Use partialFingerprints or create key from location
fp = None
if result.get("partialFingerprints"):
fp = tuple(sorted(result["partialFingerprints"].items()))
elif result.get("fingerprints"):
fp = tuple(sorted(result["fingerprints"].items()))
else:
# Fallback: create fingerprint from rule + location
loc = result.get("locations", [{}])[0]
phys = loc.get("physicalLocation", {})
fp = (
result.get("ruleId"),
phys.get("artifactLocation", {}).get("uri"),
phys.get("region", {}).get("startLine")
)
if fp not in seen_fingerprints:
seen_fingerprints.add(fp)
unique_results.append(result)
run["results"] = unique_results
return sarif当需要合并多个工具的结果时:
python
import json
from pathlib import Path
def aggregate_sarif_files(sarif_paths: list[str]) -> dict:
"""Combine multiple SARIF files into one."""
aggregated = {
"version": "2.1.0",
"$schema": "https://json.schemastore.org/sarif-2.1.0.json",
"runs": []
}
for path in sarif_paths:
with open(path) as f:
sarif = json.load(f)
aggregated["runs"].extend(sarif.get("runs", []))
return aggregated
def deduplicate_results(sarif: dict) -> dict:
"""Remove duplicate findings based on fingerprints."""
seen_fingerprints = set()
for run in sarif["runs"]:
unique_results = []
for result in run.get("results", []):
# Use partialFingerprints or create key from location
fp = None
if result.get("partialFingerprints"):
fp = tuple(sorted(result["partialFingerprints"].items()))
elif result.get("fingerprints"):
fp = tuple(sorted(result["fingerprints"].items()))
else:
# Fallback: create fingerprint from rule + location
loc = result.get("locations", [{}])[0]
phys = loc.get("physicalLocation", {})
fp = (
result.get("ruleId"),
phys.get("artifactLocation", {}).get("uri"),
phys.get("region", {}).get("startLine")
)
if fp not in seen_fingerprints:
seen_fingerprints.add(fp)
unique_results.append(result)
run["results"] = unique_results
return sarifStrategy 5: Extracting Actionable Data
策略5:提取可执行数据
python
import json
from dataclasses import dataclass
from typing import Optional
@dataclass
class Finding:
rule_id: str
level: str
message: str
file_path: Optional[str]
start_line: Optional[int]
end_line: Optional[int]
fingerprint: Optional[str]
def extract_findings(sarif_path: str) -> list[Finding]:
"""Extract structured findings from SARIF file."""
with open(sarif_path) as f:
sarif = json.load(f)
findings = []
for run in sarif.get("runs", []):
for result in run.get("results", []):
loc = result.get("locations", [{}])[0]
phys = loc.get("physicalLocation", {})
region = phys.get("region", {})
findings.append(Finding(
rule_id=result.get("ruleId", "unknown"),
level=result.get("level", "warning"),
message=result.get("message", {}).get("text", ""),
file_path=phys.get("artifactLocation", {}).get("uri"),
start_line=region.get("startLine"),
end_line=region.get("endLine"),
fingerprint=next(iter(result.get("partialFingerprints", {}).values()), None)
))
return findingspython
import json
from dataclasses import dataclass
from typing import Optional
@dataclass
class Finding:
rule_id: str
level: str
message: str
file_path: Optional[str]
start_line: Optional[int]
end_line: Optional[int]
fingerprint: Optional[str]
def extract_findings(sarif_path: str) -> list[Finding]:
"""Extract structured findings from SARIF file."""
with open(sarif_path) as f:
sarif = json.load(f)
findings = []
for run in sarif.get("runs", []):
for result in run.get("results", []):
loc = result.get("locations", [{}])[0]
phys = loc.get("physicalLocation", {})
region = phys.get("region", {})
findings.append(Finding(
rule_id=result.get("ruleId", "unknown"),
level=result.get("level", "warning"),
message=result.get("message", {}).get("text", ""),
file_path=phys.get("artifactLocation", {}).get("uri"),
start_line=region.get("startLine"),
end_line=region.get("endLine"),
fingerprint=next(iter(result.get("partialFingerprints", {}).values()), None)
))
return findingsFilter and prioritize
Filter and prioritize
def prioritize_findings(findings: list[Finding]) -> list[Finding]:
"""Sort findings by severity."""
severity_order = {"error": 0, "warning": 1, "note": 2, "none": 3}
return sorted(findings, key=lambda f: severity_order.get(f.level, 99))
undefineddef prioritize_findings(findings: list[Finding]) -> list[Finding]:
"""Sort findings by severity."""
severity_order = {"error": 0, "warning": 1, "note": 2, "none": 3}
return sorted(findings, key=lambda f: severity_order.get(f.level, 99))
undefinedCommon Pitfalls and Solutions
常见问题与解决方案
1. Path Normalization Issues
1. 路径标准化问题
Different tools report paths differently (absolute, relative, URI-encoded):
python
from urllib.parse import unquote
from pathlib import Path
def normalize_path(uri: str, base_path: str = "") -> str:
"""Normalize SARIF artifact URI to consistent path."""
# Remove file:// prefix if present
if uri.startswith("file://"):
uri = uri[7:]
# URL decode
uri = unquote(uri)
# Handle relative paths
if not Path(uri).is_absolute() and base_path:
uri = str(Path(base_path) / uri)
# Normalize separators
return str(Path(uri))不同工具报告的路径格式不同(绝对路径、相对路径、URI编码):
python
from urllib.parse import unquote
from pathlib import Path
def normalize_path(uri: str, base_path: str = "") -> str:
"""Normalize SARIF artifact URI to consistent path."""
# Remove file:// prefix if present
if uri.startswith("file://"):
uri = uri[7:]
# URL decode
uri = unquote(uri)
# Handle relative paths
if not Path(uri).is_absolute() and base_path:
uri = str(Path(base_path) / uri)
# Normalize separators
return str(Path(uri))2. Fingerprint Mismatch Across Runs
2. 跨运行周期的指纹不匹配
Fingerprints may not match if:
- File paths differ between environments
- Tool versions changed fingerprinting algorithm
- Code was reformatted (changing line numbers)
Solution: Use multiple fingerprint strategies:
python
def compute_stable_fingerprint(result: dict, file_content: str = None) -> str:
"""Compute environment-independent fingerprint."""
import hashlib
components = [
result.get("ruleId", ""),
result.get("message", {}).get("text", "")[:100], # First 100 chars
]
# Add code snippet if available
if file_content and result.get("locations"):
region = result["locations"][0].get("physicalLocation", {}).get("region", {})
if region.get("startLine"):
lines = file_content.split("\n")
line_idx = region["startLine"] - 1
if 0 <= line_idx < len(lines):
# Normalize whitespace
components.append(lines[line_idx].strip())
return hashlib.sha256("".join(components).encode()).hexdigest()[:16]在以下情况下,指纹可能不匹配:
- 不同环境下的文件路径不同
- 工具版本更新导致指纹算法变更
- 代码被格式化(行号发生变化)
解决方案: 使用多种指纹策略:
python
def compute_stable_fingerprint(result: dict, file_content: str = None) -> str:
"""Compute environment-independent fingerprint."""
import hashlib
components = [
result.get("ruleId", ""),
result.get("message", {}).get("text", "")[:100], # First 100 chars
]
# Add code snippet if available
if file_content and result.get("locations"):
region = result["locations"][0].get("physicalLocation", {}).get("region", {})
if region.get("startLine"):
lines = file_content.split("\n")
line_idx = region["startLine"] - 1
if 0 <= line_idx < len(lines):
# Normalize whitespace
components.append(lines[line_idx].strip())
return hashlib.sha256("".join(components).encode()).hexdigest()[:16]3. Missing or Incomplete Data
3. 数据缺失或不完整
SARIF allows many optional fields. Always use defensive access:
python
def safe_get_location(result: dict) -> tuple[str, int]:
"""Safely extract file and line from result."""
try:
loc = result.get("locations", [{}])[0]
phys = loc.get("physicalLocation", {})
file_path = phys.get("artifactLocation", {}).get("uri", "unknown")
line = phys.get("region", {}).get("startLine", 0)
return file_path, line
except (IndexError, KeyError, TypeError):
return "unknown", 0SARIF允许许多可选字段,务必使用防御性访问方式:
python
def safe_get_location(result: dict) -> tuple[str, int]:
"""Safely extract file and line from result."""
try:
loc = result.get("locations", [{}])[0]
phys = loc.get("physicalLocation", {})
file_path = phys.get("artifactLocation", {}).get("uri", "unknown")
line = phys.get("region", {}).get("startLine", 0)
return file_path, line
except (IndexError, KeyError, TypeError):
return "unknown", 04. Large File Performance
4. 大文件性能问题
For very large SARIF files (100MB+):
python
import ijson # pip install ijson
def stream_results(sarif_path: str):
"""Stream results without loading entire file."""
with open(sarif_path, "rb") as f:
# Stream through results arrays
for result in ijson.items(f, "runs.item.results.item"):
yield result对于超大SARIF文件(100MB以上):
python
import ijson # pip install ijson
def stream_results(sarif_path: str):
"""Stream results without loading entire file."""
with open(sarif_path, "rb") as f:
# Stream through results arrays
for result in ijson.items(f, "runs.item.results.item"):
yield result5. Schema Validation
5. Schema验证
Validate before processing to catch malformed files:
bash
undefined在处理前进行验证,以捕获格式错误的文件:
bash
undefinedUsing ajv-cli
Using ajv-cli
npm install -g ajv-cli
ajv validate -s sarif-schema-2.1.0.json -d results.sarif
npm install -g ajv-cli
ajv validate -s sarif-schema-2.1.0.json -d results.sarif
Using Python jsonschema
Using Python jsonschema
pip install jsonschema
```python
from jsonschema import validate, ValidationError
import json
def validate_sarif(sarif_path: str, schema_path: str) -> bool:
"""Validate SARIF file against schema."""
with open(sarif_path) as f:
sarif = json.load(f)
with open(schema_path) as f:
schema = json.load(f)
try:
validate(sarif, schema)
return True
except ValidationError as e:
print(f"Validation error: {e.message}")
return Falsepip install jsonschema
```python
from jsonschema import validate, ValidationError
import json
def validate_sarif(sarif_path: str, schema_path: str) -> bool:
"""Validate SARIF file against schema."""
with open(sarif_path) as f:
sarif = json.load(f)
with open(schema_path) as f:
schema = json.load(f)
try:
validate(sarif, schema)
return True
except ValidationError as e:
print(f"Validation error: {e.message}")
return FalseCI/CD Integration Patterns
CI/CD集成模式
GitHub Actions
GitHub Actions
yaml
- name: Upload SARIF
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: results.sarif
- name: Check for high severity
run: |
HIGH_COUNT=$(jq '[.runs[].results[] | select(.level == "error")] | length' results.sarif)
if [ "$HIGH_COUNT" -gt 0 ]; then
echo "Found $HIGH_COUNT high severity issues"
exit 1
fiyaml
- name: Upload SARIF
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: results.sarif
- name: Check for high severity
run: |
HIGH_COUNT=$(jq '[.runs[].results[] | select(.level == "error")] | length' results.sarif)
if [ "$HIGH_COUNT" -gt 0 ]; then
echo "Found $HIGH_COUNT high severity issues"
exit 1
fiFail on New Issues
发现新问题则失败
python
from sarif import loader
def check_for_regressions(baseline: str, current: str) -> int:
"""Return count of new issues not in baseline."""
baseline_data = loader.load_sarif_file(baseline)
current_data = loader.load_sarif_file(current)
baseline_fps = {get_fingerprint(r) for r in baseline_data.get_results()}
new_issues = [r for r in current_data.get_results()
if get_fingerprint(r) not in baseline_fps]
return len(new_issues)python
from sarif import loader
def check_for_regressions(baseline: str, current: str) -> int:
"""Return count of new issues not in baseline."""
baseline_data = loader.load_sarif_file(baseline)
current_data = loader.load_sarif_file(current)
baseline_fps = {get_fingerprint(r) for r in baseline_data.get_results()}
new_issues = [r for r in current_data.get_results()
if get_fingerprint(r) not in baseline_fps]
return len(new_issues)Key Principles
核心原则
- Validate first: Check SARIF structure before processing
- Handle optionals: Many fields are optional; use defensive access
- Normalize paths: Tools report paths differently; normalize early
- Fingerprint wisely: Combine multiple strategies for stable deduplication
- Stream large files: Use ijson or similar for 100MB+ files
- Aggregate thoughtfully: Preserve tool metadata when combining files
- 先验证:处理前检查SARIF结构
- 处理可选字段:许多字段是可选的,使用防御性访问
- 标准化路径:不同工具报告的路径不同,尽早标准化
- 合理使用指纹:结合多种策略实现稳定去重
- 流式处理大文件:对100MB以上的文件使用ijson或类似工具
- 谨慎聚合:合并文件时保留工具元数据
Skill Resources
技能资源
For ready-to-use query templates, see {baseDir}/resources/jq-queries.md:
- 40+ jq queries for common SARIF operations
- Severity filtering, rule extraction, aggregation patterns
For Python utilities, see {baseDir}/resources/sarif_helpers.py:
- - Handle tool-specific path formats
normalize_path() - - Stable fingerprinting ignoring paths
compute_fingerprint() - - Remove duplicates across runs
deduplicate_results()
如需现成的查询模板,请查看{baseDir}/resources/jq-queries.md:
- 40余个用于常见SARIF操作的jq查询
- severity过滤、规则提取、聚合模式
如需Python工具类,请查看{baseDir}/resources/sarif_helpers.py:
- - 处理工具特定的路径格式
normalize_path() - - 忽略路径的稳定指纹生成
compute_fingerprint() - - 跨运行周期去重
deduplicate_results()