semgrep

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Semgrep Static Analysis

Semgrep静态分析

When to Use Semgrep

何时使用Semgrep

Ideal scenarios:
  • Quick security scans (minutes, not hours)
  • Pattern-based bug detection
  • Enforcing coding standards and best practices
  • Finding known vulnerability patterns
  • Single-file analysis without complex data flow
  • First-pass analysis before deeper tools
Consider CodeQL instead when:
  • Need interprocedural taint tracking across files
  • Complex data flow analysis required
  • Analyzing custom proprietary frameworks
理想场景:
  • 快速安全扫描(数分钟而非数小时)
  • 基于模式的bug检测
  • 强制执行编码标准和最佳实践
  • 查找已知漏洞模式
  • 无需复杂数据流的单文件分析
  • 使用深度分析工具前的首轮分析
考虑改用CodeQL的场景:
  • 需要跨文件的过程间污点追踪
  • 需要复杂的数据流分析
  • 分析自定义专有框架

When NOT to Use

何时不使用

Do NOT use this skill for:
  • Complex interprocedural data flow analysis (use CodeQL instead)
  • Binary analysis or compiled code without source
  • Custom deep semantic analysis requiring AST/CFG traversal
  • When you need to track taint across many function boundaries
请勿将此技能用于:
  • 复杂的过程间数据流分析(改用CodeQL)
  • 二进制分析或无源码的编译代码
  • 需要AST/CFG遍历的自定义深度语义分析
  • 需要跨多个函数边界追踪污点的场景

Installation

安装

bash
undefined
bash
undefined

pip

pip

python3 -m pip install semgrep
python3 -m pip install semgrep

Homebrew

Homebrew

brew install semgrep
brew install semgrep

Docker

Docker

docker run --rm -v "${PWD}:/src" returntocorp/semgrep semgrep --config auto /src
docker run --rm -v "${PWD}:/src" returntocorp/semgrep semgrep --config auto /src

Update

升级

pip install --upgrade semgrep
undefined
pip install --upgrade semgrep
undefined

Core Workflow

核心工作流

1. Quick Scan

1. 快速扫描

bash
semgrep --config auto .                    # Auto-detect rules
semgrep --config auto --metrics=off .      # Disable telemetry for proprietary code
bash
semgrep --config auto .                    # 自动检测规则
semgrep --config auto --metrics=off .      # 为专有代码禁用遥测

2. Use Rulesets

2. 使用规则集

bash
semgrep --config p/<RULESET> .             # Single ruleset
semgrep --config p/security-audit --config p/trailofbits .  # Multiple
RulesetDescription
p/default
General security and code quality
p/security-audit
Comprehensive security rules
p/owasp-top-ten
OWASP Top 10 vulnerabilities
p/cwe-top-25
CWE Top 25 vulnerabilities
p/r2c-security-audit
r2c security audit rules
p/trailofbits
Trail of Bits security rules
p/python
Python-specific
p/javascript
JavaScript-specific
p/golang
Go-specific
bash
semgrep --config p/<RULESET> .             # 单一规则集
semgrep --config p/security-audit --config p/trailofbits .  # 多个规则集
规则集描述
p/default
通用安全与代码质量
p/security-audit
全面安全规则
p/owasp-top-ten
OWASP Top 10漏洞
p/cwe-top-25
CWE Top 25漏洞
p/r2c-security-audit
r2c安全审计规则
p/trailofbits
Trail of Bits安全规则
p/python
Python专属规则
p/javascript
JavaScript专属规则
p/golang
Go专属规则

3. Output Formats

3. 输出格式

bash
semgrep --config p/security-audit --sarif -o results.sarif .   # SARIF
semgrep --config p/security-audit --json -o results.json .     # JSON
semgrep --config p/security-audit --dataflow-traces .          # Show data flow
bash
semgrep --config p/security-audit --sarif -o results.sarif .   # SARIF格式
semgrep --config p/security-audit --json -o results.json .     # JSON格式
semgrep --config p/security-audit --dataflow-traces .          # 显示数据流

4. Scan Specific Paths

4. 扫描特定路径

bash
semgrep --config p/python app.py           # Single file
semgrep --config p/javascript src/         # Directory
semgrep --config auto --include='**/test/**' .  # Include tests (excluded by default)
bash
semgrep --config p/python app.py           # 单个文件
semgrep --config p/javascript src/         # 目录
semgrep --config auto --include='**/test/**' .  # 包含测试文件(默认排除)

Writing Custom Rules

编写自定义规则

Basic Structure

基本结构

yaml
rules:
  - id: hardcoded-password
    languages: [python]
    message: "Hardcoded password detected: $PASSWORD"
    severity: ERROR
    pattern: password = "$PASSWORD"
yaml
rules:
  - id: hardcoded-password
    languages: [python]
    message: "检测到硬编码密码: $PASSWORD"
    severity: ERROR
    pattern: password = "$PASSWORD"

Pattern Syntax

模式语法

SyntaxDescriptionExample
...
Match anything
func(...)
$VAR
Capture metavariable
$FUNC($INPUT)
<... ...>
Deep expression match
<... user_input ...>
语法描述示例
...
匹配任意内容
func(...)
$VAR
捕获元变量
$FUNC($INPUT)
<... ...>
深度表达式匹配
<... user_input ...>

Pattern Operators

模式运算符

OperatorDescription
pattern
Match exact pattern
patterns
All must match (AND)
pattern-either
Any matches (OR)
pattern-not
Exclude matches
pattern-inside
Match only inside context
pattern-not-inside
Match only outside context
pattern-regex
Regex matching
metavariable-regex
Regex on captured value
metavariable-comparison
Compare values
运算符描述
pattern
匹配精确模式
patterns
所有模式必须匹配(逻辑与)
pattern-either
任意模式匹配(逻辑或)
pattern-not
排除匹配项
pattern-inside
仅在指定上下文中匹配
pattern-not-inside
仅在指定上下文外匹配
pattern-regex
正则表达式匹配
metavariable-regex
对捕获值应用正则
metavariable-comparison
比较值

Combining Patterns

组合模式

yaml
rules:
  - id: sql-injection
    languages: [python]
    message: "Potential SQL injection"
    severity: ERROR
    patterns:
      - pattern-either:
          - pattern: cursor.execute($QUERY)
          - pattern: db.execute($QUERY)
      - pattern-not:
          - pattern: cursor.execute("...", (...))
      - metavariable-regex:
          metavariable: $QUERY
          regex: .*\+.*|.*\.format\(.*|.*%.*
yaml
rules:
  - id: sql-injection
    languages: [python]
    message: "潜在SQL注入"
    severity: ERROR
    patterns:
      - pattern-either:
          - pattern: cursor.execute($QUERY)
          - pattern: db.execute($QUERY)
      - pattern-not:
          - pattern: cursor.execute("...", (...))
      - metavariable-regex:
          metavariable: $QUERY
          regex: .*\+.*|.*\.format\(.*|.*%.*

Taint Mode (Data Flow)

污点模式(数据流)

Simple pattern matching finds obvious cases:
python
undefined
简单模式匹配可发现明显案例:
python
undefined

Pattern
os.system($CMD)
catches this:

模式
os.system($CMD)
可捕获此情况:

os.system(user_input) # Found

But misses indirect flows:

```python
os.system(user_input) # 已发现

但会遗漏间接数据流:

```python

Same pattern misses this:

相同模式会遗漏此情况:

cmd = user_input processed = cmd.strip() os.system(processed) # Missed - no direct match

Taint mode tracks data through assignments and transformations:
- **Source**: Where untrusted data enters (`user_input`)
- **Propagators**: How it flows (`cmd = ...`, `processed = ...`)
- **Sanitizers**: What makes it safe (`shlex.quote()`)
- **Sink**: Where it becomes dangerous (`os.system()`)

```yaml
rules:
  - id: command-injection
    languages: [python]
    message: "User input flows to command execution"
    severity: ERROR
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: os.system($SINK)
      - pattern: subprocess.call($SINK, shell=True)
      - pattern: subprocess.run($SINK, shell=True, ...)
    pattern-sanitizers:
      - pattern: shlex.quote(...)
      - pattern: int(...)
cmd = user_input processed = cmd.strip() os.system(processed) # 遗漏 - 无直接匹配

污点模式可追踪数据在赋值和转换过程中的流向:
- **源(Source)**: 不可信数据的入口点(如`user_input`)
- **传播器(Propagators)**: 数据的流转方式(如`cmd = ...`, `processed = ...`)
- **清理器(Sanitizers)**: 使数据安全的操作(如`shlex.quote()`)
- **Sink(Sink)**: 数据变得危险的位置(如`os.system()`)

```yaml
rules:
  - id: command-injection
    languages: [python]
    message: "用户输入流向命令执行"
    severity: ERROR
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: os.system($SINK)
      - pattern: subprocess.call($SINK, shell=True)
      - pattern: subprocess.run($SINK, shell=True, ...)
    pattern-sanitizers:
      - pattern: shlex.quote(...)
      - pattern: int(...)

Full Rule with Metadata

带元数据的完整规则

yaml
rules:
  - id: flask-sql-injection
    languages: [python]
    message: "SQL injection: user input flows to query without parameterization"
    severity: ERROR
    metadata:
      cwe: "CWE-89: SQL Injection"
      owasp: "A03:2021 - Injection"
      confidence: HIGH
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: cursor.execute($QUERY)
      - pattern: db.execute($QUERY)
    pattern-sanitizers:
      - pattern: int(...)
    fix: cursor.execute($QUERY, (params,))
yaml
rules:
  - id: flask-sql-injection
    languages: [python]
    message: "SQL注入: 用户输入流向未参数化的查询"
    severity: ERROR
    metadata:
      cwe: "CWE-89: SQL Injection"
      owasp: "A03:2021 - Injection"
      confidence: HIGH
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: cursor.execute($QUERY)
      - pattern: db.execute($QUERY)
    pattern-sanitizers:
      - pattern: int(...)
    fix: cursor.execute($QUERY, (params,))

Testing Rules

测试规则

Test File Format

测试文件格式

python
undefined
python
undefined

test_rule.py

test_rule.py

def test_vulnerable(): user_input = request.args.get("id") # ruleid: flask-sql-injection cursor.execute("SELECT * FROM users WHERE id = " + user_input)
def test_safe(): user_input = request.args.get("id") # ok: flask-sql-injection cursor.execute("SELECT * FROM users WHERE id = ?", (user_input,))

```bash
semgrep --test rules/
def test_vulnerable(): user_input = request.args.get("id") # ruleid: flask-sql-injection cursor.execute("SELECT * FROM users WHERE id = " + user_input)
def test_safe(): user_input = request.args.get("id") # ok: flask-sql-injection cursor.execute("SELECT * FROM users WHERE id = ?", (user_input,))

```bash
semgrep --test rules/

CI/CD Integration (GitHub Actions)

CI/CD集成(GitHub Actions)

yaml
name: Semgrep

on:
  push:
    branches: [main]
  pull_request:
  schedule:
    - cron: '0 0 1 * *'  # Monthly

jobs:
  semgrep:
    runs-on: ubuntu-latest
    container:
      image: returntocorp/semgrep

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for diff-aware scanning

      - name: Run Semgrep
        run: |
          if [ "${{ github.event_name }}" = "pull_request" ]; then
            semgrep ci --baseline-commit ${{ github.event.pull_request.base.sha }}
          else
            semgrep ci
          fi
        env:
          SEMGREP_RULES: >-
            p/security-audit
            p/owasp-top-ten
            p/trailofbits
yaml
name: Semgrep

on:
  push:
    branches: [main]
  pull_request:
  schedule:
    - cron: '0 0 1 * *'  # 每月执行

jobs:
  semgrep:
    runs-on: ubuntu-latest
    container:
      image: returntocorp/semgrep

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # 差异感知扫描所需

      - name: 运行Semgrep
        run: |
          if [ "${{ github.event_name }}" = "pull_request" ]; then
            semgrep ci --baseline-commit ${{ github.event.pull_request.base.sha }}
          else
            semgrep ci
          fi
        env:
          SEMGREP_RULES: >-
            p/security-audit
            p/owasp-top-ten
            p/trailofbits

Configuration

配置

.semgrepignore

.semgrepignore

tests/fixtures/
**/testdata/
generated/
vendor/
node_modules/
tests/fixtures/
**/testdata/
generated/
vendor/
node_modules/

Suppress False Positives

抑制误报

python
password = get_from_vault()  # nosemgrep: hardcoded-password
dangerous_but_safe()  # nosemgrep
python
password = get_from_vault()  # nosemgrep: hardcoded-password
dangerous_but_safe()  # nosemgrep

Performance

性能

bash
semgrep --config rules/ --time .    # Check rule performance
ulimit -n 4096                       # Increase file descriptors for large codebases
bash
semgrep --config rules/ --time .    # 检查规则性能
ulimit -n 4096                       # 为大型代码库增加文件描述符限制

Path Filtering in Rules

规则中的路径过滤

yaml
rules:
  - id: my-rule
    paths:
      include: [src/]
      exclude: [src/generated/]
yaml
rules:
  - id: my-rule
    paths:
      include: [src/]
      exclude: [src/generated/]

Third-Party Rules

第三方规则

bash
pip install semgrep-rules-manager
semgrep-rules-manager --dir ~/semgrep-rules download
semgrep -f ~/semgrep-rules .
bash
pip install semgrep-rules-manager
semgrep-rules-manager --dir ~/semgrep-rules download
semgrep -f ~/semgrep-rules .

Rationalizations to Reject

需摒弃的错误认知

ShortcutWhy It's Wrong
"Semgrep found nothing, code is clean"Semgrep is pattern-based; it can't track complex data flow across functions
"I wrote a rule, so we're covered"Rules need testing with
semgrep --test
; false negatives are silent
"Taint mode catches injection"Only if you defined all sources, sinks, AND sanitizers correctly
"Pro rules are comprehensive"Pro rules are good but not exhaustive; supplement with custom rules for your codebase
"Too many findings = noisy tool"High finding count often means real problems; tune rules, don't disable them
错误观点原因
"Semgrep未发现问题,代码就是干净的"Semgrep是基于模式的工具;无法跨函数追踪复杂数据流
"我写了规则,所以我们就覆盖全面了"规则需要用
semgrep --test
测试;漏报是无提示的
"污点模式能捕获所有注入漏洞"仅当正确定义了所有源、Sink和清理器时才有效
"专业规则是全面的"专业规则虽好但并非穷尽所有场景;需针对你的代码库补充自定义规则
"发现太多问题 = 工具太嘈杂"高发现量通常意味着真实问题;应调整规则,而非禁用工具

Resources

资源