add-lang
Original:🇺🇸 English
Translated
Add tree-sitter language support to codegraph end-to-end — wire the grammar + extractor, write tests, then benchmark extraction quality and retrieval value on 3 popular real-world repos. Use when the user runs /add-lang <language> or asks to add/support a new language (e.g. Lua, Elixir, Zig, OCaml) in codegraph.
5installs
Sourcecolbymchenry/codegraph
Added on
NPX Install
npx skill4agent add colbymchenry/codegraph add-langTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Add a language to CodeGraph
Wire a new tree-sitter language into codegraph's extraction pipeline, prove it
extracts real symbols on popular repos, and prove it beats no-codegraph for an
agent. Runs fully autonomously — pick repos, benchmark, update docs, then
report. Never commit, push, publish, or tag (house rule); leave all changes
for the user to review.
The argument is the language token used throughout the union, e.g.
, , . If none was given, ask which language. Use the lowercase
single-token form everywhere (, not ).
Languageluaelixirzigcsharpc#Prerequisites
- Run from the codegraph repo root. ,
node,git, and a logged-inghCLI (the benchmark spawns realclauderuns).claude -p - The benchmark uses the local dev build — Step 8 builds + links it on PATH.
Workflow
Copy this checklist and work through it in order:
- [ ] 1. Resolve language; bail early if already supported (just benchmark)
- [ ] 2. Find a grammar + health-check it (ABI / heap corruption)
- [ ] 3. Discover the grammar's AST node types (dump-ast.mjs)
- [ ] 4. Wire the language (4 files; sometimes a 5th core touch)
- [ ] 5. Build + verify-extraction loop until PASS
- [ ] 6. Add extraction tests; make them green
- [ ] 7. Auto-pick 3 popular repos by size tier; add to corpus.json
- [ ] 8. Benchmark all 3: extraction + with/without A/B
- [ ] 9. Update README + CHANGELOG
- [ ] 10. Report; do NOT commitStep 1 — Resolve + short-circuit
Check whether the language is already wired: look for the token in the
const () and the map
(). If it is already supported (e.g.
, ), skip Steps 2–6 and go straight to benchmarking
(Steps 7–8) to validate/measure it — note in the report that no code changed.
LANGUAGESsrc/types.tsEXTRACTORSsrc/extraction/languages/index.tstypescriptrustStep 2 — Find a grammar, then health-check it
bash
ls node_modules/tree-sitter-wasms/out/ | grep -i <lang> # csharp -> c_sharp- Present → likely off-the-shelf; resolves it from
grammars.tsautomatically. (Many languages: elixir, zig, ocaml, solidity, toml, yaml, …)tree-sitter-wasms - Absent → vendor a into
.wasm(likesrc/extraction/wasm//pascal/scala) and add the token to the vendored branch in Step 4.lua
Always health-check before writing an extractor — a present grammar can
still be unusable:
bash
node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext>It prints the grammar's ABI version and parses a valid sample many times in a
multi-grammar runtime. If it FAILs (ERROR trees on valid code — an old ABI
corrupting the shared WASM heap, which silently drops nested calls/imports on
every file after the first; e.g. the tree-sitter-wasms Lua grammar is ABI 13
and fails), do NOT use that wasm. Vendor a newer (ABI 14/15) build instead:
bash
npm pack @tree-sitter-grammars/tree-sitter-<lang> # often ships a prebuilt *.wasm
# or build one: npx tree-sitter build --wasm (needs Docker/emscripten)
cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasmthen add the token to the vendored branch in Step 4 and re-run check-grammar on
the vendored path until it PASSes. If you cannot obtain a healthy wasm, STOP
and tell the user.
Step 3 — Discover AST node types
Get a representative source file (write a small sample covering functions,
classes/structs, imports, enums; or a raw file from a known repo), then:
curlbash
node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>
# vendored grammar: pass the wasm path instead of the token
node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>The frequency table + field names (, , ,
) tell you what to map. Open the existing extractor closest to the
language's paradigm as a model: / (functional, traits),
/ (OO), / (scripting),
(top-level methods + receivers).
name:parameters:body:return_type:rust.tsscala.tsjava.tscsharp.tspython.tsruby.tsgo.tsStep 4 — Wire the language (4 files)
These are exact, fragile wiring — match the existing style precisely:
- — TWO edits:
src/types.ts- add to the
'<lang>',const (beforeLANGUAGES);'unknown' - add to
'**/*.<ext>',. Don't skip this — it's the file-scan allowlist; without the glob,DEFAULT_CONFIG.includefinds 0 files even though detection/extraction are wired.codegraph init
- add
- — three maps:
src/extraction/grammars.ts- :
WASM_GRAMMAR_FILES<lang>: 'tree-sitter-<lang>.wasm', - : each file extension →
EXTENSION_MAP(e.g.'<lang>')'.lua': 'lua', - :
getLanguageDisplayName<lang>: '<Display Name>', - vendored only: add to the
<lang>wasm-path branch.(lang === 'pascal' || lang === 'scala' || …)
- — new file exporting
src/extraction/languages/<lang>.ts. Map the node types from Step 3. Required fields:export const <lang>Extractor: LanguageExtractor = { … },functionTypes,classTypes,methodTypes,interfaceTypes,structTypes,enumTypes,typeAliasTypes,importTypes,callTypes,variableTypes,nameField,bodyField. Add hooks as the grammar needs them (paramsField,getSignature,getVisibility,isExported,extractImport,visitNode,getReceiverType,interfaceKind, etc. — seeenumMemberTypes).src/extraction/tree-sitter-types.ts - —
src/extraction/languages/index.tsand addimport { <lang>Extractor } from './<lang>';to<lang>: <lang>Extractor,.EXTRACTORS
Sometimes a 5th, core touch in — variable
extraction has per-language branches in (the generic fallback
only finds direct / children). If the grammar
nests declared names (e.g. Lua's ), add a
branch there, mirroring the existing
ts/python/go ones. Import forms that aren't a distinct node (Lua/Ruby
is a call) are handled in the extractor's hook instead.
src/extraction/tree-sitter.tsextractVariableidentifiervariable_declaratorvariable_declaration → variable_list} else if (this.language === '<lang>')requirevisitNodeStep 5 — Build + verify loop
bash
npm run build # tsc + copy-assets (copies any vendored *.wasm into dist/)Index a small sample repo and check extraction:
bash
( cd <sample-repo> && codegraph init -i )
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>verify-extraction.mjsfileimportdump-ast.mjs<lang>.tsnpm run buildStep 6 — Tests
Add to , modeled on the block:
__tests__/extraction.test.tsRust Extraction- a assertion in
detectLanguagedescribe('Language Detection') - a block asserting functions/classes/imports are extracted from an inline source string.
describe('<Lang> Extraction')
bash
npx vitest run __tests__/extraction.test.tsGreen before continuing.
Step 7 — Auto-pick 3 repos + corpus
Pick without asking. Find candidates, then curate 3 that are genuinely
-dominant, one per size tier:
<lang>bash
gh search repos --language=<lang> --sort=stars --limit 40 \
--json fullName,stargazerCount,descriptionTiers (match ): Small <~150 files · Medium ~150–1500 ·
Large >~1500. Skip repos that are tagged but mostly another
language. Write one cross-file architecture question per repo (the kind that
needs tracing across files). Add a block to
(fields: , , ,
, ) so can reuse them.
corpus.json<lang>"<Language>".claude/skills/agent-eval/corpus.jsonnamereposizefilesquestion/agent-evalStep 8 — Benchmark all 3 (extraction + A/B)
Make the dev build the codegraph on PATH once, then loop:
bash
npm run build && ./scripts/local-install.sh
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless # ×3bench.sh/tmp/codegraph-corpusverify-extraction.mjsscripts/agent-eval/run-all.shparse-run.mjsrun-all.shReadwithwithout./scripts/local-install.shStep 9 — Docs + CHANGELOG
- README.md: add to the "19+ Languages" feature bullet, and add a row to the Supported Languages table:
<Lang>.ext` | Full support (classes, methods, …) |`.| <Lang> | \ - CHANGELOG.md: add an section at the top (above the latest version) with
## [Unreleased]→ a user-perspective bullet, e.g. "CodeGraph now indexes <Lang> (### Added) — functions, classes, imports, and call edges." If.extalready exists, append under it. (It's folded into the next versioned block at release time.)## [Unreleased]
Step 10 — Report (do NOT commit)
Summarize for review:
- Files changed: the 4 wiring edits + new extractor + tests + README +
CHANGELOG + corpus.json (+ any vendored ).
.wasm - Extraction per repo: files / nodes / edges / result.
verify-extraction - A/B per repo: vs
with(tool calls, file Reads, cost) and a one-line verdict — did codegraph reduce effort, and did both arms reach a correct answer?without - Gaps / follow-ups (node types not yet mapped, resolution edges missing, framework routes, etc.).
Hand the changes to the user. Do not run / or publish —
releases go through the GitHub Actions Release workflow.
git commitpushNotes
- The A/B spawns real paid runs (opus,
claude -p), 2 arms × 3 repos. The corpus dir--max-budget-usdis shared with/tmp/codegraph-corpus, so clones are reused across runs./agent-eval - Any new must live in
*.wasm—src/extraction/wasm/(run bycopy-assets) ships it; otherwise it won't be innpm run build.dist/ - An index must be served by the same binary that built it. Step 8 builds + links the dev build first, so this holds.
- If a grammar can't be obtained, or extraction can't reach PASS, STOP and report — don't ship a half-wired language.