stata
Original:🇺🇸 English
Translated
Comprehensive Stata reference for writing correct .do files, data management, econometrics, causal inference, graphics, Mata programming, and 20 community packages (reghdfe, estout, did, rdrobust, etc.). Covers syntax, options, gotchas, and idiomatic patterns. Use this skill whenever the user asks you to write, debug, or explain Stata code.
3installs
Sourcedylantmoore/stata-skill
Added on
NPX Install
npx skill4agent add dylantmoore/stata-skill stataTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Stata Skill
You have access to comprehensive Stata reference files. Do not load all files.
Read only the 1-3 files relevant to the user's current task using the routing table below.
Critical Gotchas
These are Stata-specific pitfalls that lead to silent bugs. Internalize these before writing any code.
Missing Values Sort to +Infinity
Stata's (and -) are greater than all numbers.
..a.zstata
* WRONG — includes observations where income is missing!
gen high_income = (income > 50000)
* RIGHT
gen high_income = (income > 50000) if !missing(income)
* WRONG — missing ages appear in this list
list if age > 60
* RIGHT
list if age > 60 & !missing(age)=
vs ==
======stata
* WRONG — syntax error
gen employed = 1 if status = 1
* RIGHT
gen employed = 1 if status == 1Local Macro Syntax
Locals use (backtick + single-quote). Globals use or .
Forgetting the closing quote is the #1 macro bug.
`name'$name${name}stata
local controls "age education income"
regress wage `controls' // correct
regress wage `controls // WRONG — missing closing quote
regress wage 'controls' // WRONG — wrong quote charactersby
Requires Prior Sort (Use bysort
)
bybysortstata
* WRONG — error if data not sorted by id
by id: gen first = (_n == 1)
* RIGHT — bysort sorts automatically
bysort id: gen first = (_n == 1)
* Also RIGHT — explicit sort
sort id
by id: gen first = (_n == 1)Factor Variable Notation (i.
and c.
)
i.c.Use for categorical, for continuous. Omitting treats categories as continuous.
i.c.i.stata
* WRONG — treats race as continuous (e.g., race=3 has 3x effect of race=1)
regress wage race education
* RIGHT — creates dummies automatically
regress wage i.race education
* Interactions
regress wage i.race##c.education // full interaction
regress wage i.race#c.education // interaction only (no main effects)generate
vs replace
generatereplacegeneratereplacegeneratestata
gen x = 1
gen x = 2 // ERROR: x already defined
replace x = 2 // correctString Comparison Is Case-Sensitive
stata
* May miss "Male", "MALE", etc.
keep if gender == "male"
* Safer
keep if lower(gender) == "male"merge
Always Check _merge
merge_mergeNever skip — it costs nothing and is the only diagnostic you get when fails.
tab _mergeassertstata
merge 1:1 id using other.dta
tab _merge // ALWAYS tab before assert
assert _merge == 3 // fails silently without tab output
drop _mergepreserve
/ restore
+ tempfile
for Collapse-Merge-Back
preserverestoretempfileThe standard pattern for computing group stats and merging them onto the original data:
stata
tempfile stats
preserve
collapse (mean) avg_x=x, by(group)
save `stats'
restore
merge m:1 group using `stats'
tab _merge
assert _merge == 3
drop _mergeFor simple group means, avoids the round-trip entirely.
bysort group: egen avg_x = mean(x)Weights Are Not Interchangeable
- — frequency weights (replication)
fweight - — analytic/regression weights (inverse variance)
aweight - — probability/sampling weights (survey data, implies robust SE)
pweight - — importance weights (rarely used)
iweight
capture
Swallows Errors
capturestata
capture some_command
if _rc != 0 {
di as error "Failed with code: " _rc
exit _rc
}Line Continuation Uses ///
///stata
regress y x1 x2 x3 ///
x4 x5 x6, ///
vce(robust)Stored Results: r()
vs e()
vs s()
r()e()s()- — r-class commands (summarize, tabulate, etc.)
r() - — e-class commands (estimation: regress, logit, etc.)
e() - — s-class commands (parsing)
s()
A new estimation command overwrites previous results. Store them first:
e()stata
regress y x1 x2
estimates store model1Running Stata from the Command Line
Claude can execute Stata code by running files in batch mode from the terminal. This is how to run Stata non-interactively.
.doFinding the Stata Binary
Stata on macOS is a bundle. The actual binary is inside it. Common locations:
.app# Stata 18 / StataNow (most common)
/Applications/Stata/StataMP.app/Contents/MacOS/stata-mp
/Applications/StataNow/StataMP.app/Contents/MacOS/stata-mp
# Other editions (SE, BE)
/Applications/Stata/StataSE.app/Contents/MacOS/stata-se
/Applications/Stata/StataBE.app/Contents/MacOS/stata-beIf Stata isn't on , find it with:
$PATHmdfind -name "stata-mp" | grep MacOSBatch Mode (-b
)
-bbash
# Run a .do file in batch mode — output goes to <filename>.log
/Applications/Stata/StataMP.app/Contents/MacOS/stata-mp -b do analysis.do
# If stata-mp is on PATH (e.g., via symlink or alias):
stata-mp -b do analysis.do- = batch mode (non-interactive, no GUI)
-b - Output (everything Stata would display) is written to in the working directory
analysis.log - Exit code is 0 on success, non-zero on error
- The log file contains all output, including error messages — check it after execution
Running Inline Stata Code
To run a quick Stata snippet without creating a file:
.dobash
# Write a temp .do file and run it
cat > /tmp/stata_run.do << 'EOF'
sysuse auto, clear
summarize price mpg
EOF
stata-mp -b do /tmp/stata_run.do
cat /tmp/stata_run.logChecking Results
bash
# Check if it succeeded
stata-mp -b do tests/run_tests.do && echo "SUCCESS" || echo "FAILED"
# Search the log for pass/fail
grep -E "PASS|FAIL|error|r\([0-9]+\)" run_tests.logTips
- at the top of batch scripts — batch mode starts with a fresh Stata session, but
clear allensures no stale state from prior runs in the same session.clear all - — prevents Stata from pausing for
set more offprompts (fatal in batch mode).--more-- - Log files overwrite silently — always writes to
analysis.doin the current directory. If you run multipleanalysis.logfiles, check the right log..do - Working directory — Stata's working directory is wherever you run the command from, not where the file lives. Use
.doin thecdfile or absolute paths if needed..do
Routing Table
Read only the files relevant to the user's task. Paths are relative to this SKILL.md file.
Data Operations
| File | Topics & Key Commands |
|---|---|
| |
| |
| |
| Variable types, |
| |
| |
| |
Statistics & Econometrics
| File | Topics & Key Commands |
|---|---|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Causal Inference
| File | Topics & Key Commands |
|---|---|
| |
| DiD, parallel trends, event studies, staggered adoption |
| Sharp/fuzzy RD, bandwidth selection, |
| PSM, nearest neighbor, kernel matching, |
| |
Advanced Methods
| File | Topics & Key Commands |
|---|---|
| |
| |
| |
| |
| |
Graphics
| File | Topics & Key Commands |
|---|---|
| |
Programming
| File | Topics & Key Commands |
|---|---|
| |
| |
| Mata basics, when to use Mata vs ado, data types |
| Mata functions, flow control, structures, pointers |
| Matrix creation, decompositions, solvers, |
| |
Output & Workflow
| File | Topics & Key Commands |
|---|---|
| |
| Project structure, master do-files, version control, debugging, common mistakes |
| Python via |
| User wants to report a Stata skill documentation gap or error to the repository |
Community Packages
| File | What It Does |
|---|---|
| High-dimensional fixed effects OLS (absorbs multiple FE sets efficiently) |
| |
| Alternative regression table exporter (Word, Excel, TeX) |
| One-command Word document creation for any Stata output |
| Cross-tabulations and summary tables to file |
| Coefficient plots from stored estimates |
| |
| Modern DiD: |
| |
| Robust RD estimation with optimal bandwidth ( |
| Propensity score matching (nearest neighbor, kernel, radius) |
| Synthetic control method ( |
| Enhanced IV/2SLS: |
| Dynamic panel GMM (Arellano-Bond/Blundell-Bond) |
| Binned scatter plots with CI ( |
| Nonparametric kernel estimation and inference |
| |
| Winsorizing and trimming: |
| |
| |
Common Patterns
Regression Table Workflow
stata
* Estimate models
eststo clear
eststo: regress y x1 x2, vce(robust)
eststo: regress y x1 x2 x3, vce(robust)
eststo: regress y x1 x2 x3 x4, vce(cluster id)
* Export table
esttab using "results.tex", replace ///
se star(* 0.10 ** 0.05 *** 0.01) ///
label booktabs ///
title("Main Results") ///
mtitles("(1)" "(2)" "(3)")Panel Data Setup
stata
xtset panelid timevar // declare panel structure
xtdescribe // check balance
xtsum outcome // within/between variation
* Fixed effects
xtreg y x1 x2, fe vce(cluster panelid)
* Or with reghdfe (preferred for multiple FE)
reghdfe y x1 x2, absorb(panelid timevar) vce(cluster panelid)Difference-in-Differences
stata
* Classic 2x2 DiD
gen post = (year >= treatment_year)
gen treat_post = treated * post
regress y treated post treat_post, vce(cluster id)
* Event study (uniform timing — must interact with treatment group)
reghdfe y ib(-1).rel_time#1.treated, absorb(id year) vce(cluster id)
testparm *.rel_time#1.treated // pre-trend test
* Modern staggered DiD (Callaway & Sant'Anna)
csdid y x1 x2, ivar(id) time(year) gvar(first_treat) agg(event)
csdid_plotGraph Export
stata
* Publication-quality scatter with fit line
twoway (scatter y x, mcolor(navy%50) msize(small)) ///
(lfit y x, lcolor(cranberry) lwidth(medthick)), ///
title("Title Here") ///
xtitle("X Label") ytitle("Y Label") ///
legend(off) scheme(s2color)
graph export "figure1.pdf", replace as(pdf)
graph export "figure1.png", replace as(png) width(2400)Data Cleaning Pipeline
stata
* Load and inspect
import delimited "raw_data.csv", clear varnames(1)
describe
codebook, compact
* Clean
rename *, lower // lowercase all varnames
destring income, replace force // convert string to numeric
replace income = . if income < 0
* Label
label variable income "Annual household income (USD)"
label define yesno 0 "No" 1 "Yes"
label values employed yesno
* Save
compress
save "clean_data.dta", replaceMultiple Imputation
stata
mi set mlong
mi register imputed income education
mi impute chained (regress) income (ologit) education = age i.gender, add(20) rseed(12345)
mi estimate: regress wage income education age i.genderHelp Improve This Skill
If you produce Stata code with a significant error — wrong syntax, incorrect command usage, or a gotcha you failed to catch — and the issue seems to stem from a gap in these reference files rather than a one-off mistake, consider suggesting to the user that they file an issue on the skill repository. This helps future users.
When to raise this: Only after you've already corrected the error and the user has working code. Frame it as optional: "I made an error with [X] that I think comes from a gap in the Stata skill documentation. If you'd like, I can help you file an issue or a PR so it gets fixed for everyone."
When NOT to raise this: If the user is on Claude Haiku, the error is more likely a model capability issue than a documentation gap. In that case, suggest they try Sonnet or Opus for complex Stata work instead of filing an issue.
If the user agrees, read for instructions on writing a good issue report.
references/filing-issues.md