stata-c-plugins
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseStata C/C++ Plugin Development
Stata C/C++插件开发
Build high-performance C/C++ plugins for Stata. This skill covers the full lifecycle from SDK setup through cross-platform distribution, based on real experience building production Stata plugins for statistical imputation, random forests, string matching, and causal inference.
This skill assumes macOS (Apple Silicon or Intel) as the development platform. Build commands, cross-compilation workflows, and Docker instructions are all Mac-oriented. The plugins themselves target all four platforms (macOS ARM64, macOS x86_64, Linux x86_64, Windows x86_64), but the development environment is macOS. If you need to develop on Linux or Windows natively, adapt the compilation and Docker sections accordingly.
为Stata构建高性能C/C++插件。本技能基于开发统计插补、随机森林、字符串匹配和因果推断等生产级Stata插件的实际经验,覆盖从SDK设置到跨平台分发的全生命周期。
本技能假设以macOS(Apple Silicon或Intel)作为开发平台。 构建命令、跨编译工作流和Docker指令均基于Mac环境。插件本身支持四大平台(macOS ARM64、macOS x86_64、Linux x86_64、Windows x86_64),但开发环境为macOS。如果需要在Linux或Windows原生环境开发,请相应调整编译和Docker相关内容。
How to Approach Every Task
任务处理流程
Before writing any code, enter plan mode. A good plan covers:
- Complete inventory — every feature, option, and component to build (for translation: exhaustive catalog of the source package's API)
- Architecture decisions — wrap C++ backend vs. write C from scratch vs. pure Stata
- Relevant reference files — identify up front which of this skill's reference files contain info you'll need, and cite them explicitly in the plan steps so they get loaded at the right time:
- — full translation workflow, test repurposing, fidelity audit
references/translation_workflow.md - — test layers, reference data generation, Layer 0 (repurpose original tests)
references/testing_strategy.md - — pthreads, XorShift RNG, quickselect, pre-sorted indices
references/performance_patterns.md - — .toc/.pkg/.sthlp templates, build scripts
references/packaging_and_help.md - — C++ wrapping, extern "C", exception safety, compilation
references/cpp_plugins.md
- Phase-by-phase steps with dependencies between them
- For each step: what gets built, what tests get written, and that the review loop runs before proceeding
- For translation projects: a final fidelity audit as the last step (see )
translation_workflow.md
Implement sequentially across components, in parallel within each component. Once an interface is defined, dispatch independent sub-tasks as parallel subagents (e.g., C plugin implementation, .ado wrapper, and test suite can run simultaneously). Merge their work, run the full test suite, then proceed to the review loop before moving to the next component.
Run the review loop after every component:
- Default: dispatch 2-3 review agents in parallel, ideally from different models (e.g., Claude + GPT + Gemini) for diversity of perspective. Use whatever multi-model tools are available in your environment.
- If only one model is available: dispatch 2-3 agents with different review focuses (correctness, completeness, architecture). Different prompts approximate the diversity of different models.
- Each agent reviews the diff, test results, and requirements — instruction: "List any gaps, bugs, or issues. Say LGTM if everything looks correct."
- Fix all issues raised, re-dispatch, loop until all agents say LGTM. Then proceed.
编写任何代码前,先进入规划模式。 一份优质规划应包含:
- 完整清单 —— 需构建的所有功能、选项和组件(针对移植任务:源包API的详尽目录)
- 架构决策 —— 封装C++后端、从零编写C代码还是纯Stata实现
- 相关参考文件 —— 提前确定本技能的哪些参考文件包含所需信息,并在规划步骤中明确引用,以便在合适的时机加载:
- —— 完整移植工作流、测试复用、保真度审计
references/translation_workflow.md - —— 测试层级、参考数据生成、第0层(复用原测试)
references/testing_strategy.md - —— pthreads、XorShift随机数生成器、快速选择、预排序索引
references/performance_patterns.md - —— .toc/.pkg/.sthlp模板、构建脚本
references/packaging_and_help.md - —— C++封装、extern "C"、异常安全、编译
references/cpp_plugins.md
- 分阶段步骤及依赖关系
- 每个步骤细节:构建内容、编写的测试,以及进入下一阶段前需运行的评审环节
- 移植项目专属:最后一步进行最终保真度审计(见)
translation_workflow.md
按组件顺序实现,单个组件内并行处理。 接口定义完成后,可将独立子任务分配给并行子Agent(例如C插件实现、.ado封装器和测试套件可同时运行)。合并工作成果,运行完整测试套件,然后进入评审环节,之后再推进到下一个组件。
每个组件完成后运行评审环节:
- 默认:并行分配2-3个评审Agent,理想情况使用不同模型(如Claude + GPT + Gemini)以获取多元视角。使用环境中可用的多模型工具即可。
- 若仅有一种模型可用:分配2-3个Agent,每个聚焦不同评审维度(正确性、完整性、架构)。通过不同提示词模拟不同模型的多样性。
- 每个Agent需评审差异、测试结果和需求,指令为:"列出所有缺口、漏洞或问题。若一切正常则回复LGTM。"
- 修复所有提出的问题,重新分配评审,循环直到所有Agent回复LGTM,再继续推进。
Wrap First, Write From Scratch Second
优先封装,其次从零编写
When translating a package, always check for an existing C/C++ backend before writing any algorithm code. Many R packages have C++ in . Many Python packages have Cython or vendored C/C++ libraries. Standalone C++ libraries exist for string matching, linear algebra, tree algorithms, and more.
src/If a C++ implementation exists, wrap it. Do not reimplement the algorithm in C. Wrapping gives you identical output (same code path), production-grade performance, and a fraction of the code. The plugin is just a thin glue layer between Stata's SDK and the library's API. Binary size is irrelevant — statically link everything () and ship whatever size the binary turns out to be, even 10-15 MB on Windows. Users don't care about plugin file size; they care about correct results.
extern "C"-static-libstdc++ -static-libgccSee for the full pattern and for the workflow. Working examples of this approach (wrapping C++ backends, multi-plugin dispatching, save/load for scoring on new data) can be found in the repos listed in the project CLAUDE.md under "Example Applications."
references/cpp_plugins.mdreferences/translation_workflow.mdFor translation projects, also: repurpose the original package's test suite and data (see Layer 0), write additional Stata-specific tests, and end the plan with a multi-agent fidelity audit. See for the complete workflow.
references/testing_strategy.mdreferences/translation_workflow.md移植包时,编写算法代码前务必先检查是否存在现成C/C++后端。 许多R包的目录下有C++代码,许多Python包包含Cython或内置C/C++库。字符串匹配、线性代数、树算法等领域也有独立的C++库。
src/若存在C++实现,直接封装。 不要用C重新实现算法。封装可提供完全一致的输出(相同代码路径)、生产级性能,且代码量仅为一小部分。插件只是Stata SDK与库API之间的薄粘合层。二进制大小无关紧要——静态链接所有内容(),无论最终二进制文件多大(即使Windows平台下为10-15 MB)都可发布。用户不关心插件文件大小,只关心结果是否正确。
extern "C"-static-libstdc++ -static-libgcc完整模式见,工作流见。该方法的实际示例(封装C++后端、多插件调度、新数据评分的保存/加载)可在项目CLAUDE.md的"示例应用"部分列出的仓库中找到。
references/cpp_plugins.mdreferences/translation_workflow.md对于移植项目,还需:复用原包的测试套件和数据(见第0层),编写额外的Stata专属测试,并在规划末尾加入多Agent保真度审计。完整工作流见。
references/testing_strategy.mdreferences/translation_workflow.mdThe Plugin SDK
插件SDK
These two files define the interface between your C code and Stata:
| Function/Macro | Purpose |
|---|---|
| Read variable value (1-indexed!) |
| Write variable value (1-indexed!) |
| Number of observations in current dataset |
| Number of variables in the entire dataset (not just plugin call) |
| Check for Stata missing value ( |
| The missing value constant |
| Print informational text in Stata |
| Print red error text in Stata |
Indexing is 1-based. Both variable indices and observation indices start at 1, not 0. Off-by-one errors here are silent and catastrophic — you read the wrong variable's data with no warning.
这两个文件定义了C代码与Stata之间的接口:
| 函数/宏 | 用途 |
|---|---|
| 读取变量值(从1开始索引!) |
| 写入变量值(从1开始索引!) |
| 当前数据集的观测数量 |
| 整个数据集的变量数量(不仅是插件调用涉及的变量) |
| 检查Stata缺失值( |
| 缺失值常量 |
| 在Stata中打印信息文本 |
| 在Stata中打印红色错误文本 |
索引从1开始。 变量索引和观测索引均从1开始,而非0。此处的差一错误是静默且灾难性的——你会读取错误变量的数据且无任何警告。
Memory Safety
内存安全
A crash in your plugin kills the entire Stata session. No save prompt, no recovery. The user loses all unsaved work. This is the single most important thing to internalize.
- Check every /
malloc()return forcalloc()NULL - Validate before accessing
argcargv[] - Build with during development
-fsanitize=address - Test on small data first, scale up gradually
- Pre-allocate all memory upfront in , free at the end
stata_call()
插件崩溃会导致整个Stata会话终止。 无保存提示,无恢复机制,用户所有未保存的工作都会丢失。这是你必须牢记的最重要的一点。
- 检查每个/
malloc()的返回值是否为calloc()NULL - 访问前先验证
argv[]argc - 开发阶段使用编译
-fsanitize=address - 先在小数据上测试,再逐步扩展
- 在中预先分配所有内存,结束时释放
stata_call()
The stata_call() Entry Point
stata_call()入口点
Every plugin implements one function. Plugins can also be written in C++ — the entry point just needs linkage so Stata can find it; everything else can be full C++. The obvious case for C++ is when existing C++ code is available to wrap (e.g., an R package's directory). C++ also helps when you need complex data structures or threading via . For practical C++ guidance — the pattern, exception safety, compilation commands, wrapping libraries — see . The rest of this file focuses on C because it's the simpler default.
extern "C"src/std::threadextern "C"references/cpp_plugins.mdc
#include "stplugin.h"
// For C++ plugins, wrap the entry point with extern "C":
// extern "C" {
// STDLL stata_call(int argc, char *argv[]) { ... }
// }
STDLL stata_call(int argc, char *argv[]) {
// 0. Validate arguments BEFORE accessing argv[]
if (argc < 3) {
SF_error("myplugin requires 3 arguments: n_train n_test seed\n");
return 198; // Stata's "syntax error" code
}
// 1. Parse arguments (all strings — use atoi/atof)
int n_train = atoi(argv[0]);
int n_test = atoi(argv[1]);
int seed = atoi(argv[2]);
// 2. Get dimensions
ST_int nobs = SF_nobs();
// CAUTION: SF_nvar() returns ALL variables in the dataset, not just
// the ones passed to `plugin call`. If the .ado creates tempvars
// (touse, merge_id, etc.) the count will be higher than expected.
// Pass the variable count via argv instead of relying on SF_nvar().
int p = atoi(argv[3]); // safer: pass feature count explicitly
// 3. Allocate memory
double *X = calloc(nobs * p, sizeof(double));
double *y = calloc(nobs, sizeof(double));
double *pred = calloc(nobs, sizeof(double));
if (!X || !y || !pred) {
SF_error("myplugin: out of memory\n");
if (X) free(X); if (y) free(y); if (pred) free(pred);
return 909;
}
// 4. Read data from Stata (1-indexed!)
ST_double val;
for (ST_int obs = 1; obs <= nobs; obs++) {
SF_vdata(1, obs, &val); // var 1 = depvar
y[obs-1] = val;
for (int j = 0; j < p; j++) {
SF_vdata(j + 2, obs, &val); // vars 2..nvars-1 = features
X[(obs-1) * p + j] = val;
}
}
// 5. Run your algorithm
int rc = my_algorithm(X, y, pred, n_train, n_test, p, seed);
if (rc != 0) {
SF_error("myplugin: algorithm failed\n");
free(X); free(y); free(pred);
return 909;
}
// 6. Write results back to Stata
for (ST_int obs = 1; obs <= nobs; obs++) {
SF_vstore(nvars, obs, pred[obs-1]); // last var = output
}
free(X); free(y); free(pred);
return 0; // 0 = success
}每个插件实现一个函数。插件也可用C++编写——入口点只需链接,以便Stata能找到它;其余部分可完全使用C++。当有现成C++代码可封装时(例如R包的目录),C++是显而易见的选择。当你需要复杂数据结构或通过实现线程时,C++也会有所帮助。关于实用C++指南——模式、异常安全、编译命令、库封装——见。本文档其余部分聚焦于C,因为它是更简单的默认选择。
extern "C"src/std::threadextern "C"references/cpp_plugins.mdc
#include "stplugin.h"
// 对于C++插件,用extern "C"包裹入口点:
// extern "C" {
// STDLL stata_call(int argc, char *argv[]) { ... }
// }
STDLL stata_call(int argc, char *argv[]) {
// 0. 访问argv[]前先验证参数
if (argc < 3) {
SF_error("myplugin requires 3 arguments: n_train n_test seed\n");
return 198; // Stata的"语法错误"代码
}
// 1. 解析参数(均为字符串——使用atoi/atof)
int n_train = atoi(argv[0]);
int n_test = atoi(argv[1]);
int seed = atoi(argv[2]);
// 2. 获取维度
ST_int nobs = SF_nobs();
// 注意:SF_nvar()返回整个数据集的所有变量数量,而非仅插件调用涉及的变量
// 如果.ado创建了临时变量(touse、merge_id等),计数会比预期高
// 建议通过argv传递变量数量,而非依赖SF_nvar()
int p = atoi(argv[3]); // 更安全:显式传递特征数量
// 3. 分配内存
double *X = calloc(nobs * p, sizeof(double));
double *y = calloc(nobs, sizeof(double));
double *pred = calloc(nobs, sizeof(double));
if (!X || !y || !pred) {
SF_error("myplugin: out of memory\n");
if (X) free(X); if (y) free(y); if (pred) free(pred);
return 909;
}
// 4. 从Stata读取数据(从1开始索引!)
ST_double val;
for (ST_int obs = 1; obs <= nobs; obs++) {
SF_vdata(1, obs, &val); // 变量1 = 因变量
y[obs-1] = val;
for (int j = 0; j < p; j++) {
SF_vdata(j + 2, obs, &val); // 变量2..nvars-1 = 特征
X[(obs-1) * p + j] = val;
}
}
// 5. 运行算法
int rc = my_algorithm(X, y, pred, n_train, n_test, p, seed);
if (rc != 0) {
SF_error("myplugin: algorithm failed\n");
free(X); free(y); free(pred);
return 909;
}
// 6. 将结果写回Stata
for (ST_int obs = 1; obs <= nobs; obs++) {
SF_vstore(nvars, obs, pred[obs-1]); // 最后一个变量 = 输出
}
free(X); free(y); free(pred);
return 0; // 0 = 成功
}Return Codes
返回码
- — success
0 - — syntax error (bad arguments)
198 - — insufficient memory
909 - — file not found
601 - Any non-zero triggers a Stata error
- — 成功
0 - — 语法错误(参数错误)
198 - — 内存不足
909 - — 文件未找到
601 - 任何非零值都会触发Stata错误
The .ado Wrapper Pattern
.ado封装器模式
Users never call directly. An file provides the Stata-native interface.
plugin call.ado用户从不直接调用。.ado文件提供Stata原生接口。
plugin callThe Preserve/Merge Pattern
Preserve/Merge模式
This is the core pattern for plugins that operate on a subset of data:
stata
program define mycommand, rclass
syntax varlist(min=2) [if] [in], GENerate(name) [SEED(integer 12345) REPlace]
gettoken depvar indepvars : varlist
if "`replace'" != "" {
capture drop `generate'
}
confirm new variable `generate'
// Mark sample: novarlist ALLOWS missing depvar (critical for imputation)
marksample touse, novarlist
markout `touse' `indepvars' // but DO exclude missing predictors
// Stable merge key — create BEFORE any sorting or subsetting
tempvar merge_id
quietly gen long `merge_id' = _n
// Count subsets
quietly count if `touse' & !missing(`depvar')
local n_train = r(N)
quietly count if `touse' & missing(`depvar')
local n_test = r(N)
// Create output variable (all missing initially)
quietly gen double `generate' = .
// Preserve, subset, call plugin
preserve
quietly keep if `touse'
// Sort if plugin requires it (donors first, test second)
tempvar sort_order
quietly gen `sort_order' = missing(`depvar')
quietly sort `sort_order'
// Call plugin
plugin call myplugin `depvar' `indepvars' `generate', ///
`n_train' `n_test' `seed'
// Save results and restore
tempfile results
quietly keep `merge_id' `generate'
quietly save `results'
restore
// Merge predictions back (update replaces missing with non-missing)
quietly merge 1:1 `merge_id' using `results', nogenerate update
endWhy works: The variable is all-missing before preserve. After restore, it's still all-missing. The option replaces missing values with non-missing ones from the merge file. The option is handled earlier via , so by merge time the variable is always freshly created.
updategenerateupdatereplacecapture drop这是处理数据子集的插件核心模式:
stata
program define mycommand, rclass
syntax varlist(min=2) [if] [in], GENerate(name) [SEED(integer 12345) REPlace]
gettoken depvar indepvars : varlist
if "`replace'" != "" {
capture drop `generate'
}
confirm new variable `generate'
// 标记样本:novarlist允许因变量缺失(插补场景至关重要)
marksample touse, novarlist
markout `touse' `indepvars' // 但需排除缺失的预测变量
// 稳定合并键——在任何排序或子集化之前创建
tempvar merge_id
quietly gen long `merge_id' = _n
// 统计子集数量
quietly count if `touse' & !missing(`depvar')
local n_train = r(N)
quietly count if `touse' & missing(`depvar')
local n_test = r(N)
// 创建输出变量(初始全为缺失值)
quietly gen double `generate' = .
// 保留数据、子集化、调用插件
preserve
quietly keep if `touse'
// 如果插件要求,进行排序(训练样本在前,测试样本在后)
tempvar sort_order
quietly gen `sort_order' = missing(`depvar')
quietly sort `sort_order'
// 调用插件
plugin call myplugin `depvar' `indepvars' `generate', ///
`n_train' `n_test' `seed'
// 保存结果并恢复
tempfile results
quietly keep `merge_id' `generate'
quietly save `results'
restore
// 合并预测结果(update选项用非缺失值替换缺失值)
quietly merge 1:1 `merge_id' using `results', nogenerate update
endupdate选项的作用原理: generate变量在preserve前全为缺失值。restore后,它仍全为缺失值。update选项会用合并文件中的非缺失值替换缺失值。replace选项已通过提前处理,因此合并时变量总是新创建的。
capture dropPlugin Sorting Contract
插件排序约定
CRITICAL: Some plugins expect data sorted a specific way (training rows first, test rows second). Others handle missing data internally. Sorting mismatches are among the most dangerous bugs — the plugin silently reads the wrong data, producing garbage output with no error message. A mismatched sort order can drop prediction quality dramatically (e.g., correlation going from 0.99 to 0.38) because the plugin treats test observations as training data and vice versa.
- If the plugin checks internally: do NOT sort in the .ado wrapper
SF_is_missing() - If the plugin expects contiguous rows then
n_trainrows: sort byn_testbefore callingmissing(depvar)
Document which pattern your plugin uses.
关键注意事项: 部分插件要求数据按特定方式排序(训练行在前,测试行在后),其他插件则在内部处理缺失数据。排序不匹配是最危险的bug之一——插件会静默读取错误数据,产生无效输出且无任何错误提示。排序不匹配会大幅降低预测质量(例如相关性从0.99降至0.38),因为插件会将测试观测当作训练数据,反之亦然。
- 如果插件内部检查:不要在.ado封装器中排序
SF_is_missing() - 如果插件要求行连续排列,之后是
n_train行:调用前按n_test排序missing(depvar)
请记录你的插件使用哪种模式。
Plugin Loading (Cross-Platform)
插件加载(跨平台)
Use the gtools-style OS detection pattern. This detects the OS via and constructs a bare filename. The bare filename is resolved via Stata's adopath, which is reliable across all platforms.
c(os)stata
/* ---- Load plugin (gtools-style: detect OS, bare filename) ---- */
if ( inlist("`c(os)'", "MacOSX") | strpos("`c(machine_type)'", "Mac") ) local c_os_ macosx
else local c_os_: di lower("`c(os)'")
cap program drop myplugin
program myplugin, plugin using("myplugin_`c_os_'.plugin")This resolves to , , or depending on platform.
myplugin_macosx.pluginmyplugin_windows.pluginmyplugin_unix.pluginWARNING — DO NOT use + absolute paths. The following pattern is BROKEN on Windows and must never be used:
findfilestata
* BROKEN — DO NOT USE
capture findfile myplugin.plugin
capture program myplugin, plugin using("`r(fn)'")findfileC:\ado\plus\m\myplugin.pluginLoadLibraryusing()Similarly, do not use a nested if/else cascade trying each suffix. This was the old pattern in several packages and fails for the same reason if is involved, plus it's fragile and verbose.
platform-archfindfilePlugin file naming: where is one of , , . Examples: , .
pluginname_os.pluginosmacosxunixwindowsqrf_plugin_macosx.plugingrf_plugin_windows.pluginNote: wipes loaded plugin definitions. If a test script starts with , all definitions are gone. Reload them.
clear allclear allprogram ... plugin使用gtools风格的操作系统检测模式。通过检测操作系统并构造裸文件名。裸文件名通过Stata的adopath解析,在所有平台上均可靠。
c(os)stata
/* ---- 加载插件(gtools风格:检测OS,使用裸文件名) ---- */
if ( inlist("`c(os)'", "MacOSX") | strpos("`c(machine_type)'", "Mac") ) local c_os_ macosx
else local c_os_: di lower("`c(os)'")
cap program drop myplugin
program myplugin, plugin using("myplugin_`c_os_'.plugin")根据平台不同,会解析为、或。
myplugin_macosx.pluginmyplugin_windows.pluginmyplugin_unix.plugin警告——不要使用 + 绝对路径。 以下模式在Windows上会失效,绝对禁止使用:
findfilestata
* 失效模式——禁止使用
capture findfile myplugin.plugin
capture program myplugin, plugin using("`r(fn)'")findfileC:\ado\plus\m\myplugin.pluginLoadLibraryusing()同样,不要使用嵌套if/else分支尝试每个后缀。这是多个旧包使用的模式,若涉及也会因相同原因失效,且代码脆弱冗长。
platform-archfindfile插件文件命名规则: ,其中为、或。例如:、。
pluginname_os.pluginosmacosxunixwindowsqrf_plugin_macosx.plugingrf_plugin_windows.plugin注意: 会清除已加载的插件定义。如果测试脚本以开头,所有定义都会丢失,需重新加载。
clear allclear allprogram ... pluginCross-Platform Compilation
跨平台编译
Build for three platforms (ARM Macs run x86_64 via Rosetta, so one macOS binary suffices). Install the Windows cross-compiler first: .
brew install mingw-w64| Target OS | Output name suffix | Compiler | | Link flag | pthreads |
|---|---|---|---|---|---|
| macOS (ARM64) | | | | | |
| Linux (x86_64) | | | | | |
| Windows (x86_64) | | | | | |
All platforms: for release, add for development.
-O3 -fPIC-g -fsanitize=addressFor C++ plugins: use instead of . Add at the version the library requires (check its docs — C++11, C++14, and C++17 are all common). Header-only C++ libraries can be vendored into and included with . Always use on Windows and Linux.
g++gcc-std=c++c_source/-I.-static-libstdc++ -static-libgccNaming convention: (e.g., , ). The suffix must match what the gtools-style loader produces: , , or .
pluginname_os.pluginqrf_plugin_macosx.plugingrf_plugin_windows.pluginosmacosxunixwindowsmacOS note: use , NOT . This is a common mistake.
-bundle-shared为三个平台构建(ARM Mac通过Rosetta运行x86_64,因此一个macOS二进制文件即可)。先安装Windows交叉编译器:。
brew install mingw-w64| 目标操作系统 | 输出名称后缀 | 编译器 | | 链接参数 | 线程库 |
|---|---|---|---|---|---|
| macOS (ARM64) | | | | | |
| Linux (x86_64) | | | | | |
| Windows (x86_64) | | | | | |
所有平台:发布版本使用,开发版本添加。
-O3 -fPIC-g -fsanitize=addressC++插件注意事项: 使用而非。添加及库要求的版本(查看文档——C++11、C++14和C++17均常见)。仅头文件的C++库可放入目录,通过包含。Windows和Linux平台始终使用。
g++gcc-std=c++c_source/-I.-static-libstdc++ -static-libgccLinux from macOS (Docker Required)
从macOS构建Linux版本(需Docker)
There is no native Linux cross-compiler on macOS. Use Docker via Colima (, then ). Build with a one-liner:
brew install colima dockercolima startbash
docker run --rm --platform linux/amd64 -v "$(pwd):/build" -w /build ubuntu:18.04 \
bash -c "apt-get update -qq && apt-get install -y -qq g++ gcc make > /dev/null 2>&1 && make linux"glibc compatibility: Build on Ubuntu 18.04 for maximum compatibility (requires only GLIBC 2.14, works on any Linux from ~2012+). Building on Ubuntu 22.04+ requires GLIBC 2.34, which excludes RHEL 8, Ubuntu 20.04, and many HPC environments.
macOS上没有原生Linux交叉编译器。通过Colima使用Docker(,然后)。使用以下单行命令构建:
brew install colima dockercolima startbash
docker run --rm --platform linux/amd64 -v "$(pwd):/build" -w /build ubuntu:18.04 \
bash -c "apt-get update -qq && apt-get install -y -qq g++ gcc make > /dev/null 2>&1 && make linux"glibc兼容性: 在Ubuntu 18.04上构建以获得最大兼容性(仅需GLIBC 2.14,可在2012年以后的任何Linux系统上运行)。在Ubuntu 22.04+上构建需要GLIBC 2.34,会排除RHEL 8、Ubuntu 20.04和许多HPC环境。
Performance Optimization
性能优化
See for detailed code examples of:
references/performance_patterns.md- Pre-sorted feature indices — Sort feature values once, scan linearly at each tree node. O(n) per split instead of O(n log n).
- Precomputed distance norms — Exploit ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a'b for KNN.
- Quickselect — O(n) partial sort for finding k-th nearest neighbor.
- Parallel ensemble training (pthreads) — Train multiple models concurrently. Each thread gets its own data copy and RNG state. Never call Stata SDK functions (,
SF_vdata,SF_vstore) from worker threads — read all data on the main thread first, dispatch computation to workers, write results back on the main thread after joining.SF_display - XorShift RNG — C plugins cannot access Stata's internal RNG (). XorShift128+ is fast, statistically sound, and thread-safe (each thread gets its own state). Seed from
runiform()for reproducibility.argv[] - Dense arrays for trees — Flat node arrays instead of linked lists for cache locality.
详细代码示例见,包括:
references/performance_patterns.md- 预排序特征索引 —— 对特征值排序一次,在每个树节点线性扫描。每个分割的时间复杂度为O(n)而非O(n log n)。
- 预计算距离范数 —— 利用||a-b||^2 = ||a||^2 + ||b||^2 - 2*a'b优化KNN算法。
- 快速选择 —— 以O(n)时间复杂度进行部分排序,用于查找第k近邻。
- 并行集成训练(pthreads) —— 同时训练多个模型。每个线程拥有自己的数据副本和RNG状态。绝对不要从工作线程调用Stata SDK函数(、
SF_vdata、SF_vstore)——先在主线程读取所有数据,将计算任务分配给工作线程,线程合并后在主线程写回结果。SF_display - XorShift随机数生成器 —— C插件无法访问Stata的内部RNG()。XorShift128+速度快、统计特性良好且线程安全(每个线程拥有自己的状态)。通过
runiform()传入种子以保证可复现性。argv[] - 树的密集数组 —— 使用扁平节点数组而非链表,以提高缓存局部性。
Debugging
调试
Debugging is hard because you can't attach a debugger to Stata's plugin host.
调试难度较大,因为无法将调试器附加到Stata的插件宿主进程。
Strategies
策略
-
Printf via SF_display():c
char buf[256]; snprintf(buf, sizeof(buf), "Debug: n=%d, p=%d\n", n, p); SF_display(buf); -
Write diagnostic files:c
FILE *f = fopen("plugin_debug.log", "w"); fprintf(f, "value at [%d][%d] = %f\n", i, j, val); fclose(f); -
Test standalone first. Write athat reads CSV and calls your algorithm. Debug with normal tools (gdb, valgrind, sanitizers). Then adapt for the plugin interface.
main() -
Build with sanitizers during development:
-g -fsanitize=address -
Check SF_vdata() return values. It returns(0=success). Non-zero means invalid obs/var index.
RC
-
通过SF_display()实现printf功能:c
char buf[256]; snprintf(buf, sizeof(buf), "Debug: n=%d, p=%d\n", n, p); SF_display(buf); -
写入诊断文件:c
FILE *f = fopen("plugin_debug.log", "w"); fprintf(f, "value at [%d][%d] = %f\n", i, j, val); fclose(f); -
先测试独立版本。 编写函数读取CSV并调用算法。使用常规工具(gdb、valgrind、 sanitizers)调试。然后适配为插件接口。
main() -
开发阶段使用sanitizers构建:
-g -fsanitize=address -
检查SF_vdata()返回值。 返回(0=成功),非零值表示无效的观测/变量索引。
RC
Common Failure Modes
常见失败模式
| Symptom | Likely Cause |
|---|---|
| Stata crashes silently | Segfault: buffer overflow, bad argv access, NULL deref |
| Plugin returns all missing | Wrong variable count, wrong obs indexing, plugin not loaded |
| Results are garbage | Sorting mismatch, 0-vs-1 indexing error, unnormalized inputs |
| "plugin not found" | Wrong filename, |
| Works on Mac, fails on Linux | Integer size difference, use |
| 症状 | 可能原因 |
|---|---|
| Stata静默崩溃 | 段错误:缓冲区溢出、错误的argv访问、空指针解引用 |
| 插件返回全缺失值 | 变量数量错误、观测索引错误、插件未加载 |
| 结果无效 | 排序不匹配、0/1索引错误、输入未归一化 |
| "plugin not found" | 文件名错误、 |
| Mac上正常,Linux上失败 | 整数大小差异,使用 |
Packaging and Distribution
打包与分发
Use platform-specific files so users only download the binary for their OS. Stata's has no conditional logic, so the way to avoid shipping all 4 binaries to every user is to offer separate packages per platform. All packages install the same and files — only the binary differs.
.pkgnet install.ado.sthlp.pluginmypackage/
├── stata.toc # lists all package variants
├── mypackage.pkg # all platforms (for users who don't care)
├── mypackage_mac.pkg # macOS only
├── mypackage_linux.pkg # Linux only
├── mypackage_win.pkg # Windows only
├── mycommand.sthlp # overview help file (short name!)
├── mycommand.ado # user-facing command
├── myplugin_macosx.plugin
├── myplugin_unix.plugin
├── myplugin_windows.plugin
└── c_source/ # NOT distributed, for building
├── build.py
├── stplugin.c
├── stplugin.h
└── algorithm.cUsers install their platform's package:
stata
* macOS
net install mypackage_mac, from("https://raw.githubusercontent.com/user/repo/main") replace
* Linux
net install mypackage_linux, from("https://raw.githubusercontent.com/user/repo/main") replace
* Windows
net install mypackage_win, from("https://raw.githubusercontent.com/user/repo/main") replaceAll platform binaries ship via the all-platform .pkg, or users can install platform-specific packages. Stata loads only the matching plugin at runtime via gtools-style OS detection. Windows C++ binaries can be 10-15MB due to static linking, which is normal.
See for , , templates and SMCL formatting.
references/packaging_and_help.md.toc.pkg.sthlp使用平台专属的文件,让用户仅下载对应操作系统的二进制文件。Stata的无条件逻辑,因此避免向所有用户分发4个二进制文件的方法是提供按平台划分的独立包。所有包安装相同的和文件——仅二进制文件不同。
.pkgnet install.ado.sthlp.pluginmypackage/
├── stata.toc # 列出所有包变体
├── mypackage.pkg # 全平台版本(面向不关心平台的用户)
├── mypackage_mac.pkg # 仅macOS版本
├── mypackage_linux.pkg # 仅Linux版本
├── mypackage_win.pkg # 仅Windows版本
├── mycommand.sthlp # 概览帮助文件(使用短名称!)
├── mycommand.ado # 用户面向的命令
├── myplugin_macosx.plugin
├── myplugin_unix.plugin
├── myplugin_windows.plugin
└── c_source/ # 不分发,仅用于构建
├── build.py
├── stplugin.c
├── stplugin.h
└── algorithm.c用户安装对应平台的包:
stata
* macOS
net install mypackage_mac, from("https://raw.githubusercontent.com/user/repo/main") replace
* Linux
net install mypackage_linux, from("https://raw.githubusercontent.com/user/repo/main") replace
* Windows
net install mypackage_win, from("https://raw.githubusercontent.com/user/repo/main") replace所有平台二进制文件通过全平台.pkg发布,用户也可安装平台专属包。Stata通过gtools风格的操作系统检测在运行时仅加载匹配的插件。Windows C++二进制文件因静态链接可能达到10-15MB,这是正常现象。
.toc.pkg.sthlpreferences/packaging_and_help.mdCommon Pitfalls
常见陷阱
-
Sorting destroys merge keys. If you sort inside/
preserve, the merge_id linkage breaks. Always create merge_id BEFORE preserve.restore -
1-indexed everything.— both var and obs start at 1. Off-by-one errors are silent.
SF_vdata(var, obs, &val) -
excludes missing by default. For imputation (where missing depvar IS the point), use
marksample.marksample touse, novarlist -
macOSreturns "MacOSX". Use the gtools pattern:
c(os)c(os)'", "MacOSX") | strpos("inlist("to detect Mac. For other platforms,c(machine_type)'", "Mac")giveslower(c(os))or"windows"."unix" -
argv[] has no bounds checking. Accessingwhen
argv[3]is a segfault. Always checkargc == 2first.argc -
wipes plugins. Reload plugin definitions after
clear allin test scripts.clear all -
Only the firstin a .ado file is auto-discovered. Subprograms need their own .ado files or explicit
program defineto load.run -
Normalize inputs when the algorithm requires it (neural networks, gradient-based methods, distance-based methods like KNN). Scale to mean=0, sd=1 in the .ado wrapper, denormalize predictions after. The plugin should receive clean, normalized data — let the .ado handle the scaling.
-
pthreads on Windows needs. Use conditional linker flags.
-lwinpthread -
Memory errors crash Stata with no recovery. Pre-allocate everything, check every allocation, build with sanitizers during development.
-
glibc version mismatch. Building Linux plugins on a modern distro produces binaries that won't load on older systems. Use Ubuntu 18.04 in Docker for maximum compatibility.
-
returns total dataset variables. It counts ALL variables in the dataset, not just the ones in the
SF_nvar()varlist. If the .ado creates tempvars (plugin call,touse, sort keys), the count will be higher than expected. Never usemerge_idto validate argument counts — pass the expected count viaSF_nvar()instead.argv -
+ absolute paths breaks on Windows.
findfilereturns an absolute path that Stata'sfindfilecan't resolve on Windows. Use the gtools-style OS detection pattern instead (see Plugin Loading section above) — it constructs a bare filename that Stata resolves via the adopath.LoadLibrary
-
排序会破坏合并键。 如果在/
preserve内排序,merge_id关联会断裂。务必在preserve之前创建merge_id。restore -
所有索引从1开始。——变量和观测索引均从1开始。差一错误是静默的。
SF_vdata(var, obs, &val) -
默认排除缺失值。 对于插补场景(缺失因变量是核心需求),使用
marksample。marksample touse, novarlist -
macOS的返回"MacOSX"。 使用gtools模式:
c(os)c(os)'", "MacOSX") | strpos("inlist("检测Mac。其他平台使用c(machine_type)'", "Mac")得到lower(c(os))或"windows"。"unix" -
argv[]无边界检查。 当时访问
argc == 2会导致段错误。务必先检查argv[3]。argc -
会清除插件。 测试脚本中
clear all后需重新加载插件定义。clear all -
.ado文件中只有第一个会被自动发现。 子程序需要单独的.ado文件或显式
program define加载。run -
算法要求时归一化输入(神经网络、基于梯度的方法、KNN等基于距离的方法)。在.ado封装器中缩放到均值=0、标准差=1,预测后再反归一化。插件应接收干净、归一化的数据——让.ado处理缩放。
-
Windows平台pthreads需要。 使用条件链接参数。
-lwinpthread -
内存错误会导致Stata无恢复崩溃。 预先分配所有内存,检查每个分配,开发阶段使用sanitizers构建。
-
glibc版本不兼容。 在现代发行版上构建的Linux插件无法在旧系统上加载。使用Docker中的Ubuntu 18.04构建以获得最大兼容性。
-
返回数据集总变量数。 它会计算数据集中的所有变量,不仅是
SF_nvar()变量列表中的变量。如果.ado创建了临时变量(touse、merge_id、排序键),计数会比预期高。永远不要使用plugin call验证参数数量——通过SF_nvar()传递预期数量。argv -
+ 绝对路径在Windows上失效。
findfile返回的绝对路径无法被Stata的findfile解析。改用gtools风格的操作系统检测模式(见插件加载部分)——它构造的裸文件名会被Stata通过adopath解析。LoadLibrary
Naming Conventions
命名约定
- Use not
method()for method selection optionsmodel() - Use (abbreviation
generate()) for output variable naminggen() - Use as a flag option, not
replacereplace() - Plugin files: where os is
algorithm_plugin_os.plugin,macosx, orunixwindows - .ado files: lowercase, underscores for multi-word
- Stata option convention: options lowercase, abbreviations capitalized (,
GENerate)MAXDepth - Target Stata 14.0+ () for plugin support
version 14.0 - Help files use the short command name, not the repo name. If the repo is called , the overview help file should still be
mypackage_stata(somypackage.sthlpworks). Don't append "stata" to help file or command names — the user is already in Stata.help mypackage
- 方法选择选项使用而非
method()model() - 输出变量命名使用(缩写
generate())gen() - 替换选项使用标志,而非
replacereplace() - 插件文件:,其中os为
algorithm_plugin_os.plugin、macosx或unixwindows - .ado文件:小写,多词用下划线分隔
- Stata选项约定:选项小写,缩写大写(、
GENerate)MAXDepth - 目标Stata版本14.0+()以支持插件
version 14.0 - 帮助文件使用命令短名称,而非仓库名称。 如果仓库名为,概览帮助文件仍应为
mypackage_stata(这样mypackage.sthlp可用)。不要在帮助文件或命令名称后追加"stata"——用户已经在Stata中。help mypackage