stata-c-plugins

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Stata C/C++ Plugin Development

Stata C/C++插件开发

Build high-performance C/C++ plugins for Stata. This skill covers the full lifecycle from SDK setup through cross-platform distribution, based on real experience building production Stata plugins for statistical imputation, random forests, string matching, and causal inference.
This skill assumes macOS (Apple Silicon or Intel) as the development platform. Build commands, cross-compilation workflows, and Docker instructions are all Mac-oriented. The plugins themselves target all four platforms (macOS ARM64, macOS x86_64, Linux x86_64, Windows x86_64), but the development environment is macOS. If you need to develop on Linux or Windows natively, adapt the compilation and Docker sections accordingly.
为Stata构建高性能C/C++插件。本技能基于开发统计插补、随机森林、字符串匹配和因果推断等生产级Stata插件的实际经验,覆盖从SDK设置到跨平台分发的全生命周期。
本技能假设以macOS(Apple Silicon或Intel)作为开发平台。 构建命令、跨编译工作流和Docker指令均基于Mac环境。插件本身支持四大平台(macOS ARM64、macOS x86_64、Linux x86_64、Windows x86_64),但开发环境为macOS。如果需要在Linux或Windows原生环境开发,请相应调整编译和Docker相关内容。

How to Approach Every Task

任务处理流程

Before writing any code, enter plan mode. A good plan covers:
  1. Complete inventory — every feature, option, and component to build (for translation: exhaustive catalog of the source package's API)
  2. Architecture decisions — wrap C++ backend vs. write C from scratch vs. pure Stata
  3. Relevant reference files — identify up front which of this skill's reference files contain info you'll need, and cite them explicitly in the plan steps so they get loaded at the right time:
    • references/translation_workflow.md
      — full translation workflow, test repurposing, fidelity audit
    • references/testing_strategy.md
      — test layers, reference data generation, Layer 0 (repurpose original tests)
    • references/performance_patterns.md
      — pthreads, XorShift RNG, quickselect, pre-sorted indices
    • references/packaging_and_help.md
      — .toc/.pkg/.sthlp templates, build scripts
    • references/cpp_plugins.md
      — C++ wrapping, extern "C", exception safety, compilation
  4. Phase-by-phase steps with dependencies between them
  5. For each step: what gets built, what tests get written, and that the review loop runs before proceeding
  6. For translation projects: a final fidelity audit as the last step (see
    translation_workflow.md
    )
Implement sequentially across components, in parallel within each component. Once an interface is defined, dispatch independent sub-tasks as parallel subagents (e.g., C plugin implementation, .ado wrapper, and test suite can run simultaneously). Merge their work, run the full test suite, then proceed to the review loop before moving to the next component.
Run the review loop after every component:
  • Default: dispatch 2-3 review agents in parallel, ideally from different models (e.g., Claude + GPT + Gemini) for diversity of perspective. Use whatever multi-model tools are available in your environment.
  • If only one model is available: dispatch 2-3 agents with different review focuses (correctness, completeness, architecture). Different prompts approximate the diversity of different models.
  • Each agent reviews the diff, test results, and requirements — instruction: "List any gaps, bugs, or issues. Say LGTM if everything looks correct."
  • Fix all issues raised, re-dispatch, loop until all agents say LGTM. Then proceed.
编写任何代码前,先进入规划模式。 一份优质规划应包含:
  1. 完整清单 —— 需构建的所有功能、选项和组件(针对移植任务:源包API的详尽目录)
  2. 架构决策 —— 封装C++后端、从零编写C代码还是纯Stata实现
  3. 相关参考文件 —— 提前确定本技能的哪些参考文件包含所需信息,并在规划步骤中明确引用,以便在合适的时机加载:
    • references/translation_workflow.md
      —— 完整移植工作流、测试复用、保真度审计
    • references/testing_strategy.md
      —— 测试层级、参考数据生成、第0层(复用原测试)
    • references/performance_patterns.md
      —— pthreads、XorShift随机数生成器、快速选择、预排序索引
    • references/packaging_and_help.md
      —— .toc/.pkg/.sthlp模板、构建脚本
    • references/cpp_plugins.md
      —— C++封装、extern "C"、异常安全、编译
  4. 分阶段步骤及依赖关系
  5. 每个步骤细节:构建内容、编写的测试,以及进入下一阶段前需运行的评审环节
  6. 移植项目专属:最后一步进行最终保真度审计(见
    translation_workflow.md
按组件顺序实现,单个组件内并行处理。 接口定义完成后,可将独立子任务分配给并行子Agent(例如C插件实现、.ado封装器和测试套件可同时运行)。合并工作成果,运行完整测试套件,然后进入评审环节,之后再推进到下一个组件。
每个组件完成后运行评审环节:
  • 默认:并行分配2-3个评审Agent,理想情况使用不同模型(如Claude + GPT + Gemini)以获取多元视角。使用环境中可用的多模型工具即可。
  • 若仅有一种模型可用:分配2-3个Agent,每个聚焦不同评审维度(正确性、完整性、架构)。通过不同提示词模拟不同模型的多样性。
  • 每个Agent需评审差异、测试结果和需求,指令为:"列出所有缺口、漏洞或问题。若一切正常则回复LGTM。"
  • 修复所有提出的问题,重新分配评审,循环直到所有Agent回复LGTM,再继续推进。

Wrap First, Write From Scratch Second

优先封装,其次从零编写

When translating a package, always check for an existing C/C++ backend before writing any algorithm code. Many R packages have C++ in
src/
. Many Python packages have Cython or vendored C/C++ libraries. Standalone C++ libraries exist for string matching, linear algebra, tree algorithms, and more.
If a C++ implementation exists, wrap it. Do not reimplement the algorithm in C. Wrapping gives you identical output (same code path), production-grade performance, and a fraction of the code. The plugin is just a thin
extern "C"
glue layer between Stata's SDK and the library's API. Binary size is irrelevant — statically link everything (
-static-libstdc++ -static-libgcc
) and ship whatever size the binary turns out to be, even 10-15 MB on Windows. Users don't care about plugin file size; they care about correct results.
See
references/cpp_plugins.md
for the full pattern and
references/translation_workflow.md
for the workflow. Working examples of this approach (wrapping C++ backends, multi-plugin dispatching, save/load for scoring on new data) can be found in the repos listed in the project CLAUDE.md under "Example Applications."
For translation projects, also: repurpose the original package's test suite and data (see
references/testing_strategy.md
Layer 0), write additional Stata-specific tests, and end the plan with a multi-agent fidelity audit. See
references/translation_workflow.md
for the complete workflow.
移植包时,编写算法代码前务必先检查是否存在现成C/C++后端。 许多R包的
src/
目录下有C++代码,许多Python包包含Cython或内置C/C++库。字符串匹配、线性代数、树算法等领域也有独立的C++库。
若存在C++实现,直接封装。 不要用C重新实现算法。封装可提供完全一致的输出(相同代码路径)、生产级性能,且代码量仅为一小部分。插件只是Stata SDK与库API之间的薄
extern "C"
粘合层。二进制大小无关紧要——静态链接所有内容(
-static-libstdc++ -static-libgcc
),无论最终二进制文件多大(即使Windows平台下为10-15 MB)都可发布。用户不关心插件文件大小,只关心结果是否正确。
完整模式见
references/cpp_plugins.md
,工作流见
references/translation_workflow.md
。该方法的实际示例(封装C++后端、多插件调度、新数据评分的保存/加载)可在项目CLAUDE.md的"示例应用"部分列出的仓库中找到。
对于移植项目,还需:复用原包的测试套件和数据(见
references/testing_strategy.md
第0层),编写额外的Stata专属测试,并在规划末尾加入多Agent保真度审计。完整工作流见
references/translation_workflow.md

The Plugin SDK

插件SDK

Download
stplugin.h
and
stplugin.c
from: https://www.stata.com/plugins/
These two files define the interface between your C code and Stata:
Function/MacroPurpose
SF_vdata(var, obs, &val)
Read variable value (1-indexed!)
SF_vstore(var, obs, val)
Write variable value (1-indexed!)
SF_nobs()
Number of observations in current dataset
SF_nvar()
Number of variables in the entire dataset (not just plugin call)
SF_is_missing(val)
Check for Stata missing value (
.
)
SV_missval
The missing value constant
SF_display(msg)
Print informational text in Stata
SF_error(msg)
Print red error text in Stata
Indexing is 1-based. Both variable indices and observation indices start at 1, not 0. Off-by-one errors here are silent and catastrophic — you read the wrong variable's data with no warning.
从以下地址下载
stplugin.h
stplugin.c
https://www.stata.com/plugins/
这两个文件定义了C代码与Stata之间的接口:
函数/宏用途
SF_vdata(var, obs, &val)
读取变量值(从1开始索引!)
SF_vstore(var, obs, val)
写入变量值(从1开始索引!)
SF_nobs()
当前数据集的观测数量
SF_nvar()
整个数据集的变量数量(不仅是插件调用涉及的变量)
SF_is_missing(val)
检查Stata缺失值(
.
SV_missval
缺失值常量
SF_display(msg)
在Stata中打印信息文本
SF_error(msg)
在Stata中打印红色错误文本
索引从1开始。 变量索引和观测索引均从1开始,而非0。此处的差一错误是静默且灾难性的——你会读取错误变量的数据且无任何警告。

Memory Safety

内存安全

A crash in your plugin kills the entire Stata session. No save prompt, no recovery. The user loses all unsaved work. This is the single most important thing to internalize.
  • Check every
    malloc()
    /
    calloc()
    return for
    NULL
  • Validate
    argc
    before accessing
    argv[]
  • Build with
    -fsanitize=address
    during development
  • Test on small data first, scale up gradually
  • Pre-allocate all memory upfront in
    stata_call()
    , free at the end
插件崩溃会导致整个Stata会话终止。 无保存提示,无恢复机制,用户所有未保存的工作都会丢失。这是你必须牢记的最重要的一点。
  • 检查每个
    malloc()
    /
    calloc()
    的返回值是否为
    NULL
  • 访问
    argv[]
    前先验证
    argc
  • 开发阶段使用
    -fsanitize=address
    编译
  • 先在小数据上测试,再逐步扩展
  • stata_call()
    中预先分配所有内存,结束时释放

The stata_call() Entry Point

stata_call()入口点

Every plugin implements one function. Plugins can also be written in C++ — the entry point just needs
extern "C"
linkage so Stata can find it; everything else can be full C++. The obvious case for C++ is when existing C++ code is available to wrap (e.g., an R package's
src/
directory). C++ also helps when you need complex data structures or threading via
std::thread
. For practical C++ guidance — the
extern "C"
pattern, exception safety, compilation commands, wrapping libraries — see
references/cpp_plugins.md
. The rest of this file focuses on C because it's the simpler default.
c
#include "stplugin.h"

// For C++ plugins, wrap the entry point with extern "C":
//   extern "C" {
//     STDLL stata_call(int argc, char *argv[]) { ... }
//   }

STDLL stata_call(int argc, char *argv[]) {
    // 0. Validate arguments BEFORE accessing argv[]
    if (argc < 3) {
        SF_error("myplugin requires 3 arguments: n_train n_test seed\n");
        return 198;  // Stata's "syntax error" code
    }

    // 1. Parse arguments (all strings — use atoi/atof)
    int n_train = atoi(argv[0]);
    int n_test  = atoi(argv[1]);
    int seed    = atoi(argv[2]);

    // 2. Get dimensions
    ST_int nobs  = SF_nobs();
    // CAUTION: SF_nvar() returns ALL variables in the dataset, not just
    // the ones passed to `plugin call`. If the .ado creates tempvars
    // (touse, merge_id, etc.) the count will be higher than expected.
    // Pass the variable count via argv instead of relying on SF_nvar().
    int p = atoi(argv[3]);  // safer: pass feature count explicitly

    // 3. Allocate memory
    double *X    = calloc(nobs * p, sizeof(double));
    double *y    = calloc(nobs, sizeof(double));
    double *pred = calloc(nobs, sizeof(double));
    if (!X || !y || !pred) {
        SF_error("myplugin: out of memory\n");
        if (X) free(X); if (y) free(y); if (pred) free(pred);
        return 909;
    }

    // 4. Read data from Stata (1-indexed!)
    ST_double val;
    for (ST_int obs = 1; obs <= nobs; obs++) {
        SF_vdata(1, obs, &val);      // var 1 = depvar
        y[obs-1] = val;
        for (int j = 0; j < p; j++) {
            SF_vdata(j + 2, obs, &val);  // vars 2..nvars-1 = features
            X[(obs-1) * p + j] = val;
        }
    }

    // 5. Run your algorithm
    int rc = my_algorithm(X, y, pred, n_train, n_test, p, seed);
    if (rc != 0) {
        SF_error("myplugin: algorithm failed\n");
        free(X); free(y); free(pred);
        return 909;
    }

    // 6. Write results back to Stata
    for (ST_int obs = 1; obs <= nobs; obs++) {
        SF_vstore(nvars, obs, pred[obs-1]);  // last var = output
    }

    free(X); free(y); free(pred);
    return 0;  // 0 = success
}
每个插件实现一个函数。插件也可用C++编写——入口点只需
extern "C"
链接,以便Stata能找到它;其余部分可完全使用C++。当有现成C++代码可封装时(例如R包的
src/
目录),C++是显而易见的选择。当你需要复杂数据结构或通过
std::thread
实现线程时,C++也会有所帮助。关于实用C++指南——
extern "C"
模式、异常安全、编译命令、库封装——见
references/cpp_plugins.md
。本文档其余部分聚焦于C,因为它是更简单的默认选择。
c
#include "stplugin.h"

// 对于C++插件,用extern "C"包裹入口点:
//   extern "C" {
//     STDLL stata_call(int argc, char *argv[]) { ... }
//   }

STDLL stata_call(int argc, char *argv[]) {
    // 0. 访问argv[]前先验证参数
    if (argc < 3) {
        SF_error("myplugin requires 3 arguments: n_train n_test seed\n");
        return 198;  // Stata的"语法错误"代码
    }

    // 1. 解析参数(均为字符串——使用atoi/atof)
    int n_train = atoi(argv[0]);
    int n_test  = atoi(argv[1]);
    int seed    = atoi(argv[2]);

    // 2. 获取维度
    ST_int nobs  = SF_nobs();
    // 注意:SF_nvar()返回整个数据集的所有变量数量,而非仅插件调用涉及的变量
    // 如果.ado创建了临时变量(touse、merge_id等),计数会比预期高
    // 建议通过argv传递变量数量,而非依赖SF_nvar()
    int p = atoi(argv[3]);  // 更安全:显式传递特征数量

    // 3. 分配内存
    double *X    = calloc(nobs * p, sizeof(double));
    double *y    = calloc(nobs, sizeof(double));
    double *pred = calloc(nobs, sizeof(double));
    if (!X || !y || !pred) {
        SF_error("myplugin: out of memory\n");
        if (X) free(X); if (y) free(y); if (pred) free(pred);
        return 909;
    }

    // 4. 从Stata读取数据(从1开始索引!)
    ST_double val;
    for (ST_int obs = 1; obs <= nobs; obs++) {
        SF_vdata(1, obs, &val);      // 变量1 = 因变量
        y[obs-1] = val;
        for (int j = 0; j < p; j++) {
            SF_vdata(j + 2, obs, &val);  // 变量2..nvars-1 = 特征
            X[(obs-1) * p + j] = val;
        }
    }

    // 5. 运行算法
    int rc = my_algorithm(X, y, pred, n_train, n_test, p, seed);
    if (rc != 0) {
        SF_error("myplugin: algorithm failed\n");
        free(X); free(y); free(pred);
        return 909;
    }

    // 6. 将结果写回Stata
    for (ST_int obs = 1; obs <= nobs; obs++) {
        SF_vstore(nvars, obs, pred[obs-1]);  // 最后一个变量 = 输出
    }

    free(X); free(y); free(pred);
    return 0;  // 0 = 成功
}

Return Codes

返回码

  • 0
    — success
  • 198
    — syntax error (bad arguments)
  • 909
    — insufficient memory
  • 601
    — file not found
  • Any non-zero triggers a Stata error
  • 0
    — 成功
  • 198
    — 语法错误(参数错误)
  • 909
    — 内存不足
  • 601
    — 文件未找到
  • 任何非零值都会触发Stata错误

The .ado Wrapper Pattern

.ado封装器模式

Users never call
plugin call
directly. An
.ado
file provides the Stata-native interface.
用户从不直接调用
plugin call
。.ado文件提供Stata原生接口。

The Preserve/Merge Pattern

Preserve/Merge模式

This is the core pattern for plugins that operate on a subset of data:
stata
program define mycommand, rclass
    syntax varlist(min=2) [if] [in], GENerate(name) [SEED(integer 12345) REPlace]

    gettoken depvar indepvars : varlist

    if "`replace'" != "" {
        capture drop `generate'
    }
    confirm new variable `generate'

    // Mark sample: novarlist ALLOWS missing depvar (critical for imputation)
    marksample touse, novarlist
    markout `touse' `indepvars'   // but DO exclude missing predictors

    // Stable merge key — create BEFORE any sorting or subsetting
    tempvar merge_id
    quietly gen long `merge_id' = _n

    // Count subsets
    quietly count if `touse' & !missing(`depvar')
    local n_train = r(N)
    quietly count if `touse' & missing(`depvar')
    local n_test = r(N)

    // Create output variable (all missing initially)
    quietly gen double `generate' = .

    // Preserve, subset, call plugin
    preserve
    quietly keep if `touse'

    // Sort if plugin requires it (donors first, test second)
    tempvar sort_order
    quietly gen `sort_order' = missing(`depvar')
    quietly sort `sort_order'

    // Call plugin
    plugin call myplugin `depvar' `indepvars' `generate', ///
        `n_train' `n_test' `seed'

    // Save results and restore
    tempfile results
    quietly keep `merge_id' `generate'
    quietly save `results'
    restore

    // Merge predictions back (update replaces missing with non-missing)
    quietly merge 1:1 `merge_id' using `results', nogenerate update
end
Why
update
works:
The
generate
variable is all-missing before preserve. After restore, it's still all-missing. The
update
option replaces missing values with non-missing ones from the merge file. The
replace
option is handled earlier via
capture drop
, so by merge time the variable is always freshly created.
这是处理数据子集的插件核心模式:
stata
program define mycommand, rclass
    syntax varlist(min=2) [if] [in], GENerate(name) [SEED(integer 12345) REPlace]

    gettoken depvar indepvars : varlist

    if "`replace'" != "" {
        capture drop `generate'
    }
    confirm new variable `generate'

    // 标记样本:novarlist允许因变量缺失(插补场景至关重要)
    marksample touse, novarlist
    markout `touse' `indepvars'   // 但需排除缺失的预测变量

    // 稳定合并键——在任何排序或子集化之前创建
    tempvar merge_id
    quietly gen long `merge_id' = _n

    // 统计子集数量
    quietly count if `touse' & !missing(`depvar')
    local n_train = r(N)
    quietly count if `touse' & missing(`depvar')
    local n_test = r(N)

    // 创建输出变量(初始全为缺失值)
    quietly gen double `generate' = .

    // 保留数据、子集化、调用插件
    preserve
    quietly keep if `touse'

    // 如果插件要求,进行排序(训练样本在前,测试样本在后)
    tempvar sort_order
    quietly gen `sort_order' = missing(`depvar')
    quietly sort `sort_order'

    // 调用插件
    plugin call myplugin `depvar' `indepvars' `generate', ///
        `n_train' `n_test' `seed'

    // 保存结果并恢复
    tempfile results
    quietly keep `merge_id' `generate'
    quietly save `results'
    restore

    // 合并预测结果(update选项用非缺失值替换缺失值)
    quietly merge 1:1 `merge_id' using `results', nogenerate update
end
update选项的作用原理: generate变量在preserve前全为缺失值。restore后,它仍全为缺失值。update选项会用合并文件中的非缺失值替换缺失值。replace选项已通过
capture drop
提前处理,因此合并时变量总是新创建的。

Plugin Sorting Contract

插件排序约定

CRITICAL: Some plugins expect data sorted a specific way (training rows first, test rows second). Others handle missing data internally. Sorting mismatches are among the most dangerous bugs — the plugin silently reads the wrong data, producing garbage output with no error message. A mismatched sort order can drop prediction quality dramatically (e.g., correlation going from 0.99 to 0.38) because the plugin treats test observations as training data and vice versa.
  • If the plugin checks
    SF_is_missing()
    internally: do NOT sort in the .ado wrapper
  • If the plugin expects
    n_train
    contiguous rows then
    n_test
    rows: sort by
    missing(depvar)
    before calling
Document which pattern your plugin uses.
关键注意事项: 部分插件要求数据按特定方式排序(训练行在前,测试行在后),其他插件则在内部处理缺失数据。排序不匹配是最危险的bug之一——插件会静默读取错误数据,产生无效输出且无任何错误提示。排序不匹配会大幅降低预测质量(例如相关性从0.99降至0.38),因为插件会将测试观测当作训练数据,反之亦然。
  • 如果插件内部检查
    SF_is_missing()
    :不要在.ado封装器中排序
  • 如果插件要求
    n_train
    行连续排列,之后是
    n_test
    行:调用前按
    missing(depvar)
    排序
请记录你的插件使用哪种模式。

Plugin Loading (Cross-Platform)

插件加载(跨平台)

Use the gtools-style OS detection pattern. This detects the OS via
c(os)
and constructs a bare filename. The bare filename is resolved via Stata's adopath, which is reliable across all platforms.
stata
/* ---- Load plugin (gtools-style: detect OS, bare filename) ---- */
if ( inlist("`c(os)'", "MacOSX") | strpos("`c(machine_type)'", "Mac") ) local c_os_ macosx
else local c_os_: di lower("`c(os)'")

cap program drop myplugin
program myplugin, plugin using("myplugin_`c_os_'.plugin")
This resolves to
myplugin_macosx.plugin
,
myplugin_windows.plugin
, or
myplugin_unix.plugin
depending on platform.
WARNING — DO NOT use
findfile
+ absolute paths.
The following pattern is BROKEN on Windows and must never be used:
stata
* BROKEN — DO NOT USE
capture findfile myplugin.plugin
capture program myplugin, plugin using("`r(fn)'")
findfile
returns an absolute path (e.g.,
C:\ado\plus\m\myplugin.plugin
). On Windows, Stata's
LoadLibrary
call fails when given certain absolute paths via
using()
. The gtools-style pattern avoids this by passing a bare filename (no path), which Stata resolves via the adopath — exactly how gtools, ftools, and other major packages work.
Similarly, do not use a nested if/else cascade trying each
platform-arch
suffix. This was the old pattern in several packages and fails for the same reason if
findfile
is involved, plus it's fragile and verbose.
Plugin file naming:
pluginname_os.plugin
where
os
is one of
macosx
,
unix
,
windows
. Examples:
qrf_plugin_macosx.plugin
,
grf_plugin_windows.plugin
.
Note:
clear all
wipes loaded plugin definitions. If a test script starts with
clear all
, all
program ... plugin
definitions are gone. Reload them.
使用gtools风格的操作系统检测模式。通过
c(os)
检测操作系统并构造裸文件名。裸文件名通过Stata的adopath解析,在所有平台上均可靠。
stata
/* ---- 加载插件(gtools风格:检测OS,使用裸文件名) ---- */
if ( inlist("`c(os)'", "MacOSX") | strpos("`c(machine_type)'", "Mac") ) local c_os_ macosx
else local c_os_: di lower("`c(os)'")

cap program drop myplugin
program myplugin, plugin using("myplugin_`c_os_'.plugin")
根据平台不同,会解析为
myplugin_macosx.plugin
myplugin_windows.plugin
myplugin_unix.plugin
警告——不要使用
findfile
+ 绝对路径。
以下模式在Windows上会失效,绝对禁止使用:
stata
* 失效模式——禁止使用
capture findfile myplugin.plugin
capture program myplugin, plugin using("`r(fn)'")
findfile
返回绝对路径(例如
C:\ado\plus\m\myplugin.plugin
)。在Windows上,Stata的
LoadLibrary
调用在通过
using()
传入某些绝对路径时会失败。gtools风格模式通过传递裸文件名(无路径)避免此问题,Stata会通过adopath解析——与gtools、ftools等主流包的工作方式完全一致。
同样,不要使用嵌套if/else分支尝试每个
platform-arch
后缀。这是多个旧包使用的模式,若涉及
findfile
也会因相同原因失效,且代码脆弱冗长。
插件文件命名规则:
pluginname_os.plugin
,其中
os
macosx
unix
windows
。例如:
qrf_plugin_macosx.plugin
grf_plugin_windows.plugin
注意:
clear all
会清除已加载的插件定义。如果测试脚本以
clear all
开头,所有
program ... plugin
定义都会丢失,需重新加载。

Cross-Platform Compilation

跨平台编译

Build for three platforms (ARM Macs run x86_64 via Rosetta, so one macOS binary suffices). Install the Windows cross-compiler first:
brew install mingw-w64
.
Target OSOutput name suffixCompiler
-D
flag
Link flagpthreads
macOS (ARM64)
_macosx
gcc -arch arm64
-DSYSTEM=APPLEMAC
-bundle
-pthread
Linux (x86_64)
_unix
gcc
-DSYSTEM=OPUNIX
-shared
-pthread
Windows (x86_64)
_windows
x86_64-w64-mingw32-gcc
-DSYSTEM=STWIN32
-shared
-lwinpthread
All platforms:
-O3 -fPIC
for release, add
-g -fsanitize=address
for development.
For C++ plugins: use
g++
instead of
gcc
. Add
-std=c++
at the version the library requires (check its docs — C++11, C++14, and C++17 are all common). Header-only C++ libraries can be vendored into
c_source/
and included with
-I.
. Always use
-static-libstdc++ -static-libgcc
on Windows and Linux.
Naming convention:
pluginname_os.plugin
(e.g.,
qrf_plugin_macosx.plugin
,
grf_plugin_windows.plugin
). The
os
suffix must match what the gtools-style loader produces:
macosx
,
unix
, or
windows
.
macOS note: use
-bundle
, NOT
-shared
. This is a common mistake.
为三个平台构建(ARM Mac通过Rosetta运行x86_64,因此一个macOS二进制文件即可)。先安装Windows交叉编译器:
brew install mingw-w64
目标操作系统输出名称后缀编译器
-D
参数
链接参数线程库
macOS (ARM64)
_macosx
gcc -arch arm64
-DSYSTEM=APPLEMAC
-bundle
-pthread
Linux (x86_64)
_unix
gcc
-DSYSTEM=OPUNIX
-shared
-pthread
Windows (x86_64)
_windows
x86_64-w64-mingw32-gcc
-DSYSTEM=STWIN32
-shared
-lwinpthread
所有平台:发布版本使用
-O3 -fPIC
,开发版本添加
-g -fsanitize=address
C++插件注意事项: 使用
g++
而非
gcc
。添加
-std=c++
及库要求的版本(查看文档——C++11、C++14和C++17均常见)。仅头文件的C++库可放入
c_source/
目录,通过
-I.
包含。Windows和Linux平台始终使用
-static-libstdc++ -static-libgcc

Linux from macOS (Docker Required)

从macOS构建Linux版本(需Docker)

There is no native Linux cross-compiler on macOS. Use Docker via Colima (
brew install colima docker
, then
colima start
). Build with a one-liner:
bash
docker run --rm --platform linux/amd64 -v "$(pwd):/build" -w /build ubuntu:18.04 \
    bash -c "apt-get update -qq && apt-get install -y -qq g++ gcc make > /dev/null 2>&1 && make linux"
glibc compatibility: Build on Ubuntu 18.04 for maximum compatibility (requires only GLIBC 2.14, works on any Linux from ~2012+). Building on Ubuntu 22.04+ requires GLIBC 2.34, which excludes RHEL 8, Ubuntu 20.04, and many HPC environments.
macOS上没有原生Linux交叉编译器。通过Colima使用Docker(
brew install colima docker
,然后
colima start
)。使用以下单行命令构建:
bash
docker run --rm --platform linux/amd64 -v "$(pwd):/build" -w /build ubuntu:18.04 \
    bash -c "apt-get update -qq && apt-get install -y -qq g++ gcc make > /dev/null 2>&1 && make linux"
glibc兼容性: 在Ubuntu 18.04上构建以获得最大兼容性(仅需GLIBC 2.14,可在2012年以后的任何Linux系统上运行)。在Ubuntu 22.04+上构建需要GLIBC 2.34,会排除RHEL 8、Ubuntu 20.04和许多HPC环境。

Performance Optimization

性能优化

See
references/performance_patterns.md
for detailed code examples of:
  1. Pre-sorted feature indices — Sort feature values once, scan linearly at each tree node. O(n) per split instead of O(n log n).
  2. Precomputed distance norms — Exploit ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a'b for KNN.
  3. Quickselect — O(n) partial sort for finding k-th nearest neighbor.
  4. Parallel ensemble training (pthreads) — Train multiple models concurrently. Each thread gets its own data copy and RNG state. Never call Stata SDK functions (
    SF_vdata
    ,
    SF_vstore
    ,
    SF_display
    ) from worker threads
    — read all data on the main thread first, dispatch computation to workers, write results back on the main thread after joining.
  5. XorShift RNG — C plugins cannot access Stata's internal RNG (
    runiform()
    ). XorShift128+ is fast, statistically sound, and thread-safe (each thread gets its own state). Seed from
    argv[]
    for reproducibility.
  6. Dense arrays for trees — Flat node arrays instead of linked lists for cache locality.
详细代码示例见
references/performance_patterns.md
,包括:
  1. 预排序特征索引 —— 对特征值排序一次,在每个树节点线性扫描。每个分割的时间复杂度为O(n)而非O(n log n)。
  2. 预计算距离范数 —— 利用||a-b||^2 = ||a||^2 + ||b||^2 - 2*a'b优化KNN算法。
  3. 快速选择 —— 以O(n)时间复杂度进行部分排序,用于查找第k近邻。
  4. 并行集成训练(pthreads) —— 同时训练多个模型。每个线程拥有自己的数据副本和RNG状态。绝对不要从工作线程调用Stata SDK函数(
    SF_vdata
    SF_vstore
    SF_display
    ——先在主线程读取所有数据,将计算任务分配给工作线程,线程合并后在主线程写回结果。
  5. XorShift随机数生成器 —— C插件无法访问Stata的内部RNG(
    runiform()
    )。XorShift128+速度快、统计特性良好且线程安全(每个线程拥有自己的状态)。通过
    argv[]
    传入种子以保证可复现性。
  6. 树的密集数组 —— 使用扁平节点数组而非链表,以提高缓存局部性。

Debugging

调试

Debugging is hard because you can't attach a debugger to Stata's plugin host.
调试难度较大,因为无法将调试器附加到Stata的插件宿主进程。

Strategies

策略

  1. Printf via SF_display():
    c
    char buf[256];
    snprintf(buf, sizeof(buf), "Debug: n=%d, p=%d\n", n, p);
    SF_display(buf);
  2. Write diagnostic files:
    c
    FILE *f = fopen("plugin_debug.log", "w");
    fprintf(f, "value at [%d][%d] = %f\n", i, j, val);
    fclose(f);
  3. Test standalone first. Write a
    main()
    that reads CSV and calls your algorithm. Debug with normal tools (gdb, valgrind, sanitizers). Then adapt for the plugin interface.
  4. Build with sanitizers during development:
    -g -fsanitize=address
  5. Check SF_vdata() return values. It returns
    RC
    (0=success). Non-zero means invalid obs/var index.
  1. 通过SF_display()实现printf功能:
    c
    char buf[256];
    snprintf(buf, sizeof(buf), "Debug: n=%d, p=%d\n", n, p);
    SF_display(buf);
  2. 写入诊断文件:
    c
    FILE *f = fopen("plugin_debug.log", "w");
    fprintf(f, "value at [%d][%d] = %f\n", i, j, val);
    fclose(f);
  3. 先测试独立版本。 编写
    main()
    函数读取CSV并调用算法。使用常规工具(gdb、valgrind、 sanitizers)调试。然后适配为插件接口。
  4. 开发阶段使用sanitizers构建:
    -g -fsanitize=address
  5. 检查SF_vdata()返回值。 返回
    RC
    (0=成功),非零值表示无效的观测/变量索引。

Common Failure Modes

常见失败模式

SymptomLikely Cause
Stata crashes silentlySegfault: buffer overflow, bad argv access, NULL deref
Plugin returns all missingWrong variable count, wrong obs indexing, plugin not loaded
Results are garbageSorting mismatch, 0-vs-1 indexing error, unnormalized inputs
"plugin not found"Wrong filename,
clear all
wiped definition, wrong platform
Works on Mac, fails on LinuxInteger size difference, use
int32_t
/
int64_t
from
<stdint.h>
症状可能原因
Stata静默崩溃段错误:缓冲区溢出、错误的argv访问、空指针解引用
插件返回全缺失值变量数量错误、观测索引错误、插件未加载
结果无效排序不匹配、0/1索引错误、输入未归一化
"plugin not found"文件名错误、
clear all
清除了定义、平台不匹配
Mac上正常,Linux上失败整数大小差异,使用
<stdint.h>
中的
int32_t
/
int64_t

Packaging and Distribution

打包与分发

Use platform-specific
.pkg
files
so users only download the binary for their OS. Stata's
net install
has no conditional logic, so the way to avoid shipping all 4 binaries to every user is to offer separate packages per platform. All packages install the same
.ado
and
.sthlp
files — only the
.plugin
binary differs.
mypackage/
├── stata.toc                          # lists all package variants
├── mypackage.pkg                      # all platforms (for users who don't care)
├── mypackage_mac.pkg                  # macOS only
├── mypackage_linux.pkg                # Linux only
├── mypackage_win.pkg                  # Windows only
├── mycommand.sthlp                    # overview help file (short name!)
├── mycommand.ado                      # user-facing command
├── myplugin_macosx.plugin
├── myplugin_unix.plugin
├── myplugin_windows.plugin
└── c_source/                          # NOT distributed, for building
    ├── build.py
    ├── stplugin.c
    ├── stplugin.h
    └── algorithm.c
Users install their platform's package:
stata
* macOS
net install mypackage_mac, from("https://raw.githubusercontent.com/user/repo/main") replace
* Linux
net install mypackage_linux, from("https://raw.githubusercontent.com/user/repo/main") replace
* Windows
net install mypackage_win, from("https://raw.githubusercontent.com/user/repo/main") replace
All platform binaries ship via the all-platform .pkg, or users can install platform-specific packages. Stata loads only the matching plugin at runtime via gtools-style OS detection. Windows C++ binaries can be 10-15MB due to static linking, which is normal.
See
references/packaging_and_help.md
for
.toc
,
.pkg
,
.sthlp
templates and SMCL formatting.
使用平台专属的
.pkg
文件
,让用户仅下载对应操作系统的二进制文件。Stata的
net install
无条件逻辑,因此避免向所有用户分发4个二进制文件的方法是提供按平台划分的独立包。所有包安装相同的
.ado
.sthlp
文件——仅
.plugin
二进制文件不同。
mypackage/
├── stata.toc                          # 列出所有包变体
├── mypackage.pkg                      # 全平台版本(面向不关心平台的用户)
├── mypackage_mac.pkg                  # 仅macOS版本
├── mypackage_linux.pkg                # 仅Linux版本
├── mypackage_win.pkg                  # 仅Windows版本
├── mycommand.sthlp                    # 概览帮助文件(使用短名称!)
├── mycommand.ado                      # 用户面向的命令
├── myplugin_macosx.plugin
├── myplugin_unix.plugin
├── myplugin_windows.plugin
└── c_source/                          # 不分发,仅用于构建
    ├── build.py
    ├── stplugin.c
    ├── stplugin.h
    └── algorithm.c
用户安装对应平台的包:
stata
* macOS
net install mypackage_mac, from("https://raw.githubusercontent.com/user/repo/main") replace
* Linux
net install mypackage_linux, from("https://raw.githubusercontent.com/user/repo/main") replace
* Windows
net install mypackage_win, from("https://raw.githubusercontent.com/user/repo/main") replace
所有平台二进制文件通过全平台.pkg发布,用户也可安装平台专属包。Stata通过gtools风格的操作系统检测在运行时仅加载匹配的插件。Windows C++二进制文件因静态链接可能达到10-15MB,这是正常现象。
.toc
.pkg
.sthlp
模板和SMCL格式见
references/packaging_and_help.md

Common Pitfalls

常见陷阱

  1. Sorting destroys merge keys. If you sort inside
    preserve
    /
    restore
    , the merge_id linkage breaks. Always create merge_id BEFORE preserve.
  2. 1-indexed everything.
    SF_vdata(var, obs, &val)
    — both var and obs start at 1. Off-by-one errors are silent.
  3. marksample
    excludes missing by default.
    For imputation (where missing depvar IS the point), use
    marksample touse, novarlist
    .
  4. macOS
    c(os)
    returns "MacOSX".
    Use the gtools pattern:
    inlist("
    c(os)'", "MacOSX") | strpos("
    c(machine_type)'", "Mac")
    to detect Mac. For other platforms,
    lower(c(os))
    gives
    "windows"
    or
    "unix"
    .
  5. argv[] has no bounds checking. Accessing
    argv[3]
    when
    argc == 2
    is a segfault. Always check
    argc
    first.
  6. clear all
    wipes plugins.
    Reload plugin definitions after
    clear all
    in test scripts.
  7. Only the first
    program define
    in a .ado file is auto-discovered.
    Subprograms need their own .ado files or explicit
    run
    to load.
  8. Normalize inputs when the algorithm requires it (neural networks, gradient-based methods, distance-based methods like KNN). Scale to mean=0, sd=1 in the .ado wrapper, denormalize predictions after. The plugin should receive clean, normalized data — let the .ado handle the scaling.
  9. pthreads on Windows needs
    -lwinpthread
    .
    Use conditional linker flags.
  10. Memory errors crash Stata with no recovery. Pre-allocate everything, check every allocation, build with sanitizers during development.
  11. glibc version mismatch. Building Linux plugins on a modern distro produces binaries that won't load on older systems. Use Ubuntu 18.04 in Docker for maximum compatibility.
  12. SF_nvar()
    returns total dataset variables.
    It counts ALL variables in the dataset, not just the ones in the
    plugin call
    varlist. If the .ado creates tempvars (
    touse
    ,
    merge_id
    , sort keys), the count will be higher than expected. Never use
    SF_nvar()
    to validate argument counts — pass the expected count via
    argv
    instead.
  13. findfile
    + absolute paths breaks on Windows.
    findfile
    returns an absolute path that Stata's
    LoadLibrary
    can't resolve on Windows. Use the gtools-style OS detection pattern instead (see Plugin Loading section above) — it constructs a bare filename that Stata resolves via the adopath.
  1. 排序会破坏合并键。 如果在
    preserve
    /
    restore
    内排序,merge_id关联会断裂。务必在preserve之前创建merge_id。
  2. 所有索引从1开始。
    SF_vdata(var, obs, &val)
    ——变量和观测索引均从1开始。差一错误是静默的。
  3. marksample
    默认排除缺失值。
    对于插补场景(缺失因变量是核心需求),使用
    marksample touse, novarlist
  4. macOS的
    c(os)
    返回"MacOSX"。
    使用gtools模式:
    inlist("
    c(os)'", "MacOSX") | strpos("
    c(machine_type)'", "Mac")
    检测Mac。其他平台使用
    lower(c(os))
    得到
    "windows"
    "unix"
  5. argv[]无边界检查。
    argc == 2
    时访问
    argv[3]
    会导致段错误。务必先检查
    argc
  6. clear all
    会清除插件。
    测试脚本中
    clear all
    后需重新加载插件定义。
  7. .ado文件中只有第一个
    program define
    会被自动发现。
    子程序需要单独的.ado文件或显式
    run
    加载。
  8. 算法要求时归一化输入(神经网络、基于梯度的方法、KNN等基于距离的方法)。在.ado封装器中缩放到均值=0、标准差=1,预测后再反归一化。插件应接收干净、归一化的数据——让.ado处理缩放。
  9. Windows平台pthreads需要
    -lwinpthread
    使用条件链接参数。
  10. 内存错误会导致Stata无恢复崩溃。 预先分配所有内存,检查每个分配,开发阶段使用sanitizers构建。
  11. glibc版本不兼容。 在现代发行版上构建的Linux插件无法在旧系统上加载。使用Docker中的Ubuntu 18.04构建以获得最大兼容性。
  12. SF_nvar()
    返回数据集总变量数。
    它会计算数据集中的所有变量,不仅是
    plugin call
    变量列表中的变量。如果.ado创建了临时变量(touse、merge_id、排序键),计数会比预期高。永远不要使用
    SF_nvar()
    验证参数数量——通过
    argv
    传递预期数量。
  13. findfile
    + 绝对路径在Windows上失效。
    findfile
    返回的绝对路径无法被Stata的
    LoadLibrary
    解析。改用gtools风格的操作系统检测模式(见插件加载部分)——它构造的裸文件名会被Stata通过adopath解析。

Naming Conventions

命名约定

  • Use
    method()
    not
    model()
    for method selection options
  • Use
    generate()
    (abbreviation
    gen()
    ) for output variable naming
  • Use
    replace
    as a flag option, not
    replace()
  • Plugin files:
    algorithm_plugin_os.plugin
    where os is
    macosx
    ,
    unix
    , or
    windows
  • .ado files: lowercase, underscores for multi-word
  • Stata option convention: options lowercase, abbreviations capitalized (
    GENerate
    ,
    MAXDepth
    )
  • Target Stata 14.0+ (
    version 14.0
    ) for plugin support
  • Help files use the short command name, not the repo name. If the repo is called
    mypackage_stata
    , the overview help file should still be
    mypackage.sthlp
    (so
    help mypackage
    works). Don't append "stata" to help file or command names — the user is already in Stata.
  • 方法选择选项使用
    method()
    而非
    model()
  • 输出变量命名使用
    generate()
    (缩写
    gen()
  • 替换选项使用
    replace
    标志,而非
    replace()
  • 插件文件:
    algorithm_plugin_os.plugin
    ,其中os为
    macosx
    unix
    windows
  • .ado文件:小写,多词用下划线分隔
  • Stata选项约定:选项小写,缩写大写(
    GENerate
    MAXDepth
  • 目标Stata版本14.0+(
    version 14.0
    )以支持插件
  • 帮助文件使用命令短名称,而非仓库名称。 如果仓库名为
    mypackage_stata
    ,概览帮助文件仍应为
    mypackage.sthlp
    (这样
    help mypackage
    可用)。不要在帮助文件或命令名称后追加"stata"——用户已经在Stata中。