survey-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesesurvy — Survey Data Analysis Skill
survy — 调研数据分析技能
survyInstall: Always install the latest version —
Powered by: Polars (all DataFrames returned are Polars, not pandas)
pip install --upgrade survysurvy安装:始终安装最新版本 —
底层依赖:Polars(所有返回的DataFrame都是Polars类型,而非pandas)
pip install --upgrade survy1. Core Objects
1. 核心对象
Survey
Survey
Top-level container. Created via functions — never instantiate directly.
Access variables with . Print for a compact summary.
read_*survey["Q1"]顶层容器,通过系列函数创建,请勿直接实例化。使用访问变量,打印该对象可获得精简摘要。
read_*survey["Q1"]Variable
Variable
Wraps a single Polars Series plus survey metadata. Key attributes:
| Attribute | Type | Description |
|---|---|---|
| | Column name (read/write via property) |
| | Human-readable label (read/write); defaults to |
| | One of |
| | Answer code → numeric index mapping; always empty |
| | Count of non-null/non-empty responses |
| | Total row count including nulls |
| | Underlying Polars dtype |
| | Frequency table (value, count, proportion) |
| | SPSS syntax for this variable |
封装了单个Polars Series以及调研元数据,核心属性如下:
| 属性 | 类型 | 描述 |
|---|---|---|
| | 列名(可通过属性读写) |
| | 人类可读的标签(可读写),未设置时默认等于 |
| | 取值为 |
| | 答案文本到数字索引的映射,NUMBER类型变量的该属性始终为空 |
| | 非空/非无效回复的数量 |
| | 总行数(包含空值) |
| | 底层Polars数据类型 |
| | 频率表,包含值、计数、占比三列 |
| | 对应变量的SPSS语法 |
2. Reading Data
2. 读取数据
All readers return a object. The key challenge survy
solves at read time is recognizing multiselect questions —
questions where one respondent can choose multiple answers.
Raw data encodes these in two very different layouts, and survy
needs to know which layout it's looking at so it can merge the
data into a single logical variable.
Survey所有读取函数都会返回一个对象。survy在读取阶段解决的核心问题是识别多选问题——即受访者可以选择多个答案的问题。原始数据会以两种完全不同的布局编码这类问题,survy需要知道当前处理的是哪种布局,才能将数据合并为单个逻辑变量。
SurveyMultiselect: Compact Format vs Wide Format
多选问题:紧凑格式 vs 宽格式
Compact format stores all selected answers in a single
cell, joined by a separator (typically ).
One column = one question.
;id, gender, hobby
1, Male, Sport;Book
2, Female, Sport;Movie
3, Male, MovieHere is one column. The cell means the
respondent chose both Sport and Book. survy splits this cell on
the separator to recover the individual choices.
hobby"Sport;Book"Wide format spreads each possible answer across its own
column, using a shared prefix plus a numeric suffix
(, , ...). Multiple columns = one question.
_1_2id, gender, hobby_1, hobby_2, hobby_3
1, Male, Book, , Sport
2, Female, , Movie, Sport
3, Male, , Movie,Here , , are three columns that
together represent the single question. survy groups
them by matching the prefix pattern and merges them into one
multiselect variable named .
hobby_1hobby_2hobby_3hobbyhobbyAfter reading, both formats produce the exact same Survey
variable internally — a variable whose data is a
sorted list of chosen values per respondent:
MULTISELECThobby: [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]]紧凑格式将所有选中的答案存储在单个单元格中,通过分隔符(通常是)拼接。一列对应一个问题。
;id, gender, hobby
1, Male, Sport;Book
2, Female, Sport;Movie
3, Male, Movie此处是单列,单元格表示受访者同时选择了运动和读书。survy会按分隔符拆分单元格,还原出单个选项。
hobby"Sport;Book"宽格式将每个可选答案拆分到独立的列中,使用共享前缀加数字后缀(、……)的命名方式。多列对应一个问题。
_1_2id, gender, hobby_1, hobby_2, hobby_3
1, Male, Book, , Sport
2, Female, , Movie, Sport
3, Male, , Movie,此处、、三列共同代表这一个问题。survy会通过匹配前缀规则将它们分组,合并为名为的单个多选变量。
hobby_1hobby_2hobby_3hobbyhobby读取完成后,两种格式在内部会生成完全相同的Survey变量——即一个类型的变量,其数据为每个受访者选中值的排序后列表:
MULTISELECThobby: [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]]How survy detects each format
survy如何检测两种格式
Wide format is detected via — a format template (NOT a raw regex)
with two named tokens and a set of reserved separators:
name_pattern- Tokens: (base variable name),
id(suffix for wide columns)multi - Reserved separators: ,
_,.— these are always treated as delimiters between tokens when parsing column names:
With the default pattern :
"id(_multi)?"- →
hobby_1,id="hobby"→ grouped as wide multiselectmulti="1" - → same
hobby_2→ merged withid="hobby"hobby_1 - → no suffix → normal column
gender
Other patterns:
- → matches
"id.multi",Q1.1, ...Q1.2 - → matches
"id:multi",Q1:a, ...Q1:b
Separator conflict warning: If a column name contains more than one reserved
separator (e.g. ), will fail because it can't unambiguously
split the name into tokens. Before loading, rename such columns so only one
separator is used (e.g. rename to or ).
my.var_1parse_idmy.var_1myvar_1my@var_1Compact format is NOT detected by default because a semicolon in a cell
could be regular text. You must tell survy which columns are compact in one of two ways:
- — explicitly list the column IDs that are compact multiselect.
compact_ids - — survy scans every column for the
auto_detect=Truecharacter; any column containing it in at least one cell is treated as compact.compact_separator
Rule: Do NOT combine with in the same call.
auto_detect=Truecompact_ids宽格式通过检测,这是一个格式模板(不是原生正则),包含两个命名标记和一组预留分隔符:
name_pattern- 标记:(基础变量名)、
id(宽列后缀)multi - 预留分隔符:、
_、.—— 解析列名时,这些符号始终会被当作标记之间的分隔符:
使用默认模式时:
"id(_multi)?"- →
hobby_1,id="hobby"→ 被归类为宽格式多选列multi="1" - → 同样
hobby_2→ 与id="hobby"合并hobby_1 - → 无后缀 → 普通列
gender
其他模式示例:
- → 匹配
"id.multi"、Q1.1等命名Q1.2 - → 匹配
"id:multi"、Q1:a等命名Q1:b
分隔符冲突警告:如果列名包含超过一个预留分隔符(例如),会因为无法明确拆分名称为标记而失败。加载前请重命名这类列,仅保留一种分隔符(例如将重命名为或)。
my.var_1parse_idmy.var_1myvar_1my@var_1紧凑格式默认不会被自动检测,因为单元格中的分号可能是普通文本。你需要通过以下两种方式之一明确告知survy哪些列是紧凑格式:
- —— 显式列出属于紧凑格式多选的列ID
compact_ids - —— survy会扫描所有列检查是否包含
auto_detect=True字符,只要有一个单元格包含该字符的列就会被当作紧凑格式处理compact_separator
规则:请勿在同一次调用中同时使用和。
auto_detect=Truecompact_idsread_spss
read_spss
Reads an SPSS file. SPSS files are always wide format — compact multiselect does not
apply and / are not parameters. Wide multiselect columns (e.g.
, ) are still auto-detected and merged via . Value labels stored
in the file are applied automatically, so variables come back as text (, )
rather than numeric codes. Requires .
.savcompact_idsauto_detecthobby_1hobby_2name_pattern.sav"Male""Female"pyreadstatpython
undefined读取SPSS 文件。SPSS文件始终是宽格式,不适用紧凑多选格式,也没有/参数。宽格式多选列(例如、)仍然会通过自动检测并合并。文件中存储的值标签会被自动应用,因此变量返回的是文本(、)而非数字编码。该功能依赖库。
.savcompact_idsauto_detecthobby_1hobby_2name_pattern.sav"Male""Female"pyreadstatpython
undefinedWide multiselect detected automatically
宽格式多选会被自动检测
survey = survy.read_spss("data.sav")
survey = survy.read_spss("data.sav")
Custom suffix convention (Q1.1, Q1.2, ...)
自定义后缀规则(适配Q1.1、Q1.2……这类命名)
survey = survy.read_spss("data.sav", name_pattern="id.multi")
**Rule**: Do NOT pass `compact_ids` or `auto_detect` to `read_spss` — those parameters don't exist.
---survey = survy.read_spss("data.sav", name_pattern="id.multi")
**规则**:请勿向`read_spss`传递`compact_ids`或`auto_detect`参数——该函数没有这两个参数。
---Shared Reader Parameters (read_csv, read_excel, read_polars only)
通用读取参数(仅适用于read_csv、read_excel、read_polars)
These parameters control multiselect detection and apply to
, , and . They do NOT apply
to (which reads survy's own format where variable
types are already resolved).
read_csvread_excelread_polarsread_json| Parameter | Type | Default | Description |
|---|---|---|---|
| | | Column IDs to treat as compact multiselect |
| | | Separator used to split compact cells |
| | | Auto-detect compact columns by scanning for separator |
| | | Format template for wide column names. Tokens: |
这些参数用于控制多选检测,适用于、和,不适用于(read_json读取的是survy自有格式,变量类型已经确定)。
read_csvread_excelread_polarsread_json| 参数 | 类型 | 默认值 | 描述 |
|---|---|---|---|
| `list[str] \ | None` | |
| | | 用于拆分紧凑格式单元格的分隔符 |
| | | 通过扫描分隔符自动检测紧凑格式列 |
| | | 宽格式列名的模板,标记为 |
read_csv / read_excel
read_csv / read_excel
python
import survypython
import survy--- Compact format data ---
--- 紧凑格式数据 ---
Option A: you know which columns are compact
方案A:你明确知道哪些列是紧凑格式
survey = survy.read_csv("data_compact.csv", compact_ids=["hobby"], compact_separator=";")
survey = survy.read_csv("data_compact.csv", compact_ids=["hobby"], compact_separator=";")
Option B: let survy scan for the separator automatically
方案B:让survy自动扫描检测分隔符
survey = survy.read_csv("data_compact.csv", auto_detect=True, compact_separator=";")
survey = survy.read_csv("data_compact.csv", auto_detect=True, compact_separator=";")
--- Wide format data ---
--- 宽格式数据 ---
Wide detection is automatic via name_pattern (default works for Q1_1, Q1_2, ...)
宽格式会通过name_pattern自动检测(默认规则适配Q1_1、Q1_2……这类命名)
survey = survy.read_csv("data_wide.csv")
survey = survy.read_csv("data_wide.csv")
Custom name_pattern if your columns use a different suffix convention
如果列使用不同的后缀规则,可自定义name_pattern
survey = survy.read_csv("data_wide.csv", name_pattern="id(_multi)?")
survey = survy.read_csv("data_wide.csv", name_pattern="id(_multi)?")
--- Mixed: some columns are wide, some are compact ---
--- 混合格式:部分列是宽格式,部分是紧凑格式 ---
survey = survy.read_csv("data_mixed.csv", name_pattern="id(_multi)?", auto_detect=True)
survey = survy.read_csv("data_mixed.csv", name_pattern="id(_multi)?", auto_detect=True)
Excel — identical API to read_csv
Excel —— 与read_csv的API完全一致
survey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")
undefinedsurvey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")
undefinedread_json
read_json
Reads a survy-format JSON file. The file must have this exact structure:
json
{
"variables": [
{
"id": "gender",
"data": ["Male", "Female", "Male"],
"label": "Gender of respondent",
"value_indices": {"Female": 1, "Male": 2}
},
{
"id": "yob",
"data": [2000, 1999, 1998],
"label": "",
"value_indices": {}
},
{
"id": "hobby",
"data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
"label": "Hobbies",
"value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
}
]
}Key rules for the JSON structure:
- Top-level key must be (a list of variable objects).
"variables" - Each variable must have ,
"id","data", and"label"."value_indices" - SELECT variables: is a flat list of strings (or nulls).
"data" - NUMBER variables: is a flat list of numbers;
"data"must be"value_indices".{} - MULTISELECT variables: is a list of lists of strings.
"data" - maps each answer text to a numeric index; only applied when non-empty.
"value_indices" - Read vs Write difference: writes an extra
to_json()field per variable (e.g."vtype","select","multi_select")."number"ignores this field — it re-infers the type from the data. So if you're building JSON manually, you can omitread_json()."vtype"
python
survey = survy.read_json("data.json")读取survy格式的JSON文件,文件必须严格遵循以下结构:
json
{
"variables": [
{
"id": "gender",
"data": ["Male", "Female", "Male"],
"label": "Gender of respondent",
"value_indices": {"Female": 1, "Male": 2}
},
{
"id": "yob",
"data": [2000, 1999, 1998],
"label": "",
"value_indices": {}
},
{
"id": "hobby",
"data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
"label": "Hobbies",
"value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
}
]
}JSON结构核心规则:
- 顶层键必须是(值为变量对象的列表)
"variables" - 每个变量必须包含、
"id"、"data"和"label"字段"value_indices" - SELECT类型变量:是字符串(或null)的一维列表
"data" - NUMBER类型变量:是数字的一维列表,
"data"必须为"value_indices"{} - MULTISELECT类型变量:是字符串列表的二维列表
"data" - 映射每个答案文本到数字索引,仅非空时生效
"value_indices" - 读写差异:会为每个变量额外写入
to_json()字段(例如"vtype"、"select"、"multi_select"),"number"会忽略该字段,直接从数据推断类型。因此如果你手动构建JSON,可以省略read_json()字段"vtype"
python
survey = survy.read_json("data.json")read_polars
read_polars
Construct a Survey from an existing Polars DataFrame.
Extra parameter (default ) drops columns
with no responses or all-empty lists.
read_polars also have same concepts of wide/ compact format
as read_csv.
exclude_nullTruepython
import polars, survy
df = polars.DataFrame({
"gender": ["Male", "Female", "Male"],
"yob": [2000, 1999, 1998],
"hobby": ["Sport;Book", "Sport;Movie", "Movie"],
"animal_1": ["Cat", "", "Cat"],
"animal_2": ["Dog", "Dog", ""],
})
survey = survy.read_polars(df, auto_detect=True)从已有的Polars DataFrame构建Survey对象。额外参数(默认)会删除没有回复或全为空列表的列。read_polars和read_csv一样支持宽/紧凑格式的概念。
exclude_nullTruepython
import polars, survy
df = polars.DataFrame({
"gender": ["Male", "Female", "Male"],
"yob": [2000, 1999, 1998],
"hobby": ["Sport;Book", "Sport;Movie", "Movie"],
"animal_1": ["Cat", "", "Cat"],
"animal_2": ["Dog", "Dog", ""],
})
survey = survy.read_polars(df, auto_detect=True)3. Modifying the Survey
3. 修改Survey
survey.update() — batch label/value_indices
survey.update() —— 批量更新标签/值索引
python
survey.update([
{"id": "Q1", "label": "Satisfaction", "value_indices": {"good": 1, "bad": 2}},
{"id": "Q2", "label": "Channels used"},
])Silently skips for NUMBER variables.
Warns and skips unknown IDs.
value_indicespython
survey.update([
{"id": "Q1", "label": "Satisfaction", "value_indices": {"good": 1, "bad": 2}},
{"id": "Q2", "label": "Channels used"},
])针对NUMBER类型变量的会被静默跳过,未知ID会触发警告并跳过处理。
value_indicessurvey.add() — add a variable
survey.add() —— 添加变量
python
survey.add(some_variable) # Variable object
survey.add(polars.Series("new", [1, 2, 3])) # auto-wrapped into VariableIf the ID already exists, a numeric suffix is appended (e.g. ).
"Q1#1"python
survey.add(some_variable) # 传入Variable对象
survey.add(polars.Series("new", [1, 2, 3])) # 会自动封装为Variable对象如果ID已存在,会追加数字后缀(例如)。
"Q1#1"survey.drop() — remove a variable
survey.drop() —— 删除变量
python
survey.drop("Q3") # silently ignored if not foundpython
survey.drop("Q3") # 如果ID不存在会被静默忽略survey.sort() — reorder variables
survey.sort() —— 重排变量顺序
python
survey.sort() # alphabetical by id (default)
survey.sort(key=lambda v: v.base, reverse=True) # by response count descpython
survey.sort() # 默认按ID字母序排序
survey.sort(key=lambda v: v.base, reverse=True) # 按回复数量降序排序variable.replace() — recode values
variable.replace() —— 重编码值
python
survey["gender"].replace({"Male": "M", "Female": "F"})Works for both SELECT and MULTISELECT. Automatically rebuilds .
value_indicespython
survey["gender"].replace({"Male": "M", "Female": "F"})同时适用于SELECT和MULTISELECT类型,会自动重建。
value_indicesDirect property assignment
直接属性赋值
python
v = survey["Q1"]
v.id = "satisfaction"
v.label = "Overall satisfaction"
v.value_indices = {"very_satisfied": 1, "satisfied": 2, "neutral": 3}Caution on value_indices setter: Raises
if any existing value in the data is
missing from the new mapping. You must cover ALL values
present in the data.
DataStructureErrorpython
v = survey["Q1"]
v.id = "satisfaction"
v.label = "Overall satisfaction"
v.value_indices = {"very_satisfied": 1, "satisfied": 2, "neutral": 3}value_indices赋值注意事项:如果数据中存在任何现有值不在新映射中,会抛出,你必须覆盖数据中存在的所有值。
DataStructureError4. Filtering
4. 筛选
Returns a new Survey (original is not mutated).
python
filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male") # single value also worksFor MULTISELECT, a row is kept if any of its selected values appears in the filter list.
返回新的Survey对象(原对象不会被修改)。
python
filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male") # 也支持单个值筛选对于MULTISELECT类型,只要受访者选中的值有任意一个出现在筛选列表中,该行就会被保留。
5. Getting a DataFrame
5. 获取DataFrame
python
df = survey.get_df(
select_dtype="text", # "text" | "number"
multiselect_dtype="compact", # "compact" | "text" | "number"
)select_dtype"text""number"value_indicesmultiselect_dtype- → one
"compact"column per multiselect (default)List[str] - → expands to wide columns
"text",Q_1, ... with string orQ_2null - → expands to wide columns with
"number"/1binary flags0
Returns Polars DataFrame — use Polars methods, not pandas.
Valid dtype literals: , , . Never or .
"text""number""compact""numeric""string"python
df = survey.get_df(
select_dtype="text", # 可选值:"text" | "number"
multiselect_dtype="compact", # 可选值:"compact" | "text" | "number"
)select_dtype"text""number"value_indicesmultiselect_dtype- → 每个多选变量对应一个
"compact"类型列(默认)List[str] - → 展开为
"text"、Q_1等宽列,值为字符串或Q_2null - → 展开为宽列,使用
"number"/1二值标记是否选中0
返回Polars DataFrame —— 请使用Polars方法而非pandas方法处理。
合法的dtype字面量:、、,请勿使用或。
"text""number""compact""numeric""string"6. Analysis
6. 分析
Frequency table
频率表
python
survey["Q1"].frequenciespython
survey["Q1"].frequencies→ Polars DataFrame: columns [variable_id, "count", "proportion"]
→ Polars DataFrame,包含列:[variable_id, "count", "proportion"]
undefinedundefinedCrosstab
交叉表
python
result = survy.crosstab(
column=survey["gender"], # grouping variable (columns)
row=survey["hobby"], # analyzed variable (rows)
filter=None, # optional: segment by another variable
aggfunc="count", # "count" | "percent" | "mean" | "median" | "sum"
alpha=0.05, # significance level for stat tests
)python
result = survy.crosstab(
column=survey["gender"], # 分组变量(列维度)
row=survey["hobby"], # 待分析变量(行维度)
filter=None, # 可选:按另一个变量拆分结果
aggfunc="count", # 可选值:"count" | "percent" | "mean" | "median" | "sum"
alpha=0.05, # 统计检验的显著性水平
)Returns dict[str, polars.DataFrame]
返回dict[str, polars.DataFrame]
Key is "Total" when no filter, or each filter-value when filter is provided
未设置filter时键为"Total",设置filter时键为每个筛选值
**aggfunc options**:
- `"count"` — cell counts with significance letter labels (z-test)
- `"percent"` — column-wise proportions with significance labels
- Numeric (`"mean"`, `"median"`, `"sum"`) — aggregates row variable; Welch's t-test for significance
**filter**: Pass a Variable to segment the crosstab into one table per filter value.
---
**aggfunc可选值说明**:
- `"count"` —— 单元格计数,附带显著性字母标记(z检验)
- `"percent"` —— 列维度占比,附带显著性标记
- 数值聚合(`"mean"`、`"median"`、`"sum"`)—— 聚合行变量,使用Welch t检验计算显著性
**filter**:传入一个变量,将交叉表按筛选值拆分为多个表格。
---7. Exporting
7. 导出
All exports take a directory path (not file path) + optional (base filename).
name所有导出方法都接收目录路径(不是文件路径)+ 可选的参数(基础文件名)。
nameto_csv / to_excel
to_csv / to_excel
Writes three files per export:
- — the actual survey responses. Format depends on
{name}_data.csvparam.compact - — variable metadata with columns:
{name}_variables_info.csv,id(SINGLE/MULTISELECT/NUMBER),vtype.label - — value-to-index mappings with columns:
{name}_values_info.csv,id,text.index
The parameter (default ) controls how multiselect variables appear in the
data file: joins values into one cell (e.g. ), expands into
wide columns (e.g. , , ).
compactFalseTrue"Book;Sport"Falsehobby_1hobby_2hobby_3python
undefined每次导出会生成三个文件:
- —— 实际调研回复数据,格式由
{name}_data.csv参数决定compact - —— 变量元数据,包含列:
{name}_variables_info.csv、id(SINGLE/MULTISELECT/NUMBER)、vtypelabel - —— 值到索引的映射,包含列:
{name}_values_info.csv、id、textindex
compactFalseTrue"Book;Sport"Falsehobby_1hobby_2hobby_3python
undefinedDefault (compact=False) — multiselect expanded to wide columns
默认(compact=False)—— 多选变量展开为宽列
survey.to_csv("output/", name="results")
survey.to_csv("output/", name="results")
Compact mode — multiselect joined into single cells
紧凑模式 —— 多选变量拼接为单个单元格
survey.to_csv("output/", name="results", compact=True, compact_separator=";")
survey.to_csv("output/", name="results", compact=True, compact_separator=";")
Excel — identical API and output structure (.xlsx files instead of .csv)
Excel —— API和输出结构完全一致,生成.xlsx文件而非.csv
survey.to_excel("output/", name="results")
survey.to_excel("output/", name="results", compact=True)
undefinedsurvey.to_excel("output/", name="results")
survey.to_excel("output/", name="results", compact=True)
undefinedto_spss
to_spss
Writes (data) + (syntax). Requires .
{name}.sav{name}.spspyreadstatpython
survey.to_spss("output/", name="results")生成(数据文件)+ (语法文件),依赖库。
{name}.sav{name}.spspyreadstatpython
survey.to_spss("output/", name="results")to_json
to_json
Writes in the same structure expects (see Section 2), plus an
extra field per variable that ignores on re-read. Pretty-printed
with 4-space indent, non-ASCII preserved ().
{name}.jsonread_json"vtype"read_jsonensure_ascii=Falsepython
survey.to_json("output/", name="results")Common mistake: Do NOT pass . Pass directory + .
"output/results.csv"name=生成,格式与要求的结构一致(见第2节),额外添加了每个变量的字段,该字段在重新读取时会被忽略。输出为4空格缩进的美化格式,保留非ASCII字符()。
{name}.jsonread_json"vtype"read_jsonensure_ascii=Falsepython
survey.to_json("output/", name="results")常见错误:请勿传递这类文件路径,请传递目录路径+参数。
"output/results.csv"name=SPSS Syntax
SPSS语法
python
print(survey.sps) # full syntax: VARIABLE LABELS, VALUE LABELS, MRSETS, CTABLESpython
print(survey.sps) # 完整语法:包含VARIABLE LABELS、VALUE LABELS、MRSETS、CTABLES8. Gotchas & Rules
8. 注意事项与使用规则
- vs
auto_detect: Never combine both.compact_ids - setter must cover all existing data values — raises
value_indicesotherwise.DataStructureError - is silently skipped for NUMBER variables (in
value_indicesand direct set).update() - Export path is a directory, not a file.
- returns Polars, not pandas.
get_df() - returns a new Survey — does not mutate.
filter() - Empty strings become during CSV/Excel read.
None - Multiselect values are sorted alphabetically within each row.
- All variables in a crosstab must have the same row count.
- raises
read_csvif the file extension is notFileTypeError. Same for.csvwith non-read_exceland.xlsxwith non-read_spss..sav - /
to_csvdefault isto_excel— multiselect variables are expanded to wide columns unless you explicitly passcompact=False.compact=True - Column names must not contain multiple reserved separators
(,
_,.). If a column like:uses more than one,my.var_1will fail. Rename before loading so only one separator appears (e.g.parse_id).myvar_1
- 与
auto_detect互斥:请勿同时使用两者compact_ids - 赋值必须覆盖所有现有数据值 —— 否则会抛出
value_indicesDataStructureError - NUMBER类型变量的会被静默跳过(在
value_indices和直接赋值场景下都生效)update() - 导出路径是目录,不是文件
- 返回Polars对象,不是pandas
get_df() - 返回新的Survey对象 —— 不会修改原对象
filter() - 读取CSV/Excel时空字符串会被转换为
None - 多选值在每行内部会按字母序排序
- 交叉表中所有变量必须有相同的行数
- 如果文件扩展名不是,
.csv会抛出read_csv。FileTypeError对应非read_excel文件、.xlsx对应非read_spss文件也有同样规则.sav - /
to_csv默认to_excel—— 除非显式传递compact=False,否则多选变量会被展开为宽列compact=True - 列名不能包含多个预留分隔符(、
_、.)。如果存在类似:这类使用多个分隔符的列,my.var_1会失败。加载前请重命名,仅保留一种分隔符(例如改为parse_id)myvar_1
9. Quick Reference
9. 快速参考
| Task | Code |
|---|---|
| Load CSV auto-detect | |
| Load CSV explicit compact | |
| Load CSV wide format | |
| Load SPSS | |
| Load JSON | |
| Load from Polars DF | |
| Inspect variable | |
| Frequencies | |
| Crosstab count | |
| Crosstab percent | |
| Crosstab with filter | |
| Crosstab mean | |
| Filter respondents | |
| Replace values | |
| Add variable | |
| Drop variable | |
| Sort variables | |
| Batch update labels | |
| Get compact DF | |
| Get wide binary DF | |
| Export CSV | |
| Export SPSS | |
| Export JSON | |
| SPSS syntax string | |
| Serialize variable | |
| 任务 | 代码 |
|---|---|
| 自动检测加载CSV | |
| 显式指定紧凑列加载CSV | |
| 加载宽格式CSV | |
| 加载SPSS | |
| 加载JSON | |
| 从Polars DF加载 | |
| 查看变量属性 | |
| 获取频率表 | |
| 计数交叉表 | |
| 占比交叉表 | |
| 带筛选的交叉表 | |
| 均值交叉表 | |
| 筛选受访者 | |
| 替换值 | |
| 添加变量 | |
| 删除变量 | |
| 排序变量 | |
| 批量更新标签 | |
| 获取紧凑格式DF | |
| 获取宽格式二值DF | |
| 导出CSV | |
| 导出SPSS | |
| 导出JSON | |
| 获取SPSS语法字符串 | |
| 序列化变量 | |
10. Reference Files
10. 参考文件
- — Complete method signatures with all parameters and return types
references/api_reference.md - — Loads a survey file, checks missing labels/value_indices, prints report
scripts/validate_survey.py - — Reads a survey and exports to CSV, Excel, SPSS, and JSON
scripts/batch_export.py - — Wide-format sample dataset
assets/sample_data.csv - — Compact-format sample dataset
assets/sample_data_compact.csv
- —— 完整的方法签名,包含所有参数和返回类型
references/api_reference.md - —— 加载调研文件,检查缺失的标签/值索引,输出报告
scripts/validate_survey.py - —— 读取调研数据并导出为CSV、Excel、SPSS和JSON格式
scripts/batch_export.py - —— 宽格式示例数据集
assets/sample_data.csv - —— 紧凑格式示例数据集",
assets/sample_data_compact.csv