survey-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

survy — Survey Data Analysis Skill

survy — 调研数据分析技能

survy
is a lightweight Python library for processing, transforming, and analyzing survey data. Its central design principle is treating survey constructs — especially multiselect questions — as first-class concepts rather than awkward DataFrame workarounds.
Install: Always install the latest version —
pip install --upgrade survy
Powered by: Polars (all DataFrames returned are Polars, not pandas)

survy
是一个轻量级Python库,用于处理、转换和分析调研数据。它的核心设计原则是将调研结构(尤其是多选问题)作为一等概念处理,而不是使用笨拙的DataFrame变通方案实现。
安装:始终安装最新版本 —
pip install --upgrade survy
底层依赖:Polars(所有返回的DataFrame都是Polars类型,而非pandas)

1. Core Objects

1. 核心对象

Survey

Survey

Top-level container. Created via
read_*
functions — never instantiate directly. Access variables with
survey["Q1"]
. Print for a compact summary.
顶层容器,通过
read_*
系列函数创建,请勿直接实例化。使用
survey["Q1"]
访问变量,打印该对象可获得精简摘要。

Variable

Variable

Wraps a single Polars Series plus survey metadata. Key attributes:
AttributeTypeDescription
id
str
Column name (read/write via property)
label
str
Human-readable label (read/write); defaults to
id
if unset
vtype
VarType
One of
VarType.SELECT
,
VarType.MULTISELECT
,
VarType.NUMBER
value_indices
dict[str, int]
Answer code → numeric index mapping; always empty
{}
for NUMBER
base
int
Count of non-null/non-empty responses
len
int
Total row count including nulls
dtype
polars.DataType
Underlying Polars dtype
frequencies
polars.DataFrame
Frequency table (value, count, proportion)
sps
str
SPSS syntax for this variable

封装了单个Polars Series以及调研元数据,核心属性如下:
属性类型描述
id
str
列名(可通过属性读写)
label
str
人类可读的标签(可读写),未设置时默认等于
id
vtype
VarType
取值为
VarType.SELECT
VarType.MULTISELECT
VarType.NUMBER
其中之一
value_indices
dict[str, int]
答案文本到数字索引的映射,NUMBER类型变量的该属性始终为空
{}
base
int
非空/非无效回复的数量
len
int
总行数(包含空值)
dtype
polars.DataType
底层Polars数据类型
frequencies
polars.DataFrame
频率表,包含值、计数、占比三列
sps
str
对应变量的SPSS语法

2. Reading Data

2. 读取数据

All readers return a
Survey
object. The key challenge survy solves at read time is recognizing multiselect questions — questions where one respondent can choose multiple answers. Raw data encodes these in two very different layouts, and survy needs to know which layout it's looking at so it can merge the data into a single logical variable.
所有读取函数都会返回一个
Survey
对象。survy在读取阶段解决的核心问题是识别多选问题——即受访者可以选择多个答案的问题。原始数据会以两种完全不同的布局编码这类问题,survy需要知道当前处理的是哪种布局,才能将数据合并为单个逻辑变量。

Multiselect: Compact Format vs Wide Format

多选问题:紧凑格式 vs 宽格式

Compact format stores all selected answers in a single cell, joined by a separator (typically
;
). One column = one question.
id,  gender,  hobby
1,   Male,    Sport;Book
2,   Female,  Sport;Movie
3,   Male,    Movie
Here
hobby
is one column. The cell
"Sport;Book"
means the respondent chose both Sport and Book. survy splits this cell on the separator to recover the individual choices.
Wide format spreads each possible answer across its own column, using a shared prefix plus a numeric suffix (
_1
,
_2
, ...). Multiple columns = one question.
id,  gender,  hobby_1,  hobby_2,  hobby_3
1,   Male,    Book,     ,         Sport
2,   Female,  ,         Movie,    Sport
3,   Male,    ,         Movie,
Here
hobby_1
,
hobby_2
,
hobby_3
are three columns that together represent the single
hobby
question. survy groups them by matching the prefix pattern and merges them into one multiselect variable named
hobby
.
After reading, both formats produce the exact same Survey variable internally — a
MULTISELECT
variable whose data is a sorted list of chosen values per respondent:
hobby: [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]]
紧凑格式将所有选中的答案存储在单个单元格中,通过分隔符(通常是
;
)拼接。一列对应一个问题。
id,  gender,  hobby
1,   Male,    Sport;Book
2,   Female,  Sport;Movie
3,   Male,    Movie
此处
hobby
是单列,单元格
"Sport;Book"
表示受访者同时选择了运动和读书。survy会按分隔符拆分单元格,还原出单个选项。
宽格式将每个可选答案拆分到独立的列中,使用共享前缀加数字后缀(
_1
_2
……)的命名方式。多列对应一个问题。
id,  gender,  hobby_1,  hobby_2,  hobby_3
1,   Male,    Book,     ,         Sport
2,   Female,  ,         Movie,    Sport
3,   Male,    ,         Movie,
此处
hobby_1
hobby_2
hobby_3
三列共同代表
hobby
这一个问题。survy会通过匹配前缀规则将它们分组,合并为名为
hobby
的单个多选变量。
读取完成后,两种格式在内部会生成完全相同的Survey变量——即一个
MULTISELECT
类型的变量,其数据为每个受访者选中值的排序后列表:
hobby: [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]]

How survy detects each format

survy如何检测两种格式

Wide format is detected via
name_pattern
— a format template (NOT a raw regex) with two named tokens and a set of reserved separators:
  • Tokens:
    id
    (base variable name),
    multi
    (suffix for wide columns)
  • Reserved separators:
    _
    ,
    .
    ,
    :
    — these are always treated as delimiters between tokens when parsing column names
With the default pattern
"id(_multi)?"
:
  • hobby_1
    id="hobby"
    ,
    multi="1"
    → grouped as wide multiselect
  • hobby_2
    → same
    id="hobby"
    → merged with
    hobby_1
  • gender
    → no suffix → normal column
Other patterns:
  • "id.multi"
    → matches
    Q1.1
    ,
    Q1.2
    , ...
  • "id:multi"
    → matches
    Q1:a
    ,
    Q1:b
    , ...
Separator conflict warning: If a column name contains more than one reserved separator (e.g.
my.var_1
),
parse_id
will fail because it can't unambiguously split the name into tokens. Before loading, rename such columns so only one separator is used (e.g. rename
my.var_1
to
myvar_1
or
my@var_1
).
Compact format is NOT detected by default because a semicolon in a cell could be regular text. You must tell survy which columns are compact in one of two ways:
  1. compact_ids
    — explicitly list the column IDs that are compact multiselect.
  2. auto_detect=True
    — survy scans every column for the
    compact_separator
    character; any column containing it in at least one cell is treated as compact.
Rule: Do NOT combine
auto_detect=True
with
compact_ids
in the same call.
宽格式通过
name_pattern
检测,这是一个格式模板(不是原生正则),包含两个命名标记和一组预留分隔符:
  • 标记
    id
    (基础变量名)、
    multi
    (宽列后缀)
  • 预留分隔符
    _
    .
    :
    —— 解析列名时,这些符号始终会被当作标记之间的分隔符
使用默认模式
"id(_multi)?"
时:
  • hobby_1
    id="hobby"
    multi="1"
    → 被归类为宽格式多选列
  • hobby_2
    → 同样
    id="hobby"
    → 与
    hobby_1
    合并
  • gender
    → 无后缀 → 普通列
其他模式示例:
  • "id.multi"
    → 匹配
    Q1.1
    Q1.2
    等命名
  • "id:multi"
    → 匹配
    Q1:a
    Q1:b
    等命名
分隔符冲突警告:如果列名包含超过一个预留分隔符(例如
my.var_1
),
parse_id
会因为无法明确拆分名称为标记而失败。加载前请重命名这类列,仅保留一种分隔符(例如将
my.var_1
重命名为
myvar_1
my@var_1
)。
紧凑格式默认不会被自动检测,因为单元格中的分号可能是普通文本。你需要通过以下两种方式之一明确告知survy哪些列是紧凑格式:
  1. compact_ids
    —— 显式列出属于紧凑格式多选的列ID
  2. auto_detect=True
    —— survy会扫描所有列检查是否包含
    compact_separator
    字符,只要有一个单元格包含该字符的列就会被当作紧凑格式处理
规则:请勿在同一次调用中同时使用
auto_detect=True
compact_ids

read_spss

read_spss

Reads an SPSS
.sav
file. SPSS files are always wide format — compact multiselect does not apply and
compact_ids
/
auto_detect
are not parameters. Wide multiselect columns (e.g.
hobby_1
,
hobby_2
) are still auto-detected and merged via
name_pattern
. Value labels stored in the
.sav
file are applied automatically, so variables come back as text (
"Male"
,
"Female"
) rather than numeric codes. Requires
pyreadstat
.
python
undefined
读取SPSS
.sav
文件。SPSS文件始终是宽格式,不适用紧凑多选格式,也没有
compact_ids
/
auto_detect
参数。宽格式多选列(例如
hobby_1
hobby_2
)仍然会通过
name_pattern
自动检测并合并。
.sav
文件中存储的值标签会被自动应用,因此变量返回的是文本(
"Male"
"Female"
)而非数字编码。该功能依赖
pyreadstat
库。
python
undefined

Wide multiselect detected automatically

宽格式多选会被自动检测

survey = survy.read_spss("data.sav")
survey = survy.read_spss("data.sav")

Custom suffix convention (Q1.1, Q1.2, ...)

自定义后缀规则(适配Q1.1、Q1.2……这类命名)

survey = survy.read_spss("data.sav", name_pattern="id.multi")

**Rule**: Do NOT pass `compact_ids` or `auto_detect` to `read_spss` — those parameters don't exist.

---
survey = survy.read_spss("data.sav", name_pattern="id.multi")

**规则**:请勿向`read_spss`传递`compact_ids`或`auto_detect`参数——该函数没有这两个参数。

---

Shared Reader Parameters (read_csv, read_excel, read_polars only)

通用读取参数(仅适用于read_csv、read_excel、read_polars)

These parameters control multiselect detection and apply to
read_csv
,
read_excel
, and
read_polars
. They do NOT apply to
read_json
(which reads survy's own format where variable types are already resolved).
ParameterTypeDefaultDescription
compact_ids
list[str] | None
None
Column IDs to treat as compact multiselect
compact_separator
str
";"
Separator used to split compact cells
auto_detect
bool
False
Auto-detect compact columns by scanning for separator
name_pattern
str
"id(_multi)?"
Format template for wide column names. Tokens:
id
,
multi
. Separators:
_
.
:
. Not a raw regex.
这些参数用于控制多选检测,适用于
read_csv
read_excel
read_polars
,不适用于
read_json
(read_json读取的是survy自有格式,变量类型已经确定)。
参数类型默认值描述
compact_ids
`list[str] \None`
None
compact_separator
str
";"
用于拆分紧凑格式单元格的分隔符
auto_detect
bool
False
通过扫描分隔符自动检测紧凑格式列
name_pattern
str
"id(_multi)?"
宽格式列名的模板,标记为
id
multi
,分隔符支持
_``.
:
,不是原生正则

read_csv / read_excel

read_csv / read_excel

python
import survy
python
import survy

--- Compact format data ---

--- 紧凑格式数据 ---

Option A: you know which columns are compact

方案A:你明确知道哪些列是紧凑格式

survey = survy.read_csv("data_compact.csv", compact_ids=["hobby"], compact_separator=";")
survey = survy.read_csv("data_compact.csv", compact_ids=["hobby"], compact_separator=";")

Option B: let survy scan for the separator automatically

方案B:让survy自动扫描检测分隔符

survey = survy.read_csv("data_compact.csv", auto_detect=True, compact_separator=";")
survey = survy.read_csv("data_compact.csv", auto_detect=True, compact_separator=";")

--- Wide format data ---

--- 宽格式数据 ---

Wide detection is automatic via name_pattern (default works for Q1_1, Q1_2, ...)

宽格式会通过name_pattern自动检测(默认规则适配Q1_1、Q1_2……这类命名)

survey = survy.read_csv("data_wide.csv")
survey = survy.read_csv("data_wide.csv")

Custom name_pattern if your columns use a different suffix convention

如果列使用不同的后缀规则,可自定义name_pattern

survey = survy.read_csv("data_wide.csv", name_pattern="id(_multi)?")
survey = survy.read_csv("data_wide.csv", name_pattern="id(_multi)?")

--- Mixed: some columns are wide, some are compact ---

--- 混合格式:部分列是宽格式,部分是紧凑格式 ---

survey = survy.read_csv("data_mixed.csv", name_pattern="id(_multi)?", auto_detect=True)
survey = survy.read_csv("data_mixed.csv", name_pattern="id(_multi)?", auto_detect=True)

Excel — identical API to read_csv

Excel —— 与read_csv的API完全一致

survey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")
undefined
survey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")
undefined

read_json

read_json

Reads a survy-format JSON file. The file must have this exact structure:
json
{
  "variables": [
    {
      "id": "gender",
      "data": ["Male", "Female", "Male"],
      "label": "Gender of respondent",
      "value_indices": {"Female": 1, "Male": 2}
    },
    {
      "id": "yob",
      "data": [2000, 1999, 1998],
      "label": "",
      "value_indices": {}
    },
    {
      "id": "hobby",
      "data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
      "label": "Hobbies",
      "value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
    }
  ]
}
Key rules for the JSON structure:
  • Top-level key must be
    "variables"
    (a list of variable objects).
  • Each variable must have
    "id"
    ,
    "data"
    ,
    "label"
    , and
    "value_indices"
    .
  • SELECT variables:
    "data"
    is a flat list of strings (or nulls).
  • NUMBER variables:
    "data"
    is a flat list of numbers;
    "value_indices"
    must be
    {}
    .
  • MULTISELECT variables:
    "data"
    is a list of lists of strings.
  • "value_indices"
    maps each answer text to a numeric index; only applied when non-empty.
  • Read vs Write difference:
    to_json()
    writes an extra
    "vtype"
    field per variable (e.g.
    "select"
    ,
    "multi_select"
    ,
    "number"
    ).
    read_json()
    ignores this field — it re-infers the type from the data. So if you're building JSON manually, you can omit
    "vtype"
    .
python
survey = survy.read_json("data.json")
读取survy格式的JSON文件,文件必须严格遵循以下结构:
json
{
  "variables": [
    {
      "id": "gender",
      "data": ["Male", "Female", "Male"],
      "label": "Gender of respondent",
      "value_indices": {"Female": 1, "Male": 2}
    },
    {
      "id": "yob",
      "data": [2000, 1999, 1998],
      "label": "",
      "value_indices": {}
    },
    {
      "id": "hobby",
      "data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
      "label": "Hobbies",
      "value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
    }
  ]
}
JSON结构核心规则
  • 顶层键必须是
    "variables"
    (值为变量对象的列表)
  • 每个变量必须包含
    "id"
    "data"
    "label"
    "value_indices"
    字段
  • SELECT类型变量:
    "data"
    是字符串(或null)的一维列表
  • NUMBER类型变量:
    "data"
    是数字的一维列表,
    "value_indices"
    必须为
    {}
  • MULTISELECT类型变量:
    "data"
    是字符串列表的二维列表
  • "value_indices"
    映射每个答案文本到数字索引,仅非空时生效
  • 读写差异
    to_json()
    会为每个变量额外写入
    "vtype"
    字段(例如
    "select"
    "multi_select"
    "number"
    ),
    read_json()
    会忽略该字段,直接从数据推断类型。因此如果你手动构建JSON,可以省略
    "vtype"
    字段
python
survey = survy.read_json("data.json")

read_polars

read_polars

Construct a Survey from an existing Polars DataFrame. Extra parameter
exclude_null
(default
True
) drops columns with no responses or all-empty lists. read_polars also have same concepts of wide/ compact format as read_csv.
python
import polars, survy

df = polars.DataFrame({
    "gender": ["Male", "Female", "Male"],
    "yob": [2000, 1999, 1998],
    "hobby": ["Sport;Book", "Sport;Movie", "Movie"],
    "animal_1": ["Cat", "", "Cat"],
    "animal_2": ["Dog", "Dog", ""],
})
survey = survy.read_polars(df, auto_detect=True)

从已有的Polars DataFrame构建Survey对象。额外参数
exclude_null
(默认
True
)会删除没有回复或全为空列表的列。read_polars和read_csv一样支持宽/紧凑格式的概念。
python
import polars, survy

df = polars.DataFrame({
    "gender": ["Male", "Female", "Male"],
    "yob": [2000, 1999, 1998],
    "hobby": ["Sport;Book", "Sport;Movie", "Movie"],
    "animal_1": ["Cat", "", "Cat"],
    "animal_2": ["Dog", "Dog", ""],
})
survey = survy.read_polars(df, auto_detect=True)

3. Modifying the Survey

3. 修改Survey

survey.update() — batch label/value_indices

survey.update() —— 批量更新标签/值索引

python
survey.update([
    {"id": "Q1", "label": "Satisfaction", "value_indices": {"good": 1, "bad": 2}},
    {"id": "Q2", "label": "Channels used"},
])
Silently skips
value_indices
for NUMBER variables. Warns and skips unknown IDs.
python
survey.update([
    {"id": "Q1", "label": "Satisfaction", "value_indices": {"good": 1, "bad": 2}},
    {"id": "Q2", "label": "Channels used"},
])
针对NUMBER类型变量的
value_indices
会被静默跳过,未知ID会触发警告并跳过处理。

survey.add() — add a variable

survey.add() —— 添加变量

python
survey.add(some_variable)           # Variable object
survey.add(polars.Series("new", [1, 2, 3]))  # auto-wrapped into Variable
If the ID already exists, a numeric suffix is appended (e.g.
"Q1#1"
).
python
survey.add(some_variable)           # 传入Variable对象
survey.add(polars.Series("new", [1, 2, 3]))  # 会自动封装为Variable对象
如果ID已存在,会追加数字后缀(例如
"Q1#1"
)。

survey.drop() — remove a variable

survey.drop() —— 删除变量

python
survey.drop("Q3")   # silently ignored if not found
python
survey.drop("Q3")   # 如果ID不存在会被静默忽略

survey.sort() — reorder variables

survey.sort() —— 重排变量顺序

python
survey.sort()                                      # alphabetical by id (default)
survey.sort(key=lambda v: v.base, reverse=True)    # by response count desc
python
survey.sort()                                      # 默认按ID字母序排序
survey.sort(key=lambda v: v.base, reverse=True)    # 按回复数量降序排序

variable.replace() — recode values

variable.replace() —— 重编码值

python
survey["gender"].replace({"Male": "M", "Female": "F"})
Works for both SELECT and MULTISELECT. Automatically rebuilds
value_indices
.
python
survey["gender"].replace({"Male": "M", "Female": "F"})
同时适用于SELECT和MULTISELECT类型,会自动重建
value_indices

Direct property assignment

直接属性赋值

python
v = survey["Q1"]
v.id = "satisfaction"
v.label = "Overall satisfaction"
v.value_indices = {"very_satisfied": 1, "satisfied": 2, "neutral": 3}
Caution on value_indices setter: Raises
DataStructureError
if any existing value in the data is missing from the new mapping. You must cover ALL values present in the data.

python
v = survey["Q1"]
v.id = "satisfaction"
v.label = "Overall satisfaction"
v.value_indices = {"very_satisfied": 1, "satisfied": 2, "neutral": 3}
value_indices赋值注意事项:如果数据中存在任何现有值不在新映射中,会抛出
DataStructureError
,你必须覆盖数据中存在的所有值。

4. Filtering

4. 筛选

Returns a new Survey (original is not mutated).
python
filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male")   # single value also works
For MULTISELECT, a row is kept if any of its selected values appears in the filter list.

返回新的Survey对象(原对象不会被修改)。
python
filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male")   # 也支持单个值筛选
对于MULTISELECT类型,只要受访者选中的值有任意一个出现在筛选列表中,该行就会被保留。

5. Getting a DataFrame

5. 获取DataFrame

python
df = survey.get_df(
    select_dtype="text",          # "text" | "number"
    multiselect_dtype="compact",  # "compact" | "text" | "number"
)
select_dtype
:
"text"
keeps string codes (default);
"number"
converts via
value_indices
.
multiselect_dtype
:
  • "compact"
    → one
    List[str]
    column per multiselect (default)
  • "text"
    → expands to wide columns
    Q_1
    ,
    Q_2
    , ... with string or
    null
  • "number"
    → expands to wide columns with
    1
    /
    0
    binary flags
Returns Polars DataFrame — use Polars methods, not pandas.
Valid dtype literals:
"text"
,
"number"
,
"compact"
. Never
"numeric"
or
"string"
.

python
df = survey.get_df(
    select_dtype="text",          # 可选值:"text" | "number"
    multiselect_dtype="compact",  # 可选值:"compact" | "text" | "number"
)
select_dtype
"text"
保留字符串编码(默认);
"number"
通过
value_indices
转换为数字。
multiselect_dtype
  • "compact"
    → 每个多选变量对应一个
    List[str]
    类型列(默认)
  • "text"
    → 展开为
    Q_1
    Q_2
    等宽列,值为字符串或
    null
  • "number"
    → 展开为宽列,使用
    1
    /
    0
    二值标记是否选中
返回Polars DataFrame —— 请使用Polars方法而非pandas方法处理。
合法的dtype字面量
"text"
"number"
"compact"
,请勿使用
"numeric"
"string"

6. Analysis

6. 分析

Frequency table

频率表

python
survey["Q1"].frequencies
python
survey["Q1"].frequencies

→ Polars DataFrame: columns [variable_id, "count", "proportion"]

→ Polars DataFrame,包含列:[variable_id, "count", "proportion"]

undefined
undefined

Crosstab

交叉表

python
result = survy.crosstab(
    column=survey["gender"],     # grouping variable (columns)
    row=survey["hobby"],         # analyzed variable (rows)
    filter=None,                 # optional: segment by another variable
    aggfunc="count",             # "count" | "percent" | "mean" | "median" | "sum"
    alpha=0.05,                  # significance level for stat tests
)
python
result = survy.crosstab(
    column=survey["gender"],     # 分组变量(列维度)
    row=survey["hobby"],         # 待分析变量(行维度)
    filter=None,                 # 可选:按另一个变量拆分结果
    aggfunc="count",             # 可选值:"count" | "percent" | "mean" | "median" | "sum"
    alpha=0.05,                  # 统计检验的显著性水平
)

Returns dict[str, polars.DataFrame]

返回dict[str, polars.DataFrame]

Key is "Total" when no filter, or each filter-value when filter is provided

未设置filter时键为"Total",设置filter时键为每个筛选值


**aggfunc options**:

- `"count"` — cell counts with significance letter labels (z-test)
- `"percent"` — column-wise proportions with significance labels
- Numeric (`"mean"`, `"median"`, `"sum"`) — aggregates row variable; Welch's t-test for significance

**filter**: Pass a Variable to segment the crosstab into one table per filter value.

---

**aggfunc可选值说明**:

- `"count"` —— 单元格计数,附带显著性字母标记(z检验)
- `"percent"` —— 列维度占比,附带显著性标记
- 数值聚合(`"mean"`、`"median"`、`"sum"`)—— 聚合行变量,使用Welch t检验计算显著性

**filter**:传入一个变量,将交叉表按筛选值拆分为多个表格。

---

7. Exporting

7. 导出

All exports take a directory path (not file path) + optional
name
(base filename).
所有导出方法都接收目录路径(不是文件路径)+ 可选的
name
参数(基础文件名)。

to_csv / to_excel

to_csv / to_excel

Writes three files per export:
  • {name}_data.csv
    — the actual survey responses. Format depends on
    compact
    param.
  • {name}_variables_info.csv
    — variable metadata with columns:
    id
    ,
    vtype
    (SINGLE/MULTISELECT/NUMBER),
    label
    .
  • {name}_values_info.csv
    — value-to-index mappings with columns:
    id
    ,
    text
    ,
    index
    .
The
compact
parameter (default
False
) controls how multiselect variables appear in the data file:
True
joins values into one cell (e.g.
"Book;Sport"
),
False
expands into wide columns (e.g.
hobby_1
,
hobby_2
,
hobby_3
).
python
undefined
每次导出会生成三个文件:
  • {name}_data.csv
    —— 实际调研回复数据,格式由
    compact
    参数决定
  • {name}_variables_info.csv
    —— 变量元数据,包含列:
    id
    vtype
    (SINGLE/MULTISELECT/NUMBER)、
    label
  • {name}_values_info.csv
    —— 值到索引的映射,包含列:
    id
    text
    index
compact
参数(默认
False
)控制多选变量在数据文件中的展示形式:
True
将值拼接为单个单元格(例如
"Book;Sport"
),
False
展开为宽列(例如
hobby_1
hobby_2
hobby_3
)。
python
undefined

Default (compact=False) — multiselect expanded to wide columns

默认(compact=False)—— 多选变量展开为宽列

survey.to_csv("output/", name="results")
survey.to_csv("output/", name="results")

Compact mode — multiselect joined into single cells

紧凑模式 —— 多选变量拼接为单个单元格

survey.to_csv("output/", name="results", compact=True, compact_separator=";")
survey.to_csv("output/", name="results", compact=True, compact_separator=";")

Excel — identical API and output structure (.xlsx files instead of .csv)

Excel —— API和输出结构完全一致,生成.xlsx文件而非.csv

survey.to_excel("output/", name="results") survey.to_excel("output/", name="results", compact=True)
undefined
survey.to_excel("output/", name="results") survey.to_excel("output/", name="results", compact=True)
undefined

to_spss

to_spss

Writes
{name}.sav
(data) +
{name}.sps
(syntax). Requires
pyreadstat
.
python
survey.to_spss("output/", name="results")
生成
{name}.sav
(数据文件)+
{name}.sps
(语法文件),依赖
pyreadstat
库。
python
survey.to_spss("output/", name="results")

to_json

to_json

Writes
{name}.json
in the same structure
read_json
expects (see Section 2), plus an extra
"vtype"
field per variable that
read_json
ignores on re-read. Pretty-printed with 4-space indent, non-ASCII preserved (
ensure_ascii=False
).
python
survey.to_json("output/", name="results")
Common mistake: Do NOT pass
"output/results.csv"
. Pass directory +
name=
.
生成
{name}.json
,格式与
read_json
要求的结构一致(见第2节),额外添加了每个变量的
"vtype"
字段,该字段在重新读取时会被
read_json
忽略。输出为4空格缩进的美化格式,保留非ASCII字符(
ensure_ascii=False
)。
python
survey.to_json("output/", name="results")
常见错误:请勿传递
"output/results.csv"
这类文件路径,请传递目录路径+
name=
参数。

SPSS Syntax

SPSS语法

python
print(survey.sps)  # full syntax: VARIABLE LABELS, VALUE LABELS, MRSETS, CTABLES

python
print(survey.sps)  # 完整语法:包含VARIABLE LABELS、VALUE LABELS、MRSETS、CTABLES

8. Gotchas & Rules

8. 注意事项与使用规则

  1. auto_detect
    vs
    compact_ids
    : Never combine both.
  2. value_indices
    setter must cover all existing data values
    — raises
    DataStructureError
    otherwise.
  3. value_indices
    is silently skipped for NUMBER variables
    (in
    update()
    and direct set).
  4. Export path is a directory, not a file.
  5. get_df()
    returns Polars, not pandas
    .
  6. filter()
    returns a new Survey
    — does not mutate.
  7. Empty strings become
    None
    during CSV/Excel read.
  8. Multiselect values are sorted alphabetically within each row.
  9. All variables in a crosstab must have the same row count.
  10. read_csv
    raises
    FileTypeError
    if the file extension is not
    .csv
    . Same for
    read_excel
    with non-
    .xlsx
    and
    read_spss
    with non-
    .sav
    .
  11. to_csv
    /
    to_excel
    default is
    compact=False
    — multiselect variables are expanded to wide columns unless you explicitly pass
    compact=True
    .
  12. Column names must not contain multiple reserved separators (
    _
    ,
    .
    ,
    :
    ). If a column like
    my.var_1
    uses more than one,
    parse_id
    will fail. Rename before loading so only one separator appears (e.g.
    myvar_1
    ).

  1. auto_detect
    compact_ids
    互斥
    :请勿同时使用两者
  2. value_indices
    赋值必须覆盖所有现有数据值
    —— 否则会抛出
    DataStructureError
  3. NUMBER类型变量的
    value_indices
    会被静默跳过
    (在
    update()
    和直接赋值场景下都生效)
  4. 导出路径是目录,不是文件
  5. get_df()
    返回Polars对象,不是pandas
  6. filter()
    返回新的Survey对象
    —— 不会修改原对象
  7. 读取CSV/Excel时空字符串会被转换为
    None
  8. 多选值在每行内部会按字母序排序
  9. 交叉表中所有变量必须有相同的行数
  10. 如果文件扩展名不是
    .csv
    read_csv
    会抛出
    FileTypeError
    read_excel
    对应非
    .xlsx
    文件、
    read_spss
    对应非
    .sav
    文件也有同样规则
  11. to_csv
    /
    to_excel
    默认
    compact=False
    —— 除非显式传递
    compact=True
    ,否则多选变量会被展开为宽列
  12. 列名不能包含多个预留分隔符
    _
    .
    :
    )。如果存在类似
    my.var_1
    这类使用多个分隔符的列,
    parse_id
    会失败。加载前请重命名,仅保留一种分隔符(例如改为
    myvar_1

9. Quick Reference

9. 快速参考

TaskCode
Load CSV auto-detect
survy.read_csv("f.csv", auto_detect=True, compact_separator=";")
Load CSV explicit compact
survy.read_csv("f.csv", compact_ids=["Q2"], compact_separator=";")
Load CSV wide format
survy.read_csv("f.csv", name_pattern="id(_multi)?")
(wide detected automatically)
Load SPSS
survy.read_spss("f.sav")
Load JSON
survy.read_json("f.json")
Load from Polars DF
survy.read_polars(df, auto_detect=True)
Inspect variable
survey["Q1"].vtype
,
.base
,
.len
,
.label
,
.value_indices
,
.dtype
Frequencies
survey["Q1"].frequencies
Crosstab count
survy.crosstab(survey["Q1"], survey["Q2"])
Crosstab percent
survy.crosstab(survey["Q1"], survey["Q2"], aggfunc="percent")
Crosstab with filter
survy.crosstab(survey["col"], survey["row"], filter=survey["seg"])
Crosstab mean
survy.crosstab(survey["col"], survey["row"], aggfunc="mean")
Filter respondents
survey.filter("Q1", ["a", "b"])
Replace values
survey["Q1"].replace({"old": "new"})
Add variable
survey.add(polars.Series("x", [1,2,3]))
Drop variable
survey.drop("Q3")
Sort variables
survey.sort(key=lambda v: v.id)
Batch update labels
survey.update([{"id":"Q1","label":"...","value_indices":{...}}])
Get compact DF
survey.get_df()
Get wide binary DF
survey.get_df(multiselect_dtype="number")
Export CSV
survey.to_csv("output/", name="results")
Export SPSS
survey.to_spss("output/", name="results")
Export JSON
survey.to_json("output/", name="results")
SPSS syntax string
survey.sps
Serialize variable
survey["Q1"].to_dict()

任务代码
自动检测加载CSV
survy.read_csv("f.csv", auto_detect=True, compact_separator=";")
显式指定紧凑列加载CSV
survy.read_csv("f.csv", compact_ids=["Q2"], compact_separator=";")
加载宽格式CSV
survy.read_csv("f.csv", name_pattern="id(_multi)?")
(宽格式自动检测)
加载SPSS
survy.read_spss("f.sav")
加载JSON
survy.read_json("f.json")
从Polars DF加载
survy.read_polars(df, auto_detect=True)
查看变量属性
survey["Q1"].vtype
,
.base
,
.len
,
.label
,
.value_indices
,
.dtype
获取频率表
survey["Q1"].frequencies
计数交叉表
survy.crosstab(survey["Q1"], survey["Q2"])
占比交叉表
survy.crosstab(survey["Q1"], survey["Q2"], aggfunc="percent")
带筛选的交叉表
survy.crosstab(survey["col"], survey["row"], filter=survey["seg"])
均值交叉表
survy.crosstab(survey["col"], survey["row"], aggfunc="mean")
筛选受访者
survey.filter("Q1", ["a", "b"])
替换值
survey["Q1"].replace({"old": "new"})
添加变量
survey.add(polars.Series("x", [1,2,3]))
删除变量
survey.drop("Q3")
排序变量
survey.sort(key=lambda v: v.id)
批量更新标签
survey.update([{"id":"Q1","label":"...","value_indices":{...}}])
获取紧凑格式DF
survey.get_df()
获取宽格式二值DF
survey.get_df(multiselect_dtype="number")
导出CSV
survey.to_csv("output/", name="results")
导出SPSS
survey.to_spss("output/", name="results")
导出JSON
survey.to_json("output/", name="results")
获取SPSS语法字符串
survey.sps
序列化变量
survey["Q1"].to_dict()

10. Reference Files

10. 参考文件

  • references/api_reference.md
    — Complete method signatures with all parameters and return types
  • scripts/validate_survey.py
    — Loads a survey file, checks missing labels/value_indices, prints report
  • scripts/batch_export.py
    — Reads a survey and exports to CSV, Excel, SPSS, and JSON
  • assets/sample_data.csv
    — Wide-format sample dataset
  • assets/sample_data_compact.csv
    — Compact-format sample dataset
  • references/api_reference.md
    —— 完整的方法签名,包含所有参数和返回类型
  • scripts/validate_survey.py
    —— 加载调研文件,检查缺失的标签/值索引,输出报告
  • scripts/batch_export.py
    —— 读取调研数据并导出为CSV、Excel、SPSS和JSON格式
  • assets/sample_data.csv
    —— 宽格式示例数据集
  • assets/sample_data_compact.csv
    —— 紧凑格式示例数据集",