survey-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

survy — Survey Data Analysis Skill

survy — 调研数据分析技能

survy

is a lightweight Python library for processing, transforming, and analyzing survey data. Its central design principle is treating survey constructs — especially multiselect questions — as first-class concepts rather than awkward DataFrame workarounds.

Install: Always install the latest version —

pip install --upgrade survy

Powered by: Polars (all DataFrames returned are Polars, not pandas)

survy

是一个轻量级Python库，用于处理、转换和分析调研数据。它的核心设计原则是将调研结构（尤其是多选问题）作为一等概念处理，而不是使用笨拙的DataFrame变通方案实现。

安装：始终安装最新版本 —

pip install --upgrade survy

底层依赖：Polars（所有返回的DataFrame都是Polars类型，而非pandas）

1. Core Objects

1. 核心对象

Survey

Top-level container. Created via

read_*

functions — never instantiate directly. Access variables with

survey["Q1"]

. Print for a compact summary.

顶层容器，通过

read_*

系列函数创建，请勿直接实例化。使用

survey["Q1"]

访问变量，打印该对象可获得精简摘要。

Variable

Wraps a single Polars Series plus survey metadata. Key attributes:

Attribute	Type	Description
`id`	`str`	Column name (read/write via property)
`label`	`str`	Human-readable label (read/write); defaults to `id` if unset
`vtype`	`VarType`	One of `VarType.SELECT` , `VarType.MULTISELECT` , `VarType.NUMBER`
`value_indices`	`dict[str, int]`	Answer code → numeric index mapping; always empty `{}` for NUMBER
`base`	`int`	Count of non-null/non-empty responses
`len`	`int`	Total row count including nulls
`dtype`	`polars.DataType`	Underlying Polars dtype
`frequencies`	`polars.DataFrame`	Frequency table (value, count, proportion)
`sps`	`str`	SPSS syntax for this variable

封装了单个Polars Series以及调研元数据，核心属性如下：

属性	类型	描述
`id`	`str`	列名（可通过属性读写）
`label`	`str`	人类可读的标签（可读写），未设置时默认等于 `id`
`vtype`	`VarType`	取值为 `VarType.SELECT` 、 `VarType.MULTISELECT` 、 `VarType.NUMBER` 其中之一
`value_indices`	`dict[str, int]`	答案文本到数字索引的映射，NUMBER类型变量的该属性始终为空 `{}`
`base`	`int`	非空/非无效回复的数量
`len`	`int`	总行数（包含空值）
`dtype`	`polars.DataType`	底层Polars数据类型
`frequencies`	`polars.DataFrame`	频率表，包含值、计数、占比三列
`sps`	`str`	对应变量的SPSS语法

2. Reading Data

2. 读取数据

All readers return a

Survey

object. The key challenge survy solves at read time is recognizing multiselect questions — questions where one respondent can choose multiple answers. Raw data encodes these in two very different layouts, and survy needs to know which layout it's looking at so it can merge the data into a single logical variable.

所有读取函数都会返回一个

Survey

对象。survy在读取阶段解决的核心问题是识别多选问题——即受访者可以选择多个答案的问题。原始数据会以两种完全不同的布局编码这类问题，survy需要知道当前处理的是哪种布局，才能将数据合并为单个逻辑变量。

Multiselect: Compact Format vs Wide Format

多选问题：紧凑格式 vs 宽格式

Compact format stores all selected answers in a single cell, joined by a separator (typically

). One column = one question.

id,  gender,  hobby
1,   Male,    Sport;Book
2,   Female,  Sport;Movie
3,   Male,    Movie

Here

hobby

is one column. The cell

"Sport;Book"

means the respondent chose both Sport and Book. survy splits this cell on the separator to recover the individual choices.

Wide format spreads each possible answer across its own column, using a shared prefix plus a numeric suffix (

_1

_2

, ...). Multiple columns = one question.

id,  gender,  hobby_1,  hobby_2,  hobby_3
1,   Male,    Book,     ,         Sport
2,   Female,  ,         Movie,    Sport
3,   Male,    ,         Movie,

Here

hobby_1

hobby_2

hobby_3

are three columns that together represent the single

hobby

question. survy groups them by matching the prefix pattern and merges them into one multiselect variable named

hobby

After reading, both formats produce the exact same Survey variable internally — a

MULTISELECT

variable whose data is a sorted list of chosen values per respondent:

hobby: [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]]

紧凑格式将所有选中的答案存储在单个单元格中，通过分隔符（通常是

）拼接。一列对应一个问题。

id,  gender,  hobby
1,   Male,    Sport;Book
2,   Female,  Sport;Movie
3,   Male,    Movie

此处

hobby

是单列，单元格

"Sport;Book"

表示受访者同时选择了运动和读书。survy会按分隔符拆分单元格，还原出单个选项。

宽格式将每个可选答案拆分到独立的列中，使用共享前缀加数字后缀（

_1

、

_2

……）的命名方式。多列对应一个问题。

id,  gender,  hobby_1,  hobby_2,  hobby_3
1,   Male,    Book,     ,         Sport
2,   Female,  ,         Movie,    Sport
3,   Male,    ,         Movie,

此处

hobby_1

、

hobby_2

、

hobby_3

三列共同代表

hobby

这一个问题。survy会通过匹配前缀规则将它们分组，合并为名为

hobby

的单个多选变量。

读取完成后，两种格式在内部会生成完全相同的Survey变量——即一个

MULTISELECT

类型的变量，其数据为每个受访者选中值的排序后列表：

hobby: [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]]

How survy detects each format

survy如何检测两种格式

Wide format is detected via

name_pattern

— a format template (NOT a raw regex) with two named tokens and a set of reserved separators:

Tokens:
```
id
```
(base variable name),
```
multi
```
(suffix for wide columns)
Reserved separators:
```
_
```
,
```
.
```
,
```
:
```
— these are always treated as delimiters between tokens when parsing column names

With the default pattern

"id(_multi)?"

```
hobby_1
```
→
```
id="hobby"
```
,
```
multi="1"
```
→ grouped as wide multiselect
```
hobby_2
```
→ same
```
id="hobby"
```
→ merged with
```
hobby_1
```
```
gender
```
→ no suffix → normal column

Other patterns:

```
"id.multi"
```
→ matches
```
Q1.1
```
,
```
Q1.2
```
, ...
```
"id:multi"
```
→ matches
```
Q1:a
```
,
```
Q1:b
```
, ...

Separator conflict warning: If a column name contains more than one reserved separator (e.g.

my.var_1

parse_id

will fail because it can't unambiguously split the name into tokens. Before loading, rename such columns so only one separator is used (e.g. rename

my.var_1

myvar_1

my@var_1

Compact format is NOT detected by default because a semicolon in a cell could be regular text. You must tell survy which columns are compact in one of two ways:

compact_ids
— explicitly list the column IDs that are compact multiselect.
auto_detect=True
— survy scans every column for the
```
compact_separator
```
character; any column containing it in at least one cell is treated as compact.

Rule: Do NOT combine

auto_detect=True

with

compact_ids

in the same call.

宽格式通过

name_pattern

检测，这是一个格式模板（不是原生正则），包含两个命名标记和一组预留分隔符：

标记：
```
id
```
（基础变量名）、
```
multi
```
（宽列后缀）
预留分隔符：
```
_
```
、
```
.
```
、
```
:
```
—— 解析列名时，这些符号始终会被当作标记之间的分隔符

使用默认模式

"id(_multi)?"

时：

```
hobby_1
```
→
```
id="hobby"
```
，
```
multi="1"
```
→ 被归类为宽格式多选列
```
hobby_2
```
→ 同样
```
id="hobby"
```
→ 与
```
hobby_1
```
合并
```
gender
```
→ 无后缀 → 普通列

其他模式示例：

```
"id.multi"
```
→ 匹配
```
Q1.1
```
、
```
Q1.2
```
等命名
```
"id:multi"
```
→ 匹配
```
Q1:a
```
、
```
Q1:b
```
等命名

分隔符冲突警告：如果列名包含超过一个预留分隔符（例如

my.var_1

），

parse_id

会因为无法明确拆分名称为标记而失败。加载前请重命名这类列，仅保留一种分隔符（例如将

my.var_1

重命名为

myvar_1

或

my@var_1

）。

紧凑格式默认不会被自动检测，因为单元格中的分号可能是普通文本。你需要通过以下两种方式之一明确告知survy哪些列是紧凑格式：

compact_ids
—— 显式列出属于紧凑格式多选的列ID
auto_detect=True
—— survy会扫描所有列检查是否包含
```
compact_separator
```
字符，只要有一个单元格包含该字符的列就会被当作紧凑格式处理

规则：请勿在同一次调用中同时使用

auto_detect=True

和

compact_ids

。

read_spss

Reads an SPSS

.sav

file. SPSS files are always wide format — compact multiselect does not apply and

compact_ids

auto_detect

are not parameters. Wide multiselect columns (e.g.

hobby_1

hobby_2

) are still auto-detected and merged via

name_pattern

. Value labels stored in the

.sav

file are applied automatically, so variables come back as text (

"Male"

"Female"

) rather than numeric codes. Requires

pyreadstat

python

undefined

读取SPSS

.sav

文件。SPSS文件始终是宽格式，不适用紧凑多选格式，也没有

compact_ids

auto_detect

参数。宽格式多选列（例如

hobby_1

、

hobby_2

）仍然会通过

name_pattern

自动检测并合并。

.sav

文件中存储的值标签会被自动应用，因此变量返回的是文本（

"Male"

、

"Female"

）而非数字编码。该功能依赖

pyreadstat

库。

python

undefined

Wide multiselect detected automatically

宽格式多选会被自动检测

survey = survy.read_spss("data.sav")

Custom suffix convention (Q1.1, Q1.2, ...)

自定义后缀规则（适配Q1.1、Q1.2……这类命名）

survey = survy.read_spss("data.sav", name_pattern="id.multi")


**Rule**: Do NOT pass `compact_ids` or `auto_detect` to `read_spss` — those parameters don't exist.

---

survey = survy.read_spss("data.sav", name_pattern="id.multi")


**规则**：请勿向`read_spss`传递`compact_ids`或`auto_detect`参数——该函数没有这两个参数。

---

Shared Reader Parameters (read_csv, read_excel, read_polars only)

通用读取参数（仅适用于read_csv、read_excel、read_polars）

These parameters control multiselect detection and apply to

read_csv

read_excel

, and

read_polars

. They do NOT apply to

read_json

(which reads survy's own format where variable types are already resolved).

Parameter	Type	Default	Description
`compact_ids`	`list[str] \| None`	`None`	Column IDs to treat as compact multiselect
`compact_separator`	`str`	`";"`	Separator used to split compact cells
`auto_detect`	`bool`	`False`	Auto-detect compact columns by scanning for separator
`name_pattern`	`str`	`"id(_multi)?"`	Format template for wide column names. Tokens: `id` , `multi` . Separators: `_` `.` `:` . Not a raw regex.

这些参数用于控制多选检测，适用于

read_csv

、

read_excel

和

read_polars

，不适用于

read_json

（read_json读取的是survy自有格式，变量类型已经确定）。

参数	类型	默认值	描述
`compact_ids`	`list[str] \	None`	`None`
`compact_separator`	`str`	`";"`	用于拆分紧凑格式单元格的分隔符
`auto_detect`	`bool`	`False`	通过扫描分隔符自动检测紧凑格式列
`name_pattern`	`str`	`"id(_multi)?"`	宽格式列名的模板，标记为 `id` 、 `multi` ，分隔符支持 _``. `:` ，不是原生正则

read_csv / read_excel

python

import survy

python

import survy

--- Compact format data ---

--- 紧凑格式数据 ---

Option A: you know which columns are compact

方案A：你明确知道哪些列是紧凑格式

survey = survy.read_csv("data_compact.csv", compact_ids=["hobby"], compact_separator=";")

Option B: let survy scan for the separator automatically

方案B：让survy自动扫描检测分隔符

survey = survy.read_csv("data_compact.csv", auto_detect=True, compact_separator=";")

--- Wide format data ---

--- 宽格式数据 ---

Wide detection is automatic via name_pattern (default works for Q1_1, Q1_2, ...)

宽格式会通过name_pattern自动检测（默认规则适配Q1_1、Q1_2……这类命名）

survey = survy.read_csv("data_wide.csv")

Custom name_pattern if your columns use a different suffix convention

如果列使用不同的后缀规则，可自定义name_pattern

survey = survy.read_csv("data_wide.csv", name_pattern="id(_multi)?")

--- Mixed: some columns are wide, some are compact ---

--- 混合格式：部分列是宽格式，部分是紧凑格式 ---

survey = survy.read_csv("data_mixed.csv", name_pattern="id(_multi)?", auto_detect=True)

Excel — identical API to read_csv

Excel —— 与read_csv的API完全一致

survey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")

undefined

survey = survy.read_excel("data.xlsx", auto_detect=True, compact_separator=";")

undefined

read_json

Reads a survy-format JSON file. The file must have this exact structure:

json

{
  "variables": [
    {
      "id": "gender",
      "data": ["Male", "Female", "Male"],
      "label": "Gender of respondent",
      "value_indices": {"Female": 1, "Male": 2}
    },
    {
      "id": "yob",
      "data": [2000, 1999, 1998],
      "label": "",
      "value_indices": {}
    },
    {
      "id": "hobby",
      "data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
      "label": "Hobbies",
      "value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
    }
  ]
}

Key rules for the JSON structure:

Top-level key must be
```
"variables"
```
(a list of variable objects).
Each variable must have
```
"id"
```
,
```
"data"
```
,
```
"label"
```
, and
```
"value_indices"
```
.
SELECT variables:
```
"data"
```
is a flat list of strings (or nulls).
NUMBER variables:
```
"data"
```
is a flat list of numbers;
```
"value_indices"
```
must be
```
{}
```
.
MULTISELECT variables:
```
"data"
```
is a list of lists of strings.
```
"value_indices"
```
maps each answer text to a numeric index; only applied when non-empty.
Read vs Write difference:
```
to_json()
```
writes an extra
```
"vtype"
```
field per variable (e.g.
```
"select"
```
,
```
"multi_select"
```
,
```
"number"
```
).
```
read_json()
```
ignores this field — it re-infers the type from the data. So if you're building JSON manually, you can omit
```
"vtype"
```
.

python

survey = survy.read_json("data.json")

读取survy格式的JSON文件，文件必须严格遵循以下结构：

json

{
  "variables": [
    {
      "id": "gender",
      "data": ["Male", "Female", "Male"],
      "label": "Gender of respondent",
      "value_indices": {"Female": 1, "Male": 2}
    },
    {
      "id": "yob",
      "data": [2000, 1999, 1998],
      "label": "",
      "value_indices": {}
    },
    {
      "id": "hobby",
      "data": [["Book", "Sport"], ["Movie", "Sport"], ["Movie"]],
      "label": "Hobbies",
      "value_indices": {"Book": 1, "Movie": 2, "Sport": 3}
    }
  ]
}

JSON结构核心规则：

顶层键必须是
```
"variables"
```
（值为变量对象的列表）
每个变量必须包含
```
"id"
```
、
```
"data"
```
、
```
"label"
```
和
```
"value_indices"
```
字段
SELECT类型变量：
```
"data"
```
是字符串（或null）的一维列表
NUMBER类型变量：
```
"data"
```
是数字的一维列表，
```
"value_indices"
```
必须为
```
{}
```
MULTISELECT类型变量：
```
"data"
```
是字符串列表的二维列表
```
"value_indices"
```
映射每个答案文本到数字索引，仅非空时生效
读写差异：
```
to_json()
```
会为每个变量额外写入
```
"vtype"
```
字段（例如
```
"select"
```
、
```
"multi_select"
```
、
```
"number"
```
），
```
read_json()
```
会忽略该字段，直接从数据推断类型。因此如果你手动构建JSON，可以省略
```
"vtype"
```
字段

python

survey = survy.read_json("data.json")

read_polars

Construct a Survey from an existing Polars DataFrame. Extra parameter

exclude_null

(default

True

) drops columns with no responses or all-empty lists. read_polars also have same concepts of wide/ compact format as read_csv.

python

import polars, survy

df = polars.DataFrame({
    "gender": ["Male", "Female", "Male"],
    "yob": [2000, 1999, 1998],
    "hobby": ["Sport;Book", "Sport;Movie", "Movie"],
    "animal_1": ["Cat", "", "Cat"],
    "animal_2": ["Dog", "Dog", ""],
})
survey = survy.read_polars(df, auto_detect=True)

从已有的Polars DataFrame构建Survey对象。额外参数

exclude_null

（默认

True

）会删除没有回复或全为空列表的列。read_polars和read_csv一样支持宽/紧凑格式的概念。

python

import polars, survy

df = polars.DataFrame({
    "gender": ["Male", "Female", "Male"],
    "yob": [2000, 1999, 1998],
    "hobby": ["Sport;Book", "Sport;Movie", "Movie"],
    "animal_1": ["Cat", "", "Cat"],
    "animal_2": ["Dog", "Dog", ""],
})
survey = survy.read_polars(df, auto_detect=True)

3. Modifying the Survey

3. 修改Survey

survey.update() — batch label/value_indices

survey.update() —— 批量更新标签/值索引

python

survey.update([
    {"id": "Q1", "label": "Satisfaction", "value_indices": {"good": 1, "bad": 2}},
    {"id": "Q2", "label": "Channels used"},
])

Silently skips

value_indices

for NUMBER variables. Warns and skips unknown IDs.

python

survey.update([
    {"id": "Q1", "label": "Satisfaction", "value_indices": {"good": 1, "bad": 2}},
    {"id": "Q2", "label": "Channels used"},
])

针对NUMBER类型变量的

value_indices

会被静默跳过，未知ID会触发警告并跳过处理。

survey.add() — add a variable

survey.add() —— 添加变量

python

survey.add(some_variable)           # Variable object
survey.add(polars.Series("new", [1, 2, 3]))  # auto-wrapped into Variable

If the ID already exists, a numeric suffix is appended (e.g.

"Q1#1"

python

survey.add(some_variable)           # 传入Variable对象
survey.add(polars.Series("new", [1, 2, 3]))  # 会自动封装为Variable对象

如果ID已存在，会追加数字后缀（例如

"Q1#1"

）。

survey.drop() — remove a variable

survey.drop() —— 删除变量

python

survey.drop("Q3")   # silently ignored if not found

python

survey.drop("Q3")   # 如果ID不存在会被静默忽略

survey.sort() — reorder variables

survey.sort() —— 重排变量顺序

python

survey.sort()                                      # alphabetical by id (default)
survey.sort(key=lambda v: v.base, reverse=True)    # by response count desc

python

survey.sort()                                      # 默认按ID字母序排序
survey.sort(key=lambda v: v.base, reverse=True)    # 按回复数量降序排序

variable.replace() — recode values

variable.replace() —— 重编码值

python

survey["gender"].replace({"Male": "M", "Female": "F"})

Works for both SELECT and MULTISELECT. Automatically rebuilds

value_indices

python

survey["gender"].replace({"Male": "M", "Female": "F"})

同时适用于SELECT和MULTISELECT类型，会自动重建

value_indices

。

Direct property assignment

直接属性赋值

python

v = survey["Q1"]
v.id = "satisfaction"
v.label = "Overall satisfaction"
v.value_indices = {"very_satisfied": 1, "satisfied": 2, "neutral": 3}

Caution on value_indices setter: Raises

DataStructureError

if any existing value in the data is missing from the new mapping. You must cover ALL values present in the data.

python

v = survey["Q1"]
v.id = "satisfaction"
v.label = "Overall satisfaction"
v.value_indices = {"very_satisfied": 1, "satisfied": 2, "neutral": 3}

value_indices赋值注意事项：如果数据中存在任何现有值不在新映射中，会抛出

DataStructureError

，你必须覆盖数据中存在的所有值。

4. Filtering

4. 筛选

Returns a new Survey (original is not mutated).

python

filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male")   # single value also works

For MULTISELECT, a row is kept if any of its selected values appears in the filter list.

返回新的Survey对象（原对象不会被修改）。

python

filtered = survey.filter("hobby", ["Sport", "Book"])
filtered = survey.filter("gender", "Male")   # 也支持单个值筛选

对于MULTISELECT类型，只要受访者选中的值有任意一个出现在筛选列表中，该行就会被保留。

5. Getting a DataFrame

5. 获取DataFrame

python

df = survey.get_df(
    select_dtype="text",          # "text" | "number"
    multiselect_dtype="compact",  # "compact" | "text" | "number"
)

select_dtype
:

"text"

keeps string codes (default);

"number"

converts via

value_indices

multiselect_dtype
:

```
"compact"
```
→ one
```
List[str]
```
column per multiselect (default)
```
"text"
```
→ expands to wide columns
```
Q_1
```
,
```
Q_2
```
, ... with string or
```
null
```
```
"number"
```
→ expands to wide columns with
```
1
```
/
```
0
```
binary flags

Returns Polars DataFrame — use Polars methods, not pandas.

Valid dtype literals:

"text"

"number"

"compact"

. Never

"numeric"

"string"

python

df = survey.get_df(
    select_dtype="text",          # 可选值："text" | "number"
    multiselect_dtype="compact",  # 可选值："compact" | "text" | "number"
)

select_dtype
：

"text"

保留字符串编码（默认）；

"number"

通过

value_indices

转换为数字。

multiselect_dtype
：

```
"compact"
```
→ 每个多选变量对应一个
```
List[str]
```
类型列（默认）
```
"text"
```
→ 展开为
```
Q_1
```
、
```
Q_2
```
等宽列，值为字符串或
```
null
```
```
"number"
```
→ 展开为宽列，使用
```
1
```
/
```
0
```
二值标记是否选中

返回Polars DataFrame —— 请使用Polars方法而非pandas方法处理。

合法的dtype字面量：

"text"

、

"number"

、

"compact"

，请勿使用

"numeric"

或

"string"

。

6. Analysis

6. 分析

Frequency table

频率表

python

survey["Q1"].frequencies

python

survey["Q1"].frequencies

→ Polars DataFrame: columns [variable_id, "count", "proportion"]

→ Polars DataFrame，包含列：[variable_id, "count", "proportion"]

undefined

undefined

Crosstab

交叉表

python

result = survy.crosstab(
    column=survey["gender"],     # grouping variable (columns)
    row=survey["hobby"],         # analyzed variable (rows)
    filter=None,                 # optional: segment by another variable
    aggfunc="count",             # "count" | "percent" | "mean" | "median" | "sum"
    alpha=0.05,                  # significance level for stat tests
)

python

result = survy.crosstab(
    column=survey["gender"],     # 分组变量（列维度）
    row=survey["hobby"],         # 待分析变量（行维度）
    filter=None,                 # 可选：按另一个变量拆分结果
    aggfunc="count",             # 可选值："count" | "percent" | "mean" | "median" | "sum"
    alpha=0.05,                  # 统计检验的显著性水平
)

Returns dict[str, polars.DataFrame]

返回dict[str, polars.DataFrame]

Key is "Total" when no filter, or each filter-value when filter is provided

未设置filter时键为"Total"，设置filter时键为每个筛选值


**aggfunc options**:

- `"count"` — cell counts with significance letter labels (z-test)
- `"percent"` — column-wise proportions with significance labels
- Numeric (`"mean"`, `"median"`, `"sum"`) — aggregates row variable; Welch's t-test for significance

**filter**: Pass a Variable to segment the crosstab into one table per filter value.

---


**aggfunc可选值说明**：

- `"count"` —— 单元格计数，附带显著性字母标记（z检验）
- `"percent"` —— 列维度占比，附带显著性标记
- 数值聚合（`"mean"`、`"median"`、`"sum"`）—— 聚合行变量，使用Welch t检验计算显著性

**filter**：传入一个变量，将交叉表按筛选值拆分为多个表格。

---

7. Exporting

7. 导出

All exports take a directory path (not file path) + optional

name

(base filename).

所有导出方法都接收目录路径（不是文件路径）+ 可选的

name

参数（基础文件名）。

to_csv / to_excel

Writes three files per export:

{name}_data.csv
— the actual survey responses. Format depends on
```
compact
```
param.
{name}_variables_info.csv
— variable metadata with columns:
```
id
```
,
```
vtype
```
(SINGLE/MULTISELECT/NUMBER),
```
label
```
.
{name}_values_info.csv
— value-to-index mappings with columns:
```
id
```
,
```
text
```
,
```
index
```
.

The

compact

parameter (default

False

) controls how multiselect variables appear in the data file:

True

joins values into one cell (e.g.

"Book;Sport"

False

expands into wide columns (e.g.

hobby_1

hobby_2

hobby_3

python

undefined

每次导出会生成三个文件：

{name}_data.csv
—— 实际调研回复数据，格式由
```
compact
```
参数决定
{name}_variables_info.csv
—— 变量元数据，包含列：
```
id
```
、
```
vtype
```
（SINGLE/MULTISELECT/NUMBER）、
```
label
```
{name}_values_info.csv
—— 值到索引的映射，包含列：
```
id
```
、
```
text
```
、
```
index
```

compact

参数（默认

False

）控制多选变量在数据文件中的展示形式：

True

将值拼接为单个单元格（例如

"Book;Sport"

），

False

展开为宽列（例如

hobby_1

、

hobby_2

、

hobby_3

）。

python

undefined

Default (compact=False) — multiselect expanded to wide columns

默认（compact=False）—— 多选变量展开为宽列

survey.to_csv("output/", name="results")

Compact mode — multiselect joined into single cells

紧凑模式 —— 多选变量拼接为单个单元格

survey.to_csv("output/", name="results", compact=True, compact_separator=";")

Excel — identical API and output structure (.xlsx files instead of .csv)

Excel —— API和输出结构完全一致，生成.xlsx文件而非.csv

survey.to_excel("output/", name="results") survey.to_excel("output/", name="results", compact=True)

undefined

survey.to_excel("output/", name="results") survey.to_excel("output/", name="results", compact=True)

undefined

to_spss

Writes

{name}.sav

(data) +

{name}.sps

(syntax). Requires

pyreadstat

python

survey.to_spss("output/", name="results")

生成

{name}.sav

（数据文件）+

{name}.sps

（语法文件），依赖

pyreadstat

库。

python

survey.to_spss("output/", name="results")

to_json

Writes

{name}.json

in the same structure

read_json

expects (see Section 2), plus an extra

"vtype"

field per variable that

read_json

ignores on re-read. Pretty-printed with 4-space indent, non-ASCII preserved (

ensure_ascii=False

python

survey.to_json("output/", name="results")

Common mistake: Do NOT pass

"output/results.csv"

. Pass directory +

name=

生成

{name}.json

，格式与

read_json

要求的结构一致（见第2节），额外添加了每个变量的

"vtype"

字段，该字段在重新读取时会被

read_json

忽略。输出为4空格缩进的美化格式，保留非ASCII字符（

ensure_ascii=False

）。

python

survey.to_json("output/", name="results")

常见错误：请勿传递

"output/results.csv"

这类文件路径，请传递目录路径+

name=

参数。

SPSS Syntax

SPSS语法

python

print(survey.sps)  # full syntax: VARIABLE LABELS, VALUE LABELS, MRSETS, CTABLES

python

print(survey.sps)  # 完整语法：包含VARIABLE LABELS、VALUE LABELS、MRSETS、CTABLES

8. Gotchas & Rules

8. 注意事项与使用规则

auto_detect
vs
compact_ids
: Never combine both.
value_indices
setter must cover all existing data values — raises
```
DataStructureError
```
otherwise.
value_indices
is silently skipped for NUMBER variables (in
```
update()
```
and direct set).
Export path is a directory, not a file.
get_df()
returns Polars, not pandas.
filter()
returns a new Survey — does not mutate.
Empty strings become
None
during CSV/Excel read.
Multiselect values are sorted alphabetically within each row.
All variables in a crosstab must have the same row count.
read_csv
raises
FileTypeError
if the file extension is not
```
.csv
```
. Same for
```
read_excel
```
with non-
```
.xlsx
```
and
```
read_spss
```
with non-
```
.sav
```
.
to_csv
/
to_excel
default is
compact=False
— multiselect variables are expanded to wide columns unless you explicitly pass
```
compact=True
```
.
Column names must not contain multiple reserved separators (
```
_
```
,
```
.
```
,
```
:
```
). If a column like
```
my.var_1
```
uses more than one,
```
parse_id
```
will fail. Rename before loading so only one separator appears (e.g.
```
myvar_1
```
).

auto_detect
与
compact_ids
互斥：请勿同时使用两者
value_indices
赋值必须覆盖所有现有数据值 —— 否则会抛出
```
DataStructureError
```
NUMBER类型变量的
value_indices
会被静默跳过（在
```
update()
```
和直接赋值场景下都生效）
导出路径是目录，不是文件
get_df()
返回Polars对象，不是pandas
filter()
返回新的Survey对象 —— 不会修改原对象
读取CSV/Excel时空字符串会被转换为
None
多选值在每行内部会按字母序排序
交叉表中所有变量必须有相同的行数
如果文件扩展名不是
.csv
，
read_csv
会抛出
FileTypeError
。
```
read_excel
```
对应非
```
.xlsx
```
文件、
```
read_spss
```
对应非
```
.sav
```
文件也有同样规则
to_csv
/
to_excel
默认
compact=False
—— 除非显式传递
```
compact=True
```
，否则多选变量会被展开为宽列
列名不能包含多个预留分隔符（
```
_
```
、
```
.
```
、
```
:
```
）。如果存在类似
```
my.var_1
```
这类使用多个分隔符的列，
```
parse_id
```
会失败。加载前请重命名，仅保留一种分隔符（例如改为
```
myvar_1
```
）

9. Quick Reference

9. 快速参考

Task	Code
Load CSV auto-detect	`survy.read_csv("f.csv", auto_detect=True, compact_separator=";")`
Load CSV explicit compact	`survy.read_csv("f.csv", compact_ids=["Q2"], compact_separator=";")`
Load CSV wide format	`survy.read_csv("f.csv", name_pattern="id(_multi)?")` (wide detected automatically)
Load SPSS	`survy.read_spss("f.sav")`
Load JSON	`survy.read_json("f.json")`
Load from Polars DF	`survy.read_polars(df, auto_detect=True)`
Inspect variable	`survey["Q1"].vtype` , `.base` , `.len` , `.label` , `.value_indices` , `.dtype`
Frequencies	`survey["Q1"].frequencies`
Crosstab count	`survy.crosstab(survey["Q1"], survey["Q2"])`
Crosstab percent	`survy.crosstab(survey["Q1"], survey["Q2"], aggfunc="percent")`
Crosstab with filter	`survy.crosstab(survey["col"], survey["row"], filter=survey["seg"])`
Crosstab mean	`survy.crosstab(survey["col"], survey["row"], aggfunc="mean")`
Filter respondents	`survey.filter("Q1", ["a", "b"])`
Replace values	`survey["Q1"].replace({"old": "new"})`
Add variable	`survey.add(polars.Series("x", [1,2,3]))`
Drop variable	`survey.drop("Q3")`
Sort variables	`survey.sort(key=lambda v: v.id)`
Batch update labels	`survey.update([{"id":"Q1","label":"...","value_indices":{...}}])`
Get compact DF	`survey.get_df()`
Get wide binary DF	`survey.get_df(multiselect_dtype="number")`
Export CSV	`survey.to_csv("output/", name="results")`
Export SPSS	`survey.to_spss("output/", name="results")`
Export JSON	`survey.to_json("output/", name="results")`
SPSS syntax string	`survey.sps`
Serialize variable	`survey["Q1"].to_dict()`

任务	代码
自动检测加载CSV	`survy.read_csv("f.csv", auto_detect=True, compact_separator=";")`
显式指定紧凑列加载CSV	`survy.read_csv("f.csv", compact_ids=["Q2"], compact_separator=";")`
加载宽格式CSV	`survy.read_csv("f.csv", name_pattern="id(_multi)?")` （宽格式自动检测）
加载SPSS	`survy.read_spss("f.sav")`
加载JSON	`survy.read_json("f.json")`
从Polars DF加载	`survy.read_polars(df, auto_detect=True)`
查看变量属性	`survey["Q1"].vtype` , `.base` , `.len` , `.label` , `.value_indices` , `.dtype`
获取频率表	`survey["Q1"].frequencies`
计数交叉表	`survy.crosstab(survey["Q1"], survey["Q2"])`
占比交叉表	`survy.crosstab(survey["Q1"], survey["Q2"], aggfunc="percent")`
带筛选的交叉表	`survy.crosstab(survey["col"], survey["row"], filter=survey["seg"])`
均值交叉表	`survy.crosstab(survey["col"], survey["row"], aggfunc="mean")`
筛选受访者	`survey.filter("Q1", ["a", "b"])`
替换值	`survey["Q1"].replace({"old": "new"})`
添加变量	`survey.add(polars.Series("x", [1,2,3]))`
删除变量	`survey.drop("Q3")`
排序变量	`survey.sort(key=lambda v: v.id)`
批量更新标签	`survey.update([{"id":"Q1","label":"...","value_indices":{...}}])`
获取紧凑格式DF	`survey.get_df()`
获取宽格式二值DF	`survey.get_df(multiselect_dtype="number")`
导出CSV	`survey.to_csv("output/", name="results")`
导出SPSS	`survey.to_spss("output/", name="results")`
导出JSON	`survey.to_json("output/", name="results")`
获取SPSS语法字符串	`survey.sps`
序列化变量	`survey["Q1"].to_dict()`

10. Reference Files

10. 参考文件

```
references/api_reference.md
```
— Complete method signatures with all parameters and return types
```
scripts/validate_survey.py
```
— Loads a survey file, checks missing labels/value_indices, prints report
```
scripts/batch_export.py
```
— Reads a survey and exports to CSV, Excel, SPSS, and JSON
```
assets/sample_data.csv
```
— Wide-format sample dataset
```
assets/sample_data_compact.csv
```
— Compact-format sample dataset

```
references/api_reference.md
```
—— 完整的方法签名，包含所有参数和返回类型
```
scripts/validate_survey.py
```
—— 加载调研文件，检查缺失的标签/值索引，输出报告
```
scripts/batch_export.py
```
—— 读取调研数据并导出为CSV、Excel、SPSS和JSON格式
```
assets/sample_data.csv
```
—— 宽格式示例数据集
```
assets/sample_data_compact.csv
```
—— 紧凑格式示例数据集",