data-storyteller

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Storyteller

数据叙事工具

Automatically transform raw data into compelling, insight-rich reports. Upload any CSV or Excel file and get back a complete analysis with visualizations, statistical summaries, and narrative explanations - all without writing code.
自动将原始数据转换为引人入胜、富含洞察的报告。上传任意CSV或Excel文件,即可获得包含可视化图表、统计摘要和叙述性解释的完整分析结果——全程无需编写代码。

Core Workflow

核心工作流程

1. Load and Analyze Data

1. 加载并分析数据

python
from scripts.data_storyteller import DataStoryteller
python
from scripts.data_storyteller import DataStoryteller

Initialize with your data file

Initialize with your data file

storyteller = DataStoryteller("your_data.csv")
storyteller = DataStoryteller("your_data.csv")

Or from a pandas DataFrame

Or from a pandas DataFrame

import pandas as pd df = pd.read_csv("your_data.csv") storyteller = DataStoryteller(df)
undefined
import pandas as pd df = pd.read_csv("your_data.csv") storyteller = DataStoryteller(df)
undefined

2. Generate Full Report

2. 生成完整报告

python
undefined
python
undefined

Generate comprehensive report

Generate comprehensive report

report = storyteller.generate_report()
report = storyteller.generate_report()

Access components

Access components

print(report['summary']) # Executive summary print(report['insights']) # Key findings print(report['statistics']) # Statistical analysis print(report['visualizations']) # Generated chart info
undefined
print(report['summary']) # Executive summary print(report['insights']) # Key findings print(report['statistics']) # Statistical analysis print(report['visualizations']) # Generated chart info
undefined

3. Export Options

3. 导出选项

python
undefined
python
undefined

Export to PDF

Export to PDF

storyteller.export_pdf("analysis_report.pdf")
storyteller.export_pdf("analysis_report.pdf")

Export to HTML (interactive charts)

Export to HTML (interactive charts)

storyteller.export_html("analysis_report.html")
storyteller.export_html("analysis_report.html")

Export charts only

Export charts only

storyteller.export_charts("charts/", format="png")
undefined
storyteller.export_charts("charts/", format="png")
undefined

Quick Start Examples

快速入门示例

Basic Analysis

基础分析

python
from scripts.data_storyteller import DataStoryteller
python
from scripts.data_storyteller import DataStoryteller

One-liner full analysis

One-liner full analysis

DataStoryteller("sales_data.csv").generate_report().export_pdf("report.pdf")
undefined
DataStoryteller("sales_data.csv").generate_report().export_pdf("report.pdf")
undefined

Custom Analysis

自定义分析

python
storyteller = DataStoryteller("data.csv")
python
storyteller = DataStoryteller("data.csv")

Focus on specific columns

Focus on specific columns

storyteller.analyze_columns(['revenue', 'customers', 'date'])
storyteller.analyze_columns(['revenue', 'customers', 'date'])

Set analysis parameters

Set analysis parameters

report = storyteller.generate_report( include_correlations=True, include_outliers=True, include_trends=True, time_column='date', chart_style='business' )
undefined
report = storyteller.generate_report( include_correlations=True, include_outliers=True, include_trends=True, time_column='date', chart_style='business' )
undefined

Features

功能特性

Auto-Detection

自动检测

  • Column Types: Numeric, categorical, datetime, text, boolean
  • Data Quality: Missing values, duplicates, outliers
  • Relationships: Correlations, dependencies, groupings
  • Time Series: Trends, seasonality, anomalies
  • 列类型:数值型、分类型、日期时间型、文本型、布尔型
  • 数据质量:缺失值、重复值、异常值
  • 数据关系:相关性、依赖关系、分组情况
  • 时间序列:趋势、季节性、异常点

Generated Visualizations

生成的可视化图表

Data TypeCharts Generated
NumericHistogram, box plot, trend line
CategoricalBar chart, pie chart, frequency table
Time SeriesLine chart, decomposition, forecast
CorrelationsHeatmap, scatter matrix
ComparisonsGrouped bar, stacked area
数据类型生成的图表
数值型直方图、箱线图、趋势线
分类型条形图、饼图、频率表
时间序列折线图、分解图、预测图
相关性热力图、散点矩阵
对比分析分组条形图、堆叠面积图

Narrative Insights

叙述性洞察内容

The storyteller generates plain-English insights including:
  • Executive summary of key findings
  • Notable patterns and anomalies
  • Statistical significance notes
  • Actionable recommendations
  • Data quality warnings
该工具会生成通俗易懂的英文洞察内容,包括:
  • 关键发现的执行摘要
  • 显著的数据模式和异常点
  • 统计显著性说明
  • 可落地的建议
  • 数据质量警告

Output Sections

输出章节

1. Executive Summary

1. 执行摘要

High-level overview of the dataset and key findings in 2-3 paragraphs.
对数据集和关键发现的高层级概述,篇幅为2-3段。

2. Data Profile

2. 数据概况

  • Row/column counts
  • Memory usage
  • Missing value analysis
  • Duplicate detection
  • Data type distribution
  • 行/列数量
  • 内存占用
  • 缺失值分析
  • 重复值检测
  • 数据类型分布

3. Statistical Analysis

3. 统计分析

For each numeric column:
  • Central tendency (mean, median, mode)
  • Dispersion (std dev, IQR, range)
  • Distribution shape (skewness, kurtosis)
  • Outlier count
针对每个数值型列:
  • 集中趋势(均值、中位数、众数)
  • 离散程度(标准差、四分位距、范围)
  • 分布形态(偏度、峰度)
  • 异常值数量

4. Categorical Analysis

4. 分类分析

For each categorical column:
  • Unique values count
  • Top/bottom categories
  • Frequency distribution
  • Category balance assessment
针对每个分类型列:
  • 唯一值数量
  • 排名靠前/靠后的类别
  • 频率分布
  • 类别平衡性评估

5. Correlation Analysis

5. 相关性分析

  • Correlation matrix with significance
  • Strongest relationships highlighted
  • Multicollinearity warnings
  • 带显著性的相关矩阵
  • 突出显示最强关联关系
  • 多重共线性警告

6. Time-Based Analysis

6. 基于时间的分析

If datetime column detected:
  • Trend direction and strength
  • Seasonality patterns
  • Year-over-year comparisons
  • Growth rate calculations
如果检测到日期时间列:
  • 趋势方向和强度
  • 季节性模式
  • 同比对比
  • 增长率计算

7. Visualizations

7. 可视化图表

Auto-generated charts saved to report:
  • Distribution plots
  • Trend charts
  • Comparison charts
  • Correlation heatmaps
自动生成的图表将保存到报告中:
  • 分布图表
  • 趋势图表
  • 对比图表
  • 相关性热力图

8. Recommendations

8. 建议

Data-driven suggestions:
  • Columns needing attention
  • Potential data quality fixes
  • Analysis suggestions
  • Business implications
基于数据的建议:
  • 需要关注的列
  • 潜在的数据质量修复方案
  • 分析建议
  • 业务影响

Chart Styles

图表样式

python
undefined
python
undefined

Available styles

Available styles

styles = ['business', 'scientific', 'minimal', 'dark', 'colorful']
storyteller.generate_report(chart_style='business')
undefined
styles = ['business', 'scientific', 'minimal', 'dark', 'colorful']
storyteller.generate_report(chart_style='business')
undefined

Configuration

配置选项

python
storyteller = DataStoryteller(df)
python
storyteller = DataStoryteller(df)

Configure analysis

配置分析参数

storyteller.config.update({ 'max_categories': 20, # Max categories to show 'outlier_method': 'iqr', # 'iqr', 'zscore', 'isolation' 'correlation_threshold': 0.5, 'significance_level': 0.05, 'date_format': 'auto', # Or specify like '%Y-%m-%d' 'language': 'en', # Narrative language })
undefined
storyteller.config.update({ 'max_categories': 20, # 最多显示的类别数量 'outlier_method': 'iqr', # 'iqr', 'zscore', 'isolation' 'correlation_threshold': 0.5, 'significance_level': 0.05, 'date_format': 'auto', # 或指定格式如 '%Y-%m-%d' 'language': 'en', # 叙述内容的语言 })
undefined

Supported File Formats

支持的文件格式

FormatExtensionNotes
CSV.csvAuto-detect delimiter
Excel.xlsx, .xlsMulti-sheet support
JSON.jsonRecords or columnar
Parquet.parquetFor large datasets
TSV.tsvTab-separated
格式扩展名说明
CSV.csv自动检测分隔符
Excel.xlsx, .xls支持多工作表
JSON.json记录式或列式结构
Parquet.parquet适用于大型数据集
TSV.tsv制表符分隔

Example Output

示例输出

Sample Executive Summary

示例执行摘要

"This dataset contains 10,847 records across 15 columns, covering sales transactions from January 2023 to December 2024. Revenue shows a strong upward trend (+23% YoY) with clear seasonal peaks in Q4. The top 3 product categories account for 67% of total revenue. Notable finding: Customer acquisition cost has increased 15% while retention rate dropped 8%, suggesting potential profitability concerns worth investigating."
"This dataset contains 10,847 records across 15 columns, covering sales transactions from January 2023 to December 2024. Revenue shows a strong upward trend (+23% YoY) with clear seasonal peaks in Q4. The top 3 product categories account for 67% of total revenue. Notable finding: Customer acquisition cost has increased 15% while retention rate dropped 8%, suggesting potential profitability concerns worth investigating."

Sample Insight

示例洞察内容

"Strong correlation detected between marketing_spend and new_customers (r=0.78, p<0.001). However, this relationship weakens significantly after $50K monthly spend, suggesting diminishing returns beyond this threshold."
"Strong correlation detected between marketing_spend and new_customers (r=0.78, p<0.001). However, this relationship weakens significantly after $50K monthly spend, suggesting diminishing returns beyond this threshold."

Best Practices

最佳实践

  1. Clean data first: Remove obvious errors before analysis
  2. Name columns clearly: Helps auto-detection and narratives
  3. Include dates: Enables time-series analysis
  4. Provide context: Tell the storyteller what the data represents
  1. 先清理数据:分析前移除明显错误
  2. 清晰命名列:有助于自动检测和叙述内容生成
  3. 包含日期信息:启用时间序列分析
  4. 提供上下文:告知工具数据代表的业务含义

Limitations

局限性

  • Maximum recommended: 1M rows, 100 columns
  • Complex nested data may need flattening
  • Images/binary data not supported
  • PDF export requires reportlab package
  • 建议最大处理规模:100万行,100列
  • 复杂嵌套数据可能需要扁平化处理
  • 不支持图像/二进制数据
  • PDF导出需要reportlab包

Dependencies

依赖项

pandas>=2.0.0
numpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
scipy>=1.10.0
reportlab>=4.0.0
openpyxl>=3.1.0
pandas>=2.0.0
umpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
scipy>=1.10.0
reportlab>=4.0.0
openpyxl>=3.1.0