data-storyteller
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Storyteller
数据叙事工具
Automatically transform raw data into compelling, insight-rich reports. Upload any CSV or Excel file and get back a complete analysis with visualizations, statistical summaries, and narrative explanations - all without writing code.
自动将原始数据转换为引人入胜、富含洞察的报告。上传任意CSV或Excel文件,即可获得包含可视化图表、统计摘要和叙述性解释的完整分析结果——全程无需编写代码。
Core Workflow
核心工作流程
1. Load and Analyze Data
1. 加载并分析数据
python
from scripts.data_storyteller import DataStorytellerpython
from scripts.data_storyteller import DataStorytellerInitialize with your data file
Initialize with your data file
storyteller = DataStoryteller("your_data.csv")
storyteller = DataStoryteller("your_data.csv")
Or from a pandas DataFrame
Or from a pandas DataFrame
import pandas as pd
df = pd.read_csv("your_data.csv")
storyteller = DataStoryteller(df)
undefinedimport pandas as pd
df = pd.read_csv("your_data.csv")
storyteller = DataStoryteller(df)
undefined2. Generate Full Report
2. 生成完整报告
python
undefinedpython
undefinedGenerate comprehensive report
Generate comprehensive report
report = storyteller.generate_report()
report = storyteller.generate_report()
Access components
Access components
print(report['summary']) # Executive summary
print(report['insights']) # Key findings
print(report['statistics']) # Statistical analysis
print(report['visualizations']) # Generated chart info
undefinedprint(report['summary']) # Executive summary
print(report['insights']) # Key findings
print(report['statistics']) # Statistical analysis
print(report['visualizations']) # Generated chart info
undefined3. Export Options
3. 导出选项
python
undefinedpython
undefinedExport to PDF
Export to PDF
storyteller.export_pdf("analysis_report.pdf")
storyteller.export_pdf("analysis_report.pdf")
Export to HTML (interactive charts)
Export to HTML (interactive charts)
storyteller.export_html("analysis_report.html")
storyteller.export_html("analysis_report.html")
Export charts only
Export charts only
storyteller.export_charts("charts/", format="png")
undefinedstoryteller.export_charts("charts/", format="png")
undefinedQuick Start Examples
快速入门示例
Basic Analysis
基础分析
python
from scripts.data_storyteller import DataStorytellerpython
from scripts.data_storyteller import DataStorytellerOne-liner full analysis
One-liner full analysis
DataStoryteller("sales_data.csv").generate_report().export_pdf("report.pdf")
undefinedDataStoryteller("sales_data.csv").generate_report().export_pdf("report.pdf")
undefinedCustom Analysis
自定义分析
python
storyteller = DataStoryteller("data.csv")python
storyteller = DataStoryteller("data.csv")Focus on specific columns
Focus on specific columns
storyteller.analyze_columns(['revenue', 'customers', 'date'])
storyteller.analyze_columns(['revenue', 'customers', 'date'])
Set analysis parameters
Set analysis parameters
report = storyteller.generate_report(
include_correlations=True,
include_outliers=True,
include_trends=True,
time_column='date',
chart_style='business'
)
undefinedreport = storyteller.generate_report(
include_correlations=True,
include_outliers=True,
include_trends=True,
time_column='date',
chart_style='business'
)
undefinedFeatures
功能特性
Auto-Detection
自动检测
- Column Types: Numeric, categorical, datetime, text, boolean
- Data Quality: Missing values, duplicates, outliers
- Relationships: Correlations, dependencies, groupings
- Time Series: Trends, seasonality, anomalies
- 列类型:数值型、分类型、日期时间型、文本型、布尔型
- 数据质量:缺失值、重复值、异常值
- 数据关系:相关性、依赖关系、分组情况
- 时间序列:趋势、季节性、异常点
Generated Visualizations
生成的可视化图表
| Data Type | Charts Generated |
|---|---|
| Numeric | Histogram, box plot, trend line |
| Categorical | Bar chart, pie chart, frequency table |
| Time Series | Line chart, decomposition, forecast |
| Correlations | Heatmap, scatter matrix |
| Comparisons | Grouped bar, stacked area |
| 数据类型 | 生成的图表 |
|---|---|
| 数值型 | 直方图、箱线图、趋势线 |
| 分类型 | 条形图、饼图、频率表 |
| 时间序列 | 折线图、分解图、预测图 |
| 相关性 | 热力图、散点矩阵 |
| 对比分析 | 分组条形图、堆叠面积图 |
Narrative Insights
叙述性洞察内容
The storyteller generates plain-English insights including:
- Executive summary of key findings
- Notable patterns and anomalies
- Statistical significance notes
- Actionable recommendations
- Data quality warnings
该工具会生成通俗易懂的英文洞察内容,包括:
- 关键发现的执行摘要
- 显著的数据模式和异常点
- 统计显著性说明
- 可落地的建议
- 数据质量警告
Output Sections
输出章节
1. Executive Summary
1. 执行摘要
High-level overview of the dataset and key findings in 2-3 paragraphs.
对数据集和关键发现的高层级概述,篇幅为2-3段。
2. Data Profile
2. 数据概况
- Row/column counts
- Memory usage
- Missing value analysis
- Duplicate detection
- Data type distribution
- 行/列数量
- 内存占用
- 缺失值分析
- 重复值检测
- 数据类型分布
3. Statistical Analysis
3. 统计分析
For each numeric column:
- Central tendency (mean, median, mode)
- Dispersion (std dev, IQR, range)
- Distribution shape (skewness, kurtosis)
- Outlier count
针对每个数值型列:
- 集中趋势(均值、中位数、众数)
- 离散程度(标准差、四分位距、范围)
- 分布形态(偏度、峰度)
- 异常值数量
4. Categorical Analysis
4. 分类分析
For each categorical column:
- Unique values count
- Top/bottom categories
- Frequency distribution
- Category balance assessment
针对每个分类型列:
- 唯一值数量
- 排名靠前/靠后的类别
- 频率分布
- 类别平衡性评估
5. Correlation Analysis
5. 相关性分析
- Correlation matrix with significance
- Strongest relationships highlighted
- Multicollinearity warnings
- 带显著性的相关矩阵
- 突出显示最强关联关系
- 多重共线性警告
6. Time-Based Analysis
6. 基于时间的分析
If datetime column detected:
- Trend direction and strength
- Seasonality patterns
- Year-over-year comparisons
- Growth rate calculations
如果检测到日期时间列:
- 趋势方向和强度
- 季节性模式
- 同比对比
- 增长率计算
7. Visualizations
7. 可视化图表
Auto-generated charts saved to report:
- Distribution plots
- Trend charts
- Comparison charts
- Correlation heatmaps
自动生成的图表将保存到报告中:
- 分布图表
- 趋势图表
- 对比图表
- 相关性热力图
8. Recommendations
8. 建议
Data-driven suggestions:
- Columns needing attention
- Potential data quality fixes
- Analysis suggestions
- Business implications
基于数据的建议:
- 需要关注的列
- 潜在的数据质量修复方案
- 分析建议
- 业务影响
Chart Styles
图表样式
python
undefinedpython
undefinedAvailable styles
Available styles
styles = ['business', 'scientific', 'minimal', 'dark', 'colorful']
storyteller.generate_report(chart_style='business')
undefinedstyles = ['business', 'scientific', 'minimal', 'dark', 'colorful']
storyteller.generate_report(chart_style='business')
undefinedConfiguration
配置选项
python
storyteller = DataStoryteller(df)python
storyteller = DataStoryteller(df)Configure analysis
配置分析参数
storyteller.config.update({
'max_categories': 20, # Max categories to show
'outlier_method': 'iqr', # 'iqr', 'zscore', 'isolation'
'correlation_threshold': 0.5,
'significance_level': 0.05,
'date_format': 'auto', # Or specify like '%Y-%m-%d'
'language': 'en', # Narrative language
})
undefinedstoryteller.config.update({
'max_categories': 20, # 最多显示的类别数量
'outlier_method': 'iqr', # 'iqr', 'zscore', 'isolation'
'correlation_threshold': 0.5,
'significance_level': 0.05,
'date_format': 'auto', # 或指定格式如 '%Y-%m-%d'
'language': 'en', # 叙述内容的语言
})
undefinedSupported File Formats
支持的文件格式
| Format | Extension | Notes |
|---|---|---|
| CSV | .csv | Auto-detect delimiter |
| Excel | .xlsx, .xls | Multi-sheet support |
| JSON | .json | Records or columnar |
| Parquet | .parquet | For large datasets |
| TSV | .tsv | Tab-separated |
| 格式 | 扩展名 | 说明 |
|---|---|---|
| CSV | .csv | 自动检测分隔符 |
| Excel | .xlsx, .xls | 支持多工作表 |
| JSON | .json | 记录式或列式结构 |
| Parquet | .parquet | 适用于大型数据集 |
| TSV | .tsv | 制表符分隔 |
Example Output
示例输出
Sample Executive Summary
示例执行摘要
"This dataset contains 10,847 records across 15 columns, covering sales transactions from January 2023 to December 2024. Revenue shows a strong upward trend (+23% YoY) with clear seasonal peaks in Q4. The top 3 product categories account for 67% of total revenue. Notable finding: Customer acquisition cost has increased 15% while retention rate dropped 8%, suggesting potential profitability concerns worth investigating."
"This dataset contains 10,847 records across 15 columns, covering sales transactions from January 2023 to December 2024. Revenue shows a strong upward trend (+23% YoY) with clear seasonal peaks in Q4. The top 3 product categories account for 67% of total revenue. Notable finding: Customer acquisition cost has increased 15% while retention rate dropped 8%, suggesting potential profitability concerns worth investigating."
Sample Insight
示例洞察内容
"Strong correlation detected between marketing_spend and new_customers (r=0.78, p<0.001). However, this relationship weakens significantly after $50K monthly spend, suggesting diminishing returns beyond this threshold."
"Strong correlation detected between marketing_spend and new_customers (r=0.78, p<0.001). However, this relationship weakens significantly after $50K monthly spend, suggesting diminishing returns beyond this threshold."
Best Practices
最佳实践
- Clean data first: Remove obvious errors before analysis
- Name columns clearly: Helps auto-detection and narratives
- Include dates: Enables time-series analysis
- Provide context: Tell the storyteller what the data represents
- 先清理数据:分析前移除明显错误
- 清晰命名列:有助于自动检测和叙述内容生成
- 包含日期信息:启用时间序列分析
- 提供上下文:告知工具数据代表的业务含义
Limitations
局限性
- Maximum recommended: 1M rows, 100 columns
- Complex nested data may need flattening
- Images/binary data not supported
- PDF export requires reportlab package
- 建议最大处理规模:100万行,100列
- 复杂嵌套数据可能需要扁平化处理
- 不支持图像/二进制数据
- PDF导出需要reportlab包
Dependencies
依赖项
pandas>=2.0.0
numpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
scipy>=1.10.0
reportlab>=4.0.0
openpyxl>=3.1.0pandas>=2.0.0
umpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
scipy>=1.10.0
reportlab>=4.0.0
openpyxl>=3.1.0