dummy-dataset
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDummy Dataset Generation
模拟数据集生成
Generate realistic dummy datasets for testing with customizable columns, constraints, and output formats (CSV, JSON, SQL, Python script). Creates executable scripts or direct data files for immediate use.
Use when: Creating test data, generating sample datasets, building realistic mock data for development, or populating test environments.
Arguments:
- : The product or system name
$PRODUCT - : Type of data (e.g., customer feedback, transactions, user profiles)
$DATASET_TYPE - : Number of rows to generate (default: 100)
$ROWS - : Specific columns or fields to include
$COLUMNS - : Output format (CSV, JSON, SQL, Python script)
$FORMAT - : Additional constraints or business rules
$CONSTRAINTS
生成可自定义列、约束条件和输出格式(CSV、JSON、SQL、Python脚本)的逼真模拟数据集,可创建可执行脚本或直接生成数据文件供立即使用。
适用场景: 创建测试数据、生成样本数据集、为开发构建逼真的模拟数据,或填充测试环境。
参数:
- : 产品或系统名称
$PRODUCT - : 数据类型(例如客户反馈、交易记录、用户档案)
$DATASET_TYPE - : 要生成的行数(默认值:100)
$ROWS - : 要包含的特定列或字段
$COLUMNS - : 输出格式(CSV、JSON、SQL、Python脚本)
$FORMAT - : 附加约束条件或业务规则
$CONSTRAINTS
Step-by-Step Process
分步流程
- Identify dataset type - Understand the data domain
- Define column specifications - Names, data types, and value ranges
- Determine row count - How many sample records needed
- Select output format - CSV, JSON, SQL INSERT, or Python script
- Apply realistic patterns - Ensure data looks authentic and valid
- Add business constraints - Respect business logic and relationships
- Generate or script data - Create executable output
- Validate output - Ensure data quality and completeness
- 确定数据集类型 - 了解数据领域
- 定义列规范 - 名称、数据类型和值范围
- 确定行数 - 需要的样本记录数量
- 选择输出格式 - CSV、JSON、SQL INSERT或Python脚本
- 应用逼真模式 - 确保数据看起来真实有效
- 添加业务约束 - 遵循业务逻辑和关联关系
- 生成或编写数据脚本 - 创建可执行输出
- 验证输出 - 确保数据质量和完整性
Template: Python Script Output
模板:Python脚本输出
python
import csv
import json
from datetime import datetime, timedelta
import randompython
import csv
import json
from datetime import datetime, timedelta
import randomConfiguration
Configuration
ROWS = $ROWS
FILENAME = "$DATASET_TYPE.csv"
ROWS = $ROWS
FILENAME = "$DATASET_TYPE.csv"
Column definitions with realistic value generators
Column definitions with realistic value generators
columns = {
"id": "auto-increment",
"name": "first_last_name",
"email": "email",
"created_at": "timestamp",
# Add more columns...
}
def generate_dataset():
"""Generate realistic dummy dataset"""
data = []
for i in range(1, ROWS + 1):
record = {
"id": f"U{i:06d}",
# Generate values based on column definitions
}
data.append(record)
return data
def save_as_csv(data, filename):
"""Save dataset as CSV"""
with open(filename, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
if name == "main":
dataset = generate_dataset()
save_as_csv(dataset, FILENAME)
print(f"Generated {len(dataset)} records in {FILENAME}")
undefinedcolumns = {
"id": "auto-increment",
"name": "first_last_name",
"email": "email",
"created_at": "timestamp",
# Add more columns...
}
def generate_dataset():
"""Generate realistic dummy dataset"""
data = []
for i in range(1, ROWS + 1):
record = {
"id": f"U{i:06d}",
# Generate values based on column definitions
}
data.append(record)
return data
def save_as_csv(data, filename):
"""Save dataset as CSV"""
with open(filename, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
if name == "main":
dataset = generate_dataset()
save_as_csv(dataset, FILENAME)
print(f"Generated {len(dataset)} records in {FILENAME}")
undefinedExample Dataset Specification
示例数据集规范
Dataset Type: Customer Feedback
Columns:
- feedback_id (auto-increment, U001, U002...)
- customer_name (realistic names)
- email (valid email format)
- feedback_date (dates last 90 days)
- rating (1-5 stars)
- category (Bug, Feature Request, Complaint, Praise)
- text (realistic feedback)
- product (electronics, clothing, home)
Constraints:
- Ratings skewed: 40% 5-star, 30% 4-star, 20% 3-star, 10% 1-2 star
- Bug category only with ratings 1-3
- Feature requests only with ratings 3-5
- Email domains realistic (gmail, yahoo, company.com)
数据集类型: 客户反馈
列:
- feedback_id(自动递增,U001、U002...)
- customer_name(真实姓名)
- email(有效的邮箱格式)
- feedback_date(最近90天的日期)
- rating(1-5星)
- category(Bug、功能请求、投诉、表扬)
- text(逼真的反馈内容)
- product(电子产品、服装、家居)
约束条件:
- 评分分布倾斜:40% 5星、30% 4星、20% 3星、10% 1-2星
- Bug类别仅对应1-3星评分
- 功能请求仅对应3-5星评分
- 邮箱域名真实可信(gmail、yahoo、company.com)
Output Deliverables
输出交付物
- Ready-to-execute Python script OR direct data file
- CSV file with proper headers and formatting
- JSON file with valid structure and types
- SQL INSERT statements for database population
- Data validation and constraint compliance
- Realistic, business-appropriate values
- Documentation of data generation logic
- Quick-start instructions for using the dataset
- 可直接执行的Python脚本或直接数据文件
- 带有正确表头和格式的CSV文件
- 结构和类型有效的JSON文件
- 用于数据库填充的SQL INSERT语句
- 数据验证和约束合规性
- 逼真、符合业务场景的值
- 数据生成逻辑的文档
- 数据集使用快速入门指南
Output Formats
输出格式
CSV: Flat tabular format, easy to import into spreadsheets and databases
JSON: Nested structure, ideal for APIs and NoSQL databases
SQL: INSERT statements, directly executable on relational databases
Python Script: Executable generator for custom or large datasets
CSV: 扁平表格格式,易于导入电子表格和数据库
JSON: 嵌套结构,适用于API和NoSQL数据库
SQL: INSERT语句,可直接在关系型数据库上执行
Python脚本: 可执行的生成器,用于自定义或大型数据集