r-data-science
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseR Data Science
R 数据科学
Overview
概述
Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.
生成遵循tidyverse规范和现代最佳实践的高质量R代码。本技能涵盖公共卫生、流行病学和数据科学领域常用的数据处理、可视化、统计分析以及可复现研究工作流。
Core Principles
核心原则
- Tidyverse-first: Use tidyverse packages (dplyr, tidyr, ggplot2, purrr, readr) as the default approach
- Pipe-forward: Use the native pipe for chains (R 4.1+); fall back to
|>for older versions%>% - Reproducibility: Structure all work for reproducibility using Quarto, renv, and clear documentation
- Defensive coding: Validate inputs, handle missing data explicitly, and fail informatively
- 优先使用Tidyverse:默认采用tidyverse包(dplyr、tidyr、ggplot2、purrr、readr)作为实现方式
- 管道式向前传递:使用原生管道 构建链式操作(R 4.1及以上版本);旧版本回退使用
|>%>% - 可复现性:使用Quarto、renv和清晰的文档结构化所有工作,确保可复现
- 防御式编码:验证输入、显式处理缺失数据,并在出错时提供清晰的提示信息
Quick Reference: Common Patterns
快速参考:常见模式
Data Import
数据导入
r
library(tidyverse)r
library(tidyverse)CSV (most common)
CSV (most common)
df <- read_csv("data/raw/dataset.csv")
df <- read_csv("data/raw/dataset.csv")
Excel
Excel
df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")
df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")
Clean column names immediately
Clean column names immediately
df <- df |> janitor::clean_names()
undefineddf <- df |> janitor::clean_names()
undefinedData Wrangling Pipeline
数据清洗流水线
r
analysis_data <- raw_data |>
# Clean and filter
filter(!is.na(key_variable)) |>
# Transform variables
mutate(
date = as.Date(date_string, format = "%Y-%m-%d"),
age_group = cut(age, breaks = c(0, 18, 45, 65, Inf),
labels = c("0-17", "18-44", "45-64", "65+"))
) |>
# Summarize
group_by(region, age_group) |>
summarize(
n = n(),
mean_value = mean(outcome, na.rm = TRUE),
.groups = "drop"
)r
analysis_data <- raw_data |>
# Clean and filter
filter(!is.na(key_variable)) |>
# Transform variables
mutate(
date = as.Date(date_string, format = "%Y-%m-%d"),
age_group = cut(age, breaks = c(0, 18, 45, 65, Inf),
labels = c("0-17", "18-44", "45-64", "65+"))
) |>
# Summarize
group_by(region, age_group) |>
summarize(
n = n(),
mean_value = mean(outcome, na.rm = TRUE),
.groups = "drop"
)Basic ggplot2 Visualization
基础ggplot2可视化
r
ggplot(df, aes(x = date, y = count, color = category)) +
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set2") +
labs(
title = "Trend Over Time",
subtitle = "By category",
x = "Date",
y = "Count",
color = "Category",
caption = "Source: Dataset Name"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold")
)r
ggplot(df, aes(x = date, y = count, color = category)) +
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set2") +
labs(
title = "Trend Over Time",
subtitle = "By category",
x = "Date",
y = "Count",
color = "Category",
caption = "Source: Dataset Name"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold")
)Tidyverse Style Guide Essentials
Tidyverse 风格指南要点
Naming Conventions
命名规范
- snake_case for objects and functions: ,
case_countscalculate_rate() - Verbs for functions: ,
filter_outliers()compute_summary() - Nouns for data: ,
patient_datasurveillance_df - Avoid: dots in names (reserved for S3), single letters except in lambdas
- snake_case 用于对象和函数命名:,
case_countscalculate_rate() - 动词用于函数命名:,
filter_outliers()compute_summary() - 名词用于数据命名:,
patient_datasurveillance_df - 避免:名称中使用点(为S3系统保留),除lambda表达式外避免使用单个字母
Code Formatting
代码格式化
- Indentation: 2 spaces (never tabs)
- Line length: 80 characters maximum
- Operators: Spaces around ,
<-,=,+, but not|>,:,::$ - Commas: Space after, never before
- Pipes: New line after each
|>
r
undefined- 缩进:2个空格(绝不用制表符)
- 行长度:最多80个字符
- 运算符:,
<-,=,+前后加空格,但|>,:,::前后不加$ - 逗号:逗号后加空格,前不加
- 管道符:每个 后换行
|>
r
undefinedGood
Good
result <- data |>
filter(year >= 2020) |>
group_by(county) |>
summarize(total = sum(cases))
result <- data |>
filter(year >= 2020) |>
group_by(county) |>
summarize(total = sum(cases))
Bad
Bad
result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))
undefinedresult<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))
undefinedAssignment
赋值
- Use for assignment, never
<-or=-> - Use only for function arguments
=
- 使用 进行赋值,绝不用
<-或=-> - 仅在函数参数中使用
=
Comments
注释
r
undefinedr
undefinedLoad and clean surveillance data ------------------------------------------
Load and clean surveillance data ------------------------------------------
Calculate age-adjusted rates
Calculate age-adjusted rates
Using direct standardization method per CDC guidelines
Using direct standardization method per CDC guidelines
adjusted_rate <- calculate_adjusted_rate(df, standard_pop)
undefinedadjusted_rate <- calculate_adjusted_rate(df, standard_pop)
undefinedPackage Ecosystem
包生态系统
Core Tidyverse (Always Load)
核心Tidyverse(始终加载)
r
library(tidyverse) # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcatsr
library(tidyverse) # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcatsData Import/Export
数据导入/导出
| Task | Package | Key Functions |
|---|---|---|
| CSV/TSV | readr | |
| Excel | readxl, writexl | |
| SAS/SPSS/Stata | haven | |
| JSON | jsonlite | |
| Databases | DBI, dbplyr | |
| 任务 | 包 | 核心函数 |
|---|---|---|
| CSV/TSV | readr | |
| Excel | readxl, writexl | |
| SAS/SPSS/Stata | haven | |
| JSON | jsonlite | |
| 数据库 | DBI, dbplyr | |
Data Manipulation
数据处理
| Task | Package | Key Functions |
|---|---|---|
| Column cleaning | janitor | |
| Date handling | lubridate | |
| String operations | stringr | |
| Missing data | naniar | |
| 任务 | 包 | 核心函数 |
|---|---|---|
| 列名清洗 | janitor | |
| 日期处理 | lubridate | |
| 字符串操作 | stringr | |
| 缺失数据处理 | naniar | |
Visualization
可视化
| Task | Package | Key Functions |
|---|---|---|
| Core plotting | ggplot2 | |
| Extensions | ggrepel, patchwork | |
| Interactive | plotly | |
| Tables | gt, kableExtra | |
| 任务 | 包 | 核心函数 |
|---|---|---|
| 基础绘图 | ggplot2 | |
| 扩展包 | ggrepel, patchwork | |
| 交互式绘图 | plotly | |
| 表格 | gt, kableExtra | |
Statistical Analysis
统计分析
| Task | Package | Key Functions |
|---|---|---|
| Model summaries | broom | |
| Regression | stats, lme4 | |
| Survival | survival | |
| Survey data | survey | |
| 任务 | 包 | 核心函数 |
|---|---|---|
| 模型汇总 | broom | |
| 回归分析 | stats, lme4 | |
| 生存分析 | survival | |
| 调查数据 | survey | |
Epidemiology & Public Health
流行病学与公共卫生
| Task | Package | Key Functions |
|---|---|---|
| Epi calculations | epiR | |
| Outbreak tools | incidence2, epicontacts | |
| Disease mapping | SpatialEpi | |
| Surveillance | surveillance | |
| Rate calculations | epitools | |
| 任务 | 包 | 核心函数 |
|---|---|---|
| 流行病学计算 | epiR | |
| 暴发应对工具 | incidence2, epicontacts | |
| 疾病地图 | SpatialEpi | |
| 监测 | surveillance | |
| 率计算 | epitools | |
Reproducibility Standards
可复现性标准
Project Structure
项目结构
project/
├── project.Rproj
├── renv.lock
├── CLAUDE.md # Claude Code configuration
├── README.md
├── data/
│ ├── raw/ # Never modify
│ └── processed/ # Analysis-ready
├── R/ # Custom functions
├── scripts/ # Pipeline scripts
├── analysis/ # Quarto documents
└── output/
├── figures/
└── tables/project/
├── project.Rproj
├── renv.lock
├── CLAUDE.md # Claude Code configuration
├── README.md
├── data/
│ ├── raw/ # Never modify
│ └── processed/ # Analysis-ready
├── R/ # Custom functions
├── scripts/ # Pipeline scripts
├── analysis/ # Quarto documents
└── output/
├── figures/
└── tables/Quarto Document Header
Quarto 文档头部
yaml
---
title: "Analysis Title"
author: "Your Name"
date: today
format:
html:
toc: true
code-fold: true
embed-resources: true
execute:
warning: false
message: false
---yaml
---
title: "Analysis Title"
author: "Your Name"
date: today
format:
html:
toc: true
code-fold: true
embed-resources: true
execute:
warning: false
message: false
---Package Management with renv
使用renv进行包管理
r
undefinedr
undefinedInitialize (once per project)
Initialize (once per project)
renv::init()
renv::init()
Snapshot dependencies after installing packages
Snapshot dependencies after installing packages
renv::snapshot()
renv::snapshot()
Restore environment (for collaborators)
Restore environment (for collaborators)
renv::restore()
undefinedrenv::restore()
undefinedWorkflow Documentation
工作流文档
Always include at the top of scripts:
r
undefined始终在脚本顶部包含以下内容:
r
undefined============================================================================
============================================================================
Title: Analysis of [Subject]
Title: Analysis of [Subject]
Author: [Name]
Author: [Name]
Date: [Date]
Date: [Date]
Purpose: [One-sentence description]
Purpose: [One-sentence description]
Input: data/processed/clean_data.csv
Input: data/processed/clean_data.csv
Output: output/figures/trend_plot.png
Output: output/figures/trend_plot.png
============================================================================
============================================================================
undefinedundefinedCommon Analysis Patterns
常见分析模式
Descriptive Statistics Table
描述性统计表
r
df |>
group_by(category) |>
summarize(
n = n(),
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
q25 = quantile(value, 0.25, na.rm = TRUE),
q75 = quantile(value, 0.75, na.rm = TRUE)
) |>
gt::gt() |>
gt::fmt_number(columns = where(is.numeric), decimals = 2)r
df |>
group_by(category) |>
summarize(
n = n(),
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
q25 = quantile(value, 0.25, na.rm = TRUE),
q75 = quantile(value, 0.75, na.rm = TRUE)
) |>
gt::gt() |>
gt::fmt_number(columns = where(is.numeric), decimals = 2)Regression with Tidy Output
带整洁输出的回归分析
r
model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)r
model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)Tidy coefficients
Tidy coefficients
tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |>
select(term, estimate, conf.low, conf.high, p.value)
tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |>
select(term, estimate, conf.low, conf.high, p.value)
Model diagnostics
Model diagnostics
glance_results <- broom::glance(model)
undefinedglance_results <- broom::glance(model)
undefinedEpi Curve (Epidemic Curve)
流行曲线(疫情曲线)
r
library(incidence2)r
library(incidence2)Create incidence object
Create incidence object
inc <- incidence(
df,
date_index = "onset_date",
interval = "week",
groups = "outcome_category"
)
inc <- incidence(
df,
date_index = "onset_date",
interval = "week",
groups = "outcome_category"
)
Plot
Plot
plot(inc) +
labs(
title = "Epidemic Curve",
x = "Week of Onset",
y = "Number of Cases"
) +
theme_minimal()
undefinedplot(inc) +
labs(
title = "Epidemic Curve",
x = "Week of Onset",
y = "Number of Cases"
) +
theme_minimal()
undefinedRate Calculation
率计算
r
undefinedr
undefinedAge-adjusted rates using direct standardization
Age-adjusted rates using direct standardization
library(epitools)
library(epitools)
Stratum-specific counts and populations
Stratum-specific counts and populations
result <- ageadjust.direct(
count = df$cases,
pop = df$population,
stdpop = standard_population$pop # e.g., US 2000 standard
)
undefinedresult <- ageadjust.direct(
count = df$cases,
pop = df$population,
stdpop = standard_population$pop # e.g., US 2000 standard
)
undefinedError Handling
错误处理
Defensive Data Checks
防御式数据检查
r
undefinedr
undefinedValidate data before analysis
Validate data before analysis
stopifnot(
"Data frame is empty" = nrow(df) > 0,
"Missing required columns" = all(c("id", "date", "value") %in% names(df)),
"Duplicate IDs found" = !any(duplicated(df$id))
)
stopifnot(
"Data frame is empty" = nrow(df) > 0,
"Missing required columns" = all(c("id", "date", "value") %in% names(df)),
"Duplicate IDs found" = !any(duplicated(df$id))
)
Informative warnings for data quality issues
Informative warnings for data quality issues
if (sum(is.na(df$key_var)) > 0) {
warning(sprintf("%d missing values in key_var (%.1f%%)",
sum(is.na(df$key_var)),
100 * mean(is.na(df$key_var))))
}
undefinedif (sum(is.na(df$key_var)) > 0) {
warning(sprintf("%d missing values in key_var (%.1f%%)",
sum(is.na(df$key_var)),
100 * mean(is.na(df$key_var))))
}
undefinedSafe File Operations
安全文件操作
r
undefinedr
undefinedCheck file exists before reading
Check file exists before reading
if (!file.exists(filepath)) {
stop(sprintf("File not found: %s", filepath))
}
if (!file.exists(filepath)) {
stop(sprintf("File not found: %s", filepath))
}
Create directories if needed
Create directories if needed
dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)
undefineddir.create("output/figures", recursive = TRUE, showWarnings = FALSE)
undefinedPerformance Tips
性能优化技巧
For Large Datasets
针对大型数据集
r
undefinedr
undefinedUse data.table for >1M rows
Use data.table for >1M rows
library(data.table)
dt <- fread("large_file.csv")
library(data.table)
dt <- fread("large_file.csv")
Or use arrow for very large/parquet files
Or use arrow for very large/parquet files
library(arrow)
df <- read_parquet("data.parquet")
library(arrow)
df <- read_parquet("data.parquet")
Lazy evaluation with duckdb
Lazy evaluation with duckdb
library(duckdb)
con <- dbConnect(duckdb())
df_lazy <- tbl(con, "data.csv")
undefinedlibrary(duckdb)
con <- dbConnect(duckdb())
df_lazy <- tbl(con, "data.csv")
undefinedVectorization Over Loops
向量化替代循环
r
undefinedr
undefinedGood: vectorized
Good: vectorized
df$rate <- df$cases / df$population * 100000
df$rate <- df$cases / df$population * 100000
Avoid: row-by-row loop
Avoid: row-by-row loop
for (i in 1:nrow(df)) {
df$rate[i] <- df$cases[i] / df$population[i] * 100000
}
undefinedfor (i in 1:nrow(df)) {
df$rate[i] <- df$cases[i] / df$population[i] * 100000
}
undefinedAdditional Resources
额外资源
For detailed patterns, consult:
- Tidyverse Style Guide: https://style.tidyverse.org/
- R for Data Science (2e): https://r4ds.hadley.nz/
- The Epidemiologist R Handbook: https://epirhandbook.com/
- Quarto Documentation: https://quarto.org/
如需了解详细模式,请参考:
- Tidyverse Style Guide: https://style.tidyverse.org/
- R for Data Science (2e): https://r4ds.hadley.nz/
- The Epidemiologist R Handbook: https://epirhandbook.com/
- Quarto Documentation: https://quarto.org/
Version History
版本历史
- v1.0.0 (2025-12-04): Initial release for PubHealthAI community
- v1.0.0 (2025-12-04): Initial release for PubHealthAI community