r-data-science

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

R Data Science

R 数据科学

Overview

概述

Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.

生成遵循tidyverse规范和现代最佳实践的高质量R代码。本技能涵盖公共卫生、流行病学和数据科学领域常用的数据处理、可视化、统计分析以及可复现研究工作流。

Core Principles

核心原则

Tidyverse-first: Use tidyverse packages (dplyr, tidyr, ggplot2, purrr, readr) as the default approach
Pipe-forward: Use the native pipe
```
|>
```
for chains (R 4.1+); fall back to
```
%>%
```
for older versions
Reproducibility: Structure all work for reproducibility using Quarto, renv, and clear documentation
Defensive coding: Validate inputs, handle missing data explicitly, and fail informatively

优先使用Tidyverse：默认采用tidyverse包（dplyr、tidyr、ggplot2、purrr、readr）作为实现方式
管道式向前传递：使用原生管道
```
|>
```
构建链式操作（R 4.1及以上版本）；旧版本回退使用
```
%>%
```
可复现性：使用Quarto、renv和清晰的文档结构化所有工作，确保可复现
防御式编码：验证输入、显式处理缺失数据，并在出错时提供清晰的提示信息

Quick Reference: Common Patterns

快速参考：常见模式

Data Import

数据导入

library(tidyverse)

library(tidyverse)

CSV (most common)

df <- read_csv("data/raw/dataset.csv")

Excel

df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")

Clean column names immediately

df <- df |> janitor::clean_names()

undefined

df <- df |> janitor::clean_names()

undefined

Data Wrangling Pipeline

数据清洗流水线

analysis_data <- raw_data |>
  # Clean and filter
  filter(!is.na(key_variable)) |>
  
  # Transform variables
  mutate(
    date = as.Date(date_string, format = "%Y-%m-%d"),
    age_group = cut(age, breaks = c(0, 18, 45, 65, Inf),
                    labels = c("0-17", "18-44", "45-64", "65+"))
  ) |>
  
  # Summarize
  group_by(region, age_group) |>
  summarize(
    n = n(),
    mean_value = mean(outcome, na.rm = TRUE),
    .groups = "drop"
  )

analysis_data <- raw_data |>
  # Clean and filter
  filter(!is.na(key_variable)) |>
  
  # Transform variables
  mutate(
    date = as.Date(date_string, format = "%Y-%m-%d"),
    age_group = cut(age, breaks = c(0, 18, 45, 65, Inf),
                    labels = c("0-17", "18-44", "45-64", "65+"))
  ) |>
  
  # Summarize
  group_by(region, age_group) |>
  summarize(
    n = n(),
    mean_value = mean(outcome, na.rm = TRUE),
    .groups = "drop"
  )

Basic ggplot2 Visualization

基础ggplot2可视化

ggplot(df, aes(x = date, y = count, color = category)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Trend Over Time",
    subtitle = "By category",
    x = "Date",
    y = "Count",
    color = "Category",
    caption = "Source: Dataset Name"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

ggplot(df, aes(x = date, y = count, color = category)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Trend Over Time",
    subtitle = "By category",
    x = "Date",
    y = "Count",
    color = "Category",
    caption = "Source: Dataset Name"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

Tidyverse Style Guide Essentials

Tidyverse 风格指南要点

Naming Conventions

命名规范

snake_case for objects and functions:
```
case_counts
```
,
```
calculate_rate()
```
Verbs for functions:
```
filter_outliers()
```
,
```
compute_summary()
```
Nouns for data:
```
patient_data
```
,
```
surveillance_df
```
Avoid: dots in names (reserved for S3), single letters except in lambdas

snake_case 用于对象和函数命名：
```
case_counts
```
,
```
calculate_rate()
```
动词用于函数命名：
```
filter_outliers()
```
,
```
compute_summary()
```
名词用于数据命名：
```
patient_data
```
,
```
surveillance_df
```
避免：名称中使用点（为S3系统保留），除lambda表达式外避免使用单个字母

Code Formatting

代码格式化

Indentation: 2 spaces (never tabs)
Line length: 80 characters maximum
Operators: Spaces around
```
<-
```
,
```
=
```
,
```
+
```
,
```
|>
```
, but not
```
:
```
,
```
::
```
,
```
$
```
Commas: Space after, never before
Pipes: New line after each
```
|>
```

undefined

缩进：2个空格（绝不用制表符）
行长度：最多80个字符
运算符：
```
<-
```
,
```
=
```
,
```
+
```
,
```
|>
```
前后加空格，但
```
:
```
,
```
::
```
,
```
$
```
前后不加
逗号：逗号后加空格，前不加
管道符：每个
```
|>
```
后换行

undefined

Good

result <- data |> filter(year >= 2020) |> group_by(county) |> summarize(total = sum(cases))

Bad

result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))

undefined

result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))

undefined

Assignment

赋值

Use
```
<-
```
for assignment, never
```
=
```
or
```
->
```
Use
```
=
```
only for function arguments

使用
```
<-
```
进行赋值，绝不用
```
=
```
或
```
->
```
仅在函数参数中使用
```
=
```

Comments

注释

undefined

undefined

Load and clean surveillance data ------------------------------------------

Calculate age-adjusted rates

Using direct standardization method per CDC guidelines

adjusted_rate <- calculate_adjusted_rate(df, standard_pop)

undefined

adjusted_rate <- calculate_adjusted_rate(df, standard_pop)

undefined

Package Ecosystem

包生态系统

Core Tidyverse (Always Load)

核心Tidyverse（始终加载）

library(tidyverse)  # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

library(tidyverse)  # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

Data Import/Export

数据导入/导出

Task	Package	Key Functions
CSV/TSV	readr	`read_csv()` , `write_csv()`
Excel	readxl, writexl	`read_excel()` , `write_xlsx()`
SAS/SPSS/Stata	haven	`read_sas()` , `read_spss()` , `read_stata()`
JSON	jsonlite	`read_json()` , `fromJSON()`
Databases	DBI, dbplyr	`dbConnect()` , `tbl()`

任务	包	核心函数
CSV/TSV	readr	`read_csv()` , `write_csv()`
Excel	readxl, writexl	`read_excel()` , `write_xlsx()`
SAS/SPSS/Stata	haven	`read_sas()` , `read_spss()` , `read_stata()`
JSON	jsonlite	`read_json()` , `fromJSON()`
数据库	DBI, dbplyr	`dbConnect()` , `tbl()`

Data Manipulation

数据处理

Task	Package	Key Functions
Column cleaning	janitor	`clean_names()` , `tabyl()`
Date handling	lubridate	`ymd()` , `mdy()` , `floor_date()`
String operations	stringr	`str_detect()` , `str_extract()`
Missing data	naniar	`vis_miss()` , `replace_with_na()`

任务	包	核心函数
列名清洗	janitor	`clean_names()` , `tabyl()`
日期处理	lubridate	`ymd()` , `mdy()` , `floor_date()`
字符串操作	stringr	`str_detect()` , `str_extract()`
缺失数据处理	naniar	`vis_miss()` , `replace_with_na()`

Visualization

可视化

Task	Package	Key Functions
Core plotting	ggplot2	`ggplot()` , `geom_*()`
Extensions	ggrepel, patchwork	`geom_text_repel()` , `+` operator
Interactive	plotly	`ggplotly()`
Tables	gt, kableExtra	`gt()` , `kable()`

任务	包	核心函数
基础绘图	ggplot2	`ggplot()` , `geom_*()`
扩展包	ggrepel, patchwork	`geom_text_repel()` , `+` 运算符
交互式绘图	plotly	`ggplotly()`
表格	gt, kableExtra	`gt()` , `kable()`

Statistical Analysis

统计分析

Task	Package	Key Functions
Model summaries	broom	`tidy()` , `glance()` , `augment()`
Regression	stats, lme4	`lm()` , `glm()` , `lmer()`
Survival	survival	`Surv()` , `survfit()` , `coxph()`
Survey data	survey	`svydesign()` , `svymean()`

任务	包	核心函数
模型汇总	broom	`tidy()` , `glance()` , `augment()`
回归分析	stats, lme4	`lm()` , `glm()` , `lmer()`
生存分析	survival	`Surv()` , `survfit()` , `coxph()`
调查数据	survey	`svydesign()` , `svymean()`

Epidemiology & Public Health

流行病学与公共卫生

Task	Package	Key Functions
Epi calculations	epiR	`epi.2by2()` , `epi.conf()`
Outbreak tools	incidence2, epicontacts	`incidence()` , `make_epicontacts()`
Disease mapping	SpatialEpi	`expected()` , `EBlocal()`
Surveillance	surveillance	`sts()` , `farrington()`
Rate calculations	epitools	`riskratio()` , `oddsratio()` , `ageadjust.direct()`

任务	包	核心函数
流行病学计算	epiR	`epi.2by2()` , `epi.conf()`
暴发应对工具	incidence2, epicontacts	`incidence()` , `make_epicontacts()`
疾病地图	SpatialEpi	`expected()` , `EBlocal()`
监测	surveillance	`sts()` , `farrington()`
率计算	epitools	`riskratio()` , `oddsratio()` , `ageadjust.direct()`

Reproducibility Standards

可复现性标准

Project Structure

项目结构

project/
├── project.Rproj
├── renv.lock
├── CLAUDE.md              # Claude Code configuration
├── README.md
├── data/
│   ├── raw/               # Never modify
│   └── processed/         # Analysis-ready
├── R/                     # Custom functions
├── scripts/               # Pipeline scripts
├── analysis/              # Quarto documents
└── output/
    ├── figures/
    └── tables/

project/
├── project.Rproj
├── renv.lock
├── CLAUDE.md              # Claude Code configuration
├── README.md
├── data/
│   ├── raw/               # Never modify
│   └── processed/         # Analysis-ready
├── R/                     # Custom functions
├── scripts/               # Pipeline scripts
├── analysis/              # Quarto documents
└── output/
    ├── figures/
    └── tables/

Quarto Document Header

Quarto 文档头部

yaml

---
title: "Analysis Title"
author: "Your Name"
date: today
format:
  html:
    toc: true
    code-fold: true
    embed-resources: true
execute:
  warning: false
  message: false
---

yaml

---
title: "Analysis Title"
author: "Your Name"
date: today
format:
  html:
    toc: true
    code-fold: true
    embed-resources: true
execute:
  warning: false
  message: false
---

Package Management with renv

使用renv进行包管理

undefined

undefined

Initialize (once per project)

renv::init()

Snapshot dependencies after installing packages

renv::snapshot()

Restore environment (for collaborators)

renv::restore()

undefined

renv::restore()

undefined

Workflow Documentation

工作流文档

Always include at the top of scripts:

undefined

始终在脚本顶部包含以下内容：

undefined

============================================================================

Title: Analysis of [Subject]

Author: [Name]

Date: [Date]

Purpose: [One-sentence description]

Input: data/processed/clean_data.csv

Output: output/figures/trend_plot.png

============================================================================

undefined

undefined

Common Analysis Patterns

常见分析模式

Descriptive Statistics Table

描述性统计表

df |>
  group_by(category) |>
  summarize(
    n = n(),
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    q25 = quantile(value, 0.25, na.rm = TRUE),
    q75 = quantile(value, 0.75, na.rm = TRUE)
  ) |>
  gt::gt() |>
  gt::fmt_number(columns = where(is.numeric), decimals = 2)

df |>
  group_by(category) |>
  summarize(
    n = n(),
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    q25 = quantile(value, 0.25, na.rm = TRUE),
    q75 = quantile(value, 0.75, na.rm = TRUE)
  ) |>
  gt::gt() |>
  gt::fmt_number(columns = where(is.numeric), decimals = 2)

Regression with Tidy Output

带整洁输出的回归分析

model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)

model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)

Tidy coefficients

tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |> select(term, estimate, conf.low, conf.high, p.value)

Model diagnostics

glance_results <- broom::glance(model)

undefined

glance_results <- broom::glance(model)

undefined

Epi Curve (Epidemic Curve)

流行曲线（疫情曲线）

library(incidence2)

library(incidence2)

Create incidence object

inc <- incidence( df, date_index = "onset_date", interval = "week", groups = "outcome_category" )

Plot

plot(inc) + labs( title = "Epidemic Curve", x = "Week of Onset", y = "Number of Cases" ) + theme_minimal()

undefined

plot(inc) + labs( title = "Epidemic Curve", x = "Week of Onset", y = "Number of Cases" ) + theme_minimal()

undefined

Rate Calculation

率计算

undefined

undefined

Age-adjusted rates using direct standardization

library(epitools)

Stratum-specific counts and populations

result <- ageadjust.direct( count = df$cases, pop = df$population, stdpop = standard_population$pop # e.g., US 2000 standard )

undefined

result <- ageadjust.direct( count = df$cases, pop = df$population, stdpop = standard_population$pop # e.g., US 2000 standard )

undefined

Error Handling

错误处理

Defensive Data Checks

防御式数据检查

undefined

undefined

Validate data before analysis

stopifnot( "Data frame is empty" = nrow(df) > 0, "Missing required columns" = all(c("id", "date", "value") %in% names(df)), "Duplicate IDs found" = !any(duplicated(df$id)) )

Informative warnings for data quality issues

if (sum(is.na(df$key_var)) > 0) { warning(sprintf("%d missing values in key_var (%.1f%%)", sum(is.na(df$key_var)), 100 * mean(is.na(df$key_var)))) }

undefined

if (sum(is.na(df$key_var)) > 0) { warning(sprintf("%d missing values in key_var (%.1f%%)", sum(is.na(df$key_var)), 100 * mean(is.na(df$key_var)))) }

undefined

Safe File Operations

安全文件操作

undefined

undefined

Check file exists before reading

if (!file.exists(filepath)) { stop(sprintf("File not found: %s", filepath)) }

Create directories if needed

dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)

undefined

dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)

undefined

Performance Tips

性能优化技巧

For Large Datasets

针对大型数据集

undefined

undefined

Use data.table for >1M rows

library(data.table) dt <- fread("large_file.csv")

Or use arrow for very large/parquet files

library(arrow) df <- read_parquet("data.parquet")

Lazy evaluation with duckdb

library(duckdb) con <- dbConnect(duckdb()) df_lazy <- tbl(con, "data.csv")

undefined

library(duckdb) con <- dbConnect(duckdb()) df_lazy <- tbl(con, "data.csv")

undefined

Vectorization Over Loops

向量化替代循环

undefined

undefined

Good: vectorized

df$rate <- df$cases / df$population * 100000

Avoid: row-by-row loop

for (i in 1:nrow(df)) { df$rate[i] <- df$cases[i] / df$population[i] * 100000 }

undefined

for (i in 1:nrow(df)) { df$rate[i] <- df$cases[i] / df$population[i] * 100000 }

undefined

Additional Resources

额外资源

For detailed patterns, consult:

Tidyverse Style Guide: https://style.tidyverse.org/
R for Data Science (2e): https://r4ds.hadley.nz/
The Epidemiologist R Handbook: https://epirhandbook.com/
Quarto Documentation: https://quarto.org/

如需了解详细模式，请参考：

Tidyverse Style Guide: https://style.tidyverse.org/
R for Data Science (2e): https://r4ds.hadley.nz/
The Epidemiologist R Handbook: https://epirhandbook.com/
Quarto Documentation: https://quarto.org/

Version History

版本历史

v1.0.0 (2025-12-04): Initial release for PubHealthAI community

v1.0.0 (2025-12-04): Initial release for PubHealthAI community