r-data-science

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

R Data Science

R 数据科学

Overview

概述

Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.
生成遵循tidyverse规范和现代最佳实践的高质量R代码。本技能涵盖公共卫生、流行病学和数据科学领域常用的数据处理、可视化、统计分析以及可复现研究工作流。

Core Principles

核心原则

  1. Tidyverse-first: Use tidyverse packages (dplyr, tidyr, ggplot2, purrr, readr) as the default approach
  2. Pipe-forward: Use the native pipe
    |>
    for chains (R 4.1+); fall back to
    %>%
    for older versions
  3. Reproducibility: Structure all work for reproducibility using Quarto, renv, and clear documentation
  4. Defensive coding: Validate inputs, handle missing data explicitly, and fail informatively
  1. 优先使用Tidyverse:默认采用tidyverse包(dplyr、tidyr、ggplot2、purrr、readr)作为实现方式
  2. 管道式向前传递:使用原生管道
    |>
    构建链式操作(R 4.1及以上版本);旧版本回退使用
    %>%
  3. 可复现性:使用Quarto、renv和清晰的文档结构化所有工作,确保可复现
  4. 防御式编码:验证输入、显式处理缺失数据,并在出错时提供清晰的提示信息

Quick Reference: Common Patterns

快速参考:常见模式

Data Import

数据导入

r
library(tidyverse)
r
library(tidyverse)

CSV (most common)

CSV (most common)

df <- read_csv("data/raw/dataset.csv")
df <- read_csv("data/raw/dataset.csv")

Excel

Excel

df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")
df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")

Clean column names immediately

Clean column names immediately

df <- df |> janitor::clean_names()
undefined
df <- df |> janitor::clean_names()
undefined

Data Wrangling Pipeline

数据清洗流水线

r
analysis_data <- raw_data |>
  # Clean and filter
  filter(!is.na(key_variable)) |>
  
  # Transform variables
  mutate(
    date = as.Date(date_string, format = "%Y-%m-%d"),
    age_group = cut(age, breaks = c(0, 18, 45, 65, Inf),
                    labels = c("0-17", "18-44", "45-64", "65+"))
  ) |>
  
  # Summarize
  group_by(region, age_group) |>
  summarize(
    n = n(),
    mean_value = mean(outcome, na.rm = TRUE),
    .groups = "drop"
  )
r
analysis_data <- raw_data |>
  # Clean and filter
  filter(!is.na(key_variable)) |>
  
  # Transform variables
  mutate(
    date = as.Date(date_string, format = "%Y-%m-%d"),
    age_group = cut(age, breaks = c(0, 18, 45, 65, Inf),
                    labels = c("0-17", "18-44", "45-64", "65+"))
  ) |>
  
  # Summarize
  group_by(region, age_group) |>
  summarize(
    n = n(),
    mean_value = mean(outcome, na.rm = TRUE),
    .groups = "drop"
  )

Basic ggplot2 Visualization

基础ggplot2可视化

r
ggplot(df, aes(x = date, y = count, color = category)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Trend Over Time",
    subtitle = "By category",
    x = "Date",
    y = "Count",
    color = "Category",
    caption = "Source: Dataset Name"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )
r
ggplot(df, aes(x = date, y = count, color = category)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Trend Over Time",
    subtitle = "By category",
    x = "Date",
    y = "Count",
    color = "Category",
    caption = "Source: Dataset Name"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

Tidyverse Style Guide Essentials

Tidyverse 风格指南要点

Naming Conventions

命名规范

  • snake_case for objects and functions:
    case_counts
    ,
    calculate_rate()
  • Verbs for functions:
    filter_outliers()
    ,
    compute_summary()
  • Nouns for data:
    patient_data
    ,
    surveillance_df
  • Avoid: dots in names (reserved for S3), single letters except in lambdas
  • snake_case 用于对象和函数命名:
    case_counts
    ,
    calculate_rate()
  • 动词用于函数命名
    filter_outliers()
    ,
    compute_summary()
  • 名词用于数据命名
    patient_data
    ,
    surveillance_df
  • 避免:名称中使用点(为S3系统保留),除lambda表达式外避免使用单个字母

Code Formatting

代码格式化

  • Indentation: 2 spaces (never tabs)
  • Line length: 80 characters maximum
  • Operators: Spaces around
    <-
    ,
    =
    ,
    +
    ,
    |>
    , but not
    :
    ,
    ::
    ,
    $
  • Commas: Space after, never before
  • Pipes: New line after each
    |>
r
undefined
  • 缩进:2个空格(绝不用制表符)
  • 行长度:最多80个字符
  • 运算符
    <-
    ,
    =
    ,
    +
    ,
    |>
    前后加空格,但
    :
    ,
    ::
    ,
    $
    前后不加
  • 逗号:逗号后加空格,前不加
  • 管道符:每个
    |>
    后换行
r
undefined

Good

Good

result <- data |> filter(year >= 2020) |> group_by(county) |> summarize(total = sum(cases))
result <- data |> filter(year >= 2020) |> group_by(county) |> summarize(total = sum(cases))

Bad

Bad

result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))
undefined
result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))
undefined

Assignment

赋值

  • Use
    <-
    for assignment, never
    =
    or
    ->
  • Use
    =
    only for function arguments
  • 使用
    <-
    进行赋值,绝不用
    =
    ->
  • 仅在函数参数中使用
    =

Comments

注释

r
undefined
r
undefined

Load and clean surveillance data ------------------------------------------

Load and clean surveillance data ------------------------------------------

Calculate age-adjusted rates

Calculate age-adjusted rates

Using direct standardization method per CDC guidelines

Using direct standardization method per CDC guidelines

adjusted_rate <- calculate_adjusted_rate(df, standard_pop)
undefined
adjusted_rate <- calculate_adjusted_rate(df, standard_pop)
undefined

Package Ecosystem

包生态系统

Core Tidyverse (Always Load)

核心Tidyverse(始终加载)

r
library(tidyverse)  # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
r
library(tidyverse)  # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

Data Import/Export

数据导入/导出

TaskPackageKey Functions
CSV/TSVreadr
read_csv()
,
write_csv()
Excelreadxl, writexl
read_excel()
,
write_xlsx()
SAS/SPSS/Statahaven
read_sas()
,
read_spss()
,
read_stata()
JSONjsonlite
read_json()
,
fromJSON()
DatabasesDBI, dbplyr
dbConnect()
,
tbl()
任务核心函数
CSV/TSVreadr
read_csv()
,
write_csv()
Excelreadxl, writexl
read_excel()
,
write_xlsx()
SAS/SPSS/Statahaven
read_sas()
,
read_spss()
,
read_stata()
JSONjsonlite
read_json()
,
fromJSON()
数据库DBI, dbplyr
dbConnect()
,
tbl()

Data Manipulation

数据处理

TaskPackageKey Functions
Column cleaningjanitor
clean_names()
,
tabyl()
Date handlinglubridate
ymd()
,
mdy()
,
floor_date()
String operationsstringr
str_detect()
,
str_extract()
Missing datananiar
vis_miss()
,
replace_with_na()
任务核心函数
列名清洗janitor
clean_names()
,
tabyl()
日期处理lubridate
ymd()
,
mdy()
,
floor_date()
字符串操作stringr
str_detect()
,
str_extract()
缺失数据处理naniar
vis_miss()
,
replace_with_na()

Visualization

可视化

TaskPackageKey Functions
Core plottingggplot2
ggplot()
,
geom_*()
Extensionsggrepel, patchwork
geom_text_repel()
,
+
operator
Interactiveplotly
ggplotly()
Tablesgt, kableExtra
gt()
,
kable()
任务核心函数
基础绘图ggplot2
ggplot()
,
geom_*()
扩展包ggrepel, patchwork
geom_text_repel()
,
+
运算符
交互式绘图plotly
ggplotly()
表格gt, kableExtra
gt()
,
kable()

Statistical Analysis

统计分析

TaskPackageKey Functions
Model summariesbroom
tidy()
,
glance()
,
augment()
Regressionstats, lme4
lm()
,
glm()
,
lmer()
Survivalsurvival
Surv()
,
survfit()
,
coxph()
Survey datasurvey
svydesign()
,
svymean()
任务核心函数
模型汇总broom
tidy()
,
glance()
,
augment()
回归分析stats, lme4
lm()
,
glm()
,
lmer()
生存分析survival
Surv()
,
survfit()
,
coxph()
调查数据survey
svydesign()
,
svymean()

Epidemiology & Public Health

流行病学与公共卫生

TaskPackageKey Functions
Epi calculationsepiR
epi.2by2()
,
epi.conf()
Outbreak toolsincidence2, epicontacts
incidence()
,
make_epicontacts()
Disease mappingSpatialEpi
expected()
,
EBlocal()
Surveillancesurveillance
sts()
,
farrington()
Rate calculationsepitools
riskratio()
,
oddsratio()
,
ageadjust.direct()
任务核心函数
流行病学计算epiR
epi.2by2()
,
epi.conf()
暴发应对工具incidence2, epicontacts
incidence()
,
make_epicontacts()
疾病地图SpatialEpi
expected()
,
EBlocal()
监测surveillance
sts()
,
farrington()
率计算epitools
riskratio()
,
oddsratio()
,
ageadjust.direct()

Reproducibility Standards

可复现性标准

Project Structure

项目结构

project/
├── project.Rproj
├── renv.lock
├── CLAUDE.md              # Claude Code configuration
├── README.md
├── data/
│   ├── raw/               # Never modify
│   └── processed/         # Analysis-ready
├── R/                     # Custom functions
├── scripts/               # Pipeline scripts
├── analysis/              # Quarto documents
└── output/
    ├── figures/
    └── tables/
project/
├── project.Rproj
├── renv.lock
├── CLAUDE.md              # Claude Code configuration
├── README.md
├── data/
│   ├── raw/               # Never modify
│   └── processed/         # Analysis-ready
├── R/                     # Custom functions
├── scripts/               # Pipeline scripts
├── analysis/              # Quarto documents
└── output/
    ├── figures/
    └── tables/

Quarto Document Header

Quarto 文档头部

yaml
---
title: "Analysis Title"
author: "Your Name"
date: today
format:
  html:
    toc: true
    code-fold: true
    embed-resources: true
execute:
  warning: false
  message: false
---
yaml
---
title: "Analysis Title"
author: "Your Name"
date: today
format:
  html:
    toc: true
    code-fold: true
    embed-resources: true
execute:
  warning: false
  message: false
---

Package Management with renv

使用renv进行包管理

r
undefined
r
undefined

Initialize (once per project)

Initialize (once per project)

renv::init()
renv::init()

Snapshot dependencies after installing packages

Snapshot dependencies after installing packages

renv::snapshot()
renv::snapshot()

Restore environment (for collaborators)

Restore environment (for collaborators)

renv::restore()
undefined
renv::restore()
undefined

Workflow Documentation

工作流文档

Always include at the top of scripts:
r
undefined
始终在脚本顶部包含以下内容:
r
undefined

============================================================================

============================================================================

Title: Analysis of [Subject]

Title: Analysis of [Subject]

Author: [Name]

Author: [Name]

Date: [Date]

Date: [Date]

Purpose: [One-sentence description]

Purpose: [One-sentence description]

Input: data/processed/clean_data.csv

Input: data/processed/clean_data.csv

Output: output/figures/trend_plot.png

Output: output/figures/trend_plot.png

============================================================================

============================================================================

undefined
undefined

Common Analysis Patterns

常见分析模式

Descriptive Statistics Table

描述性统计表

r
df |>
  group_by(category) |>
  summarize(
    n = n(),
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    q25 = quantile(value, 0.25, na.rm = TRUE),
    q75 = quantile(value, 0.75, na.rm = TRUE)
  ) |>
  gt::gt() |>
  gt::fmt_number(columns = where(is.numeric), decimals = 2)
r
df |>
  group_by(category) |>
  summarize(
    n = n(),
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    q25 = quantile(value, 0.25, na.rm = TRUE),
    q75 = quantile(value, 0.75, na.rm = TRUE)
  ) |>
  gt::gt() |>
  gt::fmt_number(columns = where(is.numeric), decimals = 2)

Regression with Tidy Output

带整洁输出的回归分析

r
model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)
r
model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)

Tidy coefficients

Tidy coefficients

tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |> select(term, estimate, conf.low, conf.high, p.value)
tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |> select(term, estimate, conf.low, conf.high, p.value)

Model diagnostics

Model diagnostics

glance_results <- broom::glance(model)
undefined
glance_results <- broom::glance(model)
undefined

Epi Curve (Epidemic Curve)

流行曲线(疫情曲线)

r
library(incidence2)
r
library(incidence2)

Create incidence object

Create incidence object

inc <- incidence( df, date_index = "onset_date", interval = "week", groups = "outcome_category" )
inc <- incidence( df, date_index = "onset_date", interval = "week", groups = "outcome_category" )

Plot

Plot

plot(inc) + labs( title = "Epidemic Curve", x = "Week of Onset", y = "Number of Cases" ) + theme_minimal()
undefined
plot(inc) + labs( title = "Epidemic Curve", x = "Week of Onset", y = "Number of Cases" ) + theme_minimal()
undefined

Rate Calculation

率计算

r
undefined
r
undefined

Age-adjusted rates using direct standardization

Age-adjusted rates using direct standardization

library(epitools)
library(epitools)

Stratum-specific counts and populations

Stratum-specific counts and populations

result <- ageadjust.direct( count = df$cases, pop = df$population, stdpop = standard_population$pop # e.g., US 2000 standard )
undefined
result <- ageadjust.direct( count = df$cases, pop = df$population, stdpop = standard_population$pop # e.g., US 2000 standard )
undefined

Error Handling

错误处理

Defensive Data Checks

防御式数据检查

r
undefined
r
undefined

Validate data before analysis

Validate data before analysis

stopifnot( "Data frame is empty" = nrow(df) > 0, "Missing required columns" = all(c("id", "date", "value") %in% names(df)), "Duplicate IDs found" = !any(duplicated(df$id)) )
stopifnot( "Data frame is empty" = nrow(df) > 0, "Missing required columns" = all(c("id", "date", "value") %in% names(df)), "Duplicate IDs found" = !any(duplicated(df$id)) )

Informative warnings for data quality issues

Informative warnings for data quality issues

if (sum(is.na(df$key_var)) > 0) { warning(sprintf("%d missing values in key_var (%.1f%%)", sum(is.na(df$key_var)), 100 * mean(is.na(df$key_var)))) }
undefined
if (sum(is.na(df$key_var)) > 0) { warning(sprintf("%d missing values in key_var (%.1f%%)", sum(is.na(df$key_var)), 100 * mean(is.na(df$key_var)))) }
undefined

Safe File Operations

安全文件操作

r
undefined
r
undefined

Check file exists before reading

Check file exists before reading

if (!file.exists(filepath)) { stop(sprintf("File not found: %s", filepath)) }
if (!file.exists(filepath)) { stop(sprintf("File not found: %s", filepath)) }

Create directories if needed

Create directories if needed

dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)
undefined
dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)
undefined

Performance Tips

性能优化技巧

For Large Datasets

针对大型数据集

r
undefined
r
undefined

Use data.table for >1M rows

Use data.table for >1M rows

library(data.table) dt <- fread("large_file.csv")
library(data.table) dt <- fread("large_file.csv")

Or use arrow for very large/parquet files

Or use arrow for very large/parquet files

library(arrow) df <- read_parquet("data.parquet")
library(arrow) df <- read_parquet("data.parquet")

Lazy evaluation with duckdb

Lazy evaluation with duckdb

library(duckdb) con <- dbConnect(duckdb()) df_lazy <- tbl(con, "data.csv")
undefined
library(duckdb) con <- dbConnect(duckdb()) df_lazy <- tbl(con, "data.csv")
undefined

Vectorization Over Loops

向量化替代循环

r
undefined
r
undefined

Good: vectorized

Good: vectorized

df$rate <- df$cases / df$population * 100000
df$rate <- df$cases / df$population * 100000

Avoid: row-by-row loop

Avoid: row-by-row loop

for (i in 1:nrow(df)) { df$rate[i] <- df$cases[i] / df$population[i] * 100000 }
undefined
for (i in 1:nrow(df)) { df$rate[i] <- df$cases[i] / df$population[i] * 100000 }
undefined

Additional Resources

额外资源

For detailed patterns, consult:
如需了解详细模式,请参考:

Version History

版本历史

  • v1.0.0 (2025-12-04): Initial release for PubHealthAI community
  • v1.0.0 (2025-12-04): Initial release for PubHealthAI community