R Data Science

Overview

Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.

Core Principles

Tidyverse-first: Use tidyverse packages (dplyr, tidyr, ggplot2, purrr, readr) as the default approach
Pipe-forward: Use the native pipe
```
|>
```
for chains (R 4.1+); fall back to
```
%>%
```
for older versions
Reproducibility: Structure all work for reproducibility using Quarto, renv, and clear documentation
Defensive coding: Validate inputs, handle missing data explicitly, and fail informatively

Quick Reference: Common Patterns

Data Import

library(tidyverse)

# CSV (most common)
df <- read_csv("data/raw/dataset.csv")

# Excel
df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")

# Clean column names immediately
df <- df |> janitor::clean_names()

Data Wrangling Pipeline

analysis_data <- raw_data |>
  # Clean and filter
  filter(!is.na(key_variable)) |>
  
  # Transform variables
  mutate(
    date = as.Date(date_string, format = "%Y-%m-%d"),
    age_group = cut(age, breaks = c(0, 18, 45, 65, Inf),
                    labels = c("0-17", "18-44", "45-64", "65+"))
  ) |>
  
  # Summarize
  group_by(region, age_group) |>
  summarize(
    n = n(),
    mean_value = mean(outcome, na.rm = TRUE),
    .groups = "drop"
  )

Basic ggplot2 Visualization

ggplot(df, aes(x = date, y = count, color = category)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Trend Over Time",
    subtitle = "By category",
    x = "Date",
    y = "Count",
    color = "Category",
    caption = "Source: Dataset Name"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

Tidyverse Style Guide Essentials

Naming Conventions

snake_case for objects and functions:
```
case_counts
```
,
```
calculate_rate()
```
Verbs for functions:
```
filter_outliers()
```
,
```
compute_summary()
```
Nouns for data:
```
patient_data
```
,
```
surveillance_df
```
Avoid: dots in names (reserved for S3), single letters except in lambdas

Code Formatting

Indentation: 2 spaces (never tabs)
Line length: 80 characters maximum
Operators: Spaces around
```
<-
```
,
```
=
```
,
```
+
```
,
```
|>
```
, but not
```
:
```
,
```
::
```
,
```
$
```
Commas: Space after, never before
Pipes: New line after each
```
|>
```

# Good
result <- data |>
  filter(year >= 2020) |>
  group_by(county) |>
  summarize(total = sum(cases))

# Bad
result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))

Assignment

Use
```
<-
```
for assignment, never
```
=
```
or
```
->
```
Use
```
=
```
only for function arguments

Comments

# Load and clean surveillance data ------------------------------------------

# Calculate age-adjusted rates
# Using direct standardization method per CDC guidelines
adjusted_rate <- calculate_adjusted_rate(df, standard_pop)

Package Ecosystem

Core Tidyverse (Always Load)

library(tidyverse)  # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

Data Import/Export

Task	Package	Key Functions
CSV/TSV	readr	`read_csv()` , `write_csv()`
Excel	readxl, writexl	`read_excel()` , `write_xlsx()`
SAS/SPSS/Stata	haven	`read_sas()` , `read_spss()` , `read_stata()`
JSON	jsonlite	`read_json()` , `fromJSON()`
Databases	DBI, dbplyr	`dbConnect()` , `tbl()`

Data Manipulation

Task	Package	Key Functions
Column cleaning	janitor	`clean_names()` , `tabyl()`
Date handling	lubridate	`ymd()` , `mdy()` , `floor_date()`
String operations	stringr	`str_detect()` , `str_extract()`
Missing data	naniar	`vis_miss()` , `replace_with_na()`

Visualization

Task	Package	Key Functions
Core plotting	ggplot2	`ggplot()` , `geom_*()`
Extensions	ggrepel, patchwork	`geom_text_repel()` , `+` operator
Interactive	plotly	`ggplotly()`
Tables	gt, kableExtra	`gt()` , `kable()`

Statistical Analysis

Task	Package	Key Functions
Model summaries	broom	`tidy()` , `glance()` , `augment()`
Regression	stats, lme4	`lm()` , `glm()` , `lmer()`
Survival	survival	`Surv()` , `survfit()` , `coxph()`
Survey data	survey	`svydesign()` , `svymean()`

Epidemiology & Public Health

Task	Package	Key Functions
Epi calculations	epiR	`epi.2by2()` , `epi.conf()`
Outbreak tools	incidence2, epicontacts	`incidence()` , `make_epicontacts()`
Disease mapping	SpatialEpi	`expected()` , `EBlocal()`
Surveillance	surveillance	`sts()` , `farrington()`
Rate calculations	epitools	`riskratio()` , `oddsratio()` , `ageadjust.direct()`

Reproducibility Standards

Project Structure

project/
├── project.Rproj
├── renv.lock
├── CLAUDE.md              # Claude Code configuration
├── README.md
├── data/
│   ├── raw/               # Never modify
│   └── processed/         # Analysis-ready
├── R/                     # Custom functions
├── scripts/               # Pipeline scripts
├── analysis/              # Quarto documents
└── output/
    ├── figures/
    └── tables/

Quarto Document Header

yaml

---
title: "Analysis Title"
author: "Your Name"
date: today
format:
  html:
    toc: true
    code-fold: true
    embed-resources: true
execute:
  warning: false
  message: false
---

Package Management with renv

# Initialize (once per project)
renv::init()

# Snapshot dependencies after installing packages
renv::snapshot()

# Restore environment (for collaborators)
renv::restore()

Workflow Documentation

Always include at the top of scripts:

# ============================================================================
# Title: Analysis of [Subject]
# Author: [Name]
# Date: [Date]
# Purpose: [One-sentence description]
# Input: data/processed/clean_data.csv
# Output: output/figures/trend_plot.png
# ============================================================================

Common Analysis Patterns

Descriptive Statistics Table

df |>
  group_by(category) |>
  summarize(
    n = n(),
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    q25 = quantile(value, 0.25, na.rm = TRUE),
    q75 = quantile(value, 0.75, na.rm = TRUE)
  ) |>
  gt::gt() |>
  gt::fmt_number(columns = where(is.numeric), decimals = 2)

Regression with Tidy Output

model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)

# Tidy coefficients
tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |>
  select(term, estimate, conf.low, conf.high, p.value)

# Model diagnostics
glance_results <- broom::glance(model)

Epi Curve (Epidemic Curve)

library(incidence2)

# Create incidence object
inc <- incidence(
  df,
  date_index = "onset_date",
  interval = "week",
  groups = "outcome_category"
)

# Plot
plot(inc) +
  labs(
    title = "Epidemic Curve",
    x = "Week of Onset",
    y = "Number of Cases"
  ) +
  theme_minimal()

Rate Calculation

# Age-adjusted rates using direct standardization
library(epitools)

# Stratum-specific counts and populations
result <- ageadjust.direct(
  count = df$cases,
  pop = df$population,
  stdpop = standard_population$pop  # e.g., US 2000 standard
)

Error Handling

Defensive Data Checks

# Validate data before analysis
stopifnot(
  "Data frame is empty" = nrow(df) > 0,
  "Missing required columns" = all(c("id", "date", "value") %in% names(df)),
  "Duplicate IDs found" = !any(duplicated(df$id))
)

# Informative warnings for data quality issues
if (sum(is.na(df$key_var)) > 0) {
  warning(sprintf("%d missing values in key_var (%.1f%%)",
                  sum(is.na(df$key_var)),
                  100 * mean(is.na(df$key_var))))
}

Safe File Operations

# Check file exists before reading
if (!file.exists(filepath)) {
  stop(sprintf("File not found: %s", filepath))
}

# Create directories if needed
dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)

Performance Tips

For Large Datasets

# Use data.table for >1M rows
library(data.table)
dt <- fread("large_file.csv")

# Or use arrow for very large/parquet files
library(arrow)
df <- read_parquet("data.parquet")

# Lazy evaluation with duckdb
library(duckdb)
con <- dbConnect(duckdb())
df_lazy <- tbl(con, "data.csv")

Vectorization Over Loops

# Good: vectorized
df$rate <- df$cases / df$population * 100000

# Avoid: row-by-row loop
for (i in 1:nrow(df)) {
  df$rate[i] <- df$cases[i] / df$population[i] * 100000
}

Additional Resources

For detailed patterns, consult:

Tidyverse Style Guide: https://style.tidyverse.org/
R for Data Science (2e): https://r4ds.hadley.nz/
The Epidemiologist R Handbook: https://epirhandbook.com/
Quarto Documentation: https://quarto.org/

Version History

v1.0.0 (2025-12-04): Initial release for PubHealthAI community

r-data-science

NPX Install

Tags

SKILL.md Content