cellxgene-census
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCZ CELLxGENE Census
CZ CELLxGENE Census
Overview
概述
The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.
The Census includes:
- 61+ million cells from human and mouse
- Standardized metadata (cell types, tissues, diseases, donors)
- Raw gene expression matrices
- Pre-calculated embeddings and statistics
- Integration with PyTorch, scanpy, and other analysis tools
CZ CELLxGENE Census 提供对CZ CELLxGENE Discover中标准化单细胞基因组数据的全面、版本化集合的程序化访问。本技能支持对跨数千个数据集的数百万细胞进行高效查询和分析。
Census包含:
- 6100多万个人类和小鼠细胞
- 标准化元数据(细胞类型、组织、疾病、供体)
- 原始基因表达矩阵
- 预计算嵌入和统计数据
- 与PyTorch、scanpy及其他分析工具集成
When to Use This Skill
何时使用本技能
This skill should be used when:
- Querying single-cell expression data by cell type, tissue, or disease
- Exploring available single-cell datasets and metadata
- Training machine learning models on single-cell data
- Performing large-scale cross-dataset analyses
- Integrating Census data with scanpy or other analysis frameworks
- Computing statistics across millions of cells
- Accessing pre-calculated embeddings or model predictions
在以下场景中应使用本技能:
- 按细胞类型、组织或疾病查询单细胞表达数据
- 探索可用的单细胞数据集和元数据
- 在单细胞数据上训练机器学习模型
- 执行大规模跨数据集分析
- 将Census数据与scanpy或其他分析框架集成
- 计算数百万细胞的统计数据
- 访问预计算嵌入或模型预测结果
Installation and Setup
安装与设置
Install the Census API:
bash
uv pip install cellxgene-censusFor machine learning workflows, install additional dependencies:
bash
uv pip install cellxgene-census[experimental]安装Census API:
bash
uv pip install cellxgene-census对于机器学习工作流,安装额外依赖:
bash
uv pip install cellxgene-census[experimental]Core Workflow Patterns
核心工作流模式
1. Opening the Census
1. 打开Census
Always use the context manager to ensure proper resource cleanup:
python
import cellxgene_census始终使用上下文管理器以确保资源正确清理:
python
import cellxgene_censusOpen latest stable version
打开最新稳定版本
with cellxgene_census.open_soma() as census:
# Work with census data
with cellxgene_census.open_soma() as census:
# 处理census数据
Open specific version for reproducibility
打开特定版本以保证可复现性
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
# Work with census data
**Key points:**
- Use context manager (`with` statement) for automatic cleanup
- Specify `census_version` for reproducible analyses
- Default opens latest "stable" releasewith cellxgene_census.open_soma(census_version="2023-07-25") as census:
# 处理census数据
**关键点:**
- 使用上下文管理器(`with`语句)自动清理资源
- 指定`census_version`以实现可复现分析
- 默认打开最新的"stable"版本2. Exploring Census Information
2. 探索Census信息
Before querying expression data, explore available datasets and metadata.
Access summary information:
python
undefined在查询表达数据之前,先探索可用的数据集和元数据。
访问摘要信息:
python
undefinedGet summary statistics
获取统计摘要
summary = census["census_info"]["summary"].read().concat().to_pandas()
print(f"Total cells: {summary['total_cell_count'][0]}")
summary = census["census_info"]["summary"].read().concat().to_pandas()
print(f"总细胞数: {summary['total_cell_count'][0]}")
Get all datasets
获取所有数据集
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
Filter datasets by criteria
按条件筛选数据集
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
**Query cell metadata to understand available data:**
```pythoncovid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
**查询细胞元数据以了解可用数据:**
```pythonGet unique cell types in a tissue
获取某一组织中的独特细胞类型
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type"]
)
unique_cell_types = cell_metadata["cell_type"].unique()
print(f"Found {len(unique_cell_types)} cell types in brain")
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type"]
)
unique_cell_types = cell_metadata["cell_type"].unique()
print(f"在大脑中发现{len(unique_cell_types)}种细胞类型")
Count cells by tissue
按组织统计细胞数量
tissue_counts = cell_metadata.groupby("tissue_general").size()
**Important:** Always filter for `is_primary_data == True` to avoid counting duplicate cells unless specifically analyzing duplicates.tissue_counts = cell_metadata.groupby("tissue_general").size()
**重要提示:** 除非专门分析重复细胞,否则始终筛选`is_primary_data == True`以避免统计重复细胞。3. Querying Expression Data (Small to Medium Scale)
3. 查询表达数据(中小规模)
For queries returning < 100k cells that fit in memory, use :
get_anndata()python
undefined对于返回细胞数<10万且可放入内存的查询,使用:
get_anndata()python
undefinedBasic query with cell type and tissue filters
带细胞类型和组织筛选的基础查询
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens", # or "Mus musculus"
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
obs_column_names=["assay", "disease", "sex", "donor_id"],
)
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens", # 或 "Mus musculus"
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
obs_column_names=["assay", "disease", "sex", "donor_id"],
)
Query specific genes with multiple filters
带多条件筛选的特定基因查询
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
obs_column_names=["cell_type", "tissue_general", "donor_id"],
)
**Filter syntax:**
- Use `obs_value_filter` for cell filtering
- Use `var_value_filter` for gene filtering
- Combine conditions with `and`, `or`
- Use `in` for multiple values: `tissue in ['lung', 'liver']`
- Select only needed columns with `obs_column_names`
**Getting metadata separately:**
```pythonadata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
obs_column_names=["cell_type", "tissue_general", "donor_id"],
)
**筛选语法:**
- 使用`obs_value_filter`进行细胞筛选
- 使用`var_value_filter`进行基因筛选
- 使用`and`、`or`组合条件
- 使用`in`匹配多个值:`tissue in ['lung', 'liver']`
- 使用`obs_column_names`仅选择所需列
**单独获取元数据:**
```pythonQuery cell metadata
查询细胞元数据
cell_metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19' and is_primary_data == True",
column_names=["cell_type", "tissue_general", "donor_id"]
)
cell_metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19' and is_primary_data == True",
column_names=["cell_type", "tissue_general", "donor_id"]
)
Query gene metadata
查询基因元数据
gene_metadata = cellxgene_census.get_var(
census, "homo_sapiens",
value_filter="feature_name in ['CD4', 'CD8A']",
column_names=["feature_id", "feature_name", "feature_length"]
)
undefinedgene_metadata = cellxgene_census.get_var(
census, "homo_sapiens",
value_filter="feature_name in ['CD4', 'CD8A']",
column_names=["feature_id", "feature_name", "feature_length"]
)
undefined4. Large-Scale Queries (Out-of-Core Processing)
4. 大规模查询(核外处理)
For queries exceeding available RAM, use with iterative processing:
axis_query()python
import tiledbsoma as soma对于超出可用内存的查询,使用进行迭代处理:
axis_query()python
import tiledbsoma as somaCreate axis query
创建轴查询
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
)
)
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
)
)
Iterate through expression matrix in chunks
分块迭代处理表达矩阵
iterator = query.X("raw").tables()
for batch in iterator:
# batch is a pyarrow.Table with columns:
# - soma_data: expression value
# - soma_dim_0: cell (obs) coordinate
# - soma_dim_1: gene (var) coordinate
process_batch(batch)
**Computing incremental statistics:**
```pythoniterator = query.X("raw").tables()
for batch in iterator:
# batch是一个pyarrow.Table,包含以下列:
# - soma_data: 表达值
# - soma_dim_0: 细胞(obs)坐标
# - soma_dim_1: 基因(var)坐标
process_batch(batch)
**计算增量统计数据:**
```pythonExample: Calculate mean expression
示例:计算平均表达量
n_observations = 0
sum_values = 0.0
iterator = query.X("raw").tables()
for batch in iterator:
values = batch["soma_data"].to_numpy()
n_observations += len(values)
sum_values += values.sum()
mean_expression = sum_values / n_observations
undefinedn_observations = 0
sum_values = 0.0
iterator = query.X("raw").tables()
for batch in iterator:
values = batch["soma_data"].to_numpy()
n_observations += len(values)
sum_values += values.sum()
mean_expression = sum_values / n_observations
undefined5. Machine Learning with PyTorch
5. 与PyTorch结合的机器学习
For training models, use the experimental PyTorch integration:
python
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census:
# Create dataloader
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
X = batch["X"] # Gene expression tensor
labels = batch["obs"]["cell_type"] # Cell type labels
# Forward pass
outputs = model(X)
loss = criterion(outputs, labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()Train/test splitting:
python
from cellxgene_census.experimental.ml import ExperimentDataset对于模型训练,使用实验性的PyTorch集成:
python
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census:
# 创建数据加载器
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# 训练循环
for epoch in range(num_epochs):
for batch in dataloader:
X = batch["X"] # 基因表达张量
labels = batch["obs"]["cell_type"] # 细胞类型标签
# 前向传播
outputs = model(X)
loss = criterion(outputs, labels)
# 反向传播
optimizer.zero_grad()
loss.backward()
optimizer.step()训练/测试拆分:
python
from cellxgene_census.experimental.ml import ExperimentDatasetCreate dataset from experiment
从实验创建数据集
dataset = ExperimentDataset(
experiment_axis_query,
layer_name="raw",
obs_column_names=["cell_type"],
batch_size=128,
)
dataset = ExperimentDataset(
experiment_axis_query,
layer_name="raw",
obs_column_names=["cell_type"],
batch_size=128,
)
Split into train and test
拆分为训练集和测试集
train_dataset, test_dataset = dataset.random_split(
split=[0.8, 0.2],
seed=42
)
undefinedtrain_dataset, test_dataset = dataset.random_split(
split=[0.8, 0.2],
seed=42
)
undefined6. Integration with Scanpy
6. 与Scanpy集成
Seamlessly integrate Census data with scanpy workflows:
python
import scanpy as sc将Census数据与scanpy工作流无缝集成:
python
import scanpy as scLoad data from Census
从Census加载数据
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
)
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
)
Standard scanpy workflow
标准scanpy工作流
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
Dimensionality reduction
降维
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
Visualization
可视化
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
undefinedsc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
undefined7. Multi-Dataset Integration
7. 多数据集集成
Query and integrate multiple datasets:
python
undefined查询并集成多个数据集:
python
undefinedStrategy 1: Query multiple tissues separately
策略1:分别查询多个组织
tissues = ["lung", "liver", "kidney"]
adatas = []
for tissue in tissues:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
)
adata.obs["tissue"] = tissue
adatas.append(adata)
tissues = ["lung", "liver", "kidney"]
adatas = []
for tissue in tissues:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
)
adata.obs["tissue"] = tissue
adatas.append(adata)
Concatenate
合并数据
combined = adatas[0].concatenate(adatas[1:])
combined = adatas[0].concatenate(adatas[1:])
Strategy 2: Query multiple datasets directly
策略2:直接查询多个数据集
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
)
undefinedadata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
)
undefinedKey Concepts and Best Practices
核心概念与最佳实践
Always Filter for Primary Data
始终筛选原始数据
Unless analyzing duplicates, always include in queries to avoid counting cells multiple times:
is_primary_data == Truepython
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"除非分析重复细胞,否则始终在查询中包含以避免重复统计细胞:
is_primary_data == Truepython
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"Specify Census Version for Reproducibility
指定Census版本以保证可复现性
Always specify the Census version in production analyses:
python
census = cellxgene_census.open_soma(census_version="2023-07-25")在生产分析中始终指定Census版本:
python
census = cellxgene_census.open_soma(census_version="2023-07-25")Estimate Query Size Before Loading
加载前预估查询规模
For large queries, first check the number of cells to avoid memory issues:
python
undefined对于大型查询,先检查细胞数量以避免内存问题:
python
undefinedGet cell count
获取细胞数量
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"Query will return {n_cells:,} cells")
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"查询将返回{n_cells:,}个细胞")
If too large (>100k), use out-of-core processing
如果数量过大(>10万),使用核外处理
undefinedundefinedUse tissue_general for Broader Groupings
使用tissue_general进行更广泛的分组
The field provides coarser categories than , useful for cross-tissue analyses:
tissue_generaltissuepython
undefinedtissue_generaltissuepython
undefinedBroader grouping
更广泛的分组
obs_value_filter="tissue_general == 'immune system'"
obs_value_filter="tissue_general == 'immune system'"
Specific tissue
特定组织
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
undefinedobs_value_filter="tissue == 'peripheral blood mononuclear cell'"
undefinedSelect Only Needed Columns
仅选择所需列
Minimize data transfer by specifying only required metadata columns:
python
obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns通过仅指定所需的元数据列来减少数据传输:
python
obs_column_names=["cell_type", "tissue_general", "disease"] # 不选择所有列Check Dataset Presence for Gene-Specific Queries
针对特定基因查询时检查数据集覆盖情况
When analyzing specific genes, verify which datasets measured them:
python
presence = cellxgene_census.get_presence_matrix(
census,
"homo_sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A']"
)分析特定基因时,验证哪些数据集测量了这些基因:
python
presence = cellxgene_census.get_presence_matrix(
census,
"homo_sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A']"
)Two-Step Workflow: Explore Then Query
两步工作流:先探索再查询
First explore metadata to understand available data, then query expression:
python
undefined先探索元数据以了解可用数据,再查询表达数据:
python
undefinedStep 1: Explore what's available
步骤1:探索可用数据
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19' and is_primary_data == True",
column_names=["cell_type", "tissue_general"]
)
print(metadata.value_counts())
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19' and is_primary_data == True",
column_names=["cell_type", "tissue_general"]
)
print(metadata.value_counts())
Step 2: Query based on findings
步骤2:基于发现结果进行查询
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
)
undefinedadata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
)
undefinedAvailable Metadata Fields
可用元数据字段
Cell Metadata (obs)
细胞元数据(obs)
Key fields for filtering:
- ,
cell_typecell_type_ontology_term_id - ,
tissue,tissue_generaltissue_ontology_term_id - ,
diseasedisease_ontology_term_id - ,
assayassay_ontology_term_id - ,
donor_id,sexself_reported_ethnicity - ,
development_stagedevelopment_stage_ontology_term_id dataset_id- (Boolean: True = unique cell)
is_primary_data
用于筛选的关键字段:
- ,
cell_typecell_type_ontology_term_id - ,
tissue,tissue_generaltissue_ontology_term_id - ,
diseasedisease_ontology_term_id - ,
assayassay_ontology_term_id - ,
donor_id,sexself_reported_ethnicity - ,
development_stagedevelopment_stage_ontology_term_id dataset_id- (布尔值:True = 唯一细胞)
is_primary_data
Gene Metadata (var)
基因元数据(var)
- (Ensembl gene ID, e.g., "ENSG00000161798")
feature_id - (Gene symbol, e.g., "FOXP2")
feature_name - (Gene length in base pairs)
feature_length
- (Ensembl基因ID,例如"ENSG00000161798")
feature_id - (基因符号,例如"FOXP2")
feature_name - (基因长度,单位为碱基对)
feature_length
Reference Documentation
参考文档
This skill includes detailed reference documentation:
本技能包含详细的参考文档:
references/census_schema.md
references/census_schema.md
Comprehensive documentation of:
- Census data structure and organization
- All available metadata fields
- Value filter syntax and operators
- SOMA object types
- Data inclusion criteria
When to read: When you need detailed schema information, full list of metadata fields, or complex filter syntax.
全面文档涵盖:
- Census数据结构与组织
- 所有可用元数据字段
- 值筛选语法与运算符
- SOMA对象类型
- 数据纳入标准
阅读时机: 当你需要详细的 schema 信息、完整的元数据字段列表或复杂筛选语法时。
references/common_patterns.md
references/common_patterns.md
Examples and patterns for:
- Exploratory queries (metadata only)
- Small-to-medium queries (AnnData)
- Large queries (out-of-core processing)
- PyTorch integration
- Scanpy integration workflows
- Multi-dataset integration
- Best practices and common pitfalls
When to read: When implementing specific query patterns, looking for code examples, or troubleshooting common issues.
包含以下场景的示例与模式:
- 探索性查询(仅元数据)
- 中小规模查询(AnnData)
- 大规模查询(核外处理)
- PyTorch集成
- Scanpy集成工作流
- 多数据集集成
- 最佳实践与常见陷阱
阅读时机: 当你实现特定查询模式、寻找代码示例或排查常见问题时。
Common Use Cases
常见用例
Use Case 1: Explore Cell Types in a Tissue
用例1:探索某一组织中的细胞类型
python
with cellxgene_census.open_soma() as census:
cells = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'lung' and is_primary_data == True",
column_names=["cell_type"]
)
print(cells["cell_type"].value_counts())python
with cellxgene_census.open_soma() as census:
cells = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'lung' and is_primary_data == True",
column_names=["cell_type"]
)
print(cells["cell_type"].value_counts())Use Case 2: Query Marker Gene Expression
用例2:查询标记基因表达
python
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
)python
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
)Use Case 3: Train Cell Type Classifier
用例3:训练细胞类型分类器
python
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census:
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# Train model
for epoch in range(epochs):
for batch in dataloader:
# Training logic
passpython
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census:
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# 训练模型
for epoch in range(epochs):
for batch in dataloader:
# 训练逻辑
passUse Case 4: Cross-Tissue Analysis
用例4:跨组织分析
python
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
)
# Analyze macrophage differences across tissues
sc.tl.rank_genes_groups(adata, groupby="tissue_general")python
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
)
# 分析跨组织巨噬细胞的差异
sc.tl.rank_genes_groups(adata, groupby="tissue_general")Troubleshooting
故障排除
Query Returns Too Many Cells
查询返回过多细胞
- Add more specific filters to reduce scope
- Use instead of
tissuefor finer granularitytissue_general - Filter by specific if known
dataset_id - Switch to out-of-core processing for large queries
- 添加更具体的筛选条件以缩小范围
- 使用而非
tissue以获得更精细的粒度tissue_general - 如果已知,按特定筛选
dataset_id - 对于大型查询切换到核外处理
Memory Errors
内存错误
- Reduce query scope with more restrictive filters
- Select fewer genes with
var_value_filter - Use out-of-core processing with
axis_query() - Process data in batches
- 使用更严格的筛选条件缩小查询范围
- 使用选择更少的基因
var_value_filter - 使用进行核外处理
axis_query() - 分块处理数据
Duplicate Cells in Results
结果中出现重复细胞
- Always include in filters
is_primary_data == True - Check if intentionally querying across multiple datasets
- 始终在筛选条件中包含
is_primary_data == True - 检查是否有意跨多个数据集查询
Gene Not Found
未找到基因
- Verify gene name spelling (case-sensitive)
- Try Ensembl ID with instead of
feature_idfeature_name - Check dataset presence matrix to see if gene was measured
- Some genes may have been filtered during Census construction
- 验证基因名称拼写(区分大小写)
- 尝试使用Ensembl ID()而非
feature_idfeature_name - 检查数据集存在矩阵以确认基因是否被测量
- 部分基因可能在Census构建过程中被过滤
Version Inconsistencies
版本不一致
- Always specify explicitly
census_version - Use same version across all analyses
- Check release notes for version-specific changes
- 始终显式指定
census_version - 在所有分析中使用相同版本
- 查看发行说明了解版本特定变更