ai-paper-reader
Original:🇨🇳 Chinese
Translated
In-depth analysis of AI papers, generating professional reading notes ready for direct publication
12installs
Added on
NPX Install
npx skill4agent add frostant/awesome-claude-skills ai-paper-readerTags
Translated version includes tags in frontmatterSKILL.md Content (Chinese)
View Translation Comparison →AI Paper Reading Note Generator
Core Objectives
Generate paper reading notes ready for direct publication on technical communities (such as Zhihu, Juejin, WeChat Official Accounts, etc.).
Requirements for notes:
- Comprehensive Content: No omission of core technical details, in-depth elaboration of innovative points
- Professional and Readable: Technical blog style, both in-depth and easy to understand
- Objective and Accurate: Analysis based on paper content, no subjective assumptions added
- In-depth Thinking: Help readers gain in-depth understanding through Q&A sessions
Writing Specifications
Dos
-
Professional and Accurate Expression
- Use standardized terminology in the field
- Formulas and symbols strictly correspond to the original paper
- Clear and unambiguous description of technical details
-
Easy-to-Understand Explanation
- Provide intuitive understanding of complex concepts before going into details
- Use analogies to help understand abstract concepts
- Explain the meaning of variables in formulas item by item
-
Well-Structured Organization
- Clear logical hierarchy
- Highlight key content
- Appropriately use charts for illustration
-
Valuable In-depth Analysis
- Analyze the reasons behind design choices
- Compare similarities and differences with related works
- Point out the scope of application and limitations of the method
Must-Avoids
-
AI clichés and template sentences
- ❌ "The core contribution of this paper is..."
- ❌ "The advantage of this method is..."
- ❌ "In summary..."
- ❌ "It is worth noting that..."
- ❌ "It has important significance/broad application prospects..."
-
Empty summaries and evaluations
- ❌ "This is an important work"
- ❌ "Provides new ideas for the field"
- ❌ General remarks without specific analysis
-
Excessive format decoration
- ❌ A large number of emojis
- ❌ Bold every sentence
- ❌ Excessive nested hierarchies
-
Unnecessary first-person perspective
- ❌ "I think..."
- ❌ "My understanding is..."
- Maintain an objective narrative perspective
Note Structure
0. Meta Information (Beginning of Notes)
Each note must include the following information at the beginning to help readers quickly decide whether to continue reading:
markdown
> **Paper**: Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
> **Authors**: Meta AI
> **Publication**: ICML 2024
> **Reading Time**: Approximately 15 minutes
> **Difficulty**: ⭐⭐⭐⭐ (Requires basics of Transformer and recommendation systems)
> **Prerequisite Knowledge**: Concepts of Attention mechanism, DLRM, Scaling LawDifficulty Level Explanation:
- ⭐ Beginner: No professional background required
- ⭐⭐ Basic: Understanding of deep learning fundamentals
- ⭐⭐⭐ Intermediate: Familiar with the relevant field
- ⭐⭐⭐⭐ Professional: Requires in-depth domain knowledge
- ⭐⭐⭐⭐⭐ Expert: Involves complex mathematics or cutting-edge research
1. TL;DR
Summarize the core innovation of the paper in 2-3 sentences, allowing readers who don't have time to read in detail to grasp the key points quickly.
markdown
## TL;DR
Traditional recommendation models like DLRM rely heavily on manual features and cannot scale. This paper proposes transforming the recommendation problem into a sequence generation problem. The core innovation is the HSTU architecture: using Pointwise Attention instead of Softmax to preserve the absolute intensity information of user preferences, enabling recommendation systems to exhibit LLM-like Scaling Law for the first time.Requirements:
- 2-3 sentences, no more than 100 words
- Must include: Problem background + Core solution + Key innovation points
- Avoid general remarks, include specific technical points
2. Paper Overview
Concise answer to three questions:
- What problem does it solve: One-sentence description
- Core solution: One-sentence summary
- Main contributions: List 2-3 points
markdown
## Paper Overview
**Problem**: Large-scale recommendation systems cannot continuously improve quality by increasing computational power like LLMs
**Solution**: Transform the recommendation problem from "feature engineering + discriminative model" to "sequence modeling + generative model"
**Contributions**:
1. Propose the Generative Recommenders (GRs) paradigm, enabling Scaling Law for recommendation systems
2. Design the HSTU architecture, using Pointwise Attention instead of Softmax to preserve intensity information
3. Propose the M-FALCON inference algorithm for efficient candidate scoring3. Background and Motivation
Explain the problems of existing methods and why new methods are needed:
- How existing methods work
- What problems/bottlenecks exist
- What is the root cause of the problem
4. Core Method (Key Section)
This is the core part of the note, requiring completeness, depth, and no omissions.
Organization Method
-
Overall Architecture
- Provide architecture diagrams
- Explain data flow
- Label key modules
-
Detailed Explanation of Core Modules (For each key module)
- Input and output description
- Core formula + item-by-item explanation
- Pseudocode/code implementation
- Analysis of reasons for design choices
-
Key Technical Details
- Training strategy
- Hyperparameter settings
- Implementation tricks
Example Format
markdown
## Core Method
### Overall Architecture
[Architecture Diagram]
Data Flow: User history sequence → Embedding → HSTU Layers × L → Prediction Head
### HSTU Layer Details
#### Input and Output
- Input: X ∈ R^{N×d}, N is sequence length, d is embedding dimension
- Output: Y ∈ R^{N×d}
#### Core Formula
**Pointwise Projection**:
$$U, V, Q, K = \text{Split}(\phi_1(f_1(X)))$$
Where:
- $\phi_1$: SiLU activation function
- $f_1$: Single-layer linear transformation
- Split divides the output into four vectors
**Spatial Aggregation**:
$$A(X)V(X) = \phi_2(Q(X)K(X)^T + r_{ab}) V(X)$$
Key Point: Use SiLU instead of Softmax to preserve the absolute intensity information of attention.
#### Code Implementation
```python
class HSTULayer(nn.Module):
def forward(self, x):
# Pointwise Projection
projected = F.silu(self.proj_in(x))
u, v, q, k = projected.split([...], dim=-1)
# Spatial Aggregation (not Softmax!)
attn = F.silu(q @ k.T + self.rel_bias)
out = self.norm(attn @ v) * u
return x + self.proj_out(out)Design Analysis
Why use SiLU instead of Softmax?
Recommendation scenarios require predicting the absolute intensity of user preferences (such as watch duration), rather than just relative ranking.
Consider two users:
- User A: 10 historical interactions
- User B: 100 historical interactions
When using Softmax, the attention weights of both users will be normalized to [0,1], causing the information that "User B is more active" to be lost.
Pointwise Attention preserves the accumulated original magnitude, allowing the model to learn activity differences.
### 5. Experimental Analysis
Instead of listing numbers, extract key conclusions:
- **Main Experimental Results**: Core findings compared with baselines
- **Ablation Experiments**: Contribution analysis of each component
- **Scaling Analysis**: Relationship between computational power and performance (if available)
- **Limitations**: Situations where the method does not perform well
### 6. In-depth Understanding Q&A
Help readers gain in-depth understanding of key points of the paper through well-designed questions.
**Display Q&A directly, do not use folding**.
```markdown
## In-depth Understanding Q&A
### Q1: Why is Softmax Attention not suitable for recommendation scenarios?
Recommendation scenarios require predicting the **absolute intensity** of user preferences (such as watch duration), rather than just **relative ranking**.
Consider two users:
- User A: 10 historical interactions
- User B: 100 historical interactions
When using Softmax, the attention weights of both users will be normalized to [0,1], causing the information that "User B is more active" to be lost.
Pointwise Attention preserves the accumulated original magnitude, allowing the model to learn activity differences.
### Q2: How does HSTU replace 6 linear layers of Transformer with 2?
Standard Transformer layers require:
- Q, K, V projection: 3 linear layers
- Output Projection: 1 linear layer
- FFN: 2 linear layers (expansion + compression)
HSTU simplifications:
1. **Fuse Q, K, V, U projection**: A single linear layer generates four vectors simultaneously
2. **Replace FFN with U gating**: `output * U` achieves similar nonlinear transformation
The cost is reduced expressive power per layer, but this can be compensated by stacking more layers.
### Q3: Why can Stochastic Length training discard 70% of tokens with almost no loss in performance?
The key lies in the **statistical characteristics** of user behavior:
1. **Temporal repetition**: Users repeatedly interact with similar content, resulting in high information redundancy
2. **Low-rank interests**: 10,000 interactions may only involve 20 main interest points
3. **Recency priority**: Weight recent behaviors during sampling to retain the most relevant information
As long as the number of samples exceeds a certain multiple of the number of interest categories, all interests can be covered with high probability.7. Summary and Reflection
Objectively summarize the contributions and limitations of the paper:
markdown
## Summary
### Core Contributions
- Proved that recommendation systems can follow Scaling Law
- Proposed an Attention variant suitable for recommendation scenarios
### Limitations
- Cold start scenarios: Advantages are not obvious when historical sequences are too short
- Computational cost: Requires a large amount of GPU resources
- Real-time performance: Latency challenges in long-sequence inference
### Applicable Scenarios
- Scenarios with rich user history (>100 interactions)
- Sufficient computational resources available
- Not extremely strict real-time requirementsDirectory Structure Specifications
Papers and reading notes should be organized in unified subdirectories for easy management and retrieval:
paper-notes/
├── hstu/ # One directory per paper, using a short name
│ ├── paper.pdf # Original paper PDF
│ ├── README.md # Reading note (main file)
│ └── images/ # Extracted charts
│ ├── fig1_architecture.png
│ ├── fig2_method.png
│ └── fig3_scaling.png
│
├── attention-is-all-you-need/
│ ├── paper.pdf
│ ├── README.md
│ └── images/
│
└── din-deep-interest-network/
├── paper.pdf
├── README.md
└── images/Naming Specifications:
- Directory name: Paper abbreviation or keywords, lowercase, connected with
- - Note file: Unifiedly named for direct preview on GitHub
README.md - Image directory: Unifiedly named
images/
Image Naming Specifications:
fig{number}_{type}_{description}.png
Types:
- arch: Architecture diagram
- method: Method flow
- result: Experimental result
- ablation: Ablation experiment
- compare: Comparison chart
Examples:
- fig1_arch_overall.png
- fig2_method_attention.png
- fig3_result_scaling.pngQ&A Session Design Guide
Question Types
-
Principle Understanding Type
- Why is it designed this way?
- What are the advantages compared to alternative solutions?
-
Detail Differentiation Type
- Specific meaning of a certain symbol/operation
- Distinction between easily confused concepts
-
Boundary Condition Type
- In what situations will the method fail?
- What are the assumptions?
-
Extended Thinking Type
- Can it be migrated to other scenarios?
- What possible improvement directions are there?
Answer Requirements
- Direct Display: Do not use folding, readers can read smoothly
- Well-Founded: Answers must have arguments, not simple assertions
- Appropriate Examples: Use specific examples to help understanding
- Acknowledge Uncertainty: For parts not explained in the paper, mark as "speculation"
Chart Processing
Must-Extract Charts
- Overall architecture diagram
- Core method flow chart
- Key experimental results (such as Scaling Law curves)
Chart Description Specifications
markdown

**Content of the Diagram**: Overall architecture of HSTU, with DLRM comparison on the left
**Key Information**:
- Input is a unified item-behavior alternating sequence
- HSTU Layers can be stacked infinitely
- Output is a multi-task prediction head
**Correspondence with Text**: Detailed description in Section 3.2Image Extraction Tools
There are two types of charts in academic papers, requiring different extraction methods:
| Type | Characteristics | Extraction Method |
|---|---|---|
| Embedded Images | PNG/JPEG inserted by authors | |
| Vector Graphics | Architecture diagrams, flowcharts and other drawn graphics | |
Method 1: Extract Embedded Images
Suitable for bitmaps directly inserted in papers (such as experimental result screenshots, photos, etc.):
python
import fitz # PyMuPDF
import os
def extract_embedded_images(pdf_path, output_dir):
"""Extract embedded bitmaps from PDF"""
os.makedirs(output_dir, exist_ok=True)
doc = fitz.open(pdf_path)
for page_num in range(len(doc)):
page = doc[page_num]
images = page.get_images(full=True)
for img_idx, img in enumerate(images):
xref = img[0]
base = doc.extract_image(xref)
image_bytes = base["image"]
image_ext = base["ext"]
# Filter out too small images (may be icons/decorations)
if base["width"] > 100 and base["height"] > 100:
output_path = f"{output_dir}/page{page_num+1}_img{img_idx+1}.{image_ext}"
with open(output_path, "wb") as f:
f.write(image_bytes)
doc.close()Method 2: Extract Vector Graphics (Recommended)
Suitable for architecture diagrams, flowcharts, charts and other vector graphics bound in papers:
python
import fitz
import os
def extract_vector_figures(pdf_path, output_dir, dpi=200, min_size=100):
"""
Identify vector graphic regions using cluster_drawings() and take screenshots
Args:
pdf_path: PDF file path
output_dir: Output directory
dpi: Output resolution (default 200, can be increased to 300 for clearer images)
min_size: Minimum size threshold, filter decorative lines (default 100pt)
"""
os.makedirs(output_dir, exist_ok=True)
doc = fitz.open(pdf_path)
figures = []
for page_num in range(len(doc)):
page = doc[page_num]
# Identify clustered regions of vector graphics
# x_tolerance/y_tolerance control the merging distance of adjacent elements
try:
drawing_rects = page.cluster_drawings(
x_tolerance=3,
y_tolerance=3
)
except Exception:
# Some PDFs may not support this, skip
continue
for idx, rect in enumerate(drawing_rects):
# Filter out too small regions (may be lines/decorations)
if rect.width < min_size or rect.height < min_size:
continue
# Expand boundaries to avoid tight cropping
rect = rect + (-10, -10, 10, 10)
# Ensure not to exceed page boundaries
rect = rect & page.rect
# High-resolution screenshot
zoom = dpi / 72
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, clip=rect)
output_path = f"{output_dir}/page{page_num+1}_fig{idx+1}.png"
pix.save(output_path)
figures.append({
"page": page_num + 1,
"path": output_path,
"rect": rect
})
doc.close()
return figuresMethod 3: Manual Specified Region Cropping
When automatic recognition is not ideal, coordinates can be specified manually:
python
import fitz
def crop_figure(pdf_path, page_num, rect, output_path, dpi=200):
"""
Crop a specific region from the specified page of PDF
Args:
pdf_path: PDF path
page_num: Page number (starting from 1)
rect: (x0, y0, x1, y1) coordinates, unit is points(pt), 72pt = 1 inch
output_path: Output image path
dpi: Resolution
"""
doc = fitz.open(pdf_path)
page = doc[page_num - 1]
clip = fitz.Rect(rect)
zoom = dpi / 72
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, clip=clip)
pix.save(output_path)
doc.close()
# Usage Example: Crop a region on page 2
# Coordinates can be viewed through PDF reader, or fine-tuned after recognition with Method 2
crop_figure(
"paper.pdf",
page_num=2,
rect=(50, 100, 550, 400), # Top-left corner(50,100) to bottom-right corner(550,400)
output_path="./images/fig1_architecture.png"
)Smart Extraction (Comprehensive Solution)
Automatically try multiple methods to extract all charts:
python
import fitz
import os
def smart_extract_figures(pdf_path, output_dir, dpi=200):
"""
Intelligently extract all charts from papers
1. First use cluster_drawings to identify vector graphics
2. Then extract embedded bitmaps
3. Automatically filter and deduplicate
"""
os.makedirs(output_dir, exist_ok=True)
doc = fitz.open(pdf_path)
results = {"vector": [], "embedded": []}
for page_num in range(len(doc)):
page = doc[page_num]
# 1. Extract vector graphics
try:
rects = page.cluster_drawings(x_tolerance=3, y_tolerance=3)
for idx, rect in enumerate(rects):
if rect.width > 100 and rect.height > 100:
rect = (rect + (-10, -10, 10, 10)) & page.rect
zoom = dpi / 72
pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom), clip=rect)
path = f"{output_dir}/p{page_num+1}_vec{idx+1}.png"
pix.save(path)
results["vector"].append(path)
except:
pass
# 2. Extract embedded images
for img_idx, img in enumerate(page.get_images(full=True)):
xref = img[0]
base = doc.extract_image(xref)
if base["width"] > 100 and base["height"] > 100:
path = f"{output_dir}/p{page_num+1}_img{img_idx+1}.{base['ext']}"
with open(path, "wb") as f:
f.write(base["image"])
results["embedded"].append(path)
doc.close()
print(f"Extraction completed: {len(results['vector'])} vector graphics, {len(results['embedded'])} bitmaps")
return results
# Usage Example
results = smart_extract_figures("paper.pdf", "./images/")Common Issues
Q: The extracted image contains multiple Figures together?
Reduce the and parameters (e.g., 1-2) to make clustering stricter.
x_tolerancey_toleranceQ: The same Figure is cut into multiple pieces?
Increase the tolerance parameters (e.g., 10-20) to merge adjacent elements.
Q: Some Figures are not recognized?
- They may be embedded images, try the method
get_images() - Use the manual specified region method
Q: The image is blurry?
Increase the parameter to 300 or higher.
dpiUsage Methods
Basic Usage
Please read this paper and generate a professional reading note suitable for publication on technical communities.Specify Key Points
Please read this paper and focus on analyzing:
1. Differences between HSTU and standard Transformer
2. Settings and conclusions of Scaling Law experiments
3. Feasibility of industrial scenario implementationComparative Analysis
Please compare and analyze the different solutions of these two papers on the XXX problem.Technical Requirements
Content Completeness
- Core formulas must be included and explained item by item
- Key algorithms have pseudocode implementations
- Important hyperparameters and training details are not omitted
- Key conclusions of ablation experiments are extracted
Depth Requirements
- Analyze "why it is designed this way"
- Establish connections with related works
- Point out the boundaries and limitations of the method
Readability
- Intuition first, then details
- Code and formulas are matched
- Long formulas are explained step by step
Dependency Configuration
bash
# Image extraction
pip install pymupdf
# PDF to image (optional)
pip install pdf2imageMCP Configuration (Optional)
json
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@anthropic/mcp-server-filesystem", "/path/to/papers"]
},
"notion": {
"command": "npx",
"args": ["-y", "@notionhq/notion-mcp-server"],
"env": {
"OPENAPI_MCP_HEADERS": "{\"Authorization\": \"Bearer YOUR_TOKEN\", \"Notion-Version\": \"2022-06-28\"}"
}
}
}
}