ai-paper-reader

Original🇨🇳 Chinese
Translated

In-depth analysis of AI papers, generating professional reading notes ready for direct publication

12installs
Added on

NPX Install

npx skill4agent add frostant/awesome-claude-skills ai-paper-reader

SKILL.md Content (Chinese)

View Translation Comparison →

AI Paper Reading Note Generator

Core Objectives

Generate paper reading notes ready for direct publication on technical communities (such as Zhihu, Juejin, WeChat Official Accounts, etc.).
Requirements for notes:
  • Comprehensive Content: No omission of core technical details, in-depth elaboration of innovative points
  • Professional and Readable: Technical blog style, both in-depth and easy to understand
  • Objective and Accurate: Analysis based on paper content, no subjective assumptions added
  • In-depth Thinking: Help readers gain in-depth understanding through Q&A sessions

Writing Specifications

Dos

  1. Professional and Accurate Expression
    • Use standardized terminology in the field
    • Formulas and symbols strictly correspond to the original paper
    • Clear and unambiguous description of technical details
  2. Easy-to-Understand Explanation
    • Provide intuitive understanding of complex concepts before going into details
    • Use analogies to help understand abstract concepts
    • Explain the meaning of variables in formulas item by item
  3. Well-Structured Organization
    • Clear logical hierarchy
    • Highlight key content
    • Appropriately use charts for illustration
  4. Valuable In-depth Analysis
    • Analyze the reasons behind design choices
    • Compare similarities and differences with related works
    • Point out the scope of application and limitations of the method

Must-Avoids

  1. AI clichés and template sentences
    • ❌ "The core contribution of this paper is..."
    • ❌ "The advantage of this method is..."
    • ❌ "In summary..."
    • ❌ "It is worth noting that..."
    • ❌ "It has important significance/broad application prospects..."
  2. Empty summaries and evaluations
    • ❌ "This is an important work"
    • ❌ "Provides new ideas for the field"
    • ❌ General remarks without specific analysis
  3. Excessive format decoration
    • ❌ A large number of emojis
    • ❌ Bold every sentence
    • ❌ Excessive nested hierarchies
  4. Unnecessary first-person perspective
    • ❌ "I think..."
    • ❌ "My understanding is..."
    • Maintain an objective narrative perspective

Note Structure

0. Meta Information (Beginning of Notes)

Each note must include the following information at the beginning to help readers quickly decide whether to continue reading:
markdown
> **Paper**: Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
> **Authors**: Meta AI
> **Publication**: ICML 2024
> **Reading Time**: Approximately 15 minutes
> **Difficulty**: ⭐⭐⭐⭐ (Requires basics of Transformer and recommendation systems)
> **Prerequisite Knowledge**: Concepts of Attention mechanism, DLRM, Scaling Law
Difficulty Level Explanation:
  • ⭐ Beginner: No professional background required
  • ⭐⭐ Basic: Understanding of deep learning fundamentals
  • ⭐⭐⭐ Intermediate: Familiar with the relevant field
  • ⭐⭐⭐⭐ Professional: Requires in-depth domain knowledge
  • ⭐⭐⭐⭐⭐ Expert: Involves complex mathematics or cutting-edge research

1. TL;DR

Summarize the core innovation of the paper in 2-3 sentences, allowing readers who don't have time to read in detail to grasp the key points quickly.
markdown
## TL;DR

Traditional recommendation models like DLRM rely heavily on manual features and cannot scale. This paper proposes transforming the recommendation problem into a sequence generation problem. The core innovation is the HSTU architecture: using Pointwise Attention instead of Softmax to preserve the absolute intensity information of user preferences, enabling recommendation systems to exhibit LLM-like Scaling Law for the first time.
Requirements:
  • 2-3 sentences, no more than 100 words
  • Must include: Problem background + Core solution + Key innovation points
  • Avoid general remarks, include specific technical points

2. Paper Overview

Concise answer to three questions:
  • What problem does it solve: One-sentence description
  • Core solution: One-sentence summary
  • Main contributions: List 2-3 points
markdown
## Paper Overview

**Problem**: Large-scale recommendation systems cannot continuously improve quality by increasing computational power like LLMs

**Solution**: Transform the recommendation problem from "feature engineering + discriminative model" to "sequence modeling + generative model"

**Contributions**:
1. Propose the Generative Recommenders (GRs) paradigm, enabling Scaling Law for recommendation systems
2. Design the HSTU architecture, using Pointwise Attention instead of Softmax to preserve intensity information
3. Propose the M-FALCON inference algorithm for efficient candidate scoring

3. Background and Motivation

Explain the problems of existing methods and why new methods are needed:
  • How existing methods work
  • What problems/bottlenecks exist
  • What is the root cause of the problem

4. Core Method (Key Section)

This is the core part of the note, requiring completeness, depth, and no omissions.

Organization Method

  1. Overall Architecture
    • Provide architecture diagrams
    • Explain data flow
    • Label key modules
  2. Detailed Explanation of Core Modules (For each key module)
    • Input and output description
    • Core formula + item-by-item explanation
    • Pseudocode/code implementation
    • Analysis of reasons for design choices
  3. Key Technical Details
    • Training strategy
    • Hyperparameter settings
    • Implementation tricks

Example Format

markdown
## Core Method

### Overall Architecture

[Architecture Diagram]

Data Flow: User history sequence → Embedding → HSTU Layers × L → Prediction Head

### HSTU Layer Details

#### Input and Output
- Input: X ∈ R^{N×d}, N is sequence length, d is embedding dimension
- Output: Y ∈ R^{N×d}

#### Core Formula

**Pointwise Projection**:
$$U, V, Q, K = \text{Split}(\phi_1(f_1(X)))$$

Where:
- $\phi_1$: SiLU activation function
- $f_1$: Single-layer linear transformation
- Split divides the output into four vectors

**Spatial Aggregation**:
$$A(X)V(X) = \phi_2(Q(X)K(X)^T + r_{ab}) V(X)$$

Key Point: Use SiLU instead of Softmax to preserve the absolute intensity information of attention.

#### Code Implementation

```python
class HSTULayer(nn.Module):
    def forward(self, x):
        # Pointwise Projection
        projected = F.silu(self.proj_in(x))
        u, v, q, k = projected.split([...], dim=-1)

        # Spatial Aggregation (not Softmax!)
        attn = F.silu(q @ k.T + self.rel_bias)
        out = self.norm(attn @ v) * u

        return x + self.proj_out(out)

Design Analysis

Why use SiLU instead of Softmax?
Recommendation scenarios require predicting the absolute intensity of user preferences (such as watch duration), rather than just relative ranking.
Consider two users:
  • User A: 10 historical interactions
  • User B: 100 historical interactions
When using Softmax, the attention weights of both users will be normalized to [0,1], causing the information that "User B is more active" to be lost.
Pointwise Attention preserves the accumulated original magnitude, allowing the model to learn activity differences.

### 5. Experimental Analysis

Instead of listing numbers, extract key conclusions:

- **Main Experimental Results**: Core findings compared with baselines
- **Ablation Experiments**: Contribution analysis of each component
- **Scaling Analysis**: Relationship between computational power and performance (if available)
- **Limitations**: Situations where the method does not perform well

### 6. In-depth Understanding Q&A

Help readers gain in-depth understanding of key points of the paper through well-designed questions.

**Display Q&A directly, do not use folding**.

```markdown
## In-depth Understanding Q&A

### Q1: Why is Softmax Attention not suitable for recommendation scenarios?

Recommendation scenarios require predicting the **absolute intensity** of user preferences (such as watch duration), rather than just **relative ranking**.

Consider two users:
- User A: 10 historical interactions
- User B: 100 historical interactions

When using Softmax, the attention weights of both users will be normalized to [0,1], causing the information that "User B is more active" to be lost.

Pointwise Attention preserves the accumulated original magnitude, allowing the model to learn activity differences.

### Q2: How does HSTU replace 6 linear layers of Transformer with 2?

Standard Transformer layers require:
- Q, K, V projection: 3 linear layers
- Output Projection: 1 linear layer
- FFN: 2 linear layers (expansion + compression)

HSTU simplifications:
1. **Fuse Q, K, V, U projection**: A single linear layer generates four vectors simultaneously
2. **Replace FFN with U gating**: `output * U` achieves similar nonlinear transformation

The cost is reduced expressive power per layer, but this can be compensated by stacking more layers.

### Q3: Why can Stochastic Length training discard 70% of tokens with almost no loss in performance?

The key lies in the **statistical characteristics** of user behavior:

1. **Temporal repetition**: Users repeatedly interact with similar content, resulting in high information redundancy
2. **Low-rank interests**: 10,000 interactions may only involve 20 main interest points
3. **Recency priority**: Weight recent behaviors during sampling to retain the most relevant information

As long as the number of samples exceeds a certain multiple of the number of interest categories, all interests can be covered with high probability.

7. Summary and Reflection

Objectively summarize the contributions and limitations of the paper:
markdown
## Summary

### Core Contributions
- Proved that recommendation systems can follow Scaling Law
- Proposed an Attention variant suitable for recommendation scenarios

### Limitations
- Cold start scenarios: Advantages are not obvious when historical sequences are too short
- Computational cost: Requires a large amount of GPU resources
- Real-time performance: Latency challenges in long-sequence inference

### Applicable Scenarios
- Scenarios with rich user history (>100 interactions)
- Sufficient computational resources available
- Not extremely strict real-time requirements

Directory Structure Specifications

Papers and reading notes should be organized in unified subdirectories for easy management and retrieval:
paper-notes/
├── hstu/                           # One directory per paper, using a short name
│   ├── paper.pdf                   # Original paper PDF
│   ├── README.md                   # Reading note (main file)
│   └── images/                     # Extracted charts
│       ├── fig1_architecture.png
│       ├── fig2_method.png
│       └── fig3_scaling.png
├── attention-is-all-you-need/
│   ├── paper.pdf
│   ├── README.md
│   └── images/
└── din-deep-interest-network/
    ├── paper.pdf
    ├── README.md
    └── images/
Naming Specifications:
  • Directory name: Paper abbreviation or keywords, lowercase, connected with
    -
  • Note file: Unifiedly named
    README.md
    for direct preview on GitHub
  • Image directory: Unifiedly named
    images/
Image Naming Specifications:
fig{number}_{type}_{description}.png

Types:
- arch: Architecture diagram
- method: Method flow
- result: Experimental result
- ablation: Ablation experiment
- compare: Comparison chart

Examples:
- fig1_arch_overall.png
- fig2_method_attention.png
- fig3_result_scaling.png

Q&A Session Design Guide

Question Types

  1. Principle Understanding Type
    • Why is it designed this way?
    • What are the advantages compared to alternative solutions?
  2. Detail Differentiation Type
    • Specific meaning of a certain symbol/operation
    • Distinction between easily confused concepts
  3. Boundary Condition Type
    • In what situations will the method fail?
    • What are the assumptions?
  4. Extended Thinking Type
    • Can it be migrated to other scenarios?
    • What possible improvement directions are there?

Answer Requirements

  • Direct Display: Do not use folding, readers can read smoothly
  • Well-Founded: Answers must have arguments, not simple assertions
  • Appropriate Examples: Use specific examples to help understanding
  • Acknowledge Uncertainty: For parts not explained in the paper, mark as "speculation"

Chart Processing

Must-Extract Charts

  • Overall architecture diagram
  • Core method flow chart
  • Key experimental results (such as Scaling Law curves)

Chart Description Specifications

markdown
![Architecture Diagram](images/fig1_architecture.png)

**Content of the Diagram**: Overall architecture of HSTU, with DLRM comparison on the left

**Key Information**:
- Input is a unified item-behavior alternating sequence
- HSTU Layers can be stacked infinitely
- Output is a multi-task prediction head

**Correspondence with Text**: Detailed description in Section 3.2

Image Extraction Tools

There are two types of charts in academic papers, requiring different extraction methods:
TypeCharacteristicsExtraction Method
Embedded ImagesPNG/JPEG inserted by authors
get_images()
Vector GraphicsArchitecture diagrams, flowcharts and other drawn graphics
cluster_drawings()

Method 1: Extract Embedded Images

Suitable for bitmaps directly inserted in papers (such as experimental result screenshots, photos, etc.):
python
import fitz  # PyMuPDF
import os

def extract_embedded_images(pdf_path, output_dir):
    """Extract embedded bitmaps from PDF"""
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)

    for page_num in range(len(doc)):
        page = doc[page_num]
        images = page.get_images(full=True)

        for img_idx, img in enumerate(images):
            xref = img[0]
            base = doc.extract_image(xref)
            image_bytes = base["image"]
            image_ext = base["ext"]

            # Filter out too small images (may be icons/decorations)
            if base["width"] > 100 and base["height"] > 100:
                output_path = f"{output_dir}/page{page_num+1}_img{img_idx+1}.{image_ext}"
                with open(output_path, "wb") as f:
                    f.write(image_bytes)

    doc.close()

Method 2: Extract Vector Graphics (Recommended)

Suitable for architecture diagrams, flowcharts, charts and other vector graphics bound in papers:
python
import fitz
import os

def extract_vector_figures(pdf_path, output_dir, dpi=200, min_size=100):
    """
    Identify vector graphic regions using cluster_drawings() and take screenshots

    Args:
        pdf_path: PDF file path
        output_dir: Output directory
        dpi: Output resolution (default 200, can be increased to 300 for clearer images)
        min_size: Minimum size threshold, filter decorative lines (default 100pt)
    """
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)

    figures = []
    for page_num in range(len(doc)):
        page = doc[page_num]

        # Identify clustered regions of vector graphics
        # x_tolerance/y_tolerance control the merging distance of adjacent elements
        try:
            drawing_rects = page.cluster_drawings(
                x_tolerance=3,
                y_tolerance=3
            )
        except Exception:
            # Some PDFs may not support this, skip
            continue

        for idx, rect in enumerate(drawing_rects):
            # Filter out too small regions (may be lines/decorations)
            if rect.width < min_size or rect.height < min_size:
                continue

            # Expand boundaries to avoid tight cropping
            rect = rect + (-10, -10, 10, 10)
            # Ensure not to exceed page boundaries
            rect = rect & page.rect

            # High-resolution screenshot
            zoom = dpi / 72
            mat = fitz.Matrix(zoom, zoom)
            pix = page.get_pixmap(matrix=mat, clip=rect)

            output_path = f"{output_dir}/page{page_num+1}_fig{idx+1}.png"
            pix.save(output_path)
            figures.append({
                "page": page_num + 1,
                "path": output_path,
                "rect": rect
            })

    doc.close()
    return figures

Method 3: Manual Specified Region Cropping

When automatic recognition is not ideal, coordinates can be specified manually:
python
import fitz

def crop_figure(pdf_path, page_num, rect, output_path, dpi=200):
    """
    Crop a specific region from the specified page of PDF

    Args:
        pdf_path: PDF path
        page_num: Page number (starting from 1)
        rect: (x0, y0, x1, y1) coordinates, unit is points(pt), 72pt = 1 inch
        output_path: Output image path
        dpi: Resolution
    """
    doc = fitz.open(pdf_path)
    page = doc[page_num - 1]

    clip = fitz.Rect(rect)
    zoom = dpi / 72
    mat = fitz.Matrix(zoom, zoom)

    pix = page.get_pixmap(matrix=mat, clip=clip)
    pix.save(output_path)
    doc.close()

# Usage Example: Crop a region on page 2
# Coordinates can be viewed through PDF reader, or fine-tuned after recognition with Method 2
crop_figure(
    "paper.pdf",
    page_num=2,
    rect=(50, 100, 550, 400),  # Top-left corner(50,100) to bottom-right corner(550,400)
    output_path="./images/fig1_architecture.png"
)

Smart Extraction (Comprehensive Solution)

Automatically try multiple methods to extract all charts:
python
import fitz
import os

def smart_extract_figures(pdf_path, output_dir, dpi=200):
    """
    Intelligently extract all charts from papers
    1. First use cluster_drawings to identify vector graphics
    2. Then extract embedded bitmaps
    3. Automatically filter and deduplicate
    """
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)
    results = {"vector": [], "embedded": []}

    for page_num in range(len(doc)):
        page = doc[page_num]

        # 1. Extract vector graphics
        try:
            rects = page.cluster_drawings(x_tolerance=3, y_tolerance=3)
            for idx, rect in enumerate(rects):
                if rect.width > 100 and rect.height > 100:
                    rect = (rect + (-10, -10, 10, 10)) & page.rect
                    zoom = dpi / 72
                    pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom), clip=rect)
                    path = f"{output_dir}/p{page_num+1}_vec{idx+1}.png"
                    pix.save(path)
                    results["vector"].append(path)
        except:
            pass

        # 2. Extract embedded images
        for img_idx, img in enumerate(page.get_images(full=True)):
            xref = img[0]
            base = doc.extract_image(xref)
            if base["width"] > 100 and base["height"] > 100:
                path = f"{output_dir}/p{page_num+1}_img{img_idx+1}.{base['ext']}"
                with open(path, "wb") as f:
                    f.write(base["image"])
                results["embedded"].append(path)

    doc.close()
    print(f"Extraction completed: {len(results['vector'])} vector graphics, {len(results['embedded'])} bitmaps")
    return results

# Usage Example
results = smart_extract_figures("paper.pdf", "./images/")

Common Issues

Q: The extracted image contains multiple Figures together?
Reduce the
x_tolerance
and
y_tolerance
parameters (e.g., 1-2) to make clustering stricter.
Q: The same Figure is cut into multiple pieces?
Increase the tolerance parameters (e.g., 10-20) to merge adjacent elements.
Q: Some Figures are not recognized?
  1. They may be embedded images, try the
    get_images()
    method
  2. Use the manual specified region method
Q: The image is blurry?
Increase the
dpi
parameter to 300 or higher.

Usage Methods

Basic Usage

Please read this paper and generate a professional reading note suitable for publication on technical communities.

Specify Key Points

Please read this paper and focus on analyzing:
1. Differences between HSTU and standard Transformer
2. Settings and conclusions of Scaling Law experiments
3. Feasibility of industrial scenario implementation

Comparative Analysis

Please compare and analyze the different solutions of these two papers on the XXX problem.

Technical Requirements

Content Completeness

  • Core formulas must be included and explained item by item
  • Key algorithms have pseudocode implementations
  • Important hyperparameters and training details are not omitted
  • Key conclusions of ablation experiments are extracted

Depth Requirements

  • Analyze "why it is designed this way"
  • Establish connections with related works
  • Point out the boundaries and limitations of the method

Readability

  • Intuition first, then details
  • Code and formulas are matched
  • Long formulas are explained step by step

Dependency Configuration

bash
# Image extraction
pip install pymupdf

# PDF to image (optional)
pip install pdf2image

MCP Configuration (Optional)

json
{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-filesystem", "/path/to/papers"]
    },
    "notion": {
      "command": "npx",
      "args": ["-y", "@notionhq/notion-mcp-server"],
      "env": {
        "OPENAPI_MCP_HEADERS": "{\"Authorization\": \"Bearer YOUR_TOKEN\", \"Notion-Version\": \"2022-06-28\"}"
      }
    }
  }
}