AI Paper Reading Note Generator

Core Objectives

Generate paper reading notes ready for direct publication on technical communities (such as Zhihu, Juejin, WeChat Official Accounts, etc.).

Requirements for notes:

Comprehensive Content: No omission of core technical details, in-depth elaboration of innovative points
Professional and Readable: Technical blog style, both in-depth and easy to understand
Objective and Accurate: Analysis based on paper content, no subjective assumptions added
In-depth Thinking: Help readers gain in-depth understanding through Q&A sessions

Writing Specifications

Dos

Professional and Accurate Expression
- Use standardized terminology in the field
- Formulas and symbols strictly correspond to the original paper
- Clear and unambiguous description of technical details
Easy-to-Understand Explanation
- Provide intuitive understanding of complex concepts before going into details
- Use analogies to help understand abstract concepts
- Explain the meaning of variables in formulas item by item
Well-Structured Organization
- Clear logical hierarchy
- Highlight key content
- Appropriately use charts for illustration
Valuable In-depth Analysis
- Analyze the reasons behind design choices
- Compare similarities and differences with related works
- Point out the scope of application and limitations of the method

Must-Avoids

AI clichés and template sentences
- ❌ "The core contribution of this paper is..."
- ❌ "The advantage of this method is..."
- ❌ "In summary..."
- ❌ "It is worth noting that..."
- ❌ "It has important significance/broad application prospects..."
Empty summaries and evaluations
- ❌ "This is an important work"
- ❌ "Provides new ideas for the field"
- ❌ General remarks without specific analysis
Excessive format decoration
- ❌ A large number of emojis
- ❌ Bold every sentence
- ❌ Excessive nested hierarchies
Unnecessary first-person perspective
- ❌ "I think..."
- ❌ "My understanding is..."
- Maintain an objective narrative perspective

Note Structure

0. Meta Information (Beginning of Notes)

Each note must include the following information at the beginning to help readers quickly decide whether to continue reading:

markdown

> **Paper**: Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
> **Authors**: Meta AI
> **Publication**: ICML 2024
> **Reading Time**: Approximately 15 minutes
> **Difficulty**: ⭐⭐⭐⭐ (Requires basics of Transformer and recommendation systems)
> **Prerequisite Knowledge**: Concepts of Attention mechanism, DLRM, Scaling Law

Difficulty Level Explanation:

⭐ Beginner: No professional background required
⭐⭐ Basic: Understanding of deep learning fundamentals
⭐⭐⭐ Intermediate: Familiar with the relevant field
⭐⭐⭐⭐ Professional: Requires in-depth domain knowledge
⭐⭐⭐⭐⭐ Expert: Involves complex mathematics or cutting-edge research

1. TL;DR

Summarize the core innovation of the paper in 2-3 sentences, allowing readers who don't have time to read in detail to grasp the key points quickly.

markdown

## TL;DR

Traditional recommendation models like DLRM rely heavily on manual features and cannot scale. This paper proposes transforming the recommendation problem into a sequence generation problem. The core innovation is the HSTU architecture: using Pointwise Attention instead of Softmax to preserve the absolute intensity information of user preferences, enabling recommendation systems to exhibit LLM-like Scaling Law for the first time.

Requirements:

2-3 sentences, no more than 100 words
Must include: Problem background + Core solution + Key innovation points
Avoid general remarks, include specific technical points

2. Paper Overview

Concise answer to three questions:

What problem does it solve: One-sentence description
Core solution: One-sentence summary
Main contributions: List 2-3 points

markdown

## Paper Overview

**Problem**: Large-scale recommendation systems cannot continuously improve quality by increasing computational power like LLMs

**Solution**: Transform the recommendation problem from "feature engineering + discriminative model" to "sequence modeling + generative model"

**Contributions**:
1. Propose the Generative Recommenders (GRs) paradigm, enabling Scaling Law for recommendation systems
2. Design the HSTU architecture, using Pointwise Attention instead of Softmax to preserve intensity information
3. Propose the M-FALCON inference algorithm for efficient candidate scoring

3. Background and Motivation

Explain the problems of existing methods and why new methods are needed:

How existing methods work
What problems/bottlenecks exist
What is the root cause of the problem

4. Core Method (Key Section)

This is the core part of the note, requiring completeness, depth, and no omissions.

Organization Method

Overall Architecture
- Provide architecture diagrams
- Explain data flow
- Label key modules
Detailed Explanation of Core Modules (For each key module)
- Input and output description
- Core formula + item-by-item explanation
- Pseudocode/code implementation
- Analysis of reasons for design choices
Key Technical Details
- Training strategy
- Hyperparameter settings
- Implementation tricks

Example Format

markdown

## Core Method

### Overall Architecture

[Architecture Diagram]

Data Flow: User history sequence → Embedding → HSTU Layers × L → Prediction Head

### HSTU Layer Details

#### Input and Output
- Input: X ∈ R^{N×d}, N is sequence length, d is embedding dimension
- Output: Y ∈ R^{N×d}

#### Core Formula

**Pointwise Projection**:
$$U, V, Q, K = \text{Split}(\phi_1(f_1(X)))$$

Where:
- $\phi_1$: SiLU activation function
- $f_1$: Single-layer linear transformation
- Split divides the output into four vectors

**Spatial Aggregation**:
$$A(X)V(X) = \phi_2(Q(X)K(X)^T + r_{ab}) V(X)$$

Key Point: Use SiLU instead of Softmax to preserve the absolute intensity information of attention.

#### Code Implementation

```python
class HSTULayer(nn.Module):
    def forward(self, x):
        # Pointwise Projection
        projected = F.silu(self.proj_in(x))
        u, v, q, k = projected.split([...], dim=-1)

        # Spatial Aggregation (not Softmax!)
        attn = F.silu(q @ k.T + self.rel_bias)
        out = self.norm(attn @ v) * u

        return x + self.proj_out(out)

Design Analysis

Why use SiLU instead of Softmax?

Recommendation scenarios require predicting the absolute intensity of user preferences (such as watch duration), rather than just relative ranking.

Consider two users:

User A: 10 historical interactions
User B: 100 historical interactions

When using Softmax, the attention weights of both users will be normalized to [0,1], causing the information that "User B is more active" to be lost.

Pointwise Attention preserves the accumulated original magnitude, allowing the model to learn activity differences.


### 5. Experimental Analysis

Instead of listing numbers, extract key conclusions:

- **Main Experimental Results**: Core findings compared with baselines
- **Ablation Experiments**: Contribution analysis of each component
- **Scaling Analysis**: Relationship between computational power and performance (if available)
- **Limitations**: Situations where the method does not perform well

### 6. In-depth Understanding Q&A

Help readers gain in-depth understanding of key points of the paper through well-designed questions.

**Display Q&A directly, do not use folding**.

```markdown
## In-depth Understanding Q&A

### Q1: Why is Softmax Attention not suitable for recommendation scenarios?

Recommendation scenarios require predicting the **absolute intensity** of user preferences (such as watch duration), rather than just **relative ranking**.

Consider two users:
- User A: 10 historical interactions
- User B: 100 historical interactions

When using Softmax, the attention weights of both users will be normalized to [0,1], causing the information that "User B is more active" to be lost.

Pointwise Attention preserves the accumulated original magnitude, allowing the model to learn activity differences.

### Q2: How does HSTU replace 6 linear layers of Transformer with 2?

Standard Transformer layers require:
- Q, K, V projection: 3 linear layers
- Output Projection: 1 linear layer
- FFN: 2 linear layers (expansion + compression)

HSTU simplifications:
1. **Fuse Q, K, V, U projection**: A single linear layer generates four vectors simultaneously
2. **Replace FFN with U gating**: `output * U` achieves similar nonlinear transformation

The cost is reduced expressive power per layer, but this can be compensated by stacking more layers.

### Q3: Why can Stochastic Length training discard 70% of tokens with almost no loss in performance?

The key lies in the **statistical characteristics** of user behavior:

1. **Temporal repetition**: Users repeatedly interact with similar content, resulting in high information redundancy
2. **Low-rank interests**: 10,000 interactions may only involve 20 main interest points
3. **Recency priority**: Weight recent behaviors during sampling to retain the most relevant information

As long as the number of samples exceeds a certain multiple of the number of interest categories, all interests can be covered with high probability.

7. Summary and Reflection

Objectively summarize the contributions and limitations of the paper:

markdown

## Summary

### Core Contributions
- Proved that recommendation systems can follow Scaling Law
- Proposed an Attention variant suitable for recommendation scenarios

### Limitations
- Cold start scenarios: Advantages are not obvious when historical sequences are too short
- Computational cost: Requires a large amount of GPU resources
- Real-time performance: Latency challenges in long-sequence inference

### Applicable Scenarios
- Scenarios with rich user history (>100 interactions)
- Sufficient computational resources available
- Not extremely strict real-time requirements

Directory Structure Specifications

Papers and reading notes should be organized in unified subdirectories for easy management and retrieval:

paper-notes/
├── hstu/                           # One directory per paper, using a short name
│   ├── paper.pdf                   # Original paper PDF
│   ├── README.md                   # Reading note (main file)
│   └── images/                     # Extracted charts
│       ├── fig1_architecture.png
│       ├── fig2_method.png
│       └── fig3_scaling.png
│
├── attention-is-all-you-need/
│   ├── paper.pdf
│   ├── README.md
│   └── images/
│
└── din-deep-interest-network/
    ├── paper.pdf
    ├── README.md
    └── images/

Naming Specifications:

Directory name: Paper abbreviation or keywords, lowercase, connected with
```
-
```
Note file: Unifiedly named
```
README.md
```
for direct preview on GitHub
Image directory: Unifiedly named
```
images/
```

Image Naming Specifications:

fig{number}_{type}_{description}.png

Types:
- arch: Architecture diagram
- method: Method flow
- result: Experimental result
- ablation: Ablation experiment
- compare: Comparison chart

Examples:
- fig1_arch_overall.png
- fig2_method_attention.png
- fig3_result_scaling.png

Q&A Session Design Guide

Question Types

Principle Understanding Type
- Why is it designed this way?
- What are the advantages compared to alternative solutions?
Detail Differentiation Type
- Specific meaning of a certain symbol/operation
- Distinction between easily confused concepts
Boundary Condition Type
- In what situations will the method fail?
- What are the assumptions?
Extended Thinking Type
- Can it be migrated to other scenarios?
- What possible improvement directions are there?

Answer Requirements

Direct Display: Do not use folding, readers can read smoothly
Well-Founded: Answers must have arguments, not simple assertions
Appropriate Examples: Use specific examples to help understanding
Acknowledge Uncertainty: For parts not explained in the paper, mark as "speculation"

Chart Processing

Must-Extract Charts

Overall architecture diagram
Core method flow chart
Key experimental results (such as Scaling Law curves)

Chart Description Specifications

markdown

![Architecture Diagram](images/fig1_architecture.png)

**Content of the Diagram**: Overall architecture of HSTU, with DLRM comparison on the left

**Key Information**:
- Input is a unified item-behavior alternating sequence
- HSTU Layers can be stacked infinitely
- Output is a multi-task prediction head

**Correspondence with Text**: Detailed description in Section 3.2

Image Extraction Tools

There are two types of charts in academic papers, requiring different extraction methods:

Type	Characteristics	Extraction Method
Embedded Images	PNG/JPEG inserted by authors	`get_images()`
Vector Graphics	Architecture diagrams, flowcharts and other drawn graphics	`cluster_drawings()`

Method 1: Extract Embedded Images

Suitable for bitmaps directly inserted in papers (such as experimental result screenshots, photos, etc.):

python

import fitz  # PyMuPDF
import os

def extract_embedded_images(pdf_path, output_dir):
    """Extract embedded bitmaps from PDF"""
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)

    for page_num in range(len(doc)):
        page = doc[page_num]
        images = page.get_images(full=True)

        for img_idx, img in enumerate(images):
            xref = img[0]
            base = doc.extract_image(xref)
            image_bytes = base["image"]
            image_ext = base["ext"]

            # Filter out too small images (may be icons/decorations)
            if base["width"] > 100 and base["height"] > 100:
                output_path = f"{output_dir}/page{page_num+1}_img{img_idx+1}.{image_ext}"
                with open(output_path, "wb") as f:
                    f.write(image_bytes)

    doc.close()

Method 2: Extract Vector Graphics (Recommended)

Suitable for architecture diagrams, flowcharts, charts and other vector graphics bound in papers:

python

import fitz
import os

def extract_vector_figures(pdf_path, output_dir, dpi=200, min_size=100):
    """
    Identify vector graphic regions using cluster_drawings() and take screenshots

    Args:
        pdf_path: PDF file path
        output_dir: Output directory
        dpi: Output resolution (default 200, can be increased to 300 for clearer images)
        min_size: Minimum size threshold, filter decorative lines (default 100pt)
    """
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)

    figures = []
    for page_num in range(len(doc)):
        page = doc[page_num]

        # Identify clustered regions of vector graphics
        # x_tolerance/y_tolerance control the merging distance of adjacent elements
        try:
            drawing_rects = page.cluster_drawings(
                x_tolerance=3,
                y_tolerance=3
            )
        except Exception:
            # Some PDFs may not support this, skip
            continue

        for idx, rect in enumerate(drawing_rects):
            # Filter out too small regions (may be lines/decorations)
            if rect.width < min_size or rect.height < min_size:
                continue

            # Expand boundaries to avoid tight cropping
            rect = rect + (-10, -10, 10, 10)
            # Ensure not to exceed page boundaries
            rect = rect & page.rect

            # High-resolution screenshot
            zoom = dpi / 72
            mat = fitz.Matrix(zoom, zoom)
            pix = page.get_pixmap(matrix=mat, clip=rect)

            output_path = f"{output_dir}/page{page_num+1}_fig{idx+1}.png"
            pix.save(output_path)
            figures.append({
                "page": page_num + 1,
                "path": output_path,
                "rect": rect
            })

    doc.close()
    return figures

Method 3: Manual Specified Region Cropping

When automatic recognition is not ideal, coordinates can be specified manually:

python

import fitz

def crop_figure(pdf_path, page_num, rect, output_path, dpi=200):
    """
    Crop a specific region from the specified page of PDF

    Args:
        pdf_path: PDF path
        page_num: Page number (starting from 1)
        rect: (x0, y0, x1, y1) coordinates, unit is points(pt), 72pt = 1 inch
        output_path: Output image path
        dpi: Resolution
    """
    doc = fitz.open(pdf_path)
    page = doc[page_num - 1]

    clip = fitz.Rect(rect)
    zoom = dpi / 72
    mat = fitz.Matrix(zoom, zoom)

    pix = page.get_pixmap(matrix=mat, clip=clip)
    pix.save(output_path)
    doc.close()

# Usage Example: Crop a region on page 2
# Coordinates can be viewed through PDF reader, or fine-tuned after recognition with Method 2
crop_figure(
    "paper.pdf",
    page_num=2,
    rect=(50, 100, 550, 400),  # Top-left corner(50,100) to bottom-right corner(550,400)
    output_path="./images/fig1_architecture.png"
)

Smart Extraction (Comprehensive Solution)

Automatically try multiple methods to extract all charts:

python

import fitz
import os

def smart_extract_figures(pdf_path, output_dir, dpi=200):
    """
    Intelligently extract all charts from papers
    1. First use cluster_drawings to identify vector graphics
    2. Then extract embedded bitmaps
    3. Automatically filter and deduplicate
    """
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)
    results = {"vector": [], "embedded": []}

    for page_num in range(len(doc)):
        page = doc[page_num]

        # 1. Extract vector graphics
        try:
            rects = page.cluster_drawings(x_tolerance=3, y_tolerance=3)
            for idx, rect in enumerate(rects):
                if rect.width > 100 and rect.height > 100:
                    rect = (rect + (-10, -10, 10, 10)) & page.rect
                    zoom = dpi / 72
                    pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom), clip=rect)
                    path = f"{output_dir}/p{page_num+1}_vec{idx+1}.png"
                    pix.save(path)
                    results["vector"].append(path)
        except:
            pass

        # 2. Extract embedded images
        for img_idx, img in enumerate(page.get_images(full=True)):
            xref = img[0]
            base = doc.extract_image(xref)
            if base["width"] > 100 and base["height"] > 100:
                path = f"{output_dir}/p{page_num+1}_img{img_idx+1}.{base['ext']}"
                with open(path, "wb") as f:
                    f.write(base["image"])
                results["embedded"].append(path)

    doc.close()
    print(f"Extraction completed: {len(results['vector'])} vector graphics, {len(results['embedded'])} bitmaps")
    return results

# Usage Example
results = smart_extract_figures("paper.pdf", "./images/")

Common Issues

Q: The extracted image contains multiple Figures together?

Reduce the

x_tolerance

and

y_tolerance

parameters (e.g., 1-2) to make clustering stricter.

Q: The same Figure is cut into multiple pieces?

Increase the tolerance parameters (e.g., 10-20) to merge adjacent elements.

Q: Some Figures are not recognized?

They may be embedded images, try the
```
get_images()
```
method
Use the manual specified region method

Q: The image is blurry?

Increase the

dpi

parameter to 300 or higher.

Usage Methods

Basic Usage

Please read this paper and generate a professional reading note suitable for publication on technical communities.

Specify Key Points

Please read this paper and focus on analyzing:
1. Differences between HSTU and standard Transformer
2. Settings and conclusions of Scaling Law experiments
3. Feasibility of industrial scenario implementation

Comparative Analysis

Please compare and analyze the different solutions of these two papers on the XXX problem.

Technical Requirements

Content Completeness

Core formulas must be included and explained item by item
Key algorithms have pseudocode implementations
Important hyperparameters and training details are not omitted
Key conclusions of ablation experiments are extracted

Depth Requirements

Analyze "why it is designed this way"
Establish connections with related works
Point out the boundaries and limitations of the method

Readability

Intuition first, then details
Code and formulas are matched
Long formulas are explained step by step

Dependency Configuration

bash

# Image extraction
pip install pymupdf

# PDF to image (optional)
pip install pdf2image

MCP Configuration (Optional)

json

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-filesystem", "/path/to/papers"]
    },
    "notion": {
      "command": "npx",
      "args": ["-y", "@notionhq/notion-mcp-server"],
      "env": {
        "OPENAPI_MCP_HEADERS": "{\"Authorization\": \"Bearer YOUR_TOKEN\", \"Notion-Version\": \"2022-06-28\"}"
      }
    }
  }
}

ai-paper-reader

NPX Install

Tags

SKILL.md Content (Chinese)

AI Paper Reading Note Generator

Core Objectives

Writing Specifications

Dos

Must-Avoids

Note Structure

0. Meta Information (Beginning of Notes)

1. TL;DR

2. Paper Overview

3. Background and Motivation

4. Core Method (Key Section)

Organization Method

Example Format

Design Analysis

7. Summary and Reflection

Directory Structure Specifications

Q&A Session Design Guide

Question Types

Answer Requirements

Chart Processing

Must-Extract Charts

Chart Description Specifications

Image Extraction Tools

Method 1: Extract Embedded Images

Method 2: Extract Vector Graphics (Recommended)

Method 3: Manual Specified Region Cropping

Smart Extraction (Comprehensive Solution)

Common Issues

Usage Methods

Basic Usage

Specify Key Points

Comparative Analysis

Technical Requirements

Content Completeness

Depth Requirements

Readability

Dependency Configuration

MCP Configuration (Optional)