ai-paper-reader

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI 论文阅读笔记生成器

AI Paper Reading Note Generator

核心目标

Core Objectives

生成可直接发布到技术社区的论文阅读笔记(知乎、掘金、公众号等)。
笔记要求:
  • 内容完整:核心技术细节不遗漏,创新点深刻阐述
  • 专业易读:技术博客风格,既有深度又便于理解
  • 客观准确:基于论文内容分析,不添加主观臆断
  • 深度思考:通过 Q&A 环节帮助读者深入理解

Generate paper reading notes ready for direct publication on technical communities (such as Zhihu, Juejin, WeChat Official Accounts, etc.).
Requirements for notes:
  • Comprehensive Content: No omission of core technical details, in-depth elaboration of innovative points
  • Professional and Readable: Technical blog style, both in-depth and easy to understand
  • Objective and Accurate: Analysis based on paper content, no subjective assumptions added
  • In-depth Thinking: Help readers gain in-depth understanding through Q&A sessions

写作规范

Writing Specifications

应该做

Dos

  1. 专业准确的表述
    • 使用领域内规范的术语
    • 公式和符号严格对应论文原文
    • 技术细节描述清晰无歧义
  2. 深入浅出的解释
    • 复杂概念先给直觉,再给细节
    • 用类比帮助理解抽象概念
    • 公式逐项解释变量含义
  3. 结构清晰的组织
    • 逻辑层次分明
    • 重点内容突出
    • 适当使用图表辅助说明
  4. 有价值的深度分析
    • 分析设计选择背后的原因
    • 对比与相关工作的异同
    • 指出方法的适用范围和局限
  1. Professional and Accurate Expression
    • Use standardized terminology in the field
    • Formulas and symbols strictly correspond to the original paper
    • Clear and unambiguous description of technical details
  2. Easy-to-Understand Explanation
    • Provide intuitive understanding of complex concepts before going into details
    • Use analogies to help understand abstract concepts
    • Explain the meaning of variables in formulas item by item
  3. Well-Structured Organization
    • Clear logical hierarchy
    • Highlight key content
    • Appropriately use charts for illustration
  4. Valuable In-depth Analysis
    • Analyze the reasons behind design choices
    • Compare similarities and differences with related works
    • Point out the scope of application and limitations of the method

必须避免

Must-Avoids

  1. AI 套话和模板句式
    • ❌ "本文的核心贡献是..."
    • ❌ "该方法的优势在于..."
    • ❌ "综上所述..."
    • ❌ "值得注意的是..."
    • ❌ "具有重要意义/广泛应用前景..."
  2. 空洞的总结和评价
    • ❌ "这是一篇重要的工作"
    • ❌ "为该领域提供了新思路"
    • ❌ 不带具体分析的泛泛而谈
  3. 过度的格式装饰
    • ❌ 大量 emoji
    • ❌ 每句话都加粗
    • ❌ 过多层级嵌套
  4. 不必要的第一人称
    • ❌ "我认为..."
    • ❌ "我的理解是..."
    • 保持客观叙述视角

  1. AI clichés and template sentences
    • ❌ "The core contribution of this paper is..."
    • ❌ "The advantage of this method is..."
    • ❌ "In summary..."
    • ❌ "It is worth noting that..."
    • ❌ "It has important significance/broad application prospects..."
  2. Empty summaries and evaluations
    • ❌ "This is an important work"
    • ❌ "Provides new ideas for the field"
    • ❌ General remarks without specific analysis
  3. Excessive format decoration
    • ❌ A large number of emojis
    • ❌ Bold every sentence
    • ❌ Excessive nested hierarchies
  4. Unnecessary first-person perspective
    • ❌ "I think..."
    • ❌ "My understanding is..."
    • Maintain an objective narrative perspective

笔记结构

Note Structure

零、元信息(笔记开头)

0. Meta Information (Beginning of Notes)

每篇笔记开头需包含以下信息,帮助读者快速判断是否继续阅读:
markdown
> **论文**:Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
> **作者**:Meta AI
> **发表**:ICML 2024
> **阅读时长**:约 15 分钟
> **难度**:⭐⭐⭐⭐ (需要 Transformer、推荐系统基础)
> **前置知识**:Attention 机制、DLRM、Scaling Law 概念
难度等级说明
  • ⭐ 入门级:无需专业背景
  • ⭐⭐ 基础级:了解深度学习基础
  • ⭐⭐⭐ 进阶级:熟悉相关领域
  • ⭐⭐⭐⭐ 专业级:需要较深的领域知识
  • ⭐⭐⭐⭐⭐ 专家级:涉及复杂数学或前沿研究
Each note must include the following information at the beginning to help readers quickly decide whether to continue reading:
markdown
> **Paper**: Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
> **Authors**: Meta AI
> **Publication**: ICML 2024
> **Reading Time**: Approximately 15 minutes
> **Difficulty**: ⭐⭐⭐⭐ (Requires basics of Transformer and recommendation systems)
> **Prerequisite Knowledge**: Concepts of Attention mechanism, DLRM, Scaling Law
Difficulty Level Explanation:
  • ⭐ Beginner: No professional background required
  • ⭐⭐ Basic: Understanding of deep learning fundamentals
  • ⭐⭐⭐ Intermediate: Familiar with the relevant field
  • ⭐⭐⭐⭐ Professional: Requires in-depth domain knowledge
  • ⭐⭐⭐⭐⭐ Expert: Involves complex mathematics or cutting-edge research

一、TL;DR

1. TL;DR

用 2-3 句话概括论文的核心创新,让没时间细读的读者快速抓住重点。
markdown
undefined
Summarize the core innovation of the paper in 2-3 sentences, allowing readers who don't have time to read in detail to grasp the key points quickly.
markdown
undefined

TL;DR

TL;DR

DLRM 等传统推荐模型依赖大量人工特征且无法 scale,本文提出将推荐问题转化为序列生成问题。核心创新是 HSTU 架构:用 Pointwise Attention 替代 Softmax 来保留用户偏好的绝对强度信息,使推荐系统首次展现出类似 LLM 的 Scaling Law。

**要求**:
- 2-3 句话,不超过 100 字
- 必须包含:问题背景 + 核心方案 + 关键创新点
- 避免泛泛而谈,要有具体技术点
Traditional recommendation models like DLRM rely heavily on manual features and cannot scale. This paper proposes transforming the recommendation problem into a sequence generation problem. The core innovation is the HSTU architecture: using Pointwise Attention instead of Softmax to preserve the absolute intensity information of user preferences, enabling recommendation systems to exhibit LLM-like Scaling Law for the first time.

**Requirements**:
- 2-3 sentences, no more than 100 words
- Must include: Problem background + Core solution + Key innovation points
- Avoid general remarks, include specific technical points

二、论文概述

2. Paper Overview

简明扼要地回答三个问题:
  • 解决什么问题:一句话描述
  • 核心方案:一句话概括
  • 主要贡献:2-3 点列举
markdown
undefined
Concise answer to three questions:
  • What problem does it solve: One-sentence description
  • Core solution: One-sentence summary
  • Main contributions: List 2-3 points
markdown
undefined

论文概述

Paper Overview

问题:大规模推荐系统无法像 LLM 一样通过增加计算量持续提升质量
方案:将推荐问题从"特征工程+判别式模型"转变为"序列建模+生成式模型"
贡献
  1. 提出 Generative Recommenders (GRs) 范式,实现推荐系统的 Scaling Law
  2. 设计 HSTU 架构,用 Pointwise Attention 替代 Softmax 保留强度信息
  3. 提出 M-FALCON 推理算法,实现高效的候选打分
undefined
Problem: Large-scale recommendation systems cannot continuously improve quality by increasing computational power like LLMs
Solution: Transform the recommendation problem from "feature engineering + discriminative model" to "sequence modeling + generative model"
Contributions:
  1. Propose the Generative Recommenders (GRs) paradigm, enabling Scaling Law for recommendation systems
  2. Design the HSTU architecture, using Pointwise Attention instead of Softmax to preserve intensity information
  3. Propose the M-FALCON inference algorithm for efficient candidate scoring
undefined

三、背景与动机

3. Background and Motivation

说明现有方法的问题,以及为什么需要新方法:
  • 现有方法怎么做的
  • 存在什么问题/瓶颈
  • 问题的根本原因是什么
Explain the problems of existing methods and why new methods are needed:
  • How existing methods work
  • What problems/bottlenecks exist
  • What is the root cause of the problem

四、核心方法(重点)

4. Core Method (Key Section)

这是笔记的核心部分,要求完整、深入、不遗漏
This is the core part of the note, requiring completeness, depth, and no omissions.

组织方式

Organization Method

  1. 整体架构
    • 给出架构图
    • 说明数据流向
    • 标注关键模块
  2. 核心模块详解(对每个关键模块)
    • 输入输出说明
    • 核心公式 + 逐项解释
    • 伪代码/代码实现
    • 设计选择的原因分析
  3. 关键技术细节
    • 训练策略
    • 超参数设置
    • 实现 tricks
  1. Overall Architecture
    • Provide architecture diagrams
    • Explain data flow
    • Label key modules
  2. Detailed Explanation of Core Modules (For each key module)
    • Input and output description
    • Core formula + item-by-item explanation
    • Pseudocode/code implementation
    • Analysis of reasons for design choices
  3. Key Technical Details
    • Training strategy
    • Hyperparameter settings
    • Implementation tricks

示例格式

Example Format

markdown
undefined
markdown
undefined

核心方法

Core Method

整体架构

Overall Architecture

[架构图]
数据流:用户历史序列 → Embedding → HSTU Layers × L → 预测头
[Architecture Diagram]
Data Flow: User history sequence → Embedding → HSTU Layers × L → Prediction Head

HSTU Layer 详解

HSTU Layer Details

输入输出

Input and Output

  • 输入:X ∈ R^{N×d},N 为序列长度,d 为嵌入维度
  • 输出:Y ∈ R^{N×d}
  • Input: X ∈ R^{N×d}, N is sequence length, d is embedding dimension
  • Output: Y ∈ R^{N×d}

核心公式

Core Formula

Pointwise Projection: $$U, V, Q, K = \text{Split}(\phi_1(f_1(X)))$$
其中:
  • $\phi_1$:SiLU 激活函数
  • $f_1$:单层线性变换
  • Split 将输出分为四个向量
Spatial Aggregation: $$A(X)V(X) = \phi_2(Q(X)K(X)^T + r_{ab}) V(X)$$
关键点:使用 SiLU 而非 Softmax,保留注意力的绝对强度信息。
Pointwise Projection: $$U, V, Q, K = \text{Split}(\phi_1(f_1(X)))$$
Where:
  • $\phi_1$: SiLU activation function
  • $f_1$: Single-layer linear transformation
  • Split divides the output into four vectors
Spatial Aggregation: $$A(X)V(X) = \phi_2(Q(X)K(X)^T + r_{ab}) V(X)$$
Key Point: Use SiLU instead of Softmax to preserve the absolute intensity information of attention.

代码实现

Code Implementation

python
class HSTULayer(nn.Module):
    def forward(self, x):
        # Pointwise Projection
        projected = F.silu(self.proj_in(x))
        u, v, q, k = projected.split([...], dim=-1)

        # Spatial Aggregation (不是 Softmax!)
        attn = F.silu(q @ k.T + self.rel_bias)
        out = self.norm(attn @ v) * u

        return x + self.proj_out(out)
python
class HSTULayer(nn.Module):
    def forward(self, x):
        # Pointwise Projection
        projected = F.silu(self.proj_in(x))
        u, v, q, k = projected.split([...], dim=-1)

        # Spatial Aggregation (not Softmax!)
        attn = F.silu(q @ k.T + self.rel_bias)
        out = self.norm(attn @ v) * u

        return x + self.proj_out(out)

设计分析

Design Analysis

为什么用 SiLU 而不是 Softmax?
Softmax 会将注意力归一化到概率分布,这在推荐场景下会丢失重要信息...
undefined
Why use SiLU instead of Softmax?
Recommendation scenarios require predicting the absolute intensity of user preferences (such as watch duration), rather than just relative ranking.
Consider two users:
  • User A: 10 historical interactions
  • User B: 100 historical interactions
When using Softmax, the attention weights of both users will be normalized to [0,1], causing the information that "User B is more active" to be lost.
Pointwise Attention preserves the accumulated original magnitude, allowing the model to learn activity differences.
undefined

五、实验分析

5. Experimental Analysis

不是罗列数字,而是提炼关键结论:
  • 主实验结果:与 baseline 对比的核心发现
  • 消融实验:各组件的贡献分析
  • Scaling 分析:计算量与性能的关系(如有)
  • 局限性:方法在什么情况下效果不好
Instead of listing numbers, extract key conclusions:
  • Main Experimental Results: Core findings compared with baselines
  • Ablation Experiments: Contribution analysis of each component
  • Scaling Analysis: Relationship between computational power and performance (if available)
  • Limitations: Situations where the method does not perform well

六、深度理解问答

6. In-depth Understanding Q&A

通过精心设计的问题,帮助读者深入理解论文的关键点。
问答直接展示,不使用折叠
markdown
undefined
Help readers gain in-depth understanding of key points of the paper through well-designed questions.
Display Q&A directly, do not use folding.
markdown
undefined

深度理解问答

In-depth Understanding Q&A

Q1: 为什么 Softmax Attention 不适合推荐场景?

Q1: Why is Softmax Attention not suitable for recommendation scenarios?

推荐场景需要预测用户偏好的绝对强度(如观看时长),而非只是相对排序
考虑两个用户:
  • 用户 A:10 次历史交互
  • 用户 B:100 次历史交互
使用 Softmax 时,两者的注意力权重都会被归一化到 [0,1],导致"用户 B 更活跃"这一信息丢失。
而 Pointwise Attention 保留了累加的原始 magnitude,模型可以学到活跃度差异。
Recommendation scenarios require predicting the absolute intensity of user preferences (such as watch duration), rather than just relative ranking.
Consider two users:
  • User A: 10 historical interactions
  • User B: 100 historical interactions
When using Softmax, the attention weights of both users will be normalized to [0,1], causing the information that "User B is more active" to be lost.
Pointwise Attention preserves the accumulated original magnitude, allowing the model to learn activity differences.

Q2: HSTU 如何用 2 个线性层替代 Transformer 的 6 个?

Q2: How does HSTU replace 6 linear layers of Transformer with 2?

标准 Transformer 每层需要:
  • Q, K, V 投影:3 个线性层
  • Output Projection:1 个线性层
  • FFN:2 个线性层(扩展+压缩)
HSTU 的简化:
  1. 融合 Q, K, V, U 投影:一个线性层同时生成四个向量
  2. 用 U 门控替代 FFN
    output * U
    实现类似的非线性变换
代价是单层表达能力下降,但可以通过堆叠更多层来补偿。
Standard Transformer layers require:
  • Q, K, V projection: 3 linear layers
  • Output Projection: 1 linear layer
  • FFN: 2 linear layers (expansion + compression)
HSTU simplifications:
  1. Fuse Q, K, V, U projection: A single linear layer generates four vectors simultaneously
  2. Replace FFN with U gating:
    output * U
    achieves similar nonlinear transformation
The cost is reduced expressive power per layer, but this can be compensated by stacking more layers.

Q3: Stochastic Length 训练为什么能丢弃 70% token 而效果几乎不变?

Q3: Why can Stochastic Length training discard 70% of tokens with almost no loss in performance?

关键在于用户行为的统计特性
  1. 时间重复性:用户会反复与相似内容交互,信息冗余度高
  2. 兴趣低秩性:10000 次交互可能只涉及 20 个主要兴趣点
  3. 最近优先:采样时对最近行为加权,保留最相关的信息
只要采样数量大于兴趣类别数的一定倍数,就能以高概率覆盖所有兴趣。
undefined
The key lies in the statistical characteristics of user behavior:
  1. Temporal repetition: Users repeatedly interact with similar content, resulting in high information redundancy
  2. Low-rank interests: 10,000 interactions may only involve 20 main interest points
  3. Recency priority: Weight recent behaviors during sampling to retain the most relevant information
As long as the number of samples exceeds a certain multiple of the number of interest categories, all interests can be covered with high probability.
undefined

七、总结与思考

7. Summary and Reflection

客观总结论文的贡献和局限:
markdown
undefined
Objectively summarize the contributions and limitations of the paper:
markdown
undefined

总结

Summary

核心贡献

Core Contributions

  • 证明了推荐系统可以遵循 Scaling Law
  • 提出了适合推荐场景的 Attention 变体
  • Proved that recommendation systems can follow Scaling Law
  • Proposed an Attention variant suitable for recommendation scenarios

局限性

Limitations

  • 冷启动场景:历史序列太短时优势不明显
  • 计算成本:需要大量 GPU 资源
  • 实时性:长序列推理的延迟挑战
  • Cold start scenarios: Advantages are not obvious when historical sequences are too short
  • Computational cost: Requires a large amount of GPU resources
  • Real-time performance: Latency challenges in long-sequence inference

适用场景

Applicable Scenarios

  • 用户历史丰富的场景(>100 次交互)
  • 有充足计算资源
  • 对实时性要求不是极端严格

---
  • Scenarios with rich user history (>100 interactions)
  • Sufficient computational resources available
  • Not extremely strict real-time requirements

---

目录结构规范

Directory Structure Specifications

论文和阅读笔记应组织在统一的子目录下,便于管理和检索:
paper-notes/
├── hstu/                           # 每篇论文一个目录,使用简短名称
│   ├── paper.pdf                   # 原始论文 PDF
│   ├── README.md                   # 阅读笔记(主文件)
│   └── images/                     # 提取的图表
│       ├── fig1_architecture.png
│       ├── fig2_method.png
│       └── fig3_scaling.png
├── attention-is-all-you-need/
│   ├── paper.pdf
│   ├── README.md
│   └── images/
└── din-deep-interest-network/
    ├── paper.pdf
    ├── README.md
    └── images/
命名规范
  • 目录名:论文简称或关键词,小写,用
    -
    连接
  • 笔记文件:统一命名为
    README.md
    ,便于 GitHub 直接预览
  • 图片目录:统一命名为
    images/
图片命名规范
fig{序号}_{类型}_{简述}.png

类型:
- arch: 架构图
- method: 方法流程
- result: 实验结果
- ablation: 消融实验
- compare: 对比图

示例:
- fig1_arch_overall.png
- fig2_method_attention.png
- fig3_result_scaling.png

Papers and reading notes should be organized in unified subdirectories for easy management and retrieval:
paper-notes/
├── hstu/                           # One directory per paper, using a short name
│   ├── paper.pdf                   # Original paper PDF
│   ├── README.md                   # Reading note (main file)
│   └── images/                     # Extracted charts
│       ├── fig1_architecture.png
│       ├── fig2_method.png
│       └── fig3_scaling.png
├── attention-is-all-you-need/
│   ├── paper.pdf
│   ├── README.md
│   └── images/
└── din-deep-interest-network/
    ├── paper.pdf
    ├── README.md
    └── images/
Naming Specifications:
  • Directory name: Paper abbreviation or keywords, lowercase, connected with
    -
  • Note file: Unifiedly named
    README.md
    for direct preview on GitHub
  • Image directory: Unifiedly named
    images/
Image Naming Specifications:
fig{number}_{type}_{description}.png

Types:
- arch: Architecture diagram
- method: Method flow
- result: Experimental result
- ablation: Ablation experiment
- compare: Comparison chart

Examples:
- fig1_arch_overall.png
- fig2_method_attention.png
- fig3_result_scaling.png

Q&A 环节设计指南

Q&A Session Design Guide

问题类型

Question Types

  1. 原理理解类
    • 为什么这样设计?
    • 与替代方案相比有何优势?
  2. 细节辨析类
    • 某个符号/操作的具体含义
    • 容易混淆的概念区分
  3. 边界条件类
    • 什么情况下方法会失效?
    • 假设条件是什么?
  4. 延伸思考类
    • 能否迁移到其他场景?
    • 有哪些可能的改进方向?
  1. Principle Understanding Type
    • Why is it designed this way?
    • What are the advantages compared to alternative solutions?
  2. Detail Differentiation Type
    • Specific meaning of a certain symbol/operation
    • Distinction between easily confused concepts
  3. Boundary Condition Type
    • In what situations will the method fail?
    • What are the assumptions?
  4. Extended Thinking Type
    • Can it be migrated to other scenarios?
    • What possible improvement directions are there?

答案要求

Answer Requirements

  • 直接展示:不使用折叠,读者可以顺畅阅读
  • 有理有据:答案要有论证,不是简单断言
  • 适当举例:用具体例子帮助理解
  • 承认不确定:对于论文未说明的部分,可以标注"推测"

  • Direct Display: Do not use folding, readers can read smoothly
  • Well-Founded: Answers must have arguments, not simple assertions
  • Appropriate Examples: Use specific examples to help understanding
  • Acknowledge Uncertainty: For parts not explained in the paper, mark as "speculation"

图表处理

Chart Processing

必须提取的图表

Must-Extract Charts

  • 整体架构图
  • 核心方法流程图
  • 关键实验结果(Scaling Law 曲线等)
  • Overall architecture diagram
  • Core method flow chart
  • Key experimental results (such as Scaling Law curves)

图表说明规范

Chart Description Specifications

markdown
![架构图](images/fig1_architecture.png)

**图示内容**:HSTU 的整体架构,左侧为 DLRM 对比

**关键信息**- 输入为统一的物品-行为交替序列
- HSTU Layer 可以无限堆叠
- 输出为多任务预测头

**与正文对应**:第 3.2 节详细描述

markdown
![Architecture Diagram](images/fig1_architecture.png)

**Content of the Diagram**: Overall architecture of HSTU, with DLRM comparison on the left

**Key Information**:
- Input is a unified item-behavior alternating sequence
- HSTU Layers can be stacked infinitely
- Output is a multi-task prediction head

**Correspondence with Text**: Detailed description in Section 3.2

图片提取工具

Image Extraction Tools

学术论文中的图表有两种类型,需要不同的提取方式:
类型特点提取方法
嵌入式图片作者插入的 PNG/JPEG
get_images()
矢量图形架构图、流程图等绘制的图形
cluster_drawings()
There are two types of charts in academic papers, requiring different extraction methods:
TypeCharacteristicsExtraction Method
Embedded ImagesPNG/JPEG inserted by authors
get_images()
Vector GraphicsArchitecture diagrams, flowcharts and other drawn graphics
cluster_drawings()

方法 1: 提取嵌入式图片

Method 1: Extract Embedded Images

适用于论文中直接插入的位图(如实验结果截图、照片等):
python
import fitz  # PyMuPDF
import os

def extract_embedded_images(pdf_path, output_dir):
    """提取 PDF 中嵌入的位图"""
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)

    for page_num in range(len(doc)):
        page = doc[page_num]
        images = page.get_images(full=True)

        for img_idx, img in enumerate(images):
            xref = img[0]
            base = doc.extract_image(xref)
            image_bytes = base["image"]
            image_ext = base["ext"]

            # 过滤过小的图片(可能是图标/装饰)
            if base["width"] > 100 and base["height"] > 100:
                output_path = f"{output_dir}/page{page_num+1}_img{img_idx+1}.{image_ext}"
                with open(output_path, "wb") as f:
                    f.write(image_bytes)

    doc.close()
Suitable for bitmaps directly inserted in papers (such as experimental result screenshots, photos, etc.):
python
import fitz  # PyMuPDF
import os

def extract_embedded_images(pdf_path, output_dir):
    """Extract embedded bitmaps from PDF"""
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)

    for page_num in range(len(doc)):
        page = doc[page_num]
        images = page.get_images(full=True)

        for img_idx, img in enumerate(images):
            xref = img[0]
            base = doc.extract_image(xref)
            image_bytes = base["image"]
            image_ext = base["ext"]

            # Filter out too small images (may be icons/decorations)
            if base["width"] > 100 and base["height"] > 100:
                output_path = f"{output_dir}/page{page_num+1}_img{img_idx+1}.{image_ext}"
                with open(output_path, "wb") as f:
                    f.write(image_bytes)

    doc.close()

方法 2: 提取矢量图形(推荐)

Method 2: Extract Vector Graphics (Recommended)

适用于论文中绑制的架构图、流程图、图表等矢量图形:
python
import fitz
import os

def extract_vector_figures(pdf_path, output_dir, dpi=200, min_size=100):
    """
    使用 cluster_drawings() 识别矢量图形区域并截图

    Args:
        pdf_path: PDF 文件路径
        output_dir: 输出目录
        dpi: 输出分辨率(默认 200,可提高到 300 获得更清晰的图片)
        min_size: 最小尺寸阈值,过滤装饰线条(默认 100pt)
    """
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)

    figures = []
    for page_num in range(len(doc)):
        page = doc[page_num]

        # 识别矢量图形的聚类区域
        # x_tolerance/y_tolerance 控制相邻元素的合并距离
        try:
            drawing_rects = page.cluster_drawings(
                x_tolerance=3,
                y_tolerance=3
            )
        except Exception:
            # 某些 PDF 可能不支持,跳过
            continue

        for idx, rect in enumerate(drawing_rects):
            # 过滤过小的区域(可能是线条/装饰)
            if rect.width < min_size or rect.height < min_size:
                continue

            # 扩展边界,避免裁切太紧
            rect = rect + (-10, -10, 10, 10)
            # 确保不超出页面边界
            rect = rect & page.rect

            # 高分辨率截图
            zoom = dpi / 72
            mat = fitz.Matrix(zoom, zoom)
            pix = page.get_pixmap(matrix=mat, clip=rect)

            output_path = f"{output_dir}/page{page_num+1}_fig{idx+1}.png"
            pix.save(output_path)
            figures.append({
                "page": page_num + 1,
                "path": output_path,
                "rect": rect
            })

    doc.close()
    return figures
Suitable for architecture diagrams, flowcharts, charts and other vector graphics bound in papers:
python
import fitz
import os

def extract_vector_figures(pdf_path, output_dir, dpi=200, min_size=100):
    """
    Identify vector graphic regions using cluster_drawings() and take screenshots

    Args:
        pdf_path: PDF file path
        output_dir: Output directory
        dpi: Output resolution (default 200, can be increased to 300 for clearer images)
        min_size: Minimum size threshold, filter decorative lines (default 100pt)
    """
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)

    figures = []
    for page_num in range(len(doc)):
        page = doc[page_num]

        # Identify clustered regions of vector graphics
        # x_tolerance/y_tolerance control the merging distance of adjacent elements
        try:
            drawing_rects = page.cluster_drawings(
                x_tolerance=3,
                y_tolerance=3
            )
        except Exception:
            # Some PDFs may not support this, skip
            continue

        for idx, rect in enumerate(drawing_rects):
            # Filter out too small regions (may be lines/decorations)
            if rect.width < min_size or rect.height < min_size:
                continue

            # Expand boundaries to avoid tight cropping
            rect = rect + (-10, -10, 10, 10)
            # Ensure not to exceed page boundaries
            rect = rect & page.rect

            # High-resolution screenshot
            zoom = dpi / 72
            mat = fitz.Matrix(zoom, zoom)
            pix = page.get_pixmap(matrix=mat, clip=rect)

            output_path = f"{output_dir}/page{page_num+1}_fig{idx+1}.png"
            pix.save(output_path)
            figures.append({
                "page": page_num + 1,
                "path": output_path,
                "rect": rect
            })

    doc.close()
    return figures

方法 3: 手动指定区域截取

Method 3: Manual Specified Region Cropping

当自动识别效果不理想时,可手动指定坐标:
python
import fitz

def crop_figure(pdf_path, page_num, rect, output_path, dpi=200):
    """
    从 PDF 指定页面裁剪特定区域

    Args:
        pdf_path: PDF 路径
        page_num: 页码(从 1 开始)
        rect: (x0, y0, x1, y1) 坐标,单位为点(pt),72pt = 1英寸
        output_path: 输出图片路径
        dpi: 分辨率
    """
    doc = fitz.open(pdf_path)
    page = doc[page_num - 1]

    clip = fitz.Rect(rect)
    zoom = dpi / 72
    mat = fitz.Matrix(zoom, zoom)

    pix = page.get_pixmap(matrix=mat, clip=clip)
    pix.save(output_path)
    doc.close()
When automatic recognition is not ideal, coordinates can be specified manually:
python
import fitz

def crop_figure(pdf_path, page_num, rect, output_path, dpi=200):
    """
    Crop a specific region from the specified page of PDF

    Args:
        pdf_path: PDF path
        page_num: Page number (starting from 1)
        rect: (x0, y0, x1, y1) coordinates, unit is points(pt), 72pt = 1 inch
        output_path: Output image path
        dpi: Resolution
    """
    doc = fitz.open(pdf_path)
    page = doc[page_num - 1]

    clip = fitz.Rect(rect)
    zoom = dpi / 72
    mat = fitz.Matrix(zoom, zoom)

    pix = page.get_pixmap(matrix=mat, clip=clip)
    pix.save(output_path)
    doc.close()

使用示例:裁剪第 2 页的某个区域

Usage Example: Crop a region on page 2

坐标可通过 PDF 阅读器查看,或先用方法 2 识别后微调

Coordinates can be viewed through PDF reader, or fine-tuned after recognition with Method 2

crop_figure( "paper.pdf", page_num=2, rect=(50, 100, 550, 400), # 左上角(50,100) 到 右下角(550,400) output_path="./images/fig1_architecture.png" )
undefined
crop_figure( "paper.pdf", page_num=2, rect=(50, 100, 550, 400), # Top-left corner(50,100) to bottom-right corner(550,400) output_path="./images/fig1_architecture.png" )
undefined

智能提取(综合方案)

Smart Extraction (Comprehensive Solution)

自动尝试多种方法,提取所有图表:
python
import fitz
import os

def smart_extract_figures(pdf_path, output_dir, dpi=200):
    """
    智能提取论文中的所有图表
    1. 先使用 cluster_drawings 识别矢量图形
    2. 再提取嵌入式位图
    3. 自动过滤和去重
    """
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)
    results = {"vector": [], "embedded": []}

    for page_num in range(len(doc)):
        page = doc[page_num]

        # 1. 提取矢量图形
        try:
            rects = page.cluster_drawings(x_tolerance=3, y_tolerance=3)
            for idx, rect in enumerate(rects):
                if rect.width > 100 and rect.height > 100:
                    rect = (rect + (-10, -10, 10, 10)) & page.rect
                    zoom = dpi / 72
                    pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom), clip=rect)
                    path = f"{output_dir}/p{page_num+1}_vec{idx+1}.png"
                    pix.save(path)
                    results["vector"].append(path)
        except:
            pass

        # 2. 提取嵌入式图片
        for img_idx, img in enumerate(page.get_images(full=True)):
            xref = img[0]
            base = doc.extract_image(xref)
            if base["width"] > 100 and base["height"] > 100:
                path = f"{output_dir}/p{page_num+1}_img{img_idx+1}.{base['ext']}"
                with open(path, "wb") as f:
                    f.write(base["image"])
                results["embedded"].append(path)

    doc.close()
    print(f"提取完成: {len(results['vector'])} 个矢量图, {len(results['embedded'])} 个位图")
    return results
Automatically try multiple methods to extract all charts:
python
import fitz
import os

def smart_extract_figures(pdf_path, output_dir, dpi=200):
    """
    Intelligently extract all charts from papers
    1. First use cluster_drawings to identify vector graphics
    2. Then extract embedded bitmaps
    3. Automatically filter and deduplicate
    """
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)
    results = {"vector": [], "embedded": []}

    for page_num in range(len(doc)):
        page = doc[page_num]

        # 1. Extract vector graphics
        try:
            rects = page.cluster_drawings(x_tolerance=3, y_tolerance=3)
            for idx, rect in enumerate(rects):
                if rect.width > 100 and rect.height > 100:
                    rect = (rect + (-10, -10, 10, 10)) & page.rect
                    zoom = dpi / 72
                    pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom), clip=rect)
                    path = f"{output_dir}/p{page_num+1}_vec{idx+1}.png"
                    pix.save(path)
                    results["vector"].append(path)
        except:
            pass

        # 2. Extract embedded images
        for img_idx, img in enumerate(page.get_images(full=True)):
            xref = img[0]
            base = doc.extract_image(xref)
            if base["width"] > 100 and base["height"] > 100:
                path = f"{output_dir}/p{page_num+1}_img{img_idx+1}.{base['ext']}"
                with open(path, "wb") as f:
                    f.write(base["image"])
                results["embedded"].append(path)

    doc.close()
    print(f"Extraction completed: {len(results['vector'])} vector graphics, {len(results['embedded'])} bitmaps")
    return results

使用示例

Usage Example

results = smart_extract_figures("paper.pdf", "./images/")
undefined
results = smart_extract_figures("paper.pdf", "./images/")
undefined

常见问题

Common Issues

Q: 提取的图片包含多个 Figure 合在一起?
调小
x_tolerance
y_tolerance
参数(如 1-2),使聚类更严格。
Q: 同一个 Figure 被切成多块?
调大容差参数(如 10-20),使相邻元素合并。
Q: 某些 Figure 没有被识别?
  1. 可能是嵌入式图片,尝试
    get_images()
    方法
  2. 使用手动指定区域的方法
Q: 图片模糊?
提高
dpi
参数到 300 或更高。

Q: The extracted image contains multiple Figures together?
Reduce the
x_tolerance
and
y_tolerance
parameters (e.g., 1-2) to make clustering stricter.
Q: The same Figure is cut into multiple pieces?
Increase the tolerance parameters (e.g., 10-20) to merge adjacent elements.
Q: Some Figures are not recognized?
  1. They may be embedded images, try the
    get_images()
    method
  2. Use the manual specified region method
Q: The image is blurry?
Increase the
dpi
parameter to 300 or higher.

使用方式

Usage Methods

基本用法

Basic Usage

请阅读这篇论文,生成一篇专业的阅读笔记,适合发布到技术社区。
Please read this paper and generate a professional reading note suitable for publication on technical communities.

指定重点

Specify Key Points

请阅读这篇论文,重点分析:
1. HSTU 与标准 Transformer 的区别
2. Scaling Law 实验的设置和结论
3. 在工业场景的落地可行性
Please read this paper and focus on analyzing:
1. Differences between HSTU and standard Transformer
2. Settings and conclusions of Scaling Law experiments
3. Feasibility of industrial scenario implementation

对比分析

Comparative Analysis

请对比分析这两篇论文在 XXX 问题上的不同解法。

Please compare and analyze the different solutions of these two papers on the XXX problem.

技术要求

Technical Requirements

内容完整性

Content Completeness

  • 核心公式必须包含,逐项解释
  • 关键算法有伪代码实现
  • 重要超参数和训练细节不省略
  • 消融实验的关键结论要提炼
  • Core formulas must be included and explained item by item
  • Key algorithms have pseudocode implementations
  • Important hyperparameters and training details are not omitted
  • Key conclusions of ablation experiments are extracted

深度要求

Depth Requirements

  • 分析"为什么这样设计"
  • 与相关工作建立联系
  • 指出方法的边界和局限
  • Analyze "why it is designed this way"
  • Establish connections with related works
  • Point out the boundaries and limitations of the method

可读性

Readability

  • 先直觉后细节
  • 代码和公式配合
  • 长公式分步解释

  • Intuition first, then details
  • Code and formulas are matched
  • Long formulas are explained step by step

依赖配置

Dependency Configuration

bash
undefined
bash
undefined

图片提取

Image extraction

pip install pymupdf
pip install pymupdf

PDF 转图片(可选)

PDF to image (optional)

pip install pdf2image
undefined
pip install pdf2image
undefined

MCP 配置(可选)

MCP Configuration (Optional)

json
{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-filesystem", "/path/to/papers"]
    },
    "notion": {
      "command": "npx",
      "args": ["-y", "@notionhq/notion-mcp-server"],
      "env": {
        "OPENAPI_MCP_HEADERS": "{\"Authorization\": \"Bearer YOUR_TOKEN\", \"Notion-Version\": \"2022-06-28\"}"
      }
    }
  }
}
json
{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-filesystem", "/path/to/papers"]
    },
    "notion": {
      "command": "npx",
      "args": ["-y", "@notionhq/notion-mcp-server"],
      "env": {
        "OPENAPI_MCP_HEADERS": "{\"Authorization\": \"Bearer YOUR_TOKEN\", \"Notion-Version\": \"2022-06-28\"}"
      }
    }
  }
}