Loading...
Loading...
In-depth analysis of AI papers, generating professional reading notes ready for direct publication
npx skill4agent add frostant/awesome-claude-skills ai-paper-reader> **Paper**: Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
> **Authors**: Meta AI
> **Publication**: ICML 2024
> **Reading Time**: Approximately 15 minutes
> **Difficulty**: ⭐⭐⭐⭐ (Requires basics of Transformer and recommendation systems)
> **Prerequisite Knowledge**: Concepts of Attention mechanism, DLRM, Scaling Law## TL;DR
Traditional recommendation models like DLRM rely heavily on manual features and cannot scale. This paper proposes transforming the recommendation problem into a sequence generation problem. The core innovation is the HSTU architecture: using Pointwise Attention instead of Softmax to preserve the absolute intensity information of user preferences, enabling recommendation systems to exhibit LLM-like Scaling Law for the first time.## Paper Overview
**Problem**: Large-scale recommendation systems cannot continuously improve quality by increasing computational power like LLMs
**Solution**: Transform the recommendation problem from "feature engineering + discriminative model" to "sequence modeling + generative model"
**Contributions**:
1. Propose the Generative Recommenders (GRs) paradigm, enabling Scaling Law for recommendation systems
2. Design the HSTU architecture, using Pointwise Attention instead of Softmax to preserve intensity information
3. Propose the M-FALCON inference algorithm for efficient candidate scoring## Core Method
### Overall Architecture
[Architecture Diagram]
Data Flow: User history sequence → Embedding → HSTU Layers × L → Prediction Head
### HSTU Layer Details
#### Input and Output
- Input: X ∈ R^{N×d}, N is sequence length, d is embedding dimension
- Output: Y ∈ R^{N×d}
#### Core Formula
**Pointwise Projection**:
$$U, V, Q, K = \text{Split}(\phi_1(f_1(X)))$$
Where:
- $\phi_1$: SiLU activation function
- $f_1$: Single-layer linear transformation
- Split divides the output into four vectors
**Spatial Aggregation**:
$$A(X)V(X) = \phi_2(Q(X)K(X)^T + r_{ab}) V(X)$$
Key Point: Use SiLU instead of Softmax to preserve the absolute intensity information of attention.
#### Code Implementation
```python
class HSTULayer(nn.Module):
def forward(self, x):
# Pointwise Projection
projected = F.silu(self.proj_in(x))
u, v, q, k = projected.split([...], dim=-1)
# Spatial Aggregation (not Softmax!)
attn = F.silu(q @ k.T + self.rel_bias)
out = self.norm(attn @ v) * u
return x + self.proj_out(out)
### 5. Experimental Analysis
Instead of listing numbers, extract key conclusions:
- **Main Experimental Results**: Core findings compared with baselines
- **Ablation Experiments**: Contribution analysis of each component
- **Scaling Analysis**: Relationship between computational power and performance (if available)
- **Limitations**: Situations where the method does not perform well
### 6. In-depth Understanding Q&A
Help readers gain in-depth understanding of key points of the paper through well-designed questions.
**Display Q&A directly, do not use folding**.
```markdown
## In-depth Understanding Q&A
### Q1: Why is Softmax Attention not suitable for recommendation scenarios?
Recommendation scenarios require predicting the **absolute intensity** of user preferences (such as watch duration), rather than just **relative ranking**.
Consider two users:
- User A: 10 historical interactions
- User B: 100 historical interactions
When using Softmax, the attention weights of both users will be normalized to [0,1], causing the information that "User B is more active" to be lost.
Pointwise Attention preserves the accumulated original magnitude, allowing the model to learn activity differences.
### Q2: How does HSTU replace 6 linear layers of Transformer with 2?
Standard Transformer layers require:
- Q, K, V projection: 3 linear layers
- Output Projection: 1 linear layer
- FFN: 2 linear layers (expansion + compression)
HSTU simplifications:
1. **Fuse Q, K, V, U projection**: A single linear layer generates four vectors simultaneously
2. **Replace FFN with U gating**: `output * U` achieves similar nonlinear transformation
The cost is reduced expressive power per layer, but this can be compensated by stacking more layers.
### Q3: Why can Stochastic Length training discard 70% of tokens with almost no loss in performance?
The key lies in the **statistical characteristics** of user behavior:
1. **Temporal repetition**: Users repeatedly interact with similar content, resulting in high information redundancy
2. **Low-rank interests**: 10,000 interactions may only involve 20 main interest points
3. **Recency priority**: Weight recent behaviors during sampling to retain the most relevant information
As long as the number of samples exceeds a certain multiple of the number of interest categories, all interests can be covered with high probability.## Summary
### Core Contributions
- Proved that recommendation systems can follow Scaling Law
- Proposed an Attention variant suitable for recommendation scenarios
### Limitations
- Cold start scenarios: Advantages are not obvious when historical sequences are too short
- Computational cost: Requires a large amount of GPU resources
- Real-time performance: Latency challenges in long-sequence inference
### Applicable Scenarios
- Scenarios with rich user history (>100 interactions)
- Sufficient computational resources available
- Not extremely strict real-time requirementspaper-notes/
├── hstu/ # One directory per paper, using a short name
│ ├── paper.pdf # Original paper PDF
│ ├── README.md # Reading note (main file)
│ └── images/ # Extracted charts
│ ├── fig1_architecture.png
│ ├── fig2_method.png
│ └── fig3_scaling.png
│
├── attention-is-all-you-need/
│ ├── paper.pdf
│ ├── README.md
│ └── images/
│
└── din-deep-interest-network/
├── paper.pdf
├── README.md
└── images/-README.mdimages/fig{number}_{type}_{description}.png
Types:
- arch: Architecture diagram
- method: Method flow
- result: Experimental result
- ablation: Ablation experiment
- compare: Comparison chart
Examples:
- fig1_arch_overall.png
- fig2_method_attention.png
- fig3_result_scaling.png
**Content of the Diagram**: Overall architecture of HSTU, with DLRM comparison on the left
**Key Information**:
- Input is a unified item-behavior alternating sequence
- HSTU Layers can be stacked infinitely
- Output is a multi-task prediction head
**Correspondence with Text**: Detailed description in Section 3.2| Type | Characteristics | Extraction Method |
|---|---|---|
| Embedded Images | PNG/JPEG inserted by authors | |
| Vector Graphics | Architecture diagrams, flowcharts and other drawn graphics | |
import fitz # PyMuPDF
import os
def extract_embedded_images(pdf_path, output_dir):
"""Extract embedded bitmaps from PDF"""
os.makedirs(output_dir, exist_ok=True)
doc = fitz.open(pdf_path)
for page_num in range(len(doc)):
page = doc[page_num]
images = page.get_images(full=True)
for img_idx, img in enumerate(images):
xref = img[0]
base = doc.extract_image(xref)
image_bytes = base["image"]
image_ext = base["ext"]
# Filter out too small images (may be icons/decorations)
if base["width"] > 100 and base["height"] > 100:
output_path = f"{output_dir}/page{page_num+1}_img{img_idx+1}.{image_ext}"
with open(output_path, "wb") as f:
f.write(image_bytes)
doc.close()import fitz
import os
def extract_vector_figures(pdf_path, output_dir, dpi=200, min_size=100):
"""
Identify vector graphic regions using cluster_drawings() and take screenshots
Args:
pdf_path: PDF file path
output_dir: Output directory
dpi: Output resolution (default 200, can be increased to 300 for clearer images)
min_size: Minimum size threshold, filter decorative lines (default 100pt)
"""
os.makedirs(output_dir, exist_ok=True)
doc = fitz.open(pdf_path)
figures = []
for page_num in range(len(doc)):
page = doc[page_num]
# Identify clustered regions of vector graphics
# x_tolerance/y_tolerance control the merging distance of adjacent elements
try:
drawing_rects = page.cluster_drawings(
x_tolerance=3,
y_tolerance=3
)
except Exception:
# Some PDFs may not support this, skip
continue
for idx, rect in enumerate(drawing_rects):
# Filter out too small regions (may be lines/decorations)
if rect.width < min_size or rect.height < min_size:
continue
# Expand boundaries to avoid tight cropping
rect = rect + (-10, -10, 10, 10)
# Ensure not to exceed page boundaries
rect = rect & page.rect
# High-resolution screenshot
zoom = dpi / 72
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, clip=rect)
output_path = f"{output_dir}/page{page_num+1}_fig{idx+1}.png"
pix.save(output_path)
figures.append({
"page": page_num + 1,
"path": output_path,
"rect": rect
})
doc.close()
return figuresimport fitz
def crop_figure(pdf_path, page_num, rect, output_path, dpi=200):
"""
Crop a specific region from the specified page of PDF
Args:
pdf_path: PDF path
page_num: Page number (starting from 1)
rect: (x0, y0, x1, y1) coordinates, unit is points(pt), 72pt = 1 inch
output_path: Output image path
dpi: Resolution
"""
doc = fitz.open(pdf_path)
page = doc[page_num - 1]
clip = fitz.Rect(rect)
zoom = dpi / 72
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, clip=clip)
pix.save(output_path)
doc.close()
# Usage Example: Crop a region on page 2
# Coordinates can be viewed through PDF reader, or fine-tuned after recognition with Method 2
crop_figure(
"paper.pdf",
page_num=2,
rect=(50, 100, 550, 400), # Top-left corner(50,100) to bottom-right corner(550,400)
output_path="./images/fig1_architecture.png"
)import fitz
import os
def smart_extract_figures(pdf_path, output_dir, dpi=200):
"""
Intelligently extract all charts from papers
1. First use cluster_drawings to identify vector graphics
2. Then extract embedded bitmaps
3. Automatically filter and deduplicate
"""
os.makedirs(output_dir, exist_ok=True)
doc = fitz.open(pdf_path)
results = {"vector": [], "embedded": []}
for page_num in range(len(doc)):
page = doc[page_num]
# 1. Extract vector graphics
try:
rects = page.cluster_drawings(x_tolerance=3, y_tolerance=3)
for idx, rect in enumerate(rects):
if rect.width > 100 and rect.height > 100:
rect = (rect + (-10, -10, 10, 10)) & page.rect
zoom = dpi / 72
pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom), clip=rect)
path = f"{output_dir}/p{page_num+1}_vec{idx+1}.png"
pix.save(path)
results["vector"].append(path)
except:
pass
# 2. Extract embedded images
for img_idx, img in enumerate(page.get_images(full=True)):
xref = img[0]
base = doc.extract_image(xref)
if base["width"] > 100 and base["height"] > 100:
path = f"{output_dir}/p{page_num+1}_img{img_idx+1}.{base['ext']}"
with open(path, "wb") as f:
f.write(base["image"])
results["embedded"].append(path)
doc.close()
print(f"Extraction completed: {len(results['vector'])} vector graphics, {len(results['embedded'])} bitmaps")
return results
# Usage Example
results = smart_extract_figures("paper.pdf", "./images/")x_tolerancey_toleranceget_images()dpiPlease read this paper and generate a professional reading note suitable for publication on technical communities.Please read this paper and focus on analyzing:
1. Differences between HSTU and standard Transformer
2. Settings and conclusions of Scaling Law experiments
3. Feasibility of industrial scenario implementationPlease compare and analyze the different solutions of these two papers on the XXX problem.# Image extraction
pip install pymupdf
# PDF to image (optional)
pip install pdf2image{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@anthropic/mcp-server-filesystem", "/path/to/papers"]
},
"notion": {
"command": "npx",
"args": ["-y", "@notionhq/notion-mcp-server"],
"env": {
"OPENAPI_MCP_HEADERS": "{\"Authorization\": \"Bearer YOUR_TOKEN\", \"Notion-Version\": \"2022-06-28\"}"
}
}
}
}