xhs

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

用户希望提取小红书帖子内容。请按以下步骤处理：

Users want to extract Xiaohongshu post content. Please follow these steps to process:

常量定义

Constant Definitions

Cookies 文件:
```
~/cookies.json
```
（从 Chrome 导出的小红书 cookies）
Obsidian 保存目录:
```
~/Documents/Obsidian Vault/xhs
```
Whisper 模型:
```
mlx-community/whisper-large-v3-turbo
```

Cookies File:
```
~/cookies.json
```
(Xiaohongshu cookies exported from Chrome)
Obsidian Save Directory:
```
~/Documents/Obsidian Vault/xhs
```
Whisper Model:
```
mlx-community/whisper-large-v3-turbo
```

输入

Input

用户提供的小红书链接: $ARGUMENTS

Xiaohongshu link provided by user: $ARGUMENTS

提取流程

Extraction Process

步骤 0：检查 Cookies

Step 0: Check Cookies

检查
```
~/cookies.json
```
是否存在

如果不存在，告知用户需要从 Chrome 导出 cookies：

在 Chrome 打开 xiaohongshu.com 并确认已登录
打开 DevTools Console，运行以下代码将 cookies 复制到剪贴板：

javascript

copy(JSON.stringify(document.cookie.split('; ').map(c => {
  const [name, ...rest] = c.split('=');
  return { name, value: rest.join('='), domain: '.xiaohongshu.com', path: '/',
    expires: Date.now()/1000 + 86400*30, size: name.length + rest.join('=').length,
    httpOnly: false, secure: false, session: false, priority: 'Medium',
    sameParty: false, sourceScheme: 'Secure', sourcePort: 443 };
})))

将剪贴板内容保存到
```
~/cookies.json
```
然后终止流程，等用户完成后重新运行

Check if
```
~/cookies.json
```
exists

If it does not exist, inform the user to export cookies from Chrome:

Open xiaohongshu.com in Chrome and confirm you are logged in
Open DevTools Console and run the following code to copy cookies to clipboard:

javascript

copy(JSON.stringify(document.cookie.split('; ').map(c => {
  const [name, ...rest] = c.split('=');
  return { name, value: rest.join('='), domain: '.xiaohongshu.com', path: '/',
    expires: Date.now()/1000 + 86400*30, size: name.length + rest.join('=').length,
    httpOnly: false, secure: false, session: false, priority: 'Medium',
    sameParty: false, sourceScheme: 'Secure', sourcePort: 443 };
})))

Save the clipboard content to
```
~/cookies.json
```
Then terminate the process and rerun after the user completes the above steps

步骤 1：解析链接

Step 1: Parse the Link

从 URL 中提取帖子 ID（24 位十六进制字符串）和 xsec_token 参数。

Extract the post ID (24-digit hexadecimal string) and xsec_token parameter from the URL.

步骤 2：获取帖子内容

Step 2: Retrieve Post Content

使用 Python 脚本，通过 Cookies 请求帖子页面 HTML，从

window.__INITIAL_STATE__

解析全部帖子数据：

python

import json, urllib.request, ssl, re

with open('<Cookies 文件>') as f:
    cookies = json.load(f)
cookie_str = '; '.join(f"{c['name']}={c['value']}" for c in cookies)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

req = urllib.request.Request('<帖子URL>')
req.add_header('Cookie', cookie_str)
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36')

resp = urllib.request.urlopen(req, timeout=15, context=ctx)
html = resp.read().decode('utf-8', errors='ignore')

m = re.search(r'window\.__INITIAL_STATE__\s*=\s*(\{.+?\})\s*</script>', html, re.DOTALL)
raw = m.group(1).replace('undefined', 'null')
data = json.loads(raw)

Use a Python script to request the post page HTML via Cookies, and parse all post data from

window.__INITIAL_STATE__

python

import json, urllib.request, ssl, re

with open('<Cookies 文件>') as f:
    cookies = json.load(f)
cookie_str = '; '.join(f"{c['name']}={c['value']}" for c in cookies)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

req = urllib.request.Request('<帖子URL>')
req.add_header('Cookie', cookie_str)
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36')

resp = urllib.request.urlopen(req, timeout=15, context=ctx)
html = resp.read().decode('utf-8', errors='ignore')

m = re.search(r'window\.__INITIAL_STATE__\s*=\s*(\{.+?\})\s*</script>', html, re.DOTALL)
raw = m.group(1).replace('undefined', 'null')
data = json.loads(raw)

帖子数据在: data['note']['noteDetailMap'][<key>]['note']

Post data is located at: data['note']['noteDetailMap'][<key>]['note']

包含: title, desc, type, time, user, imageList, video, interactInfo, ipLocation

Contains: title, desc, type, time, user, imageList, video, interactInfo, ipLocation


如果请求失败（被重定向到 404/错误页），说明 cookies 过期，提示用户按步骤 0 重新导出。


If the request fails (redirected to 404/error page), it means the cookies have expired. Prompt the user to re-export cookies as per Step 0.

步骤 3：视频转录（仅视频帖子）

Step 3: Video Transcription (Video Posts Only)

如果帖子 type 为 video，执行以下子步骤：

If the post type is video, perform the following sub-steps:

3a. 提取视频 URL

3a. Extract Video URL

从步骤 2 获取的数据中解析视频流：

note['video']['media']['stream'] -> 按 h264 > h265 > av1 优先级取第一个的 masterUrl

Parse the video stream from the data obtained in Step 2:

note['video']['media']['stream'] -> Take the masterUrl of the first stream in the priority order of h264 > h265 > av1

3b. 下载视频并提取音频

3b. Download Video and Extract Audio

bash

curl -L -o /tmp/xhs_{post_id}.mp4 -H "Referer: https://www.xiaohongshu.com/" <视频URL>
ffmpeg -y -i /tmp/xhs_{post_id}.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 /tmp/xhs_{post_id}.wav

bash

curl -L -o /tmp/xhs_{post_id}.mp4 -H "Referer: https://www.xiaohongshu.com/" <视频URL>
ffmpeg -y -i /tmp/xhs_{post_id}.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 /tmp/xhs_{post_id}.wav

3c. 语音转录

3c. Speech Transcription

python

import mlx_whisper
result = mlx_whisper.transcribe("/tmp/xhs_{post_id}.wav",
    path_or_hf_repo="mlx-community/whisper-large-v3-turbo", language="zh", verbose=False)

python

import mlx_whisper
result = mlx_whisper.transcribe("/tmp/xhs_{post_id}.wav",
    path_or_hf_repo="mlx-community/whisper-large-v3-turbo", language="zh", verbose=False)

3d. 清理转录文本

3d. Clean Transcribed Text

去除尾部重复字符（背景音乐噪音）
按语义断句，添加标点和段落
如有步骤/要点结构，用 Markdown 格式化

Remove trailing repeated characters (background music noise)
Split sentences by semantics, add punctuation and paragraphs
If there is a step/point structure, format it with Markdown

3e. 清理临时文件

3e. Clean Temporary Files

bash

rm -f /tmp/xhs_{post_id}.mp4 /tmp/xhs_{post_id}.wav

bash

rm -f /tmp/xhs_{post_id}.mp4 /tmp/xhs_{post_id}.wav

步骤 4：整理输出并保存

Step 4: Organize Output and Save

将内容整理为 Markdown 文件，保存到

<Obsidian 保存目录>/{YYYY-MM-DD} {短标题}.md

。

文件名格式：
```
{发布日期} {短标题}.md
```
，短标题不超过15个字，是核心洞察的极简概括
日期前缀确保按时间排序
不创建子目录，所有帖子 md 直接放在 xhs 文件夹下

媒体文件统一放在

<Obsidian 保存目录>/img/

或

<Obsidian 保存目录>/video/

写作风格：Peter Thiel 式——直接、反直觉、一句话给判断。笔记是决策工具，不是知识库。用户扫一眼就能决定：深挖还是跳过。

文件结构（无 YAML frontmatter）：

markdown

undefined

Organize the content into a Markdown file and save it to

<Obsidian Save Directory>/{YYYY-MM-DD} {Short Title}.md

File name format:
```
{Post Date} {Short Title}.md
```
. The short title should be no more than 15 characters, which is a concise summary of the core insight
The date prefix ensures sorting by time
Do not create subdirectories; all post md files are placed directly in the xhs folder

Media files are uniformly stored in

<Obsidian Save Directory>/img/

<Obsidian Save Directory>/video/

Writing Style: Peter Thiel-style — Direct, counterintuitive, and deliver judgments in one sentence. Notes are decision-making tools, not knowledge bases. Users can decide at a glance whether to dig deeper or skip.

File Structure (No YAML frontmatter):

markdown

undefined

一句话核心洞察（反直觉的判断，不是描述性标题）

One-sentence Core Insight (Counterintuitive judgment, not descriptive title)

核心论点，2-3句话。直接给出"大多数人觉得X，但其实Y"的判断。不废话，不铺垫，像 Thiel 在董事会上说话。

与我的关联： 一句话。读取用户的 memory（~/.claude/projects/*/memory/ 下的 user 和 project 类型记忆）了解用户背景、研究方向和当前工作，据此说清楚这个内容跟用户有什么关系。如果 memory 不可用，从通用的个人发展/工具/方法论角度切入。

值得深挖吗： 是/否。一句话理由。

[!tip]- 详情帖子核心内容的结构化整理（折叠状态，点开才看到）：
从 desc 和视频转录中提炼，清理
#xxx[话题]#
标记
按逻辑结构分节，保留关键数据和结论
图片用
![图N](urlDefault)
嵌入
视频帖子在此处放整理后的转录内容

[!info]- 笔记属性

来源: 小红书 · 作者名

帖子ID: xxx

链接: 原始链接

日期: YYYY-MM-DD

类型: image/video

互动: N赞 / N收藏 / N评论

标签: 标签1, 标签2, ...


关键约束：
- 折叠区域外的可见内容**不超过 6 行**
- 标题必须是洞察/判断，不是"XX帖子的总结"
- 图片使用 `urlDefault` 字段的 URL

Core argument, 2-3 sentences. Directly give a judgment like "Most people think X, but actually Y". No nonsense, no foreshadowing, just like Thiel speaking in a board meeting.

Relevance to Me: One sentence. Read the user's memory (user and project-type memory under ~/.claude/projects/*/memory/) to understand the user's background, research direction, and current work, and clarify how this content relates to the user. If memory is unavailable, start from the perspective of general personal development/tools/methodology.

Is it worth digging deeper?: Yes/No. One-sentence reason.

[!tip]- Details Structured organization of the post's core content (collapsed state, visible only when expanded):
Extract from desc and video transcription, clean up
#xxx[topic]#
tags
Divide into sections by logical structure, retain key data and conclusions
Embed images using
![Image N](urlDefault)
For video posts, place the organized transcribed content here

[!info]- Note Attributes

Source: Xiaohongshu · Author Name

Post ID: xxx

Link: Original Link

Date: YYYY-MM-DD

Type: image/video

Engagement: N Likes / N Saves / N Comments

Tags: Tag 1, Tag 2, ...


Key Constraints:
- Visible content outside collapsed areas **must not exceed 6 lines**
- The title must be an insight/judgment, not "Summary of XX Post"
- Use the URL from the `urlDefault` field for images