xhs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese用户希望提取小红书帖子内容。请按以下步骤处理:
Users want to extract Xiaohongshu post content. Please follow these steps to process:
常量定义
Constant Definitions
- Cookies 文件: (从 Chrome 导出的小红书 cookies)
~/cookies.json - Obsidian 保存目录:
~/Documents/Obsidian Vault/xhs - Whisper 模型:
mlx-community/whisper-large-v3-turbo
- Cookies File: (Xiaohongshu cookies exported from Chrome)
~/cookies.json - Obsidian Save Directory:
~/Documents/Obsidian Vault/xhs - Whisper Model:
mlx-community/whisper-large-v3-turbo
输入
Input
用户提供的小红书链接: $ARGUMENTS
Xiaohongshu link provided by user: $ARGUMENTS
提取流程
Extraction Process
步骤 0:检查 Cookies
Step 0: Check Cookies
- 检查 是否存在
~/cookies.json - 如果不存在,告知用户需要从 Chrome 导出 cookies:
- 在 Chrome 打开 xiaohongshu.com 并确认已登录
- 打开 DevTools Console,运行以下代码将 cookies 复制到剪贴板:
javascriptcopy(JSON.stringify(document.cookie.split('; ').map(c => { const [name, ...rest] = c.split('='); return { name, value: rest.join('='), domain: '.xiaohongshu.com', path: '/', expires: Date.now()/1000 + 86400*30, size: name.length + rest.join('=').length, httpOnly: false, secure: false, session: false, priority: 'Medium', sameParty: false, sourceScheme: 'Secure', sourcePort: 443 }; })))- 将剪贴板内容保存到
~/cookies.json - 然后终止流程,等用户完成后重新运行
- Check if exists
~/cookies.json - If it does not exist, inform the user to export cookies from Chrome:
- Open xiaohongshu.com in Chrome and confirm you are logged in
- Open DevTools Console and run the following code to copy cookies to clipboard:
javascriptcopy(JSON.stringify(document.cookie.split('; ').map(c => { const [name, ...rest] = c.split('='); return { name, value: rest.join('='), domain: '.xiaohongshu.com', path: '/', expires: Date.now()/1000 + 86400*30, size: name.length + rest.join('=').length, httpOnly: false, secure: false, session: false, priority: 'Medium', sameParty: false, sourceScheme: 'Secure', sourcePort: 443 }; })))- Save the clipboard content to
~/cookies.json - Then terminate the process and rerun after the user completes the above steps
步骤 1:解析链接
Step 1: Parse the Link
从 URL 中提取帖子 ID(24 位十六进制字符串)和 xsec_token 参数。
Extract the post ID (24-digit hexadecimal string) and xsec_token parameter from the URL.
步骤 2:获取帖子内容
Step 2: Retrieve Post Content
使用 Python 脚本,通过 Cookies 请求帖子页面 HTML,从 解析全部帖子数据:
window.__INITIAL_STATE__python
import json, urllib.request, ssl, re
with open('<Cookies 文件>') as f:
cookies = json.load(f)
cookie_str = '; '.join(f"{c['name']}={c['value']}" for c in cookies)
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
req = urllib.request.Request('<帖子URL>')
req.add_header('Cookie', cookie_str)
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36')
resp = urllib.request.urlopen(req, timeout=15, context=ctx)
html = resp.read().decode('utf-8', errors='ignore')
m = re.search(r'window\.__INITIAL_STATE__\s*=\s*(\{.+?\})\s*</script>', html, re.DOTALL)
raw = m.group(1).replace('undefined', 'null')
data = json.loads(raw)Use a Python script to request the post page HTML via Cookies, and parse all post data from :
window.__INITIAL_STATE__python
import json, urllib.request, ssl, re
with open('<Cookies 文件>') as f:
cookies = json.load(f)
cookie_str = '; '.join(f"{c['name']}={c['value']}" for c in cookies)
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
req = urllib.request.Request('<帖子URL>')
req.add_header('Cookie', cookie_str)
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36')
resp = urllib.request.urlopen(req, timeout=15, context=ctx)
html = resp.read().decode('utf-8', errors='ignore')
m = re.search(r'window\.__INITIAL_STATE__\s*=\s*(\{.+?\})\s*</script>', html, re.DOTALL)
raw = m.group(1).replace('undefined', 'null')
data = json.loads(raw)帖子数据在: data['note']['noteDetailMap'][<key>]['note']
Post data is located at: data['note']['noteDetailMap'][<key>]['note']
包含: title, desc, type, time, user, imageList, video, interactInfo, ipLocation
Contains: title, desc, type, time, user, imageList, video, interactInfo, ipLocation
如果请求失败(被重定向到 404/错误页),说明 cookies 过期,提示用户按步骤 0 重新导出。
If the request fails (redirected to 404/error page), it means the cookies have expired. Prompt the user to re-export cookies as per Step 0.步骤 3:视频转录(仅视频帖子)
Step 3: Video Transcription (Video Posts Only)
如果帖子 type 为 video,执行以下子步骤:
If the post type is video, perform the following sub-steps:
3a. 提取视频 URL
3a. Extract Video URL
从步骤 2 获取的数据中解析视频流:
note['video']['media']['stream'] -> 按 h264 > h265 > av1 优先级取第一个的 masterUrlParse the video stream from the data obtained in Step 2:
note['video']['media']['stream'] -> Take the masterUrl of the first stream in the priority order of h264 > h265 > av13b. 下载视频并提取音频
3b. Download Video and Extract Audio
bash
curl -L -o /tmp/xhs_{post_id}.mp4 -H "Referer: https://www.xiaohongshu.com/" <视频URL>
ffmpeg -y -i /tmp/xhs_{post_id}.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 /tmp/xhs_{post_id}.wavbash
curl -L -o /tmp/xhs_{post_id}.mp4 -H "Referer: https://www.xiaohongshu.com/" <视频URL>
ffmpeg -y -i /tmp/xhs_{post_id}.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 /tmp/xhs_{post_id}.wav3c. 语音转录
3c. Speech Transcription
python
import mlx_whisper
result = mlx_whisper.transcribe("/tmp/xhs_{post_id}.wav",
path_or_hf_repo="mlx-community/whisper-large-v3-turbo", language="zh", verbose=False)python
import mlx_whisper
result = mlx_whisper.transcribe("/tmp/xhs_{post_id}.wav",
path_or_hf_repo="mlx-community/whisper-large-v3-turbo", language="zh", verbose=False)3d. 清理转录文本
3d. Clean Transcribed Text
- 去除尾部重复字符(背景音乐噪音)
- 按语义断句,添加标点和段落
- 如有步骤/要点结构,用 Markdown 格式化
- Remove trailing repeated characters (background music noise)
- Split sentences by semantics, add punctuation and paragraphs
- If there is a step/point structure, format it with Markdown
3e. 清理临时文件
3e. Clean Temporary Files
bash
rm -f /tmp/xhs_{post_id}.mp4 /tmp/xhs_{post_id}.wavbash
rm -f /tmp/xhs_{post_id}.mp4 /tmp/xhs_{post_id}.wav步骤 4:整理输出并保存
Step 4: Organize Output and Save
将内容整理为 Markdown 文件,保存到 。
<Obsidian 保存目录>/{YYYY-MM-DD} {短标题}.md- 文件名格式:,短标题不超过15个字,是核心洞察的极简概括
{发布日期} {短标题}.md - 日期前缀确保按时间排序
- 不创建子目录,所有帖子 md 直接放在 xhs 文件夹下
- 媒体文件统一放在 或
<Obsidian 保存目录>/img/<Obsidian 保存目录>/video/
写作风格:Peter Thiel 式——直接、反直觉、一句话给判断。笔记是决策工具,不是知识库。用户扫一眼就能决定:深挖还是跳过。
文件结构(无 YAML frontmatter):
markdown
undefinedOrganize the content into a Markdown file and save it to .
<Obsidian Save Directory>/{YYYY-MM-DD} {Short Title}.md- File name format: . The short title should be no more than 15 characters, which is a concise summary of the core insight
{Post Date} {Short Title}.md - The date prefix ensures sorting by time
- Do not create subdirectories; all post md files are placed directly in the xhs folder
- Media files are uniformly stored in or
<Obsidian Save Directory>/img/<Obsidian Save Directory>/video/
Writing Style: Peter Thiel-style — Direct, counterintuitive, and deliver judgments in one sentence. Notes are decision-making tools, not knowledge bases. Users can decide at a glance whether to dig deeper or skip.
File Structure (No YAML frontmatter):
markdown
undefined一句话核心洞察(反直觉的判断,不是描述性标题)
One-sentence Core Insight (Counterintuitive judgment, not descriptive title)
核心论点,2-3句话。直接给出"大多数人觉得X,但其实Y"的判断。
不废话,不铺垫,像 Thiel 在董事会上说话。
与我的关联: 一句话。读取用户的 memory(~/.claude/projects/*/memory/ 下的
user 和 project 类型记忆)了解用户背景、研究方向和当前工作,据此说清楚
这个内容跟用户有什么关系。如果 memory 不可用,从通用的个人发展/工具/方法论角度切入。
值得深挖吗: 是/否。一句话理由。
[!tip]- 详情 帖子核心内容的结构化整理(折叠状态,点开才看到):
- 从 desc 和视频转录中提炼,清理
标记#xxx[话题]#- 按逻辑结构分节,保留关键数据和结论
- 图片用
嵌入- 视频帖子在此处放整理后的转录内容
[!info]- 笔记属性
- 来源: 小红书 · 作者名
- 帖子ID: xxx
- 链接: 原始链接
- 日期: YYYY-MM-DD
- 类型: image/video
- 互动: N赞 / N收藏 / N评论
- 标签: 标签1, 标签2, ...
关键约束:
- 折叠区域外的可见内容**不超过 6 行**
- 标题必须是洞察/判断,不是"XX帖子的总结"
- 图片使用 `urlDefault` 字段的 URLCore argument, 2-3 sentences. Directly give a judgment like "Most people think X, but actually Y".
No nonsense, no foreshadowing, just like Thiel speaking in a board meeting.
Relevance to Me: One sentence. Read the user's memory (user and project-type memory under ~/.claude/projects/*/memory/) to understand the user's background, research direction, and current work, and clarify how this content relates to the user. If memory is unavailable, start from the perspective of general personal development/tools/methodology.
Is it worth digging deeper?: Yes/No. One-sentence reason.
[!tip]- Details Structured organization of the post's core content (collapsed state, visible only when expanded):
- Extract from desc and video transcription, clean up
tags#xxx[topic]#- Divide into sections by logical structure, retain key data and conclusions
- Embed images using
- For video posts, place the organized transcribed content here
[!info]- Note Attributes
- Source: Xiaohongshu · Author Name
- Post ID: xxx
- Link: Original Link
- Date: YYYY-MM-DD
- Type: image/video
- Engagement: N Likes / N Saves / N Comments
- Tags: Tag 1, Tag 2, ...
Key Constraints:
- Visible content outside collapsed areas **must not exceed 6 lines**
- The title must be an insight/judgment, not "Summary of XX Post"
- Use the URL from the `urlDefault` field for images