defuddle
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDefuddle - Web Content Extraction
Defuddle - 网页内容提取
Extract main article content from web pages, removing ads, sidebars, navigation, and other clutter. Output clean Markdown with metadata.
从网页中提取文章主体内容,移除广告、侧边栏、导航栏及其他冗余内容,输出携带元数据的纯净Markdown文件。
Prerequisites
前置要求
Before first use, check if is installed:
defuddlebash
command -v defuddle >/dev/null 2>&1 || npm install -g defuddle jsdom首次使用前,请检查是否已安装:
defuddlebash
command -v defuddle >/dev/null 2>&1 || npm install -g defuddle jsdomDefault Workflow
默认工作流
When user provides a URL, follow this workflow:
当用户提供URL时,请遵循以下工作流程:
Step 1: Extract content as Markdown + JSON metadata
步骤1:将内容提取为Markdown + JSON元数据格式
Always use both and flags to get markdown content with full metadata:
-m-jbash
defuddle parse "<url>" -m -j请始终同时使用和参数,以获取附带完整元数据的Markdown内容:
-m-jbash
defuddle parse "<url>" -m -jStep 2: Present a summary to the user
步骤2:向用户展示摘要
Show the user:
- Title: from JSON field
title - Author: from JSON field
author - Source: domain
- Word count: from JSON field
wordCount - A brief preview (first 2-3 sentences)
需向用户展示以下信息:
- 标题:取自JSON的字段
title - 作者:取自JSON的字段
author - 来源:域名
- 字数:取自JSON的字段
wordCount - 简短预览(前2-3句话)
Step 3: Ask where to save
步骤3:询问保存位置
If this is the first time using defuddle in this conversation, ask the user:
"Save to which directory? (e.g.,~/Documents, or a custom path)"~/Desktop
Remember the user's chosen directory for subsequent uses in the same conversation.
如果是本次对话中首次使用defuddle,需询问用户:
"保存到哪个目录?例如、~/Documents或自定义路径"~/Desktop
请记住用户选择的目录,本次对话后续使用无需重复询问。
Step 4: Save as Markdown file
步骤4:保存为Markdown文件
Write the file with frontmatter + full content:
markdown
---
title: {title}
author: {author}
source: {url}
date: {published or "Unknown"}
clipped: {today's date YYYY-MM-DD}
wordCount: {wordCount}
---写入的文件需包含frontmatter + 完整内容:
markdown
---
title: {title}
author: {author}
source: {url}
date: {published or "Unknown"}
clipped: {today's date YYYY-MM-DD}
wordCount: {wordCount}
---{title}
{title}
{markdown content}
**File naming**: Use the article title as filename, sanitized for filesystem:
- Replace special characters with spaces
- Trim whitespace
- Example: `The Shape of the Essay Field.md`{markdown content}
**文件命名**:使用文章标题作为文件名,需符合文件系统命名规范:
- 将特殊字符替换为空格
- 修剪首尾空格
- 示例:`The Shape of the Essay Field.md`Step 5: Confirm to user
步骤5:向用户确认
Tell the user the file path where it was saved.
告知用户文件的保存路径。
CLI Reference
CLI参考
bash
defuddle parse <source> [options]Arguments:
- — URL (
<source>) or local HTML file pathhttps://...
Options:
| Flag | Description |
|---|---|
| Convert content to Markdown |
| Output as JSON with full metadata |
| Write to file instead of stdout |
| Extract single property (title, description, domain, author, published, wordCount, content) |
| Verbose logging |
bash
defuddle parse <source> [options]参数说明:
- — 网址(
<source>)或本地HTML文件路径https://...
选项说明:
| 标记 | 说明 |
|---|---|
| 将内容转换为Markdown格式 |
| 输出包含完整元数据的JSON格式 |
| 写入文件而非输出到标准输出 |
| 提取单个属性(title、description、domain、author、published、wordCount、content) |
| 输出详细日志 |
JSON Response Fields
JSON返回字段
When using , the response includes:
-j- — Article title
title - — Author name
author - — Publication date
published - — Meta description
description - — Extracted Markdown (when
contentused)-m - — Source domain
domain - — Favicon URL
favicon - — Featured image URL
image - — Site name
site - — Word count
wordCount - — Processing time in ms
parseTime
使用参数时,返回结果包含:
-j- — 文章标题
title - — 作者名
author - — 发布日期
published - — 元描述
description - — 提取的Markdown内容(使用
content参数时返回)-m - — 来源域名
domain - — 网站图标URL
favicon - — 特色图片URL
image - — 站点名称
site - — 字数统计
wordCount - — 处理耗时(单位:毫秒)
parseTime
Notes
注意事项
- Requires Node.js and npm
- is required as a peer dependency
jsdom - Works best with article-style pages (blogs, news, documentation)
- Not designed for SPAs or JavaScript-heavy pages (e.g. WeChat articles need browser rendering)
- 需要安装Node.js和npm
- 是必需的对等依赖
jsdom - 最适合文章类页面(博客、新闻、文档)
- 不适用于SPA或重度依赖JavaScript的页面(例如微信文章需要浏览器渲染)