defuddle

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Defuddle - Web Content Extraction

Defuddle - 网页内容提取

Extract main article content from web pages, removing ads, sidebars, navigation, and other clutter. Output clean Markdown with metadata.

从网页中提取文章主体内容，移除广告、侧边栏、导航栏及其他冗余内容，输出携带元数据的纯净Markdown文件。

Prerequisites

前置要求

Before first use, check if

defuddle

is installed:

bash

command -v defuddle >/dev/null 2>&1 || npm install -g defuddle jsdom

首次使用前，请检查是否已安装

defuddle

：

bash

command -v defuddle >/dev/null 2>&1 || npm install -g defuddle jsdom

Default Workflow

默认工作流

When user provides a URL, follow this workflow:

当用户提供URL时，请遵循以下工作流程：

Step 1: Extract content as Markdown + JSON metadata

步骤1：将内容提取为Markdown + JSON元数据格式

Always use both

-m

and

-j

flags to get markdown content with full metadata:

bash

defuddle parse "<url>" -m -j

请始终同时使用

-m

和

-j

参数，以获取附带完整元数据的Markdown内容：

bash

defuddle parse "<url>" -m -j

Step 2: Present a summary to the user

步骤2：向用户展示摘要

Show the user:

Title: from JSON
```
title
```
field
Author: from JSON
```
author
```
field
Source: domain
Word count: from JSON
```
wordCount
```
field
A brief preview (first 2-3 sentences)

需向用户展示以下信息：

标题：取自JSON的
```
title
```
字段
作者：取自JSON的
```
author
```
字段
来源：域名
字数：取自JSON的
```
wordCount
```
字段
简短预览（前2-3句话）

Step 3: Ask where to save

步骤3：询问保存位置

If this is the first time using defuddle in this conversation, ask the user:

"Save to which directory? (e.g.
~/Documents
,
~/Desktop
, or a custom path)"

Remember the user's chosen directory for subsequent uses in the same conversation.

如果是本次对话中首次使用defuddle，需询问用户：

"保存到哪个目录？例如
~/Documents
、
~/Desktop
或自定义路径"

请记住用户选择的目录，本次对话后续使用无需重复询问。

Step 4: Save as Markdown file

步骤4：保存为Markdown文件

Write the file with frontmatter + full content:

markdown

---
title: {title}
author: {author}
source: {url}
date: {published or "Unknown"}
clipped: {today's date YYYY-MM-DD}
wordCount: {wordCount}
---

写入的文件需包含frontmatter + 完整内容：

markdown

---
title: {title}
author: {author}
source: {url}
date: {published or "Unknown"}
clipped: {today's date YYYY-MM-DD}
wordCount: {wordCount}
---

{title}

{markdown content}


**File naming**: Use the article title as filename, sanitized for filesystem:
- Replace special characters with spaces
- Trim whitespace
- Example: `The Shape of the Essay Field.md`

{markdown content}


**文件命名**：使用文章标题作为文件名，需符合文件系统命名规范：
- 将特殊字符替换为空格
- 修剪首尾空格
- 示例：`The Shape of the Essay Field.md`

Step 5: Confirm to user

步骤5：向用户确认

Tell the user the file path where it was saved.

告知用户文件的保存路径。

CLI Reference

CLI参考

bash

defuddle parse <source> [options]

Arguments:

```
<source>
```
— URL (
```
https://...
```
) or local HTML file path

Options:

Flag	Description
`-m, --markdown`	Convert content to Markdown
`-j, --json`	Output as JSON with full metadata
`-o, --output <file>`	Write to file instead of stdout
`-p, --property <name>`	Extract single property (title, description, domain, author, published, wordCount, content)
`--debug`	Verbose logging

bash

defuddle parse <source> [options]

参数说明：

```
<source>
```
— 网址（
```
https://...
```
）或本地HTML文件路径

选项说明：

标记	说明
`-m, --markdown`	将内容转换为Markdown格式
`-j, --json`	输出包含完整元数据的JSON格式
`-o, --output <file>`	写入文件而非输出到标准输出
`-p, --property <name>`	提取单个属性（title、description、domain、author、published、wordCount、content）
`--debug`	输出详细日志

JSON Response Fields

JSON返回字段

When using

-j

, the response includes:

```
title
```
— Article title
```
author
```
— Author name
```
published
```
— Publication date
```
description
```
— Meta description
```
content
```
— Extracted Markdown (when
```
-m
```
used)
```
domain
```
— Source domain
```
favicon
```
— Favicon URL
```
image
```
— Featured image URL
```
site
```
— Site name
```
wordCount
```
— Word count
```
parseTime
```
— Processing time in ms

使用

-j

参数时，返回结果包含：

```
title
```
— 文章标题
```
author
```
— 作者名
```
published
```
— 发布日期
```
description
```
— 元描述
```
content
```
— 提取的Markdown内容（使用
```
-m
```
参数时返回）
```
domain
```
— 来源域名
```
favicon
```
— 网站图标URL
```
image
```
— 特色图片URL
```
site
```
— 站点名称
```
wordCount
```
— 字数统计
```
parseTime
```
— 处理耗时（单位：毫秒）

Notes

注意事项

Requires Node.js and npm
```
jsdom
```
is required as a peer dependency
Works best with article-style pages (blogs, news, documentation)
Not designed for SPAs or JavaScript-heavy pages (e.g. WeChat articles need browser rendering)

需要安装Node.js和npm
```
jsdom
```
是必需的对等依赖
最适合文章类页面（博客、新闻、文档）
不适用于SPA或重度依赖JavaScript的页面（例如微信文章需要浏览器渲染）