defuddle

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Defuddle - Web Content Extraction

Defuddle - 网页内容提取

Extract main article content from web pages, removing ads, sidebars, navigation, and other clutter. Output clean Markdown with metadata.
从网页中提取文章主体内容,移除广告、侧边栏、导航栏及其他冗余内容,输出携带元数据的纯净Markdown文件。

Prerequisites

前置要求

Before first use, check if
defuddle
is installed:
bash
command -v defuddle >/dev/null 2>&1 || npm install -g defuddle jsdom
首次使用前,请检查是否已安装
defuddle
bash
command -v defuddle >/dev/null 2>&1 || npm install -g defuddle jsdom

Default Workflow

默认工作流

When user provides a URL, follow this workflow:
当用户提供URL时,请遵循以下工作流程:

Step 1: Extract content as Markdown + JSON metadata

步骤1:将内容提取为Markdown + JSON元数据格式

Always use both
-m
and
-j
flags to get markdown content with full metadata:
bash
defuddle parse "<url>" -m -j
请始终同时使用
-m
-j
参数,以获取附带完整元数据的Markdown内容:
bash
defuddle parse "<url>" -m -j

Step 2: Present a summary to the user

步骤2:向用户展示摘要

Show the user:
  • Title: from JSON
    title
    field
  • Author: from JSON
    author
    field
  • Source: domain
  • Word count: from JSON
    wordCount
    field
  • A brief preview (first 2-3 sentences)
需向用户展示以下信息:
  • 标题:取自JSON的
    title
    字段
  • 作者:取自JSON的
    author
    字段
  • 来源:域名
  • 字数:取自JSON的
    wordCount
    字段
  • 简短预览(前2-3句话)

Step 3: Ask where to save

步骤3:询问保存位置

If this is the first time using defuddle in this conversation, ask the user:
"Save to which directory? (e.g.
~/Documents
,
~/Desktop
, or a custom path)"
Remember the user's chosen directory for subsequent uses in the same conversation.
如果是本次对话中首次使用defuddle,需询问用户:
"保存到哪个目录?例如
~/Documents
~/Desktop
或自定义路径"
请记住用户选择的目录,本次对话后续使用无需重复询问。

Step 4: Save as Markdown file

步骤4:保存为Markdown文件

Write the file with frontmatter + full content:
markdown
---
title: {title}
author: {author}
source: {url}
date: {published or "Unknown"}
clipped: {today's date YYYY-MM-DD}
wordCount: {wordCount}
---
写入的文件需包含frontmatter + 完整内容:
markdown
---
title: {title}
author: {author}
source: {url}
date: {published or "Unknown"}
clipped: {today's date YYYY-MM-DD}
wordCount: {wordCount}
---

{title}

{title}

{markdown content}

**File naming**: Use the article title as filename, sanitized for filesystem:
- Replace special characters with spaces
- Trim whitespace
- Example: `The Shape of the Essay Field.md`
{markdown content}

**文件命名**:使用文章标题作为文件名,需符合文件系统命名规范:
- 将特殊字符替换为空格
- 修剪首尾空格
- 示例:`The Shape of the Essay Field.md`

Step 5: Confirm to user

步骤5:向用户确认

Tell the user the file path where it was saved.
告知用户文件的保存路径。

CLI Reference

CLI参考

bash
defuddle parse <source> [options]
Arguments:
  • <source>
    — URL (
    https://...
    ) or local HTML file path
Options:
FlagDescription
-m, --markdown
Convert content to Markdown
-j, --json
Output as JSON with full metadata
-o, --output <file>
Write to file instead of stdout
-p, --property <name>
Extract single property (title, description, domain, author, published, wordCount, content)
--debug
Verbose logging
bash
defuddle parse <source> [options]
参数说明:
  • <source>
    — 网址(
    https://...
    )或本地HTML文件路径
选项说明:
标记说明
-m, --markdown
将内容转换为Markdown格式
-j, --json
输出包含完整元数据的JSON格式
-o, --output <file>
写入文件而非输出到标准输出
-p, --property <name>
提取单个属性(title、description、domain、author、published、wordCount、content)
--debug
输出详细日志

JSON Response Fields

JSON返回字段

When using
-j
, the response includes:
  • title
    — Article title
  • author
    — Author name
  • published
    — Publication date
  • description
    — Meta description
  • content
    — Extracted Markdown (when
    -m
    used)
  • domain
    — Source domain
  • favicon
    — Favicon URL
  • image
    — Featured image URL
  • site
    — Site name
  • wordCount
    — Word count
  • parseTime
    — Processing time in ms
使用
-j
参数时,返回结果包含:
  • title
    — 文章标题
  • author
    — 作者名
  • published
    — 发布日期
  • description
    — 元描述
  • content
    — 提取的Markdown内容(使用
    -m
    参数时返回)
  • domain
    — 来源域名
  • favicon
    — 网站图标URL
  • image
    — 特色图片URL
  • site
    — 站点名称
  • wordCount
    — 字数统计
  • parseTime
    — 处理耗时(单位:毫秒)

Notes

注意事项

  • Requires Node.js and npm
  • jsdom
    is required as a peer dependency
  • Works best with article-style pages (blogs, news, documentation)
  • Not designed for SPAs or JavaScript-heavy pages (e.g. WeChat articles need browser rendering)
  • 需要安装Node.js和npm
  • jsdom
    是必需的对等依赖
  • 最适合文章类页面(博客、新闻、文档)
  • 不适用于SPA或重度依赖JavaScript的页面(例如微信文章需要浏览器渲染)