download-webpage-as-pdf

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Download a webpage as a PDF (agent-browser recipe)

将网页下载为PDF(agent-browser使用方案)

The naive approaches fail on modern sites:
  • chrome --headless --print-to-pdf
    captures only the initial viewport's images. Anything below the fold renders as a blank rectangle.
  • agent-browser pdf
    immediately after
    open
    has the same problem - lazy-loaded images haven't decoded yet.
  • Scrolling via JS and then waiting a fixed time is also unreliable - you don't know when each image actually finished.
The fix is one async script that strips lazy-load attributes, scrolls the page to trigger any IntersectionObserver-based loaders, and
await
s every
<img>
to decode. agent-browser's
eval
waits for the returned promise to resolve before exiting, so the subsequent
pdf
command sees a fully-loaded DOM.
传统方法在现代网站上失效:
  • chrome --headless --print-to-pdf
    仅捕获初始视口中的图片。视口下方的所有内容都会显示为空白矩形。
  • open
    后立即执行
    agent-browser pdf
    会遇到同样的问题——懒加载图片尚未解码。
  • 通过JS滚动然后等待固定时间也不可靠——你无法知道每张图片实际何时加载完成。
解决方案是使用一段异步脚本:移除懒加载属性,滚动页面以触发所有基于IntersectionObserver的加载器,并
await
每个
<img>
完成解码。agent-browser的
eval
会等待返回的Promise解析后再退出,因此后续的
pdf
命令会看到完全加载的DOM。

The recipe

使用步骤

If multiple test/agent runs may share the host's agent-browser, isolate each invocation with
agent-browser --session <unique-name> ...
on every command in the pipeline. Single-user one-off captures can omit the flag and use the default session.
Set
AGENT_BROWSER_HEADED=false
in the environment before running so the skill launches headless even when the host's
~/.agent-browser/config.json
defaults to
"headed": true
. This avoids popping a real Chrome window on the user's desktop while an agent is working in the background. Do NOT use the CLI's
--headed false
flag - in agent-browser 0.26.0 it parses but corrupts the session context (subsequent commands see an empty document). The env var is the supported route. To watch the run for debugging, unset the variable or pass
--headed
instead.
bash
export AGENT_BROWSER_HEADED=false

agent-browser open <URL>
agent-browser wait --load networkidle

agent-browser eval "(async () => {
  const sleep = ms => new Promise(r => setTimeout(r, ms));
  ['#onetrust-banner-sdk','#onetrust-consent-sdk','.ot-sdk-container','#ot-sdk-btn-floating','[id*=cookie]','[id*=consent]','[id*=onetrust]'].forEach(s => document.querySelectorAll(s).forEach(e => e.remove()));
  document.querySelectorAll('img').forEach(img => {
    img.removeAttribute('loading');
    img.removeAttribute('decoding');
    if (img.dataset.src) img.src = img.dataset.src;
    if (img.dataset.srcset) img.srcset = img.dataset.srcset;
  });
  for (let y = 0; y < document.documentElement.scrollHeight + 2000; y += 400) {
    window.scrollTo(0, y);
    await sleep(200);
  }
  window.scrollTo(0, document.documentElement.scrollHeight);
  await sleep(2000);
  await Promise.all(Array.from(document.images).map(i =>
    i.complete && i.naturalWidth ? null
    : new Promise(r => { i.addEventListener('load', r, {once:true}); i.addEventListener('error', r, {once:true}); setTimeout(r, 5000); })
  ));
  window.scrollTo(0, 0);
  await sleep(500);
  return Array.from(document.images).filter(i => !i.naturalWidth).length;
})()"

agent-browser pdf /tmp/page.pdf
agent-browser close
如果多个测试/agent运行可能共享主机的agent-browser,请在流水线的每个命令中使用
agent-browser --session <唯一名称> ...
来隔离每次调用。单用户一次性捕获可以省略该标志,使用默认会话。
运行前在环境中设置
AGENT_BROWSER_HEADED=false
,这样即使主机的
~/.agent-browser/config.json
默认设置为
"headed": true
,该技能也会以无头模式启动。这避免了agent在后台工作时,用户桌面上弹出真实的Chrome窗口。请勿使用CLI的
--headed false
标志——在agent-browser 0.26.0版本中,该标志虽能被解析,但会破坏会话上下文(后续命令会看到空白文档)。环境变量是受支持的设置方式。如果要调试运行过程,请取消设置该变量或改为传递
--headed
bash
export AGENT_BROWSER_HEADED=false

agent-browser open <URL>
agent-browser wait --load networkidle

agent-browser eval "(async () => {
  const sleep = ms => new Promise(r => setTimeout(r, ms));
  ['#onetrust-banner-sdk','#onetrust-consent-sdk','.ot-sdk-container','#ot-sdk-btn-floating','[id*=cookie]','[id*=consent]','[id*=onetrust]'].forEach(s => document.querySelectorAll(s).forEach(e => e.remove()));
  document.querySelectorAll('img').forEach(img => {
    img.removeAttribute('loading');
    img.removeAttribute('decoding');
    if (img.dataset.src) img.src = img.dataset.src;
    if (img.dataset.srcset) img.srcset = img.dataset.srcset;
  });
  for (let y = 0; y < document.documentElement.scrollHeight + 2000; y += 400) {
    window.scrollTo(0, y);
    await sleep(200);
  }
  window.scrollTo(0, document.documentElement.scrollHeight);
  await sleep(2000);
  await Promise.all(Array.from(document.images).map(i =>
    i.complete && i.naturalWidth ? null
    : new Promise(r => { i.addEventListener('load', r, {once:true}); i.addEventListener('error', r, {once:true}); setTimeout(r, 5000); })
  ));
  window.scrollTo(0, 0);
  await sleep(500);
  return Array.from(document.images).filter(i => !i.naturalWidth).length;
})()"

agent-browser pdf /tmp/page.pdf
agent-browser close

Verify the result

验证结果

pdfinfo /tmp/page.pdf | grep -E "Pages|File size"

The `eval` returns the count of images that still failed to load. Expect `0`. If non-zero, the recipe didn't fully capture the page - investigate before trusting the PDF. The `pdfinfo` line is your standard end-of-recipe report (page count + bytes) so the agent has concrete numbers to relay back.
pdfinfo /tmp/page.pdf | grep -E "Pages|File size"

`eval`会返回仍未加载成功的图片数量,预期值为`0`。如果非零,则说明该方案未完全捕获页面——在信任该PDF之前需要进行排查。`pdfinfo`行是标准的方案结束报告(页数+字节数),以便agent可以将具体数值反馈给用户。

Why each step matters

各步骤的重要性

  • wait --load networkidle
    before the eval gives the page a chance to attach its IntersectionObservers and other JS hooks. Scrolling before observers attach defeats the trigger.
  • Removing the
    loading
    attribute
    is the structural fix. This is the same trick percollate uses internally - the most reliable way to make Chromium eagerly fetch every image.
  • Scrolling the full height in 400px steps triggers any observer-based loaders that watch for elements crossing the viewport. Some sites use observers even after
    loading=lazy
    is removed.
  • await Promise.all
    on every
    <img>
    guarantees decoded pixels are in memory before the eval returns. agent-browser's
    eval
    is promise-aware - the next command (
    pdf
    ) will not run until this resolves.
  • Returning the broken-image count is your verification. If it is not 0, the recipe did not fully capture the page - do not trust the PDF.
  • wait --load networkidle
    :在执行eval之前等待页面加载完成,让页面有机会附加其IntersectionObserver和其他JS钩子。在钩子附加前滚动会触发失败。
  • 移除
    loading
    属性
    :这是结构性修复方法。这与percollate内部使用的技巧相同——是让Chromium主动获取所有图片的最可靠方式。
  • 以400px为步长滚动完整页面高度:触发任何基于观察者的加载器,这些加载器会监听元素是否进入视口。有些网站即使移除了
    loading=lazy
    ,仍会使用观察者。
  • await Promise.all
    处理每个
    <img>
    :确保在eval返回之前,解码后的像素已存入内存。agent-browser的
    eval
    支持Promise——下一个命令(
    pdf
    )会等待此Promise解析后再运行。
  • 返回损坏图片数量:这是验证方式。如果数值不为0,则说明方案未完全捕获页面——不要信任该PDF。

Cleanup pipeline (optional but recommended)

清理流水线(可选但推荐)

agent-browser saves at letter size with the page's full footer (nav, newsletter signup, link sitemap). For a clean archive:
bash
undefined
agent-browser默认以信纸大小保存页面,包含完整页脚(导航栏、新闻通讯注册、链接站点地图)。如需干净的归档文件:
bash
undefined

1. Inspect total page count and visually identify which trailing pages are footer

1. 查看总页数,直观确定哪些末尾页面是页脚

pdfinfo /tmp/page.pdf | grep Pages
pdfinfo /tmp/page.pdf | grep Pages

Use the Read tool on the PDF to see the last few pages and decide where the article ends.

使用读取工具查看PDF的最后几页,确定文章结束的位置。

2. Trim footer pages. The "1-9" below is illustrative ONLY - replace with the

2. 裁剪页脚页面。下面的“1-9”仅为示例——请替换为根据pdfinfo和视觉检查得到的文章实际页码范围。请勿直接复制此内容。

article's actual range from pdfinfo + visual inspection. Do not copy this verbatim.

qpdf /tmp/page.pdf --pages . 1-9 -- /tmp/page-trimmed.pdf
qpdf /tmp/page.pdf --pages . 1-9 -- /tmp/page-trimmed.pdf

3. Compress (typically 60-70% reduction with images intact)

3. 压缩(在保留图片的情况下,通常可减少60-70%的大小)

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=/path/to/final.pdf /tmp/page-trimmed.pdf

`-dPDFSETTINGS=/ebook` keeps images legible at roughly half the raw size. `/screen` is smaller but blurrier; use only when size matters more than legibility.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=/path/to/final.pdf /tmp/page-trimmed.pdf

`-dPDFSETTINGS=/ebook`可在保持图片清晰的前提下,将大小压缩至原始大小的约一半。`/screen`更小但更模糊;仅在大小比清晰度更重要时使用。

When NOT to use this

何时不使用此方法

  • Reader-mode / article-only output (no nav, no footer, no manual trimming, just the article body and its images): use
    npx percollate pdf <URL> -o out.pdf
    instead. percollate runs Mozilla Readability + Puppeteer and auto-strips chrome. It also handles lazy-load via DOM preprocessing rather than scroll hacks. Slower (~10s) but zero trimming work and very robust on unknown URLs. Pair with
    --no-hyphenate --css "@page { size: A4; margin: 18mm; } a[href]::after { content: none !important; }"
    to suppress the inlined-URL-after-link default.
  • Single-page screenshot (not paginated): use
    agent-browser screenshot --full-page
    instead.
  • HTML archive with all assets bundled: use
    monolith
    or
    single-file CLI
    . PDFs lose interactivity.
  • Server-rendered HTML you control (no JS): WeasyPrint is faster and simpler.
  • 阅读器模式/仅文章输出(无导航栏、无页脚、无需手动裁剪,仅包含文章正文及其图片):请改用
    npx percollate pdf <URL> -o out.pdf
    。percollate运行Mozilla Readability + Puppeteer,会自动剥离浏览器界面元素。它还通过DOM预处理而非滚动技巧来处理懒加载。速度较慢(约10秒),但无需手动裁剪,在未知URL上表现非常稳定。可搭配
    --no-hyphenate --css "@page { size: A4; margin: 18mm; } a[href]::after { content: none !important; }"
    来禁用链接后自动插入URL的默认行为。
  • 单页截图(非分页):请改用
    agent-browser screenshot --full-page
  • 包含所有资源的HTML归档:请使用
    monolith
    single-file CLI
    。PDF会失去交互性。
  • 你可控的服务端渲染HTML(无JS):WeasyPrint更快且更简单。

Picking between agent-browser and percollate

agent-browser与percollate的选择

When the user does not specify, default to percollate for "save this for archival reference" - it's idiot-proof and handles unknown URLs without manual trimming. Use agent-browser (this skill) when the user explicitly wants the page to look like the browser version, or when the page is not article-shaped (a tool, dashboard, marketing page, portfolio) where Mozilla Readability would strip out content the user actually wants.
当用户未明确指定时,对于“保存以供归档参考”的需求,默认使用percollate——它操作简单,无需手动裁剪即可处理未知URL。当用户明确要求页面与浏览器显示一致,或者页面并非文章形式(工具、仪表盘、营销页面、作品集),而Mozilla Readability会剥离用户实际需要的内容时,使用agent-browser(本技能)。