download-webpage-as-pdf
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDownload a webpage as a PDF (agent-browser recipe)
将网页下载为PDF(agent-browser使用方案)
The naive approaches fail on modern sites:
- captures only the initial viewport's images. Anything below the fold renders as a blank rectangle.
chrome --headless --print-to-pdf - immediately after
agent-browser pdfhas the same problem - lazy-loaded images haven't decoded yet.open - Scrolling via JS and then waiting a fixed time is also unreliable - you don't know when each image actually finished.
The fix is one async script that strips lazy-load attributes, scrolls the page to trigger any IntersectionObserver-based loaders, and s every to decode. agent-browser's waits for the returned promise to resolve before exiting, so the subsequent command sees a fully-loaded DOM.
await<img>evalpdf传统方法在现代网站上失效:
- 仅捕获初始视口中的图片。视口下方的所有内容都会显示为空白矩形。
chrome --headless --print-to-pdf - 在后立即执行
open会遇到同样的问题——懒加载图片尚未解码。agent-browser pdf - 通过JS滚动然后等待固定时间也不可靠——你无法知道每张图片实际何时加载完成。
解决方案是使用一段异步脚本:移除懒加载属性,滚动页面以触发所有基于IntersectionObserver的加载器,并每个完成解码。agent-browser的会等待返回的Promise解析后再退出,因此后续的命令会看到完全加载的DOM。
await<img>evalpdfThe recipe
使用步骤
If multiple test/agent runs may share the host's agent-browser, isolate each invocation with on every command in the pipeline. Single-user one-off captures can omit the flag and use the default session.
agent-browser --session <unique-name> ...Set in the environment before running so the skill launches headless even when the host's defaults to . This avoids popping a real Chrome window on the user's desktop while an agent is working in the background. Do NOT use the CLI's flag - in agent-browser 0.26.0 it parses but corrupts the session context (subsequent commands see an empty document). The env var is the supported route. To watch the run for debugging, unset the variable or pass instead.
AGENT_BROWSER_HEADED=false~/.agent-browser/config.json"headed": true--headed false--headedbash
export AGENT_BROWSER_HEADED=false
agent-browser open <URL>
agent-browser wait --load networkidle
agent-browser eval "(async () => {
const sleep = ms => new Promise(r => setTimeout(r, ms));
['#onetrust-banner-sdk','#onetrust-consent-sdk','.ot-sdk-container','#ot-sdk-btn-floating','[id*=cookie]','[id*=consent]','[id*=onetrust]'].forEach(s => document.querySelectorAll(s).forEach(e => e.remove()));
document.querySelectorAll('img').forEach(img => {
img.removeAttribute('loading');
img.removeAttribute('decoding');
if (img.dataset.src) img.src = img.dataset.src;
if (img.dataset.srcset) img.srcset = img.dataset.srcset;
});
for (let y = 0; y < document.documentElement.scrollHeight + 2000; y += 400) {
window.scrollTo(0, y);
await sleep(200);
}
window.scrollTo(0, document.documentElement.scrollHeight);
await sleep(2000);
await Promise.all(Array.from(document.images).map(i =>
i.complete && i.naturalWidth ? null
: new Promise(r => { i.addEventListener('load', r, {once:true}); i.addEventListener('error', r, {once:true}); setTimeout(r, 5000); })
));
window.scrollTo(0, 0);
await sleep(500);
return Array.from(document.images).filter(i => !i.naturalWidth).length;
})()"
agent-browser pdf /tmp/page.pdf
agent-browser close如果多个测试/agent运行可能共享主机的agent-browser,请在流水线的每个命令中使用来隔离每次调用。单用户一次性捕获可以省略该标志,使用默认会话。
agent-browser --session <唯一名称> ...运行前在环境中设置,这样即使主机的默认设置为,该技能也会以无头模式启动。这避免了agent在后台工作时,用户桌面上弹出真实的Chrome窗口。请勿使用CLI的标志——在agent-browser 0.26.0版本中,该标志虽能被解析,但会破坏会话上下文(后续命令会看到空白文档)。环境变量是受支持的设置方式。如果要调试运行过程,请取消设置该变量或改为传递。
AGENT_BROWSER_HEADED=false~/.agent-browser/config.json"headed": true--headed false--headedbash
export AGENT_BROWSER_HEADED=false
agent-browser open <URL>
agent-browser wait --load networkidle
agent-browser eval "(async () => {
const sleep = ms => new Promise(r => setTimeout(r, ms));
['#onetrust-banner-sdk','#onetrust-consent-sdk','.ot-sdk-container','#ot-sdk-btn-floating','[id*=cookie]','[id*=consent]','[id*=onetrust]'].forEach(s => document.querySelectorAll(s).forEach(e => e.remove()));
document.querySelectorAll('img').forEach(img => {
img.removeAttribute('loading');
img.removeAttribute('decoding');
if (img.dataset.src) img.src = img.dataset.src;
if (img.dataset.srcset) img.srcset = img.dataset.srcset;
});
for (let y = 0; y < document.documentElement.scrollHeight + 2000; y += 400) {
window.scrollTo(0, y);
await sleep(200);
}
window.scrollTo(0, document.documentElement.scrollHeight);
await sleep(2000);
await Promise.all(Array.from(document.images).map(i =>
i.complete && i.naturalWidth ? null
: new Promise(r => { i.addEventListener('load', r, {once:true}); i.addEventListener('error', r, {once:true}); setTimeout(r, 5000); })
));
window.scrollTo(0, 0);
await sleep(500);
return Array.from(document.images).filter(i => !i.naturalWidth).length;
})()"
agent-browser pdf /tmp/page.pdf
agent-browser closeVerify the result
验证结果
pdfinfo /tmp/page.pdf | grep -E "Pages|File size"
The `eval` returns the count of images that still failed to load. Expect `0`. If non-zero, the recipe didn't fully capture the page - investigate before trusting the PDF. The `pdfinfo` line is your standard end-of-recipe report (page count + bytes) so the agent has concrete numbers to relay back.pdfinfo /tmp/page.pdf | grep -E "Pages|File size"
`eval`会返回仍未加载成功的图片数量,预期值为`0`。如果非零,则说明该方案未完全捕获页面——在信任该PDF之前需要进行排查。`pdfinfo`行是标准的方案结束报告(页数+字节数),以便agent可以将具体数值反馈给用户。Why each step matters
各步骤的重要性
- before the eval gives the page a chance to attach its IntersectionObservers and other JS hooks. Scrolling before observers attach defeats the trigger.
wait --load networkidle - Removing the attribute is the structural fix. This is the same trick percollate uses internally - the most reliable way to make Chromium eagerly fetch every image.
loading - Scrolling the full height in 400px steps triggers any observer-based loaders that watch for elements crossing the viewport. Some sites use observers even after is removed.
loading=lazy - on every
await Promise.allguarantees decoded pixels are in memory before the eval returns. agent-browser's<img>is promise-aware - the next command (eval) will not run until this resolves.pdf - Returning the broken-image count is your verification. If it is not 0, the recipe did not fully capture the page - do not trust the PDF.
- :在执行eval之前等待页面加载完成,让页面有机会附加其IntersectionObserver和其他JS钩子。在钩子附加前滚动会触发失败。
wait --load networkidle - 移除属性:这是结构性修复方法。这与percollate内部使用的技巧相同——是让Chromium主动获取所有图片的最可靠方式。
loading - 以400px为步长滚动完整页面高度:触发任何基于观察者的加载器,这些加载器会监听元素是否进入视口。有些网站即使移除了,仍会使用观察者。
loading=lazy - 处理每个
await Promise.all:确保在eval返回之前,解码后的像素已存入内存。agent-browser的<img>支持Promise——下一个命令(eval)会等待此Promise解析后再运行。pdf - 返回损坏图片数量:这是验证方式。如果数值不为0,则说明方案未完全捕获页面——不要信任该PDF。
Cleanup pipeline (optional but recommended)
清理流水线(可选但推荐)
agent-browser saves at letter size with the page's full footer (nav, newsletter signup, link sitemap). For a clean archive:
bash
undefinedagent-browser默认以信纸大小保存页面,包含完整页脚(导航栏、新闻通讯注册、链接站点地图)。如需干净的归档文件:
bash
undefined1. Inspect total page count and visually identify which trailing pages are footer
1. 查看总页数,直观确定哪些末尾页面是页脚
pdfinfo /tmp/page.pdf | grep Pages
pdfinfo /tmp/page.pdf | grep Pages
Use the Read tool on the PDF to see the last few pages and decide where the article ends.
使用读取工具查看PDF的最后几页,确定文章结束的位置。
2. Trim footer pages. The "1-9" below is illustrative ONLY - replace with the
2. 裁剪页脚页面。下面的“1-9”仅为示例——请替换为根据pdfinfo和视觉检查得到的文章实际页码范围。请勿直接复制此内容。
article's actual range from pdfinfo + visual inspection. Do not copy this verbatim.
—
qpdf /tmp/page.pdf --pages . 1-9 -- /tmp/page-trimmed.pdf
qpdf /tmp/page.pdf --pages . 1-9 -- /tmp/page-trimmed.pdf
3. Compress (typically 60-70% reduction with images intact)
3. 压缩(在保留图片的情况下,通常可减少60-70%的大小)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=/path/to/final.pdf /tmp/page-trimmed.pdf
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=/path/to/final.pdf /tmp/page-trimmed.pdf
`-dPDFSETTINGS=/ebook` keeps images legible at roughly half the raw size. `/screen` is smaller but blurrier; use only when size matters more than legibility.gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=/path/to/final.pdf /tmp/page-trimmed.pdf
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=/path/to/final.pdf /tmp/page-trimmed.pdf
`-dPDFSETTINGS=/ebook`可在保持图片清晰的前提下,将大小压缩至原始大小的约一半。`/screen`更小但更模糊;仅在大小比清晰度更重要时使用。When NOT to use this
何时不使用此方法
- Reader-mode / article-only output (no nav, no footer, no manual trimming, just the article body and its images): use instead. percollate runs Mozilla Readability + Puppeteer and auto-strips chrome. It also handles lazy-load via DOM preprocessing rather than scroll hacks. Slower (~10s) but zero trimming work and very robust on unknown URLs. Pair with
npx percollate pdf <URL> -o out.pdfto suppress the inlined-URL-after-link default.--no-hyphenate --css "@page { size: A4; margin: 18mm; } a[href]::after { content: none !important; }" - Single-page screenshot (not paginated): use instead.
agent-browser screenshot --full-page - HTML archive with all assets bundled: use or
monolith. PDFs lose interactivity.single-file CLI - Server-rendered HTML you control (no JS): WeasyPrint is faster and simpler.
- 阅读器模式/仅文章输出(无导航栏、无页脚、无需手动裁剪,仅包含文章正文及其图片):请改用。percollate运行Mozilla Readability + Puppeteer,会自动剥离浏览器界面元素。它还通过DOM预处理而非滚动技巧来处理懒加载。速度较慢(约10秒),但无需手动裁剪,在未知URL上表现非常稳定。可搭配
npx percollate pdf <URL> -o out.pdf来禁用链接后自动插入URL的默认行为。--no-hyphenate --css "@page { size: A4; margin: 18mm; } a[href]::after { content: none !important; }" - 单页截图(非分页):请改用。
agent-browser screenshot --full-page - 包含所有资源的HTML归档:请使用或
monolith。PDF会失去交互性。single-file CLI - 你可控的服务端渲染HTML(无JS):WeasyPrint更快且更简单。
Picking between agent-browser and percollate
agent-browser与percollate的选择
When the user does not specify, default to percollate for "save this for archival reference" - it's idiot-proof and handles unknown URLs without manual trimming. Use agent-browser (this skill) when the user explicitly wants the page to look like the browser version, or when the page is not article-shaped (a tool, dashboard, marketing page, portfolio) where Mozilla Readability would strip out content the user actually wants.
当用户未明确指定时,对于“保存以供归档参考”的需求,默认使用percollate——它操作简单,无需手动裁剪即可处理未知URL。当用户明确要求页面与浏览器显示一致,或者页面并非文章形式(工具、仪表盘、营销页面、作品集),而Mozilla Readability会剥离用户实际需要的内容时,使用agent-browser(本技能)。