scraping
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseScraping
网页抓取
Web scraping using nu-shell and browser tools for data extraction.
使用nu-shell和浏览器工具进行网页抓取以提取数据。
Prerequisites
前提条件
- nu-shell installed ()
nu - plugin installed (for HTML scraping):
query webnu -c "plugin add query web" - Browser extension enabled (for dynamic content): Enable the extension in your agent configuration
browser
- 已安装nu-shell()
nu - 已安装插件(用于HTML抓取):
query webnu -c "plugin add query web" - 已启用浏览器扩展(用于动态内容):在你的Agent配置中启用扩展
browser
Common Tasks
常见任务
Fetching Web Pages
抓取网页
Use to retrieve HTML content:
http getbash
undefined使用获取HTML内容:
http getbash
undefinedSimple GET request
Simple GET request
nu -c 'http get https://example.com'
nu -c 'http get https://example.com'
With headers
With headers
nu -c 'http get -H [User-Agent "My Scraper"] https://example.com'
undefinednu -c 'http get -H [User-Agent "My Scraper"] https://example.com'
undefinedHTML Parsing and Data Extraction
HTML解析与数据提取
Use the plugin to parse HTML and extract data using CSS selectors:
query webbash
undefined使用插件解析HTML并通过CSS选择器提取数据:
query webbash
undefinedExtract text from elements
Extract text from elements
nu -c 'http get https://example.com | query web -q "h1, h2" | str trim'
nu -c 'http get https://example.com | query web -q "h1, h2" | str trim'
Extract attributes
Extract attributes
nu -c 'http get https://example.com | query web -a href "a"'
nu -c 'http get https://example.com | query web -a href "a"'
Parse tables as structured data
Parse tables as structured data
nu -c 'http get https://example.com/table-page | query web --as-table ["Column1" "Column2"]'
undefinednu -c 'http get https://example.com/table-page | query web --as-table ["Column1" "Column2"]'
undefinedBrowser-Based Scraping for Dynamic Content
基于浏览器的动态内容抓取
For websites requiring JavaScript execution or complex DOM interactions, use browser automation tools.
bash
undefined对于需要执行JavaScript或复杂DOM交互的网站,使用浏览器自动化工具。
bash
undefinedStart browser
Start browser
start-browser
start-browser
Navigate to page
Navigate to page
navigate-browser --url https://example.com
navigate-browser --url https://example.com
Extract data with JavaScript evaluation
Extract data with JavaScript evaluation
evaluate-javascript --code "Array.from(document.querySelectorAll('selector')).map(e => e.textContent)"
evaluate-javascript --code "Array.from(document.querySelectorAll('selector')).map(e => e.textContent)"
Screenshot for visual inspection
Screenshot for visual inspection
take-screenshot
take-screenshot
Query HTML fragments
Query HTML fragments
query-html-elements --selector ".content"
undefinedquery-html-elements --selector ".content"
undefinedAPI Interactions
API交互
For JSON APIs, use and parse with :
http getfrom jsonbash
undefined对于JSON API,使用并通过解析:
http getfrom jsonbash
undefinedGET JSON API
GET JSON API
nu -c 'http get https://api.example.com/data | from json'
nu -c 'http get https://api.example.com/data | from json'
POST requests
POST requests
nu -c 'http post https://api.example.com/submit -t application/json {key: value}'
undefinednu -c 'http post https://api.example.com/submit -t application/json {key: value}'
undefinedHandling Authentication
认证处理
bash
undefinedbash
undefinedBasic auth
Basic auth
nu -c 'http get -u username:password https://api.example.com'
nu -c 'http get -u username:password https://api.example.com'
Bearer token
Bearer token
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com'
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com'
Custom headers
Custom headers
nu -c 'http get -H [X-API-Key "YOUR_KEY" User-Agent "Scraper"] https://api.example.com'
undefinednu -c 'http get -H [X-API-Key "YOUR_KEY" User-Agent "Scraper"] https://api.example.com'
undefinedRate Limiting and Delays
速率限制与延迟
bash
undefinedbash
undefinedAdd delays between requests
Add delays between requests
nu -c '$urls | each { |url| http get $url; sleep 1sec }'
undefinednu -c '$urls | each { |url| http get $url; sleep 1sec }'
undefinedParallel Processing
并行处理
bash
undefinedbash
undefinedScrape multiple pages in parallel
Scrape multiple pages in parallel
nu -c '$urls | par-each { |url| http get $url | query web -q ".data" }'
undefinednu -c '$urls | par-each { |url| http get $url | query web -q ".data" }'
undefinedOne-liner Examples
单行命令示例
Basic HTML Scraping
基础HTML抓取
bash
undefinedbash
undefinedExtract all h1 titles
Extract all h1 titles
nu -c 'http get https://example.com | query web -q "h1"'
nu -c 'http get https://example.com | query web -q "h1"'
Get all links
Get all links
nu -c 'http get https://example.com | query web -a href "a"'
nu -c 'http get https://example.com | query web -a href "a"'
Scrape product prices
Scrape product prices
nu -c 'http get https://store.example.com | query web -q ".price"'
undefinednu -c 'http get https://store.example.com | query web -q ".price"'
undefinedHTML Scraping Example: Hacker News
HTML抓取示例:Hacker News
bash
undefinedbash
undefinedScrape HN front page titles and URLs
Scrape HN front page titles and URLs
nu -c 'http get https://news.ycombinator.com/ | query web -q ".titleline a" | get text | zip (http get https://news.ycombinator.com/ | query web -a href ".titleline a" | get href) | each { |pair| echo $"($pair.0) - ($pair.1)" }'
For static sites like HN, use `http get` directly. Reserve browser tools for dynamic content requiring JavaScript execution.nu -c 'http get https://news.ycombinator.com/ | query web -q ".titleline a" | get text | zip (http get https://news.ycombinator.com/ | query web -a href ".titleline a" | get href) | each { |pair| echo $"($pair.0) - ($pair.1)" }'
对于像HN这样的静态网站,直接使用`http get`即可。浏览器工具仅用于需要执行JavaScript的动态内容。GitHub Stars Scraper
GitHub Stars抓取工具
bash
undefinedbash
undefinedGet star count for a repo
Get star count for a repo
nu -c 'http get https://api.github.com/repos/nushell/nushell | get stargazers_count'
undefinednu -c 'http get https://api.github.com/repos/nushell/nushell | get stargazers_count'
undefinedAPI Data Extraction
API数据提取
bash
undefinedbash
undefinedFetch JSON and extract fields
Fetch JSON and extract fields
nu -c 'http get https://api.example.com/users | from json | get -i 0.name'
undefinednu -c 'http get https://api.example.com/users | from json | get -i 0.name'
undefinedAPI Authentication
API认证
bash
undefinedbash
undefinedBearer token
Bearer token
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com/data'
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com/data'
API key
API key
nu -c 'http get -H [X-API-Key "YOUR_API_KEY"] https://api.example.com/data'
nu -c 'http get -H [X-API-Key "YOUR_API_KEY"] https://api.example.com/data'
Basic auth
Basic auth
nu -c 'http get -u username:password https://api.example.com/protected'
undefinednu -c 'http get -u username:password https://api.example.com/protected'
undefinedRelated Skills
相关技能
- nu-shell: Core nu-shell scripting patterns and commands.
- nu-shell:核心nu-shell脚本模式与命令。
Related Tools
相关工具
- start-browser: Start Cromite browser via Puppeteer.
- navigate-browser: Navigate to a URL in the browser.
- evaluate-javascript: Evaluate JavaScript code in the active browser tab.
- take-screenshot: Take a screenshot of the active browser tab.
- query-html-elements: Extract HTML elements by CSS selector.
- list-browser-tabs: List all open browser tabs with their titles and URLs.
- close-tab: Close a browser tab by index or title.
- switch-tab: Switch to a specific tab by index.
- refresh-tab: Refresh the current tab.
- current-url: Get the URL of the current active tab.
- page-title: Get the title of the current active tab.
- wait-for-element: Wait for a CSS selector to appear on the page.
- click-element: Click on an element by CSS selector.
- type-text: Type text into an input field.
- extract-text: Extract text content from elements by CSS selector.
- search-web: Perform web searches and extract information from search results.
- start-browser:通过Puppeteer启动Cromite浏览器。
- navigate-browser:在浏览器中导航至指定URL。
- evaluate-javascript:在当前浏览器标签页中执行JavaScript代码。
- take-screenshot:对当前浏览器标签页截图。
- query-html-elements:通过CSS选择器提取HTML元素。
- list-browser-tabs:列出所有打开的浏览器标签页及其标题和URL。
- close-tab:通过索引或标题关闭浏览器标签页。
- switch-tab:切换到指定索引的标签页。
- refresh-tab:刷新当前标签页。
- current-url:获取当前活动标签页的URL。
- page-title:获取当前活动标签页的标题。
- wait-for-element:等待页面上出现指定CSS选择器对应的元素。
- click-element:点击指定CSS选择器对应的元素。
- type-text:在输入框中输入文本。
- extract-text:通过CSS选择器提取元素的文本内容。
- search-web:执行网页搜索并从搜索结果中提取信息。