scraping

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Scraping

网页抓取

Web scraping using nu-shell and browser tools for data extraction.
使用nu-shell和浏览器工具进行网页抓取以提取数据。

Prerequisites

前提条件

  • nu-shell installed (
    nu
    )
  • query web
    plugin installed (for HTML scraping):
    nu -c "plugin add query web"
  • Browser extension enabled (for dynamic content): Enable the
    browser
    extension in your agent configuration
  • 已安装nu-shell(
    nu
  • 已安装
    query web
    插件(用于HTML抓取):
    nu -c "plugin add query web"
  • 已启用浏览器扩展(用于动态内容):在你的Agent配置中启用
    browser
    扩展

Common Tasks

常见任务

Fetching Web Pages

抓取网页

Use
http get
to retrieve HTML content:
bash
undefined
使用
http get
获取HTML内容:
bash
undefined

Simple GET request

Simple GET request

nu -c 'http get https://example.com'
nu -c 'http get https://example.com'

With headers

With headers

nu -c 'http get -H [User-Agent "My Scraper"] https://example.com'
undefined
nu -c 'http get -H [User-Agent "My Scraper"] https://example.com'
undefined

HTML Parsing and Data Extraction

HTML解析与数据提取

Use the
query web
plugin to parse HTML and extract data using CSS selectors:
bash
undefined
使用
query web
插件解析HTML并通过CSS选择器提取数据:
bash
undefined

Extract text from elements

Extract text from elements

nu -c 'http get https://example.com | query web -q "h1, h2" | str trim'
nu -c 'http get https://example.com | query web -q "h1, h2" | str trim'

Extract attributes

Extract attributes

nu -c 'http get https://example.com | query web -a href "a"'
nu -c 'http get https://example.com | query web -a href "a"'

Parse tables as structured data

Parse tables as structured data

nu -c 'http get https://example.com/table-page | query web --as-table ["Column1" "Column2"]'
undefined
nu -c 'http get https://example.com/table-page | query web --as-table ["Column1" "Column2"]'
undefined

Browser-Based Scraping for Dynamic Content

基于浏览器的动态内容抓取

For websites requiring JavaScript execution or complex DOM interactions, use browser automation tools.
bash
undefined
对于需要执行JavaScript或复杂DOM交互的网站,使用浏览器自动化工具。
bash
undefined

Start browser

Start browser

start-browser
start-browser

Navigate to page

Navigate to page

navigate-browser --url https://example.com
navigate-browser --url https://example.com

Extract data with JavaScript evaluation

Extract data with JavaScript evaluation

evaluate-javascript --code "Array.from(document.querySelectorAll('selector')).map(e => e.textContent)"
evaluate-javascript --code "Array.from(document.querySelectorAll('selector')).map(e => e.textContent)"

Screenshot for visual inspection

Screenshot for visual inspection

take-screenshot
take-screenshot

Query HTML fragments

Query HTML fragments

query-html-elements --selector ".content"
undefined
query-html-elements --selector ".content"
undefined

API Interactions

API交互

For JSON APIs, use
http get
and parse with
from json
:
bash
undefined
对于JSON API,使用
http get
并通过
from json
解析:
bash
undefined

GET JSON API

GET JSON API

nu -c 'http get https://api.example.com/data | from json'
nu -c 'http get https://api.example.com/data | from json'

POST requests

POST requests

nu -c 'http post https://api.example.com/submit -t application/json {key: value}'
undefined
nu -c 'http post https://api.example.com/submit -t application/json {key: value}'
undefined

Handling Authentication

认证处理

bash
undefined
bash
undefined

Basic auth

Basic auth

nu -c 'http get -u username:password https://api.example.com'
nu -c 'http get -u username:password https://api.example.com'

Bearer token

Bearer token

nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com'
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com'

Custom headers

Custom headers

nu -c 'http get -H [X-API-Key "YOUR_KEY" User-Agent "Scraper"] https://api.example.com'
undefined
nu -c 'http get -H [X-API-Key "YOUR_KEY" User-Agent "Scraper"] https://api.example.com'
undefined

Rate Limiting and Delays

速率限制与延迟

bash
undefined
bash
undefined

Add delays between requests

Add delays between requests

nu -c '$urls | each { |url| http get $url; sleep 1sec }'
undefined
nu -c '$urls | each { |url| http get $url; sleep 1sec }'
undefined

Parallel Processing

并行处理

bash
undefined
bash
undefined

Scrape multiple pages in parallel

Scrape multiple pages in parallel

nu -c '$urls | par-each { |url| http get $url | query web -q ".data" }'
undefined
nu -c '$urls | par-each { |url| http get $url | query web -q ".data" }'
undefined

One-liner Examples

单行命令示例

Basic HTML Scraping

基础HTML抓取

bash
undefined
bash
undefined

Extract all h1 titles

Extract all h1 titles

nu -c 'http get https://example.com | query web -q "h1"'
nu -c 'http get https://example.com | query web -q "h1"'

Get all links

Get all links

nu -c 'http get https://example.com | query web -a href "a"'
nu -c 'http get https://example.com | query web -a href "a"'

Scrape product prices

Scrape product prices

nu -c 'http get https://store.example.com | query web -q ".price"'
undefined
nu -c 'http get https://store.example.com | query web -q ".price"'
undefined

HTML Scraping Example: Hacker News

HTML抓取示例:Hacker News

bash
undefined
bash
undefined

Scrape HN front page titles and URLs

Scrape HN front page titles and URLs

nu -c 'http get https://news.ycombinator.com/ | query web -q ".titleline a" | get text | zip (http get https://news.ycombinator.com/ | query web -a href ".titleline a" | get href) | each { |pair| echo $"($pair.0) - ($pair.1)" }'

For static sites like HN, use `http get` directly. Reserve browser tools for dynamic content requiring JavaScript execution.
nu -c 'http get https://news.ycombinator.com/ | query web -q ".titleline a" | get text | zip (http get https://news.ycombinator.com/ | query web -a href ".titleline a" | get href) | each { |pair| echo $"($pair.0) - ($pair.1)" }'

对于像HN这样的静态网站,直接使用`http get`即可。浏览器工具仅用于需要执行JavaScript的动态内容。

GitHub Stars Scraper

GitHub Stars抓取工具

bash
undefined
bash
undefined

Get star count for a repo

Get star count for a repo

nu -c 'http get https://api.github.com/repos/nushell/nushell | get stargazers_count'
undefined
nu -c 'http get https://api.github.com/repos/nushell/nushell | get stargazers_count'
undefined

API Data Extraction

API数据提取

bash
undefined
bash
undefined

Fetch JSON and extract fields

Fetch JSON and extract fields

nu -c 'http get https://api.example.com/users | from json | get -i 0.name'
undefined
nu -c 'http get https://api.example.com/users | from json | get -i 0.name'
undefined

API Authentication

API认证

bash
undefined
bash
undefined

Bearer token

Bearer token

nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com/data'
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com/data'

API key

API key

nu -c 'http get -H [X-API-Key "YOUR_API_KEY"] https://api.example.com/data'
nu -c 'http get -H [X-API-Key "YOUR_API_KEY"] https://api.example.com/data'

Basic auth

Basic auth

nu -c 'http get -u username:password https://api.example.com/protected'
undefined
nu -c 'http get -u username:password https://api.example.com/protected'
undefined

Related Skills

相关技能

  • nu-shell: Core nu-shell scripting patterns and commands.
  • nu-shell:核心nu-shell脚本模式与命令。

Related Tools

相关工具

  • start-browser: Start Cromite browser via Puppeteer.
  • navigate-browser: Navigate to a URL in the browser.
  • evaluate-javascript: Evaluate JavaScript code in the active browser tab.
  • take-screenshot: Take a screenshot of the active browser tab.
  • query-html-elements: Extract HTML elements by CSS selector.
  • list-browser-tabs: List all open browser tabs with their titles and URLs.
  • close-tab: Close a browser tab by index or title.
  • switch-tab: Switch to a specific tab by index.
  • refresh-tab: Refresh the current tab.
  • current-url: Get the URL of the current active tab.
  • page-title: Get the title of the current active tab.
  • wait-for-element: Wait for a CSS selector to appear on the page.
  • click-element: Click on an element by CSS selector.
  • type-text: Type text into an input field.
  • extract-text: Extract text content from elements by CSS selector.
  • search-web: Perform web searches and extract information from search results.
  • start-browser:通过Puppeteer启动Cromite浏览器。
  • navigate-browser:在浏览器中导航至指定URL。
  • evaluate-javascript:在当前浏览器标签页中执行JavaScript代码。
  • take-screenshot:对当前浏览器标签页截图。
  • query-html-elements:通过CSS选择器提取HTML元素。
  • list-browser-tabs:列出所有打开的浏览器标签页及其标题和URL。
  • close-tab:通过索引或标题关闭浏览器标签页。
  • switch-tab:切换到指定索引的标签页。
  • refresh-tab:刷新当前标签页。
  • current-url:获取当前活动标签页的URL。
  • page-title:获取当前活动标签页的标题。
  • wait-for-element:等待页面上出现指定CSS选择器对应的元素。
  • click-element:点击指定CSS选择器对应的元素。
  • type-text:在输入框中输入文本。
  • extract-text:通过CSS选择器提取元素的文本内容。
  • search-web:执行网页搜索并从搜索结果中提取信息。