Scraping

网页抓取

Web scraping using nu-shell and browser tools for data extraction.

使用nu-shell和浏览器工具进行网页抓取以提取数据。

Prerequisites

前提条件

nu-shell installed (
```
nu
```
)
```
query web
```
plugin installed (for HTML scraping):
```
nu -c "plugin add query web"
```
Browser extension enabled (for dynamic content): Enable the
```
browser
```
extension in your agent configuration

已安装nu-shell（
```
nu
```
）
已安装
```
query web
```
插件（用于HTML抓取）：
```
nu -c "plugin add query web"
```
已启用浏览器扩展（用于动态内容）：在你的Agent配置中启用
```
browser
```
扩展

Common Tasks

常见任务

Fetching Web Pages

抓取网页

Use

http get

to retrieve HTML content:

bash

undefined

使用

http get

获取HTML内容：

bash

undefined

Simple GET request

nu -c 'http get https://example.com'

With headers

nu -c 'http get -H [User-Agent "My Scraper"] https://example.com'

undefined

nu -c 'http get -H [User-Agent "My Scraper"] https://example.com'

undefined

HTML Parsing and Data Extraction

HTML解析与数据提取

Use the

query web

plugin to parse HTML and extract data using CSS selectors:

bash

undefined

使用

query web

插件解析HTML并通过CSS选择器提取数据：

bash

undefined

Extract text from elements

nu -c 'http get https://example.com | query web -q "h1, h2" | str trim'

Extract attributes

nu -c 'http get https://example.com | query web -a href "a"'

Parse tables as structured data

nu -c 'http get https://example.com/table-page | query web --as-table ["Column1" "Column2"]'

undefined

nu -c 'http get https://example.com/table-page | query web --as-table ["Column1" "Column2"]'

undefined

Browser-Based Scraping for Dynamic Content

基于浏览器的动态内容抓取

For websites requiring JavaScript execution or complex DOM interactions, use browser automation tools.

bash

undefined

对于需要执行JavaScript或复杂DOM交互的网站，使用浏览器自动化工具。

bash

undefined

Start browser

start-browser

Navigate to page

navigate-browser --url https://example.com

Extract data with JavaScript evaluation

evaluate-javascript --code "Array.from(document.querySelectorAll('selector')).map(e => e.textContent)"

Screenshot for visual inspection

take-screenshot

Query HTML fragments

query-html-elements --selector ".content"

undefined

query-html-elements --selector ".content"

undefined

API Interactions

API交互

For JSON APIs, use

http get

and parse with

from json

:

bash

undefined

对于JSON API，使用

http get

并通过

from json

解析：

bash

undefined

GET JSON API

nu -c 'http get https://api.example.com/data | from json'

POST requests

nu -c 'http post https://api.example.com/submit -t application/json {key: value}'

undefined

nu -c 'http post https://api.example.com/submit -t application/json {key: value}'

undefined

Handling Authentication

认证处理

bash

undefined

bash

undefined

Basic auth

nu -c 'http get -u username:password https://api.example.com'

Bearer token

nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com'

Custom headers

nu -c 'http get -H [X-API-Key "YOUR_KEY" User-Agent "Scraper"] https://api.example.com'

undefined

nu -c 'http get -H [X-API-Key "YOUR_KEY" User-Agent "Scraper"] https://api.example.com'

undefined

Rate Limiting and Delays

速率限制与延迟

bash

undefined

bash

undefined

Add delays between requests

nu -c '$urls | each { |url| http get $url; sleep 1sec }'

undefined

nu -c '$urls | each { |url| http get $url; sleep 1sec }'

undefined

Parallel Processing

并行处理

bash

undefined

bash

undefined

Scrape multiple pages in parallel

nu -c '$urls | par-each { |url| http get $url | query web -q ".data" }'

undefined

nu -c '$urls | par-each { |url| http get $url | query web -q ".data" }'

undefined

One-liner Examples

单行命令示例

Basic HTML Scraping

基础HTML抓取

bash

undefined

bash

undefined

Extract all h1 titles

nu -c 'http get https://example.com | query web -q "h1"'

Get all links

nu -c 'http get https://example.com | query web -a href "a"'

Scrape product prices

nu -c 'http get https://store.example.com | query web -q ".price"'

undefined

nu -c 'http get https://store.example.com | query web -q ".price"'

undefined

HTML Scraping Example: Hacker News

HTML抓取示例：Hacker News

bash

undefined

bash

undefined

Scrape HN front page titles and URLs

nu -c 'http get https://news.ycombinator.com/ | query web -q ".titleline a" | get text | zip (http get https://news.ycombinator.com/ | query web -a href ".titleline a" | get href) | each { |pair| echo $"($pair.0) - ($pair.1)" }'


For static sites like HN, use `http get` directly. Reserve browser tools for dynamic content requiring JavaScript execution.

nu -c 'http get https://news.ycombinator.com/ | query web -q ".titleline a" | get text | zip (http get https://news.ycombinator.com/ | query web -a href ".titleline a" | get href) | each { |pair| echo $"($pair.0) - ($pair.1)" }'


对于像HN这样的静态网站，直接使用`http get`即可。浏览器工具仅用于需要执行JavaScript的动态内容。

GitHub Stars Scraper

GitHub Stars抓取工具

bash

undefined

bash

undefined

Get star count for a repo

nu -c 'http get https://api.github.com/repos/nushell/nushell | get stargazers_count'

undefined

nu -c 'http get https://api.github.com/repos/nushell/nushell | get stargazers_count'

undefined

API Data Extraction

API数据提取

bash

undefined

bash

undefined

Fetch JSON and extract fields

nu -c 'http get https://api.example.com/users | from json | get -i 0.name'

undefined

nu -c 'http get https://api.example.com/users | from json | get -i 0.name'

undefined

API Authentication

API认证

bash

undefined

bash

undefined

Bearer token

nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com/data'

API key

nu -c 'http get -H [X-API-Key "YOUR_API_KEY"] https://api.example.com/data'

Basic auth

nu -c 'http get -u username:password https://api.example.com/protected'

undefined

nu -c 'http get -u username:password https://api.example.com/protected'

undefined

scraping

Original

Translation

Scraping

网页抓取

Prerequisites

前提条件

Common Tasks

常见任务

Fetching Web Pages

抓取网页

Simple GET request

Simple GET request

With headers

With headers

HTML Parsing and Data Extraction

HTML解析与数据提取

Extract text from elements

Extract text from elements

Extract attributes

Extract attributes

Parse tables as structured data

Parse tables as structured data

Browser-Based Scraping for Dynamic Content

基于浏览器的动态内容抓取

Start browser

Start browser

Navigate to page

Navigate to page

Extract data with JavaScript evaluation

Extract data with JavaScript evaluation

Screenshot for visual inspection

Screenshot for visual inspection

Query HTML fragments

Query HTML fragments

API Interactions

API交互

GET JSON API

GET JSON API

POST requests

POST requests

Handling Authentication

认证处理

Basic auth

Basic auth

Bearer token

Bearer token

Custom headers

Custom headers

Rate Limiting and Delays

速率限制与延迟

Add delays between requests

Add delays between requests

Parallel Processing

并行处理

Scrape multiple pages in parallel

Scrape multiple pages in parallel

One-liner Examples

单行命令示例

Basic HTML Scraping

基础HTML抓取

Extract all h1 titles

Extract all h1 titles

Get all links

Get all links

Scrape product prices

Scrape product prices

HTML Scraping Example: Hacker News

HTML抓取示例：Hacker News

Scrape HN front page titles and URLs

Scrape HN front page titles and URLs

GitHub Stars Scraper

GitHub Stars抓取工具

Get star count for a repo

Get star count for a repo

API Data Extraction

API数据提取

Fetch JSON and extract fields

Fetch JSON and extract fields

API Authentication