cheerio-parsing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cheerio HTML Parsing

Cheerio HTML解析

You are an expert in Cheerio, Node.js HTML parsing, DOM manipulation, and building efficient data extraction pipelines for web scraping.
您是Cheerio、Node.js HTML解析、DOM操作以及构建高效网页抓取数据提取流水线的专家。

Core Expertise

核心专长

  • Cheerio API and jQuery-like syntax
  • CSS selector optimization
  • DOM traversal and manipulation
  • HTML/XML parsing strategies
  • Integration with HTTP clients (axios, got, node-fetch)
  • Memory-efficient processing of large documents
  • Data extraction patterns and best practices
  • Cheerio API及类jQuery语法
  • CSS选择器优化
  • DOM遍历与操作
  • HTML/XML解析策略
  • 与HTTP客户端集成(axios、got、node-fetch)
  • 大文档的内存高效处理
  • 数据提取模式与最佳实践

Key Principles

关键原则

  • Write clean, modular extraction functions
  • Use efficient selectors to minimize parsing overhead
  • Handle malformed HTML gracefully
  • Implement proper error handling for missing elements
  • Design reusable scraping utilities
  • Follow functional programming patterns where appropriate
  • 编写简洁、模块化的提取函数
  • 使用高效选择器以最小化解析开销
  • 优雅处理格式不良的HTML
  • 为缺失元素实现适当的错误处理
  • 设计可复用的抓取工具
  • 酌情遵循函数式编程模式

Basic Setup

基础设置

bash
npm install cheerio axios
bash
npm install cheerio axios

Loading HTML

加载HTML

javascript
const cheerio = require('cheerio');
const axios = require('axios');

// Load from string
const $ = cheerio.load('<html><body><h1>Hello</h1></body></html>');

// Load with options
const $ = cheerio.load(html, {
  xmlMode: false,           // Parse as XML
  decodeEntities: true,     // Decode HTML entities
  lowerCaseTags: false,     // Keep tag case
  lowerCaseAttributeNames: false
});

// Fetch and parse
async function fetchAndParse(url) {
  const response = await axios.get(url);
  return cheerio.load(response.data);
}
javascript
const cheerio = require('cheerio');
const axios = require('axios');

// Load from string
const $ = cheerio.load('<html><body><h1>Hello</h1></body></html>');

// Load with options
const $ = cheerio.load(html, {
  xmlMode: false,           // Parse as XML
  decodeEntities: true,     // Decode HTML entities
  lowerCaseTags: false,     // Keep tag case
  lowerCaseAttributeNames: false
});

// Fetch and parse
async function fetchAndParse(url) {
  const response = await axios.get(url);
  return cheerio.load(response.data);
}

Selecting Elements

选择元素

CSS Selectors

CSS选择器

javascript
// By tag
$('h1')

// By class
$('.article')

// By ID
$('#main-content')

// By attribute
$('[data-id="123"]')
$('a[href^="https://"]')  // Starts with
$('a[href$=".pdf"]')      // Ends with
$('a[href*="example"]')   // Contains

// Combinations
$('div.article > h2')     // Direct child
$('div.article h2')       // Any descendant
$('h2 + p')               // Adjacent sibling
$('h2 ~ p')               // General sibling

// Pseudo-selectors
$('li:first-child')
$('li:last-child')
$('li:nth-child(2)')
$('li:nth-child(odd)')
$('tr:even')
$('input:not([type="hidden"])')
$('p:contains("specific text")')
javascript
// By tag
$('h1')

// By class
$('.article')

// By ID
$('#main-content')

// By attribute
$('[data-id="123"]')
$('a[href^="https://"]')  // Starts with
$('a[href$=".pdf"]')      // Ends with
$('a[href*="example"]')   // Contains

// Combinations
$('div.article > h2')     // Direct child
$('div.article h2')       // Any descendant
$('h2 + p')               // Adjacent sibling
$('h2 ~ p')               // General sibling

// Pseudo-selectors
$('li:first-child')
$('li:last-child')
$('li:nth-child(2)')
$('li:nth-child(odd)')
$('tr:even')
$('input:not([type="hidden"])')
$('p:contains("specific text")')

Multiple Selectors

多选择器

javascript
// Select multiple types
$('h1, h2, h3')

// Chain selections
$('.article').find('.title')
javascript
// Select multiple types
$('h1, h2, h3')

// Chain selections
$('.article').find('.title')

Extracting Data

提取数据

Text Content

文本内容

javascript
// Get text (includes child text)
const text = $('h1').text();

// Get trimmed text
const text = $('h1').text().trim();

// Get HTML
const html = $('div.content').html();

// Get outer HTML
const outerHtml = $.html($('div.content'));
javascript
// Get text (includes child text)
const text = $('h1').text();

// Get trimmed text
const text = $('h1').text().trim();

// Get HTML
const html = $('div.content').html();

// Get outer HTML
const outerHtml = $.html($('div.content'));

Attributes

属性

javascript
// Get attribute
const href = $('a').attr('href');
const src = $('img').attr('src');

// Get data attributes
const id = $('div').data('id');  // data-id attribute

// Check if attribute exists
const hasClass = $('div').hasClass('active');
javascript
// Get attribute
const href = $('a').attr('href');
const src = $('img').attr('src');

// Get data attributes
const id = $('div').data('id');  // data-id attribute

// Check if attribute exists
const hasClass = $('div').hasClass('active');

Multiple Elements

多元素

javascript
// Iterate with each
const items = [];
$('.product').each((index, element) => {
  items.push({
    name: $(element).find('.name').text().trim(),
    price: $(element).find('.price').text().trim(),
    url: $(element).find('a').attr('href')
  });
});

// Map to array
const titles = $('h2').map((i, el) => $(el).text()).get();

// Filter elements
const featured = $('.product').filter('.featured');

// First/Last
const first = $('li').first();
const last = $('li').last();

// Get by index
const third = $('li').eq(2);
javascript
// Iterate with each
const items = [];
$('.product').each((index, element) => {
  items.push({
    name: $(element).find('.name').text().trim(),
    price: $(element).find('.price').text().trim(),
    url: $(element).find('a').attr('href')
  });
});

// Map to array
const titles = $('h2').map((i, el) => $(el).text()).get();

// Filter elements
const featured = $('.product').filter('.featured');

// First/Last
const first = $('li').first();
const last = $('li').last();

// Get by index
const third = $('li').eq(2);

DOM Traversal

DOM遍历

Navigation

导航

javascript
// Parent
$('span').parent()
$('span').parents()          // All ancestors
$('span').parents('.container')  // Specific ancestor
$('span').closest('.wrapper')    // Nearest ancestor matching selector

// Children
$('ul').children()           // Direct children
$('ul').children('li.active') // Filtered children
$('div').contents()          // Including text nodes

// Siblings
$('li').siblings()
$('li').next()
$('li').nextAll()
$('li').prev()
$('li').prevAll()
javascript
// Parent
$('span').parent()
$('span').parents()          // All ancestors
$('span').parents('.container')  // Specific ancestor
$('span').closest('.wrapper')    // Nearest ancestor matching selector

// Children
$('ul').children()           // Direct children
$('ul').children('li.active') // Filtered children
$('div').contents()          // Including text nodes

// Siblings
$('li').siblings()
$('li').next()
$('li').nextAll()
$('li').prev()
$('li').prevAll()

Filtering

过滤

javascript
// Filter by selector
$('li').filter('.active')

// Filter by function
$('li').filter((i, el) => $(el).data('price') > 100)

// Find within selection
$('.article').find('img')

// Check conditions
$('li').is('.active')  // Returns boolean
$('li').has('span')    // Has descendant matching selector
javascript
// Filter by selector
$('li').filter('.active')

// Filter by function
$('li').filter((i, el) => $(el).data('price') > 100)

// Find within selection
$('.article').find('img')

// Check conditions
$('li').is('.active')  // Returns boolean
$('li').has('span')    // Has descendant matching selector

Data Extraction Patterns

数据提取模式

Table Extraction

表格提取

javascript
function extractTable(tableSelector) {
  const $ = this;
  const headers = [];
  const rows = [];

  // Get headers
  $(tableSelector).find('th').each((i, el) => {
    headers.push($(el).text().trim());
  });

  // Get rows
  $(tableSelector).find('tbody tr').each((i, row) => {
    const rowData = {};
    $(row).find('td').each((j, cell) => {
      rowData[headers[j]] = $(cell).text().trim();
    });
    rows.push(rowData);
  });

  return rows;
}
javascript
function extractTable(tableSelector) {
  const $ = this;
  const headers = [];
  const rows = [];

  // Get headers
  $(tableSelector).find('th').each((i, el) => {
    headers.push($(el).text().trim());
  });

  // Get rows
  $(tableSelector).find('tbody tr').each((i, row) => {
    const rowData = {};
    $(row).find('td').each((j, cell) => {
      rowData[headers[j]] = $(cell).text().trim();
    });
    rows.push(rowData);
  });

  return rows;
}

List Extraction

列表提取

javascript
function extractList(selector, itemExtractor) {
  return $(selector).map((i, el) => itemExtractor($(el))).get();
}

// Usage
const products = extractList('.product', ($el) => ({
  name: $el.find('.name').text().trim(),
  price: parseFloat($el.find('.price').text().replace('$', '')),
  image: $el.find('img').attr('src'),
  link: $el.find('a').attr('href')
}));
javascript
function extractList(selector, itemExtractor) {
  return $(selector).map((i, el) => itemExtractor($(el))).get();
}

// Usage
const products = extractList('.product', ($el) => ({
  name: $el.find('.name').text().trim(),
  price: parseFloat($el.find('.price').text().replace('$', '')),
  image: $el.find('img').attr('src'),
  link: $el.find('a').attr('href')
}));

Pagination Links

分页链接

javascript
function extractPaginationLinks() {
  return $('.pagination a')
    .map((i, el) => $(el).attr('href'))
    .get()
    .filter(href => href && !href.includes('#'));
}
javascript
function extractPaginationLinks() {
  return $('.pagination a')
    .map((i, el) => $(el).attr('href'))
    .get()
    .filter(href => href && !href.includes('#'));
}

Handling Missing Data

处理缺失数据

javascript
// Safe extraction with defaults
function safeText(selector, defaultValue = '') {
  const el = $(selector);
  return el.length ? el.text().trim() : defaultValue;
}

function safeAttr(selector, attr, defaultValue = null) {
  const el = $(selector);
  return el.length ? el.attr(attr) : defaultValue;
}

// Optional chaining pattern
const price = $('.price').first().text()?.trim() || 'N/A';
javascript
// Safe extraction with defaults
function safeText(selector, defaultValue = '') {
  const el = $(selector);
  return el.length ? el.text().trim() : defaultValue;
}

function safeAttr(selector, attr, defaultValue = null) {
  const el = $(selector);
  return el.length ? el.attr(attr) : defaultValue;
}

// Optional chaining pattern
const price = $('.price').first().text()?.trim() || 'N/A';

URL Resolution

URL解析

javascript
const { URL } = require('url');

function resolveUrl(baseUrl, relativeUrl) {
  if (!relativeUrl) return null;
  try {
    return new URL(relativeUrl, baseUrl).href;
  } catch {
    return relativeUrl;
  }
}

// Usage
const baseUrl = 'https://example.com/products/';
$('a').each((i, el) => {
  const href = $(el).attr('href');
  const absoluteUrl = resolveUrl(baseUrl, href);
  console.log(absoluteUrl);
});
javascript
const { URL } = require('url');

function resolveUrl(baseUrl, relativeUrl) {
  if (!relativeUrl) return null;
  try {
    return new URL(relativeUrl, baseUrl).href;
  } catch {
    return relativeUrl;
  }
}

// Usage
const baseUrl = 'https://example.com/products/';
$('a').each((i, el) => {
  const href = $(el).attr('href');
  const absoluteUrl = resolveUrl(baseUrl, href);
  console.log(absoluteUrl);
});

Performance Optimization

性能优化

javascript
// Cache selections
const $products = $('.product');
$products.each((i, el) => {
  const $product = $(el);  // Wrap once
  // Use $product multiple times
});

// Limit parsing scope
const $article = $('.article');
const title = $article.find('.title').text();  // Searches only within article

// Use specific selectors
// Good
$('div.product > h2.title')
// Less efficient
$('div').find('.product').find('h2').filter('.title')
javascript
// Cache selections
const $products = $('.product');
$products.each((i, el) => {
  const $product = $(el);  // Wrap once
  // Use $product multiple times
});

// Limit parsing scope
const $article = $('.article');
const title = $article.find('.title').text();  // Searches only within article

// Use specific selectors
// Good
$('div.product > h2.title')
// Less efficient
$('div').find('.product').find('h2').filter('.title')

Complete Scraping Example

完整抓取示例

javascript
const cheerio = require('cheerio');
const axios = require('axios');

async function scrapeProducts(url) {
  const response = await axios.get(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
    }
  });

  const $ = cheerio.load(response.data);
  const products = [];

  $('.product-card').each((index, element) => {
    const $el = $(element);

    products.push({
      name: $el.find('.product-title').text().trim(),
      price: parseFloat(
        $el.find('.price').text().replace(/[^0-9.]/g, '')
      ),
      rating: parseFloat($el.find('.rating').attr('data-rating')) || null,
      image: $el.find('img').attr('src'),
      url: new URL($el.find('a').attr('href'), url).href,
      inStock: !$el.find('.out-of-stock').length
    });
  });

  return products;
}

// With error handling
async function safeScrape(url) {
  try {
    return await scrapeProducts(url);
  } catch (error) {
    console.error(`Failed to scrape ${url}:`, error.message);
    return [];
  }
}
javascript
const cheerio = require('cheerio');
const axios = require('axios');

async function scrapeProducts(url) {
  const response = await axios.get(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
    }
  });

  const $ = cheerio.load(response.data);
  const products = [];

  $('.product-card').each((index, element) => {
    const $el = $(element);

    products.push({
      name: $el.find('.product-title').text().trim(),
      price: parseFloat(
        $el.find('.price').text().replace(/[^0-9.]/g, '')
      ),
      rating: parseFloat($el.find('.rating').attr('data-rating')) || null,
      image: $el.find('img').attr('src'),
      url: new URL($el.find('a').attr('href'), url).href,
      inStock: !$el.find('.out-of-stock').length
    });
  });

  return products;
}

// With error handling
async function safeScrape(url) {
  try {
    return await scrapeProducts(url);
  } catch (error) {
    console.error(`Failed to scrape ${url}:`, error.message);
    return [];
  }
}

Key Dependencies

核心依赖

  • cheerio
  • axios (HTTP client)
  • got (alternative HTTP client)
  • node-fetch (fetch API for Node.js)
  • p-limit (concurrency control)
  • cheerio
  • axios(HTTP客户端)
  • got(替代HTTP客户端)
  • node-fetch(Node.js的fetch API)
  • p-limit(并发控制)

Best Practices

最佳实践

  1. Always handle missing elements gracefully
  2. Use specific selectors for better performance
  3. Cache jQuery-wrapped elements when reusing
  4. Normalize extracted text (trim whitespace)
  5. Resolve relative URLs to absolute
  6. Validate extracted data types
  7. Implement rate limiting when scraping multiple pages
  8. Use appropriate User-Agent headers
  9. Handle character encoding issues
  10. Log extraction failures for debugging
  1. 始终优雅处理缺失元素
  2. 使用特定选择器以提升性能
  3. 缓存jQuery包装的元素以便复用
  4. 规范化提取的文本(去除空白字符)
  5. 将相对URL解析为绝对URL
  6. 验证提取数据的类型
  7. 抓取多页面时实现速率限制
  8. 使用合适的User-Agent请求头
  9. 处理字符编码问题
  10. 记录提取失败情况以用于调试