beautifulsoup-parsing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

BeautifulSoup HTML Parsing

BeautifulSoup HTML解析

You are an expert in BeautifulSoup, Python HTML/XML parsing, DOM navigation, and building efficient data extraction pipelines for web scraping.

您是BeautifulSoup、Python HTML/XML解析、DOM导航以及构建高效网页抓取数据提取管道的专家。

Core Expertise

核心专长

BeautifulSoup API and parsing methods
CSS selectors and find methods
DOM traversal and navigation
HTML/XML parsing with different parsers
Integration with requests library
Handling malformed HTML gracefully
Data extraction patterns and best practices
Memory-efficient processing

BeautifulSoup API与解析方法
CSS选择器与查找方法
DOM遍历与导航
不同解析器的HTML/XML解析
与requests库的集成
优雅处理格式错误的HTML
数据提取模式与最佳实践
内存高效处理

Key Principles

关键原则

Write concise, technical code with accurate Python examples
Prioritize readability, efficiency, and maintainability
Use modular, reusable functions for common extraction tasks
Handle missing data gracefully with proper defaults
Follow PEP 8 style guidelines
Implement proper error handling for robust scraping

编写简洁、专业的代码，并附带准确的Python示例
优先考虑可读性、效率和可维护性
为常见提取任务使用模块化、可复用的函数
使用合适的默认值优雅处理缺失数据
遵循PEP 8编码风格指南
实现完善的错误处理以确保抓取的健壮性

Basic Setup

基础设置

bash

pip install beautifulsoup4 requests lxml

bash

pip install beautifulsoup4 requests lxml

Loading HTML

加载HTML

python

from bs4 import BeautifulSoup
import requests

python

from bs4 import BeautifulSoup
import requests

From string

从字符串加载

html = '<html><body><h1>Hello</h1></body></html>' soup = BeautifulSoup(html, 'lxml')

From file

从文件加载

with open('page.html', 'r', encoding='utf-8') as f: soup = BeautifulSoup(f, 'lxml')

From URL

从URL加载

response = requests.get('https://example.com') soup = BeautifulSoup(response.content, 'lxml')

undefined

response = requests.get('https://example.com') soup = BeautifulSoup(response.content, 'lxml')

undefined

Parser Options

解析器选项

python

undefined

python

undefined

lxml - Fast, lenient (recommended)

lxml - 速度快、容错性强（推荐）

soup = BeautifulSoup(html, 'lxml')

html.parser - Built-in, no dependencies

html.parser - 内置、无需依赖

soup = BeautifulSoup(html, 'html.parser')

html5lib - Most lenient, slowest

html5lib - 容错性最强、速度最慢

soup = BeautifulSoup(html, 'html5lib')

lxml-xml - For XML documents

lxml-xml - 用于XML文档

soup = BeautifulSoup(xml, 'lxml-xml')

undefined

soup = BeautifulSoup(xml, 'lxml-xml')

undefined

Finding Elements

查找元素

By Tag

按标签查找

python

undefined

python

undefined

First matching element

第一个匹配元素

soup.find('h1')

All matching elements

所有匹配元素

soup.find_all('p')

Shorthand

简写方式

soup.h1 # Same as soup.find('h1')

undefined

soup.h1 # 等同于soup.find('h1')

undefined

By Attributes

按属性查找

python

undefined

python

undefined

By class

按class查找

soup.find('div', class_='article') soup.find_all('div', class_='article')

By ID

按ID查找

soup.find(id='main-content')

By any attribute

按任意属性查找

soup.find('a', href='https://example.com') soup.find_all('input', attrs={'type': 'text', 'name': 'email'})

By data attributes

按data属性查找

soup.find('div', attrs={'data-id': '123'})

undefined

soup.find('div', attrs={'data-id': '123'})

undefined

CSS Selectors

CSS选择器

python

undefined

python

undefined

Single element

单个元素

soup.select_one('div.article > h2')

Multiple elements

多个元素

soup.select('div.article h2')

Complex selectors

复杂选择器

soup.select('a[href^="https://"]') # Starts with soup.select('a[href$=".pdf"]') # Ends with soup.select('a[href*="example"]') # Contains soup.select('li:nth-child(2)') soup.select('h1, h2, h3') # Multiple

undefined

soup.select('a[href^="https://"]') # 以指定内容开头 soup.select('a[href$=".pdf"]') # 以指定内容结尾 soup.select('a[href*="example"]') # 包含指定内容 soup.select('li:nth-child(2)') soup.select('h1, h2, h3') # 多个标签

undefined

With Functions

使用函数查找

python

import re

python

import re

By regex

按正则表达式查找

soup.find_all('a', href=re.compile(r'^https://'))

By function

按函数查找

def has_data_attr(tag): return tag.has_attr('data-id')

soup.find_all(has_data_attr)

def has_data_attr(tag): return tag.has_attr('data-id')

soup.find_all(has_data_attr)

String matching

字符串匹配

soup.find_all(string='exact text') soup.find_all(string=re.compile('pattern'))

undefined

soup.find_all(string='exact text') soup.find_all(string=re.compile('pattern'))

undefined

Extracting Data

提取数据

Text Content

文本内容

python

undefined

python

undefined

Get text

获取文本

element.text element.get_text()

Get text with separator

使用分隔符获取文本

element.get_text(separator=' ')

Get stripped text

获取去除空白的文本

element.get_text(strip=True)

Get strings (generator)

获取字符串（生成器）

for string in element.stripped_strings: print(string)

undefined

for string in element.stripped_strings: print(string)

undefined

Attributes

属性

python

undefined

python

undefined

Get attribute

获取属性

element['href'] element.get('href') # Returns None if missing element.get('href', 'default') # With default

element['href'] element.get('href') # 不存在时返回None element.get('href', 'default') # 带默认值

Get all attributes

获取所有属性

element.attrs # Returns dict

element.attrs # 返回字典

Check attribute exists

检查属性是否存在

element.has_attr('class')

undefined

element.has_attr('class')

undefined

HTML Content

HTML内容

python

undefined

python

undefined

Inner HTML

内部HTML

str(element)

Just the tag

仅标签名

element.name

Prettified HTML

格式化后的HTML

element.prettify()

undefined

element.prettify()

undefined

DOM Navigation

DOM导航

Parent/Ancestors

父元素/祖先元素

python

element.parent
element.parents  # Generator of all ancestors

python

element.parent
element.parents  # 所有祖先元素的生成器

Find specific ancestor

查找特定祖先元素

for parent in element.parents: if parent.name == 'div' and 'article' in parent.get('class', []): break

undefined

for parent in element.parents: if parent.name == 'div' and 'article' in parent.get('class', []): break

undefined

Children

子元素

python

element.children      # Direct children (generator)
list(element.children)

element.contents      # Direct children (list)
element.descendants   # All descendants (generator)

python

element.children      # 直接子元素（生成器）
list(element.children)

element.contents      # 直接子元素（列表）
element.descendants   # 所有后代元素（生成器）

Find in children

在子元素中查找

element.find('span') # Searches descendants

undefined

element.find('span') # 搜索所有后代元素

undefined

Siblings

兄弟元素

python

element.next_sibling
element.previous_sibling

element.next_siblings      # Generator
element.previous_siblings  # Generator

python

element.next_sibling
element.previous_sibling

element.next_siblings      # 后续兄弟元素生成器
element.previous_siblings  # 前序兄弟元素生成器

Next/previous element (skips whitespace)

下一个/上一个元素（跳过空白）

element.next_element element.previous_element

undefined

element.next_element element.previous_element

undefined

Data Extraction Patterns

数据提取模式

Safe Extraction

安全提取

python

def safe_text(element, selector, default=''):
    """Safely extract text from element."""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

def safe_attr(element, selector, attr, default=None):
    """Safely extract attribute from element."""
    found = element.select_one(selector)
    return found.get(attr, default) if found else default

python

def safe_text(element, selector, default=''):
    """安全提取元素中的文本。"""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

def safe_attr(element, selector, attr, default=None):
    """安全提取元素中的属性。"""
    found = element.select_one(selector)
    return found.get(attr, default) if found else default

Table Extraction

表格提取

python

def extract_table(table):
    """Extract table data as list of dictionaries."""
    headers = [th.get_text(strip=True) for th in table.select('th')]

    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        if cells:
            rows.append(dict(zip(headers, cells)))

    return rows

python

def extract_table(table):
    """将表格数据提取为字典列表。"""
    headers = [th.get_text(strip=True) for th in table.select('th')]

    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        if cells:
            rows.append(dict(zip(headers, cells)))

    return rows

List Extraction

列表提取

python

def extract_items(soup, selector, extractor):
    """Extract multiple items using a custom extractor function."""
    return [extractor(item) for item in soup.select(selector)]

python

def extract_items(soup, selector, extractor):
    """使用自定义提取器函数提取多个条目。"""
    return [extractor(item) for item in soup.select(selector)]

Usage

使用示例

def extract_product(item): return { 'name': safe_text(item, '.name'), 'price': safe_text(item, '.price'), 'url': safe_attr(item, 'a', 'href') }

products = extract_items(soup, '.product', extract_product)

undefined

def extract_product(item): return { 'name': safe_text(item, '.name'), 'price': safe_text(item, '.price'), 'url': safe_attr(item, 'a', 'href') }

products = extract_items(soup, '.product', extract_product)

undefined

URL Resolution

URL解析

python

from urllib.parse import urljoin

def resolve_url(base_url, relative_url):
    """Convert relative URL to absolute."""
    if not relative_url:
        return None
    return urljoin(base_url, relative_url)

python

from urllib.parse import urljoin

def resolve_url(base_url, relative_url):
    """将相对URL转换为绝对URL。"""
    if not relative_url:
        return None
    return urljoin(base_url, relative_url)

Usage

使用示例

base_url = 'https://example.com/products/' for link in soup.select('a'): href = link.get('href') absolute_url = resolve_url(base_url, href) print(absolute_url)

undefined

base_url = 'https://example.com/products/' for link in soup.select('a'): href = link.get('href') absolute_url = resolve_url(base_url, href) print(absolute_url)

undefined

Handling Malformed HTML

处理格式错误的HTML

python

undefined

python

undefined

lxml parser is lenient with malformed HTML

lxml解析器对格式错误的HTML容错性强

soup = BeautifulSoup(malformed_html, 'lxml')

For very broken HTML, use html5lib

对于严重损坏的HTML，使用html5lib

soup = BeautifulSoup(very_broken_html, 'html5lib')

Handle encoding issues

处理编码问题

response = requests.get(url) response.encoding = response.apparent_encoding soup = BeautifulSoup(response.text, 'lxml')

undefined

response = requests.get(url) response.encoding = response.apparent_encoding soup = BeautifulSoup(response.text, 'lxml')

undefined

Complete Scraping Example

完整抓取示例

python

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

class ProductScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
        })

    def fetch_page(self, url):
        """Fetch and parse a page."""
        response = self.session.get(url, timeout=30)
        response.raise_for_status()
        return BeautifulSoup(response.content, 'lxml')

    def extract_product(self, item):
        """Extract product data from a card element."""
        return {
            'name': self._safe_text(item, '.product-title'),
            'price': self._parse_price(item.select_one('.price')),
            'rating': self._safe_attr(item, '.rating', 'data-rating'),
            'image': self._resolve(self._safe_attr(item, 'img', 'src')),
            'url': self._resolve(self._safe_attr(item, 'a', 'href')),
            'in_stock': not item.select_one('.out-of-stock')
        }

    def scrape_products(self, url):
        """Scrape all products from a page."""
        soup = self.fetch_page(url)
        items = soup.select('.product-card')
        return [self.extract_product(item) for item in items]

    def _safe_text(self, element, selector, default=''):
        found = element.select_one(selector)
        return found.get_text(strip=True) if found else default

    def _safe_attr(self, element, selector, attr, default=None):
        found = element.select_one(selector)
        return found.get(attr, default) if found else default

    def _parse_price(self, element):
        if not element:
            return None
        text = element.get_text(strip=True)
        try:
            return float(text.replace('$', '').replace(',', ''))
        except ValueError:
            return None

    def _resolve(self, url):
        return urljoin(self.base_url, url) if url else None

python

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

class ProductScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
        })

    def fetch_page(self, url):
        """抓取并解析页面。"""
        response = self.session.get(url, timeout=30)
        response.raise_for_status()
        return BeautifulSoup(response.content, 'lxml')

    def extract_product(self, item):
        """从卡片元素中提取产品数据。"""
        return {
            'name': self._safe_text(item, '.product-title'),
            'price': self._parse_price(item.select_one('.price')),
            'rating': self._safe_attr(item, '.rating', 'data-rating'),
            'image': self._resolve(self._safe_attr(item, 'img', 'src')),
            'url': self._resolve(self._safe_attr(item, 'a', 'href')),
            'in_stock': not item.select_one('.out-of-stock')
        }

    def scrape_products(self, url):
        """抓取页面上的所有产品。"""
        soup = self.fetch_page(url)
        items = soup.select('.product-card')
        return [self.extract_product(item) for item in items]

    def _safe_text(self, element, selector, default=''):
        found = element.select_one(selector)
        return found.get_text(strip=True) if found else default

    def _safe_attr(self, element, selector, attr, default=None):
        found = element.select_one(selector)
        return found.get(attr, default) if found else default

    def _parse_price(self, element):
        if not element:
            return None
        text = element.get_text(strip=True)
        try:
            return float(text.replace('$', '').replace(',', ''))
        except ValueError:
            return None

    def _resolve(self, url):
        return urljoin(self.base_url, url) if url else None

Usage

使用示例

scraper = ProductScraper('https://example.com') products = scraper.scrape_products('https://example.com/products') for product in products: print(product)

undefined

scraper = ProductScraper('https://example.com') products = scraper.scrape_products('https://example.com/products') for product in products: print(product)

undefined

Performance Optimization

性能优化

python

undefined

python

undefined

Use SoupStrainer to parse only needed elements

使用SoupStrainer仅解析需要的元素

from bs4 import SoupStrainer

only_articles = SoupStrainer('article') soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)

from bs4 import SoupStrainer

only_articles = SoupStrainer('article') soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)

Use lxml parser for speed

使用lxml解析器提升速度

soup = BeautifulSoup(html, 'lxml') # Fastest

soup = BeautifulSoup(html, 'lxml') # 最快

Decompose unneeded elements

分解不需要的元素

for script in soup.find_all('script'): script.decompose()

Use generators for memory efficiency

使用生成器提升内存效率

for item in soup.select('.item'): yield extract_data(item)

undefined

for item in soup.select('.item'): yield extract_data(item)

undefined

Key Dependencies

核心依赖

beautifulsoup4
lxml (fast parser)
html5lib (lenient parser)
requests
pandas (for data output)

beautifulsoup4
lxml（快速解析器）
html5lib（容错性解析器）
requests
pandas（用于数据输出）

Best Practices

最佳实践

Always use lxml parser for best performance
Handle missing elements with default values
Use
```
select()
```
and
```
select_one()
```
for CSS selectors
Use
```
get_text(strip=True)
```
for clean text extraction
Resolve relative URLs to absolute
Validate extracted data types
Implement rate limiting between requests
Use proper User-Agent headers
Handle character encoding properly
Use SoupStrainer for large documents
Follow robots.txt and website terms of service
Implement retry logic for failed requests

始终使用lxml解析器以获得最佳性能
使用默认值处理缺失元素
使用
```
select()
```
和
```
select_one()
```
处理CSS选择器
使用
```
get_text(strip=True)
```
提取干净的文本
将相对URL解析为绝对URL
验证提取数据的类型
在请求之间实现速率限制
使用合适的User-Agent请求头
正确处理字符编码
对大型文档使用SoupStrainer
遵循robots.txt和网站服务条款
为失败的请求实现重试逻辑