beautifulsoup-parsing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

BeautifulSoup HTML Parsing

BeautifulSoup HTML解析

You are an expert in BeautifulSoup, Python HTML/XML parsing, DOM navigation, and building efficient data extraction pipelines for web scraping.
您是BeautifulSoup、Python HTML/XML解析、DOM导航以及构建高效网页抓取数据提取管道的专家。

Core Expertise

核心专长

  • BeautifulSoup API and parsing methods
  • CSS selectors and find methods
  • DOM traversal and navigation
  • HTML/XML parsing with different parsers
  • Integration with requests library
  • Handling malformed HTML gracefully
  • Data extraction patterns and best practices
  • Memory-efficient processing
  • BeautifulSoup API与解析方法
  • CSS选择器与查找方法
  • DOM遍历与导航
  • 不同解析器的HTML/XML解析
  • 与requests库的集成
  • 优雅处理格式错误的HTML
  • 数据提取模式与最佳实践
  • 内存高效处理

Key Principles

关键原则

  • Write concise, technical code with accurate Python examples
  • Prioritize readability, efficiency, and maintainability
  • Use modular, reusable functions for common extraction tasks
  • Handle missing data gracefully with proper defaults
  • Follow PEP 8 style guidelines
  • Implement proper error handling for robust scraping
  • 编写简洁、专业的代码,并附带准确的Python示例
  • 优先考虑可读性、效率和可维护性
  • 为常见提取任务使用模块化、可复用的函数
  • 使用合适的默认值优雅处理缺失数据
  • 遵循PEP 8编码风格指南
  • 实现完善的错误处理以确保抓取的健壮性

Basic Setup

基础设置

bash
pip install beautifulsoup4 requests lxml
bash
pip install beautifulsoup4 requests lxml

Loading HTML

加载HTML

python
from bs4 import BeautifulSoup
import requests
python
from bs4 import BeautifulSoup
import requests

From string

从字符串加载

html = '<html><body><h1>Hello</h1></body></html>' soup = BeautifulSoup(html, 'lxml')
html = '<html><body><h1>Hello</h1></body></html>' soup = BeautifulSoup(html, 'lxml')

From file

从文件加载

with open('page.html', 'r', encoding='utf-8') as f: soup = BeautifulSoup(f, 'lxml')
with open('page.html', 'r', encoding='utf-8') as f: soup = BeautifulSoup(f, 'lxml')

From URL

从URL加载

response = requests.get('https://example.com') soup = BeautifulSoup(response.content, 'lxml')
undefined
response = requests.get('https://example.com') soup = BeautifulSoup(response.content, 'lxml')
undefined

Parser Options

解析器选项

python
undefined
python
undefined

lxml - Fast, lenient (recommended)

lxml - 速度快、容错性强(推荐)

soup = BeautifulSoup(html, 'lxml')
soup = BeautifulSoup(html, 'lxml')

html.parser - Built-in, no dependencies

html.parser - 内置、无需依赖

soup = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, 'html.parser')

html5lib - Most lenient, slowest

html5lib - 容错性最强、速度最慢

soup = BeautifulSoup(html, 'html5lib')
soup = BeautifulSoup(html, 'html5lib')

lxml-xml - For XML documents

lxml-xml - 用于XML文档

soup = BeautifulSoup(xml, 'lxml-xml')
undefined
soup = BeautifulSoup(xml, 'lxml-xml')
undefined

Finding Elements

查找元素

By Tag

按标签查找

python
undefined
python
undefined

First matching element

第一个匹配元素

soup.find('h1')
soup.find('h1')

All matching elements

所有匹配元素

soup.find_all('p')
soup.find_all('p')

Shorthand

简写方式

soup.h1 # Same as soup.find('h1')
undefined
soup.h1 # 等同于soup.find('h1')
undefined

By Attributes

按属性查找

python
undefined
python
undefined

By class

按class查找

soup.find('div', class_='article') soup.find_all('div', class_='article')
soup.find('div', class_='article') soup.find_all('div', class_='article')

By ID

按ID查找

soup.find(id='main-content')
soup.find(id='main-content')

By any attribute

按任意属性查找

soup.find('a', href='https://example.com') soup.find_all('input', attrs={'type': 'text', 'name': 'email'})
soup.find('a', href='https://example.com') soup.find_all('input', attrs={'type': 'text', 'name': 'email'})

By data attributes

按data属性查找

soup.find('div', attrs={'data-id': '123'})
undefined
soup.find('div', attrs={'data-id': '123'})
undefined

CSS Selectors

CSS选择器

python
undefined
python
undefined

Single element

单个元素

soup.select_one('div.article > h2')
soup.select_one('div.article > h2')

Multiple elements

多个元素

soup.select('div.article h2')
soup.select('div.article h2')

Complex selectors

复杂选择器

soup.select('a[href^="https://"]') # Starts with soup.select('a[href$=".pdf"]') # Ends with soup.select('a[href*="example"]') # Contains soup.select('li:nth-child(2)') soup.select('h1, h2, h3') # Multiple
undefined
soup.select('a[href^="https://"]') # 以指定内容开头 soup.select('a[href$=".pdf"]') # 以指定内容结尾 soup.select('a[href*="example"]') # 包含指定内容 soup.select('li:nth-child(2)') soup.select('h1, h2, h3') # 多个标签
undefined

With Functions

使用函数查找

python
import re
python
import re

By regex

按正则表达式查找

soup.find_all('a', href=re.compile(r'^https://'))
soup.find_all('a', href=re.compile(r'^https://'))

By function

按函数查找

def has_data_attr(tag): return tag.has_attr('data-id')
soup.find_all(has_data_attr)
def has_data_attr(tag): return tag.has_attr('data-id')
soup.find_all(has_data_attr)

String matching

字符串匹配

soup.find_all(string='exact text') soup.find_all(string=re.compile('pattern'))
undefined
soup.find_all(string='exact text') soup.find_all(string=re.compile('pattern'))
undefined

Extracting Data

提取数据

Text Content

文本内容

python
undefined
python
undefined

Get text

获取文本

element.text element.get_text()
element.text element.get_text()

Get text with separator

使用分隔符获取文本

element.get_text(separator=' ')
element.get_text(separator=' ')

Get stripped text

获取去除空白的文本

element.get_text(strip=True)
element.get_text(strip=True)

Get strings (generator)

获取字符串(生成器)

for string in element.stripped_strings: print(string)
undefined
for string in element.stripped_strings: print(string)
undefined

Attributes

属性

python
undefined
python
undefined

Get attribute

获取属性

element['href'] element.get('href') # Returns None if missing element.get('href', 'default') # With default
element['href'] element.get('href') # 不存在时返回None element.get('href', 'default') # 带默认值

Get all attributes

获取所有属性

element.attrs # Returns dict
element.attrs # 返回字典

Check attribute exists

检查属性是否存在

element.has_attr('class')
undefined
element.has_attr('class')
undefined

HTML Content

HTML内容

python
undefined
python
undefined

Inner HTML

内部HTML

str(element)
str(element)

Just the tag

仅标签名

element.name
element.name

Prettified HTML

格式化后的HTML

element.prettify()
undefined
element.prettify()
undefined

DOM Navigation

DOM导航

Parent/Ancestors

父元素/祖先元素

python
element.parent
element.parents  # Generator of all ancestors
python
element.parent
element.parents  # 所有祖先元素的生成器

Find specific ancestor

查找特定祖先元素

for parent in element.parents: if parent.name == 'div' and 'article' in parent.get('class', []): break
undefined
for parent in element.parents: if parent.name == 'div' and 'article' in parent.get('class', []): break
undefined

Children

子元素

python
element.children      # Direct children (generator)
list(element.children)

element.contents      # Direct children (list)
element.descendants   # All descendants (generator)
python
element.children      # 直接子元素(生成器)
list(element.children)

element.contents      # 直接子元素(列表)
element.descendants   # 所有后代元素(生成器)

Find in children

在子元素中查找

element.find('span') # Searches descendants
undefined
element.find('span') # 搜索所有后代元素
undefined

Siblings

兄弟元素

python
element.next_sibling
element.previous_sibling

element.next_siblings      # Generator
element.previous_siblings  # Generator
python
element.next_sibling
element.previous_sibling

element.next_siblings      # 后续兄弟元素生成器
element.previous_siblings  # 前序兄弟元素生成器

Next/previous element (skips whitespace)

下一个/上一个元素(跳过空白)

element.next_element element.previous_element
undefined
element.next_element element.previous_element
undefined

Data Extraction Patterns

数据提取模式

Safe Extraction

安全提取

python
def safe_text(element, selector, default=''):
    """Safely extract text from element."""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

def safe_attr(element, selector, attr, default=None):
    """Safely extract attribute from element."""
    found = element.select_one(selector)
    return found.get(attr, default) if found else default
python
def safe_text(element, selector, default=''):
    """安全提取元素中的文本。"""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

def safe_attr(element, selector, attr, default=None):
    """安全提取元素中的属性。"""
    found = element.select_one(selector)
    return found.get(attr, default) if found else default

Table Extraction

表格提取

python
def extract_table(table):
    """Extract table data as list of dictionaries."""
    headers = [th.get_text(strip=True) for th in table.select('th')]

    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        if cells:
            rows.append(dict(zip(headers, cells)))

    return rows
python
def extract_table(table):
    """将表格数据提取为字典列表。"""
    headers = [th.get_text(strip=True) for th in table.select('th')]

    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        if cells:
            rows.append(dict(zip(headers, cells)))

    return rows

List Extraction

列表提取

python
def extract_items(soup, selector, extractor):
    """Extract multiple items using a custom extractor function."""
    return [extractor(item) for item in soup.select(selector)]
python
def extract_items(soup, selector, extractor):
    """使用自定义提取器函数提取多个条目。"""
    return [extractor(item) for item in soup.select(selector)]

Usage

使用示例

def extract_product(item): return { 'name': safe_text(item, '.name'), 'price': safe_text(item, '.price'), 'url': safe_attr(item, 'a', 'href') }
products = extract_items(soup, '.product', extract_product)
undefined
def extract_product(item): return { 'name': safe_text(item, '.name'), 'price': safe_text(item, '.price'), 'url': safe_attr(item, 'a', 'href') }
products = extract_items(soup, '.product', extract_product)
undefined

URL Resolution

URL解析

python
from urllib.parse import urljoin

def resolve_url(base_url, relative_url):
    """Convert relative URL to absolute."""
    if not relative_url:
        return None
    return urljoin(base_url, relative_url)
python
from urllib.parse import urljoin

def resolve_url(base_url, relative_url):
    """将相对URL转换为绝对URL。"""
    if not relative_url:
        return None
    return urljoin(base_url, relative_url)

Usage

使用示例

base_url = 'https://example.com/products/' for link in soup.select('a'): href = link.get('href') absolute_url = resolve_url(base_url, href) print(absolute_url)
undefined
base_url = 'https://example.com/products/' for link in soup.select('a'): href = link.get('href') absolute_url = resolve_url(base_url, href) print(absolute_url)
undefined

Handling Malformed HTML

处理格式错误的HTML

python
undefined
python
undefined

lxml parser is lenient with malformed HTML

lxml解析器对格式错误的HTML容错性强

soup = BeautifulSoup(malformed_html, 'lxml')
soup = BeautifulSoup(malformed_html, 'lxml')

For very broken HTML, use html5lib

对于严重损坏的HTML,使用html5lib

soup = BeautifulSoup(very_broken_html, 'html5lib')
soup = BeautifulSoup(very_broken_html, 'html5lib')

Handle encoding issues

处理编码问题

response = requests.get(url) response.encoding = response.apparent_encoding soup = BeautifulSoup(response.text, 'lxml')
undefined
response = requests.get(url) response.encoding = response.apparent_encoding soup = BeautifulSoup(response.text, 'lxml')
undefined

Complete Scraping Example

完整抓取示例

python
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

class ProductScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
        })

    def fetch_page(self, url):
        """Fetch and parse a page."""
        response = self.session.get(url, timeout=30)
        response.raise_for_status()
        return BeautifulSoup(response.content, 'lxml')

    def extract_product(self, item):
        """Extract product data from a card element."""
        return {
            'name': self._safe_text(item, '.product-title'),
            'price': self._parse_price(item.select_one('.price')),
            'rating': self._safe_attr(item, '.rating', 'data-rating'),
            'image': self._resolve(self._safe_attr(item, 'img', 'src')),
            'url': self._resolve(self._safe_attr(item, 'a', 'href')),
            'in_stock': not item.select_one('.out-of-stock')
        }

    def scrape_products(self, url):
        """Scrape all products from a page."""
        soup = self.fetch_page(url)
        items = soup.select('.product-card')
        return [self.extract_product(item) for item in items]

    def _safe_text(self, element, selector, default=''):
        found = element.select_one(selector)
        return found.get_text(strip=True) if found else default

    def _safe_attr(self, element, selector, attr, default=None):
        found = element.select_one(selector)
        return found.get(attr, default) if found else default

    def _parse_price(self, element):
        if not element:
            return None
        text = element.get_text(strip=True)
        try:
            return float(text.replace('$', '').replace(',', ''))
        except ValueError:
            return None

    def _resolve(self, url):
        return urljoin(self.base_url, url) if url else None
python
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

class ProductScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
        })

    def fetch_page(self, url):
        """抓取并解析页面。"""
        response = self.session.get(url, timeout=30)
        response.raise_for_status()
        return BeautifulSoup(response.content, 'lxml')

    def extract_product(self, item):
        """从卡片元素中提取产品数据。"""
        return {
            'name': self._safe_text(item, '.product-title'),
            'price': self._parse_price(item.select_one('.price')),
            'rating': self._safe_attr(item, '.rating', 'data-rating'),
            'image': self._resolve(self._safe_attr(item, 'img', 'src')),
            'url': self._resolve(self._safe_attr(item, 'a', 'href')),
            'in_stock': not item.select_one('.out-of-stock')
        }

    def scrape_products(self, url):
        """抓取页面上的所有产品。"""
        soup = self.fetch_page(url)
        items = soup.select('.product-card')
        return [self.extract_product(item) for item in items]

    def _safe_text(self, element, selector, default=''):
        found = element.select_one(selector)
        return found.get_text(strip=True) if found else default

    def _safe_attr(self, element, selector, attr, default=None):
        found = element.select_one(selector)
        return found.get(attr, default) if found else default

    def _parse_price(self, element):
        if not element:
            return None
        text = element.get_text(strip=True)
        try:
            return float(text.replace('$', '').replace(',', ''))
        except ValueError:
            return None

    def _resolve(self, url):
        return urljoin(self.base_url, url) if url else None

Usage

使用示例

scraper = ProductScraper('https://example.com') products = scraper.scrape_products('https://example.com/products') for product in products: print(product)
undefined
scraper = ProductScraper('https://example.com') products = scraper.scrape_products('https://example.com/products') for product in products: print(product)
undefined

Performance Optimization

性能优化

python
undefined
python
undefined

Use SoupStrainer to parse only needed elements

使用SoupStrainer仅解析需要的元素

from bs4 import SoupStrainer
only_articles = SoupStrainer('article') soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)
from bs4 import SoupStrainer
only_articles = SoupStrainer('article') soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)

Use lxml parser for speed

使用lxml解析器提升速度

soup = BeautifulSoup(html, 'lxml') # Fastest
soup = BeautifulSoup(html, 'lxml') # 最快

Decompose unneeded elements

分解不需要的元素

for script in soup.find_all('script'): script.decompose()
for script in soup.find_all('script'): script.decompose()

Use generators for memory efficiency

使用生成器提升内存效率

for item in soup.select('.item'): yield extract_data(item)
undefined
for item in soup.select('.item'): yield extract_data(item)
undefined

Key Dependencies

核心依赖

  • beautifulsoup4
  • lxml (fast parser)
  • html5lib (lenient parser)
  • requests
  • pandas (for data output)
  • beautifulsoup4
  • lxml(快速解析器)
  • html5lib(容错性解析器)
  • requests
  • pandas(用于数据输出)

Best Practices

最佳实践

  1. Always use lxml parser for best performance
  2. Handle missing elements with default values
  3. Use
    select()
    and
    select_one()
    for CSS selectors
  4. Use
    get_text(strip=True)
    for clean text extraction
  5. Resolve relative URLs to absolute
  6. Validate extracted data types
  7. Implement rate limiting between requests
  8. Use proper User-Agent headers
  9. Handle character encoding properly
  10. Use SoupStrainer for large documents
  11. Follow robots.txt and website terms of service
  12. Implement retry logic for failed requests
  1. 始终使用lxml解析器以获得最佳性能
  2. 使用默认值处理缺失元素
  3. 使用
    select()
    select_one()
    处理CSS选择器
  4. 使用
    get_text(strip=True)
    提取干净的文本
  5. 将相对URL解析为绝对URL
  6. 验证提取数据的类型
  7. 在请求之间实现速率限制
  8. 使用合适的User-Agent请求头
  9. 正确处理字符编码
  10. 对大型文档使用SoupStrainer
  11. 遵循robots.txt和网站服务条款
  12. 为失败的请求实现重试逻辑