beautifulsoup-parsing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBeautifulSoup HTML Parsing
BeautifulSoup HTML解析
You are an expert in BeautifulSoup, Python HTML/XML parsing, DOM navigation, and building efficient data extraction pipelines for web scraping.
您是BeautifulSoup、Python HTML/XML解析、DOM导航以及构建高效网页抓取数据提取管道的专家。
Core Expertise
核心专长
- BeautifulSoup API and parsing methods
- CSS selectors and find methods
- DOM traversal and navigation
- HTML/XML parsing with different parsers
- Integration with requests library
- Handling malformed HTML gracefully
- Data extraction patterns and best practices
- Memory-efficient processing
- BeautifulSoup API与解析方法
- CSS选择器与查找方法
- DOM遍历与导航
- 不同解析器的HTML/XML解析
- 与requests库的集成
- 优雅处理格式错误的HTML
- 数据提取模式与最佳实践
- 内存高效处理
Key Principles
关键原则
- Write concise, technical code with accurate Python examples
- Prioritize readability, efficiency, and maintainability
- Use modular, reusable functions for common extraction tasks
- Handle missing data gracefully with proper defaults
- Follow PEP 8 style guidelines
- Implement proper error handling for robust scraping
- 编写简洁、专业的代码,并附带准确的Python示例
- 优先考虑可读性、效率和可维护性
- 为常见提取任务使用模块化、可复用的函数
- 使用合适的默认值优雅处理缺失数据
- 遵循PEP 8编码风格指南
- 实现完善的错误处理以确保抓取的健壮性
Basic Setup
基础设置
bash
pip install beautifulsoup4 requests lxmlbash
pip install beautifulsoup4 requests lxmlLoading HTML
加载HTML
python
from bs4 import BeautifulSoup
import requestspython
from bs4 import BeautifulSoup
import requestsFrom string
从字符串加载
html = '<html><body><h1>Hello</h1></body></html>'
soup = BeautifulSoup(html, 'lxml')
html = '<html><body><h1>Hello</h1></body></html>'
soup = BeautifulSoup(html, 'lxml')
From file
从文件加载
with open('page.html', 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'lxml')
with open('page.html', 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'lxml')
From URL
从URL加载
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'lxml')
undefinedresponse = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'lxml')
undefinedParser Options
解析器选项
python
undefinedpython
undefinedlxml - Fast, lenient (recommended)
lxml - 速度快、容错性强(推荐)
soup = BeautifulSoup(html, 'lxml')
soup = BeautifulSoup(html, 'lxml')
html.parser - Built-in, no dependencies
html.parser - 内置、无需依赖
soup = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, 'html.parser')
html5lib - Most lenient, slowest
html5lib - 容错性最强、速度最慢
soup = BeautifulSoup(html, 'html5lib')
soup = BeautifulSoup(html, 'html5lib')
lxml-xml - For XML documents
lxml-xml - 用于XML文档
soup = BeautifulSoup(xml, 'lxml-xml')
undefinedsoup = BeautifulSoup(xml, 'lxml-xml')
undefinedFinding Elements
查找元素
By Tag
按标签查找
python
undefinedpython
undefinedFirst matching element
第一个匹配元素
soup.find('h1')
soup.find('h1')
All matching elements
所有匹配元素
soup.find_all('p')
soup.find_all('p')
Shorthand
简写方式
soup.h1 # Same as soup.find('h1')
undefinedsoup.h1 # 等同于soup.find('h1')
undefinedBy Attributes
按属性查找
python
undefinedpython
undefinedBy class
按class查找
soup.find('div', class_='article')
soup.find_all('div', class_='article')
soup.find('div', class_='article')
soup.find_all('div', class_='article')
By ID
按ID查找
soup.find(id='main-content')
soup.find(id='main-content')
By any attribute
按任意属性查找
soup.find('a', href='https://example.com')
soup.find_all('input', attrs={'type': 'text', 'name': 'email'})
soup.find('a', href='https://example.com')
soup.find_all('input', attrs={'type': 'text', 'name': 'email'})
By data attributes
按data属性查找
soup.find('div', attrs={'data-id': '123'})
undefinedsoup.find('div', attrs={'data-id': '123'})
undefinedCSS Selectors
CSS选择器
python
undefinedpython
undefinedSingle element
单个元素
soup.select_one('div.article > h2')
soup.select_one('div.article > h2')
Multiple elements
多个元素
soup.select('div.article h2')
soup.select('div.article h2')
Complex selectors
复杂选择器
soup.select('a[href^="https://"]') # Starts with
soup.select('a[href$=".pdf"]') # Ends with
soup.select('a[href*="example"]') # Contains
soup.select('li:nth-child(2)')
soup.select('h1, h2, h3') # Multiple
undefinedsoup.select('a[href^="https://"]') # 以指定内容开头
soup.select('a[href$=".pdf"]') # 以指定内容结尾
soup.select('a[href*="example"]') # 包含指定内容
soup.select('li:nth-child(2)')
soup.select('h1, h2, h3') # 多个标签
undefinedWith Functions
使用函数查找
python
import repython
import reBy regex
按正则表达式查找
soup.find_all('a', href=re.compile(r'^https://'))
soup.find_all('a', href=re.compile(r'^https://'))
By function
按函数查找
def has_data_attr(tag):
return tag.has_attr('data-id')
soup.find_all(has_data_attr)
def has_data_attr(tag):
return tag.has_attr('data-id')
soup.find_all(has_data_attr)
String matching
字符串匹配
soup.find_all(string='exact text')
soup.find_all(string=re.compile('pattern'))
undefinedsoup.find_all(string='exact text')
soup.find_all(string=re.compile('pattern'))
undefinedExtracting Data
提取数据
Text Content
文本内容
python
undefinedpython
undefinedGet text
获取文本
element.text
element.get_text()
element.text
element.get_text()
Get text with separator
使用分隔符获取文本
element.get_text(separator=' ')
element.get_text(separator=' ')
Get stripped text
获取去除空白的文本
element.get_text(strip=True)
element.get_text(strip=True)
Get strings (generator)
获取字符串(生成器)
for string in element.stripped_strings:
print(string)
undefinedfor string in element.stripped_strings:
print(string)
undefinedAttributes
属性
python
undefinedpython
undefinedGet attribute
获取属性
element['href']
element.get('href') # Returns None if missing
element.get('href', 'default') # With default
element['href']
element.get('href') # 不存在时返回None
element.get('href', 'default') # 带默认值
Get all attributes
获取所有属性
element.attrs # Returns dict
element.attrs # 返回字典
Check attribute exists
检查属性是否存在
element.has_attr('class')
undefinedelement.has_attr('class')
undefinedHTML Content
HTML内容
python
undefinedpython
undefinedInner HTML
内部HTML
str(element)
str(element)
Just the tag
仅标签名
element.name
element.name
Prettified HTML
格式化后的HTML
element.prettify()
undefinedelement.prettify()
undefinedDOM Navigation
DOM导航
Parent/Ancestors
父元素/祖先元素
python
element.parent
element.parents # Generator of all ancestorspython
element.parent
element.parents # 所有祖先元素的生成器Find specific ancestor
查找特定祖先元素
for parent in element.parents:
if parent.name == 'div' and 'article' in parent.get('class', []):
break
undefinedfor parent in element.parents:
if parent.name == 'div' and 'article' in parent.get('class', []):
break
undefinedChildren
子元素
python
element.children # Direct children (generator)
list(element.children)
element.contents # Direct children (list)
element.descendants # All descendants (generator)python
element.children # 直接子元素(生成器)
list(element.children)
element.contents # 直接子元素(列表)
element.descendants # 所有后代元素(生成器)Find in children
在子元素中查找
element.find('span') # Searches descendants
undefinedelement.find('span') # 搜索所有后代元素
undefinedSiblings
兄弟元素
python
element.next_sibling
element.previous_sibling
element.next_siblings # Generator
element.previous_siblings # Generatorpython
element.next_sibling
element.previous_sibling
element.next_siblings # 后续兄弟元素生成器
element.previous_siblings # 前序兄弟元素生成器Next/previous element (skips whitespace)
下一个/上一个元素(跳过空白)
element.next_element
element.previous_element
undefinedelement.next_element
element.previous_element
undefinedData Extraction Patterns
数据提取模式
Safe Extraction
安全提取
python
def safe_text(element, selector, default=''):
"""Safely extract text from element."""
found = element.select_one(selector)
return found.get_text(strip=True) if found else default
def safe_attr(element, selector, attr, default=None):
"""Safely extract attribute from element."""
found = element.select_one(selector)
return found.get(attr, default) if found else defaultpython
def safe_text(element, selector, default=''):
"""安全提取元素中的文本。"""
found = element.select_one(selector)
return found.get_text(strip=True) if found else default
def safe_attr(element, selector, attr, default=None):
"""安全提取元素中的属性。"""
found = element.select_one(selector)
return found.get(attr, default) if found else defaultTable Extraction
表格提取
python
def extract_table(table):
"""Extract table data as list of dictionaries."""
headers = [th.get_text(strip=True) for th in table.select('th')]
rows = []
for tr in table.select('tbody tr'):
cells = [td.get_text(strip=True) for td in tr.select('td')]
if cells:
rows.append(dict(zip(headers, cells)))
return rowspython
def extract_table(table):
"""将表格数据提取为字典列表。"""
headers = [th.get_text(strip=True) for th in table.select('th')]
rows = []
for tr in table.select('tbody tr'):
cells = [td.get_text(strip=True) for td in tr.select('td')]
if cells:
rows.append(dict(zip(headers, cells)))
return rowsList Extraction
列表提取
python
def extract_items(soup, selector, extractor):
"""Extract multiple items using a custom extractor function."""
return [extractor(item) for item in soup.select(selector)]python
def extract_items(soup, selector, extractor):
"""使用自定义提取器函数提取多个条目。"""
return [extractor(item) for item in soup.select(selector)]Usage
使用示例
def extract_product(item):
return {
'name': safe_text(item, '.name'),
'price': safe_text(item, '.price'),
'url': safe_attr(item, 'a', 'href')
}
products = extract_items(soup, '.product', extract_product)
undefineddef extract_product(item):
return {
'name': safe_text(item, '.name'),
'price': safe_text(item, '.price'),
'url': safe_attr(item, 'a', 'href')
}
products = extract_items(soup, '.product', extract_product)
undefinedURL Resolution
URL解析
python
from urllib.parse import urljoin
def resolve_url(base_url, relative_url):
"""Convert relative URL to absolute."""
if not relative_url:
return None
return urljoin(base_url, relative_url)python
from urllib.parse import urljoin
def resolve_url(base_url, relative_url):
"""将相对URL转换为绝对URL。"""
if not relative_url:
return None
return urljoin(base_url, relative_url)Usage
使用示例
base_url = 'https://example.com/products/'
for link in soup.select('a'):
href = link.get('href')
absolute_url = resolve_url(base_url, href)
print(absolute_url)
undefinedbase_url = 'https://example.com/products/'
for link in soup.select('a'):
href = link.get('href')
absolute_url = resolve_url(base_url, href)
print(absolute_url)
undefinedHandling Malformed HTML
处理格式错误的HTML
python
undefinedpython
undefinedlxml parser is lenient with malformed HTML
lxml解析器对格式错误的HTML容错性强
soup = BeautifulSoup(malformed_html, 'lxml')
soup = BeautifulSoup(malformed_html, 'lxml')
For very broken HTML, use html5lib
对于严重损坏的HTML,使用html5lib
soup = BeautifulSoup(very_broken_html, 'html5lib')
soup = BeautifulSoup(very_broken_html, 'html5lib')
Handle encoding issues
处理编码问题
response = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, 'lxml')
undefinedresponse = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, 'lxml')
undefinedComplete Scraping Example
完整抓取示例
python
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
class ProductScraper:
def __init__(self, base_url):
self.base_url = base_url
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
})
def fetch_page(self, url):
"""Fetch and parse a page."""
response = self.session.get(url, timeout=30)
response.raise_for_status()
return BeautifulSoup(response.content, 'lxml')
def extract_product(self, item):
"""Extract product data from a card element."""
return {
'name': self._safe_text(item, '.product-title'),
'price': self._parse_price(item.select_one('.price')),
'rating': self._safe_attr(item, '.rating', 'data-rating'),
'image': self._resolve(self._safe_attr(item, 'img', 'src')),
'url': self._resolve(self._safe_attr(item, 'a', 'href')),
'in_stock': not item.select_one('.out-of-stock')
}
def scrape_products(self, url):
"""Scrape all products from a page."""
soup = self.fetch_page(url)
items = soup.select('.product-card')
return [self.extract_product(item) for item in items]
def _safe_text(self, element, selector, default=''):
found = element.select_one(selector)
return found.get_text(strip=True) if found else default
def _safe_attr(self, element, selector, attr, default=None):
found = element.select_one(selector)
return found.get(attr, default) if found else default
def _parse_price(self, element):
if not element:
return None
text = element.get_text(strip=True)
try:
return float(text.replace('$', '').replace(',', ''))
except ValueError:
return None
def _resolve(self, url):
return urljoin(self.base_url, url) if url else Nonepython
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
class ProductScraper:
def __init__(self, base_url):
self.base_url = base_url
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
})
def fetch_page(self, url):
"""抓取并解析页面。"""
response = self.session.get(url, timeout=30)
response.raise_for_status()
return BeautifulSoup(response.content, 'lxml')
def extract_product(self, item):
"""从卡片元素中提取产品数据。"""
return {
'name': self._safe_text(item, '.product-title'),
'price': self._parse_price(item.select_one('.price')),
'rating': self._safe_attr(item, '.rating', 'data-rating'),
'image': self._resolve(self._safe_attr(item, 'img', 'src')),
'url': self._resolve(self._safe_attr(item, 'a', 'href')),
'in_stock': not item.select_one('.out-of-stock')
}
def scrape_products(self, url):
"""抓取页面上的所有产品。"""
soup = self.fetch_page(url)
items = soup.select('.product-card')
return [self.extract_product(item) for item in items]
def _safe_text(self, element, selector, default=''):
found = element.select_one(selector)
return found.get_text(strip=True) if found else default
def _safe_attr(self, element, selector, attr, default=None):
found = element.select_one(selector)
return found.get(attr, default) if found else default
def _parse_price(self, element):
if not element:
return None
text = element.get_text(strip=True)
try:
return float(text.replace('$', '').replace(',', ''))
except ValueError:
return None
def _resolve(self, url):
return urljoin(self.base_url, url) if url else NoneUsage
使用示例
scraper = ProductScraper('https://example.com')
products = scraper.scrape_products('https://example.com/products')
for product in products:
print(product)
undefinedscraper = ProductScraper('https://example.com')
products = scraper.scrape_products('https://example.com/products')
for product in products:
print(product)
undefinedPerformance Optimization
性能优化
python
undefinedpython
undefinedUse SoupStrainer to parse only needed elements
使用SoupStrainer仅解析需要的元素
from bs4 import SoupStrainer
only_articles = SoupStrainer('article')
soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)
from bs4 import SoupStrainer
only_articles = SoupStrainer('article')
soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)
Use lxml parser for speed
使用lxml解析器提升速度
soup = BeautifulSoup(html, 'lxml') # Fastest
soup = BeautifulSoup(html, 'lxml') # 最快
Decompose unneeded elements
分解不需要的元素
for script in soup.find_all('script'):
script.decompose()
for script in soup.find_all('script'):
script.decompose()
Use generators for memory efficiency
使用生成器提升内存效率
for item in soup.select('.item'):
yield extract_data(item)
undefinedfor item in soup.select('.item'):
yield extract_data(item)
undefinedKey Dependencies
核心依赖
- beautifulsoup4
- lxml (fast parser)
- html5lib (lenient parser)
- requests
- pandas (for data output)
- beautifulsoup4
- lxml(快速解析器)
- html5lib(容错性解析器)
- requests
- pandas(用于数据输出)
Best Practices
最佳实践
- Always use lxml parser for best performance
- Handle missing elements with default values
- Use and
select()for CSS selectorsselect_one() - Use for clean text extraction
get_text(strip=True) - Resolve relative URLs to absolute
- Validate extracted data types
- Implement rate limiting between requests
- Use proper User-Agent headers
- Handle character encoding properly
- Use SoupStrainer for large documents
- Follow robots.txt and website terms of service
- Implement retry logic for failed requests
- 始终使用lxml解析器以获得最佳性能
- 使用默认值处理缺失元素
- 使用和
select()处理CSS选择器select_one() - 使用提取干净的文本
get_text(strip=True) - 将相对URL解析为绝对URL
- 验证提取数据的类型
- 在请求之间实现速率限制
- 使用合适的User-Agent请求头
- 正确处理字符编码
- 对大型文档使用SoupStrainer
- 遵循robots.txt和网站服务条款
- 为失败的请求实现重试逻辑