llm-intelligent-public-opinion-analytics
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM-Based Intelligent Public Opinion Analytics Assistant
基于LLM的智能舆情分析助手
Overview
概述
This project is a comprehensive public opinion analytics platform that combines real-time data from 26 hot lists across 15 mainstream platforms (Weibo, Bilibili, Zhihu, Baidu, etc.) with large language model (LLM) analysis capabilities. It provides conversational query interfaces for hot searches, topic clustering, sentiment analysis, and multi-channel push notifications (WeChat, Email, Telegram).
Key Capabilities:
- Real-time crawler cluster for 15+ platforms
- LLM-powered content analysis (including video content extraction)
- Natural language query interface
- Topic clustering and sentiment analysis
- Multi-channel alert system (Email, WeChat Work, Telegram)
- Keyboard shortcuts for crawler control
本项目是一个综合性舆情分析平台,结合了15个主流平台(微博、哔哩哔哩、知乎、百度等)的26个热门榜单的实时数据与大语言模型(LLM)分析能力。它提供了热搜对话式查询接口、话题聚类、情感分析以及多渠道推送通知(微信、邮件、Telegram)功能。
核心功能:
- 支持15+平台的实时爬虫集群
- 基于LLM的内容分析(包括视频内容提取)
- 自然语言查询接口
- 话题聚类与情感分析
- 多渠道告警系统(邮件、企业微信、Telegram)
- 爬虫控制快捷键
Installation
安装步骤
Prerequisites
前置条件
- Browser Driver Setup (Required for detail page scraping):
bash
undefined- 浏览器驱动配置(详情页爬取必需):
bash
undefinedCheck your Chrome/Edge version first
先检查你的Chrome/Edge版本
Chrome: chrome://settings/help
Chrome: chrome://settings/help
Edge: edge://settings/help
Edge: edge://settings/help
Download matching driver:
下载匹配的驱动:
ChromeDriver: https://chromedriver.chromium.org/
ChromeDriver: https://chromedriver.chromium.org/
Linux/macOS - place driver in PATH:
Linux/macOS - 将驱动放入PATH:
sudo mv chromedriver /usr/local/bin/
sudo chmod +x /usr/local/bin/chromedriver
sudo mv chromedriver /usr/local/bin/
sudo chmod +x /usr/local/bin/chromedriver
Verify installation:
验证安装:
chromedriver --version
2. **MySQL Database**:
```bashchromedriver --version
2. **MySQL数据库**:
```bashInstall MySQL 8.0+
安装MySQL 8.0+
Create database and user
创建数据库和用户
mysql -u root -p
CREATE DATABASE hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'hotsearch_user'@'localhost' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'hotsearch_user'@'localhost';
FLUSH PRIVILEGES;
3. **Python Environment**:
```bashmysql -u root -p
CREATE DATABASE hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'hotsearch_user'@'localhost' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'hotsearch_user'@'localhost';
FLUSH PRIVILEGES;
3. **Python环境**:
```bashClone repository
克隆仓库
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant
Create virtual environment
创建虚拟环境
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
Install dependencies
安装依赖
pip install -r requirements.txt
undefinedpip install -r requirements.txt
undefinedDatabase Initialization
数据库初始化
Reference the file to create necessary tables:
init.pypython
undefined参考文件创建所需表:
init.pypython
undefinedExample table structure (adapt from init.py)
示例表结构(改编自init.py)
import pymysql
connection = pymysql.connect(
host='localhost',
user='hotsearch_user',
password='your_password',
database='hotsearch_db',
charset='utf8mb4'
)
cursor = connection.cursor()
import pymysql
connection = pymysql.connect(
host='localhost',
user='hotsearch_user',
password='your_password',
database='hotsearch_db',
charset='utf8mb4'
)
cursor = connection.cursor()
Hot search items table
热搜条目表
cursor.execute("""
CREATE TABLE IF NOT EXISTS hot_search_items (
id INT AUTO_INCREMENT PRIMARY KEY,
platform VARCHAR(50) NOT NULL,
rank INT,
title VARCHAR(500) NOT NULL,
url VARCHAR(1000),
heat_value VARCHAR(100),
crawl_time DATETIME NOT NULL,
detail_content TEXT,
sentiment VARCHAR(20),
INDEX idx_platform (platform),
INDEX idx_crawl_time (crawl_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
""")
connection.commit()
connection.close()
undefinedcursor.execute("""
CREATE TABLE IF NOT EXISTS hot_search_items (
id INT AUTO_INCREMENT PRIMARY KEY,
platform VARCHAR(50) NOT NULL,
rank INT,
title VARCHAR(500) NOT NULL,
url VARCHAR(1000),
heat_value VARCHAR(100),
crawl_time DATETIME NOT NULL,
detail_content TEXT,
sentiment VARCHAR(20),
INDEX idx_platform (platform),
INDEX idx_crawl_time (crawl_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
""")
connection.commit()
connection.close()
undefinedConfiguration
配置
Environment Variables
环境变量
Create file in the project root:
.envbash
undefined在项目根目录创建文件:
.envbash
undefinedDatabase Configuration
数据库配置
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=hotsearch_user
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=hotsearch_db
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=hotsearch_user
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=hotsearch_db
LLM API Configuration (OpenAI-compatible format)
LLM API配置(兼容OpenAI格式)
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=https://your-llm-endpoint.com/v1
OPENAI_MODEL=gpt-4
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=https://your-llm-endpoint.com/v1
OPENAI_MODEL=gpt-4
Huawei Pangu Model (recommended alternative)
华为盘古模型(推荐替代方案)
PANGU_API_KEY=your_pangu_key
PANGU_API_BASE=https://pangu-api.huaweicloud.com
PANGU_API_KEY=your_pangu_key
PANGU_API_BASE=https://pangu-api.huaweicloud.com
Push Notification Channels
推送通知渠道
Email (SMTP)
邮件(SMTP)
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=your_email@gmail.com
SMTP_PASSWORD=your_app_password
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=your_email@gmail.com
SMTP_PASSWORD=your_app_password
WeChat Work Bot
企业微信机器人
WECHAT_WORK_WEBHOOK=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY
WECHAT_WORK_WEBHOOK=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY
WeChat Work Application
企业微信应用
WECHAT_WORK_CORP_ID=your_corp_id
WECHAT_WORK_APP_SECRET=your_app_secret
WECHAT_WORK_AGENT_ID=your_agent_id
WECHAT_WORK_CORP_ID=your_corp_id
WECHAT_WORK_APP_SECRET=your_app_secret
WECHAT_WORK_AGENT_ID=your_agent_id
Telegram Bot
Telegram机器人
TELEGRAM_BOT_TOKEN=your_bot_token
TELEGRAM_CHAT_ID=your_chat_id
undefinedTELEGRAM_BOT_TOKEN=your_bot_token
TELEGRAM_CHAT_ID=your_chat_id
undefinedCrawler Settings
爬虫设置
Edit :
hotsearchcrawler/settings.pypython
undefined编辑:
hotsearchcrawler/settings.pypython
undefinedMySQL Connection Pool
MySQL连接池
MYSQL_CONFIG = {
'host': os.getenv('MYSQL_HOST', 'localhost'),
'port': int(os.getenv('MYSQL_PORT', 3306)),
'user': os.getenv('MYSQL_USER'),
'password': os.getenv('MYSQL_PASSWORD'),
'database': os.getenv('MYSQL_DATABASE'),
'charset': 'utf8mb4',
'autocommit': True
}
MYSQL_CONFIG = {
'host': os.getenv('MYSQL_HOST', 'localhost'),
'port': int(os.getenv('MYSQL_PORT', 3306)),
'user': os.getenv('MYSQL_USER'),
'password': os.getenv('MYSQL_PASSWORD'),
'database': os.getenv('MYSQL_DATABASE'),
'charset': 'utf8mb4',
'autocommit': True
}
Optional: Platform-specific cookies for authenticated access
可选:平台专属Cookie用于认证访问
PLATFORM_COOKIES = {
'weibo': 'your_weibo_cookies', # Optional, for better access
'bilibili': 'your_bilibili_cookies'
}
PLATFORM_COOKIES = {
'weibo': 'your_weibo_cookies', # 可选,提升访问权限
'bilibili': 'your_bilibili_cookies'
}
Concurrent requests
并发请求数
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
User-Agent rotation
User-Agent轮换
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
]
undefinedUSER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
]
undefinedUsage
使用方法
Starting the System
启动系统
bash
undefinedbash
undefinedActivate virtual environment
激活虚拟环境
source venv/bin/activate
source venv/bin/activate
Start the main application (web interface + API)
启动主应用(Web界面+API)
python app.py
python app.py
Access web interface at http://localhost:5000
访问Web界面:http://localhost:5000
undefinedundefinedCrawler Management
爬虫管理
python
undefinedpython
undefinedManual crawler test (single platform)
手动测试爬虫(单个平台)
cd hotsearchcrawler
python runspider-test.py
cd hotsearchcrawler
python runspider-test.py
Start all crawlers (typically triggered via web UI)
启动所有爬虫(通常通过Web UI触发)
python run_spiders.py
**Via Web Interface:**
- Use keyboard shortcuts to start/stop crawlers
- View real-time crawling status
- Monitor data collection metricspython run_spiders.py
**通过Web界面:**
- 使用快捷键启动/停止爬虫
- 查看实时爬取状态
- 监控数据采集指标Natural Language Queries
自然语言查询
python
undefinedpython
undefinedExamples of conversational queries via web interface:
Web界面对话式查询示例:
"Show me today's top 10 trending topics on Weibo"
"展示今日微博Top10热门话题"
"What's trending about AI technology across all platforms?"
"所有平台上关于AI技术的热门内容有哪些?"
"Analyze sentiment for news about electric vehicles"
"分析电动汽车相关新闻的情感倾向"
"Cluster topics related to economic policy"
"聚类经济政策相关话题"
"Compare hot topics between Bilibili and Zhihu"
"对比哔哩哔哩和知乎的热门话题"
undefinedundefinedProgrammatic API Usage
程序化API调用
python
from hotsearch_analysis_agent.analyzer import OpinionAnalyzer
from datetime import datetime, timedeltapython
from hotsearch_analysis_agent.analyzer import OpinionAnalyzer
from datetime import datetime, timedeltaInitialize analyzer
初始化分析器
analyzer = OpinionAnalyzer()
analyzer = OpinionAnalyzer()
Query hot searches
查询热搜
results = analyzer.query_hot_searches(
platforms=['weibo', 'zhihu', 'bilibili'],
time_range=(datetime.now() - timedelta(hours=24), datetime.now()),
keyword='人工智能'
)
results = analyzer.query_hot_searches(
platforms=['weibo', 'zhihu', 'bilibili'],
time_range=(datetime.now() - timedelta(hours=24), datetime.now()),
keyword='人工智能'
)
Perform sentiment analysis
执行情感分析
sentiment = analyzer.analyze_sentiment(results)
print(f"Overall sentiment: {sentiment['overall']}")
print(f"Positive: {sentiment['positive_ratio']}%")
sentiment = analyzer.analyze_sentiment(results)
print(f"整体情感倾向: {sentiment['overall']}")
print(f"正面占比: {sentiment['positive_ratio']}%")
Topic clustering
话题聚类
clusters = analyzer.cluster_topics(results, num_clusters=5)
for i, cluster in enumerate(clusters):
print(f"Cluster {i+1}: {cluster['keywords']}")
print(f" Items: {len(cluster['items'])}")
undefinedclusters = analyzer.cluster_topics(results, num_clusters=5)
for i, cluster in enumerate(clusters):
print(f"聚类 {i+1}: {cluster['keywords']}")
print(f" 条目数: {len(cluster['items'])}")
undefinedPush Notification Setup
推送通知设置
python
from hotsearch_analysis_agent.push_service import PushServicepython
from hotsearch_analysis_agent.push_service import PushServiceInitialize push service
初始化推送服务
push_service = PushService()
push_service = PushService()
Create scheduled push task
创建定时推送任务
task = push_service.create_task(
name="AI Technology Daily Report",
keywords=['人工智能', '大模型', '机器学习'],
platforms=['weibo', 'zhihu', 'bilibili'],
schedule='0 8,12,18 * * *', # Cron format: 8am, 12pm, 6pm daily
channels=['wechat_work', 'email'],
threshold={'heat_value': 100000, 'sentiment': 'positive'}
)
task = push_service.create_task(
name="AI技术日报",
keywords=['人工智能', '大模型', '机器学习'],
platforms=['weibo', 'zhihu', 'bilibili'],
schedule='0 8,12,18 * * *', # Cron格式:每日8点、12点、18点
channels=['wechat_work', 'email'],
threshold={'heat_value': 100000, 'sentiment': 'positive'}
)
Test push task
测试推送任务
python test_push_task.py
undefinedpython test_push_task.py
undefinedAnalysis Report Generation
分析报告生成
python
from hotsearch_analysis_agent.report_generator import ReportGenerator
generator = ReportGenerator()python
from hotsearch_analysis_agent.report_generator import ReportGenerator
generator = ReportGenerator()Generate comprehensive report
生成综合报告
report = generator.generate_report(
topic="人工智能与前沿科技",
time_range=(datetime.now() - timedelta(days=7), datetime.now()),
include_sentiment=True,
include_clustering=True,
include_trend_analysis=True
)
report = generator.generate_report(
topic="人工智能与前沿科技",
time_range=(datetime.now() - timedelta(days=7), datetime.now()),
include_sentiment=True,
include_clustering=True,
include_trend_analysis=True
)
Report includes:
报告包含:
- Core findings with data highlights
- 核心发现与数据亮点
- Detailed news content with source URLs
- 带源URL的详细新闻内容
- Sentiment distribution
- 情感分布
- Topic clusters
- 话题聚类
- Trend analysis
- 趋势分析
- Information spread characteristics
- 信息传播特征
Save report
保存报告
report.save_markdown('output/ai_tech_report.md')
report.save_pdf('output/ai_tech_report.pdf')
undefinedreport.save_markdown('output/ai_tech_report.md')
report.save_pdf('output/ai_tech_report.pdf')
undefinedCommon Patterns
常见模式
Multi-Platform Data Aggregation
多平台数据聚合
python
from hotsearch_analysis_agent.aggregator import DataAggregator
aggregator = DataAggregator()python
from hotsearch_analysis_agent.aggregator import DataAggregator
aggregator = DataAggregator()Fetch and merge data from multiple platforms
获取并合并多平台数据
merged_data = aggregator.aggregate(
platforms=['weibo', 'douyin', 'zhihu', 'bilibili', 'baidu'],
dedup_threshold=0.8, # Similarity threshold for deduplication
sort_by='heat_value',
limit=50
)
merged_data = aggregator.aggregate(
platforms=['weibo', 'douyin', 'zhihu', 'bilibili', 'baidu'],
dedup_threshold=0.8, # 去重相似度阈值
sort_by='heat_value',
limit=50
)
Cross-platform topic correlation
跨平台话题关联
correlations = aggregator.find_correlations(merged_data)
print(f"Found {len(correlations)} cross-platform trending topics")
undefinedcorrelations = aggregator.find_correlations(merged_data)
print(f"发现 {len(correlations)} 个跨平台热门话题")
undefinedVideo Content Analysis
视频内容分析
python
undefinedpython
undefinedThe system automatically extracts text from video news
系统会自动从视频新闻中提取文本
using browser automation and LLM analysis
使用浏览器自动化和LLM分析
from hotsearch_analysis_agent.video_analyzer import VideoAnalyzer
video_analyzer = VideoAnalyzer()
from hotsearch_analysis_agent.video_analyzer import VideoAnalyzer
video_analyzer = VideoAnalyzer()
Analyze video-based hot topics (e.g., from Bilibili, Douyin)
分析基于视频的热门话题(如来自哔哩哔哩、抖音)
video_topics = video_analyzer.extract_content(
url='https://www.bilibili.com/video/BV13pSoBBEvX/',
extract_comments=True,
max_comments=100
)
print(f"Video title: {video_topics['title']}")
print(f"Description: {video_topics['description']}")
print(f"Top comments sentiment: {video_topics['comments_sentiment']}")
undefinedvideo_topics = video_analyzer.extract_content(
url='https://www.bilibili.com/video/BV13pSoBBEvX/',
extract_comments=True,
max_comments=100
)
print(f"视频标题: {video_topics['title']}")
print(f"描述: {video_topics['description']}")
print(f"评论情感倾向: {video_topics['comments_sentiment']}")
undefinedCustom LLM Integration
自定义LLM集成
python
from hotsearch_analysis_agent.llm_client import LLMClientpython
from hotsearch_analysis_agent.llm_client import LLMClientUse Huawei Pangu Model (recommended)
使用华为盘古模型(推荐)
llm = LLMClient(
api_base=os.getenv('PANGU_API_BASE'),
api_key=os.getenv('PANGU_API_KEY'),
model='pangu-embedded-7b'
)
llm = LLMClient(
api_base=os.getenv('PANGU_API_BASE'),
api_key=os.getenv('PANGU_API_KEY'),
model='pangu-embedded-7b'
)
Or use any OpenAI-compatible endpoint
或使用任何兼容OpenAI的端点
llm = LLMClient(
api_base=os.getenv('OPENAI_API_BASE'),
api_key=os.getenv('OPENAI_API_KEY'),
model='gpt-4'
)
llm = LLMClient(
api_base=os.getenv('OPENAI_API_BASE'),
api_key=os.getenv('OPENAI_API_KEY'),
model='gpt-4'
)
Analyze custom content
分析自定义内容
analysis = llm.analyze(
content=news_content,
task='sentiment_and_summary',
language='zh'
)
undefinedanalysis = llm.analyze(
content=news_content,
task='sentiment_and_summary',
language='zh'
)
undefinedScheduled Monitoring
定时监控
python
from hotsearch_analysis_agent.scheduler import MonitorScheduler
scheduler = MonitorScheduler()python
from hotsearch_analysis_agent.scheduler import MonitorScheduler
scheduler = MonitorScheduler()Add monitoring rule
添加监控规则
scheduler.add_rule(
name="Tech Company Crisis Monitoring",
keywords=['某公司', '丑闻', '争议'],
alert_conditions={
'heat_spike': 2.0, # 2x normal heat
'sentiment_drop': -0.3, # 30% sentiment decrease
'platforms_count': 3 # Trending on 3+ platforms
},
notification_channels=['wechat_work', 'telegram', 'email'],
urgent=True
)
scheduler.add_rule(
name="科技公司危机监控",
keywords=['某公司', '丑闻', '争议'],
alert_conditions={
'heat_spike': 2.0, # 热度达到正常水平的2倍
'sentiment_drop': -0.3, # 情感倾向下降30%
'platforms_count': 3 # 在3个以上平台成为热门
},
notification_channels=['wechat_work', 'telegram', 'email'],
urgent=True
)
Start scheduler
启动调度器
scheduler.start()
undefinedscheduler.start()
undefinedTroubleshooting
故障排除
Browser Driver Issues
浏览器驱动问题
bash
undefinedbash
undefinedError: "Message: 'chromedriver' executable needs to be in PATH"
错误:"Message: 'chromedriver' executable needs to be in PATH"
Solution: Verify driver installation
解决方案:验证驱动安装
which chromedriver # Should return path
which chromedriver # 应返回路径
If not found, reinstall:
如果未找到,重新安装:
1. Check browser version
1. 检查浏览器版本
google-chrome --version # or microsoft-edge --version
google-chrome --version # 或 microsoft-edge --version
2. Download exact matching driver version
2. 下载完全匹配的驱动版本
3. Place in /usr/local/bin/ and chmod +x
3. 放入/usr/local/bin/并执行chmod +x
Alternative: Specify driver path in settings
替代方案:在设置中指定驱动路径
CHROMEDRIVER_PATH=/path/to/chromedriver
undefinedCHROMEDRIVER_PATH=/path/to/chromedriver
undefinedDatabase Connection Errors
数据库连接错误
python
undefinedpython
undefinedError: "Can't connect to MySQL server"
错误:"Can't connect to MySQL server"
Check MySQL service
检查MySQL服务状态
sudo systemctl status mysql
sudo systemctl status mysql
Verify credentials
验证凭据
mysql -u hotsearch_user -p -h localhost hotsearch_db
mysql -u hotsearch_user -p -h localhost hotsearch_db
Check .env file encoding (must be UTF-8 without BOM)
检查.env文件编码(必须为UTF-8无BOM)
file -I .env # Should show charset=utf-8
file -I .env # 应显示charset=utf-8
Test connection in Python
在Python中测试连接
import pymysql
try:
conn = pymysql.connect(
host=os.getenv('MYSQL_HOST'),
user=os.getenv('MYSQL_USER'),
password=os.getenv('MYSQL_PASSWORD'),
database=os.getenv('MYSQL_DATABASE')
)
print("Connection successful")
except Exception as e:
print(f"Error: {e}")
undefinedimport pymysql
try:
conn = pymysql.connect(
host=os.getenv('MYSQL_HOST'),
user=os.getenv('MYSQL_USER'),
password=os.getenv('MYSQL_PASSWORD'),
database=os.getenv('MYSQL_DATABASE')
)
print("连接成功")
except Exception as e:
print(f"错误: {e}")
undefinedCrawler Rate Limiting
爬虫限流
python
undefinedpython
undefinedError: HTTP 429 or blocked requests
错误:HTTP 429或请求被阻止
Solution: Adjust crawler settings
解决方案:调整爬虫设置
In hotsearchcrawler/settings.py:
在hotsearchcrawler/settings.py中:
CONCURRENT_REQUESTS = 8 # Reduce from 16
DOWNLOAD_DELAY = 2 # Increase delay
CONCURRENT_REQUESTS = 8 # 从16减少
DOWNLOAD_DELAY = 2 # 增加延迟
Enable AutoThrottle
启用自动节流
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
Rotate User-Agents and proxies
轮换User-Agent和代理
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
undefinedDOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
undefinedLLM API Timeouts
LLM API超时
python
undefinedpython
undefinedError: Request timeout or rate limit
错误:请求超时或限流
Solution: Implement retry logic and fallback
解决方案:实现重试逻辑和降级
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(prompt):
return llm.analyze(prompt)
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(prompt):
return llm.analyze(prompt)
Use batch processing for large datasets
对大型数据集使用批处理
from hotsearch_analysis_agent.batch_processor import BatchProcessor
processor = BatchProcessor(batch_size=10, delay=2)
results = processor.process_items(news_items, analyze_func)
undefinedfrom hotsearch_analysis_agent.batch_processor import BatchProcessor
processor = BatchProcessor(batch_size=10, delay=2)
results = processor.process_items(news_items, analyze_func)
undefinedMemory Issues with Large Datasets
大型数据集内存问题
python
undefinedpython
undefinedError: MemoryError or slow processing
错误:MemoryError或处理缓慢
Solution: Use pagination and streaming
解决方案:使用分页和流处理
from hotsearch_analysis_agent.db_client import DBClient
db = DBClient()
from hotsearch_analysis_agent.db_client import DBClient
db = DBClient()
Stream results instead of loading all at once
流式返回结果而非一次性加载全部
for batch in db.stream_hot_searches(batch_size=100):
process_batch(batch)
# Process and discard to free memory
for batch in db.stream_hot_searches(batch_size=100):
process_batch(batch)
# 处理后丢弃以释放内存
Use database aggregation instead of in-memory
使用数据库聚合而非内存聚合
aggregated = db.aggregate_by_platform(
start_date='2026-01-01',
end_date='2026-05-01'
)
undefinedaggregated = db.aggregate_by_platform(
start_date='2026-01-01',
end_date='2026-05-01'
)
undefinedProject Structure Reference
项目结构参考
.
├── app.py # Main application entry
├── hotsearch_analysis_agent/ # Analysis system
│ ├── analyzer.py # Core analysis logic
│ ├── llm_client.py # LLM integration
│ ├── report_generator.py # Report generation
│ ├── push_service.py # Notification service
│ └── scheduler.py # Task scheduling
├── hotsearchcrawler/ # Crawler cluster
│ ├── spiders/ # Platform-specific spiders
│ ├── settings.py # Crawler settings
│ └── run_spiders.py # Crawler launcher
├── test_push_task.py # Push notification testing
├── runspider-test.py # Single crawler testing
├── init.py # Database initialization
├── requirements.txt # Python dependencies
└── .env # Environment configuration.
├── app.py # 主应用入口
├── hotsearch_analysis_agent/ # 分析系统
│ ├── analyzer.py # 核心分析逻辑
│ ├── llm_client.py # LLM集成模块
│ ├── report_generator.py # 报告生成模块
│ ├── push_service.py # 通知服务模块
│ └── scheduler.py # 任务调度模块
├── hotsearchcrawler/ # 爬虫集群
│ ├── spiders/ # 平台专属爬虫
│ ├── settings.py # 爬虫设置
│ └── run_spiders.py # 爬虫启动器
├── test_push_task.py # 推送通知测试
├── runspider-test.py # 单个爬虫测试
├── init.py # 数据库初始化
├── requirements.txt # Python依赖
└── .env # 环境配置Best Practices
最佳实践
- Database Indexing: Ensure indexes on ,
platform, andcrawl_timecolumns for fast queriestitle - LLM Cost Management: Cache analysis results to avoid redundant API calls
- Crawler Politeness: Respect platform rate limits and robots.txt
- Notification Throttling: Implement cooldown periods to avoid alert fatigue
- Data Retention: Set up automatic archival for data older than 90 days
- Model Choice: Consider Huawei Pangu for better Chinese language understanding and local deployment
- 数据库索引:确保、
platform和crawl_time列有索引以加快查询title - LLM成本管理:缓存分析结果以避免重复API调用
- 爬虫合规性:尊重平台限流规则和robots.txt
- 通知限流:设置冷却期避免告警疲劳
- 数据保留:自动归档90天以上的旧数据
- 模型选择:考虑使用华为盘古模型以获得更好的中文理解能力和本地部署支持