llm-intelligent-public-opinion-analytics

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM-Based Intelligent Public Opinion Analytics Assistant

基于LLM的智能舆情分析助手

Skill by ara.so — Data Skills collection.
ara.so提供的技能——数据技能合集。

Overview

概述

This project is a comprehensive public opinion analytics platform that combines real-time data from 26 hot lists across 15 mainstream platforms (Weibo, Bilibili, Zhihu, Baidu, etc.) with large language model (LLM) analysis capabilities. It provides conversational query interfaces for hot searches, topic clustering, sentiment analysis, and multi-channel push notifications (WeChat, Email, Telegram).
Key Capabilities:
  • Real-time crawler cluster for 15+ platforms
  • LLM-powered content analysis (including video content extraction)
  • Natural language query interface
  • Topic clustering and sentiment analysis
  • Multi-channel alert system (Email, WeChat Work, Telegram)
  • Keyboard shortcuts for crawler control
本项目是一个综合性舆情分析平台,结合了15个主流平台(微博、哔哩哔哩、知乎、百度等)的26个热门榜单的实时数据与大语言模型(LLM)分析能力。它提供了热搜对话式查询接口、话题聚类、情感分析以及多渠道推送通知(微信、邮件、Telegram)功能。
核心功能:
  • 支持15+平台的实时爬虫集群
  • 基于LLM的内容分析(包括视频内容提取)
  • 自然语言查询接口
  • 话题聚类与情感分析
  • 多渠道告警系统(邮件、企业微信、Telegram)
  • 爬虫控制快捷键

Installation

安装步骤

Prerequisites

前置条件

  1. Browser Driver Setup (Required for detail page scraping):
bash
undefined
  1. 浏览器驱动配置(详情页爬取必需):
bash
undefined

Check your Chrome/Edge version first

先检查你的Chrome/Edge版本

Chrome: chrome://settings/help

Chrome: chrome://settings/help

Edge: edge://settings/help

Edge: edge://settings/help

Download matching driver:

下载匹配的驱动:

Linux/macOS - place driver in PATH:

Linux/macOS - 将驱动放入PATH:

sudo mv chromedriver /usr/local/bin/ sudo chmod +x /usr/local/bin/chromedriver
sudo mv chromedriver /usr/local/bin/ sudo chmod +x /usr/local/bin/chromedriver

Verify installation:

验证安装:

chromedriver --version

2. **MySQL Database**:

```bash
chromedriver --version

2. **MySQL数据库**:

```bash

Install MySQL 8.0+

安装MySQL 8.0+

Create database and user

创建数据库和用户

mysql -u root -p
CREATE DATABASE hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; CREATE USER 'hotsearch_user'@'localhost' IDENTIFIED BY 'your_password'; GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'hotsearch_user'@'localhost'; FLUSH PRIVILEGES;

3. **Python Environment**:

```bash
mysql -u root -p
CREATE DATABASE hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; CREATE USER 'hotsearch_user'@'localhost' IDENTIFIED BY 'your_password'; GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'hotsearch_user'@'localhost'; FLUSH PRIVILEGES;

3. **Python环境**:

```bash

Clone repository

克隆仓库

git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant

Create virtual environment

创建虚拟环境

python3 -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
python3 -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate

Install dependencies

安装依赖

pip install -r requirements.txt
undefined
pip install -r requirements.txt
undefined

Database Initialization

数据库初始化

Reference the
init.py
file to create necessary tables:
python
undefined
参考
init.py
文件创建所需表:
python
undefined

Example table structure (adapt from init.py)

示例表结构(改编自init.py)

import pymysql
connection = pymysql.connect( host='localhost', user='hotsearch_user', password='your_password', database='hotsearch_db', charset='utf8mb4' )
cursor = connection.cursor()
import pymysql
connection = pymysql.connect( host='localhost', user='hotsearch_user', password='your_password', database='hotsearch_db', charset='utf8mb4' )
cursor = connection.cursor()

Hot search items table

热搜条目表

cursor.execute(""" CREATE TABLE IF NOT EXISTS hot_search_items ( id INT AUTO_INCREMENT PRIMARY KEY, platform VARCHAR(50) NOT NULL, rank INT, title VARCHAR(500) NOT NULL, url VARCHAR(1000), heat_value VARCHAR(100), crawl_time DATETIME NOT NULL, detail_content TEXT, sentiment VARCHAR(20), INDEX idx_platform (platform), INDEX idx_crawl_time (crawl_time) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; """)
connection.commit() connection.close()
undefined
cursor.execute(""" CREATE TABLE IF NOT EXISTS hot_search_items ( id INT AUTO_INCREMENT PRIMARY KEY, platform VARCHAR(50) NOT NULL, rank INT, title VARCHAR(500) NOT NULL, url VARCHAR(1000), heat_value VARCHAR(100), crawl_time DATETIME NOT NULL, detail_content TEXT, sentiment VARCHAR(20), INDEX idx_platform (platform), INDEX idx_crawl_time (crawl_time) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; """)
connection.commit() connection.close()
undefined

Configuration

配置

Environment Variables

环境变量

Create
.env
file in the project root:
bash
undefined
在项目根目录创建
.env
文件:
bash
undefined

Database Configuration

数据库配置

MYSQL_HOST=localhost MYSQL_PORT=3306 MYSQL_USER=hotsearch_user MYSQL_PASSWORD=your_password MYSQL_DATABASE=hotsearch_db
MYSQL_HOST=localhost MYSQL_PORT=3306 MYSQL_USER=hotsearch_user MYSQL_PASSWORD=your_password MYSQL_DATABASE=hotsearch_db

LLM API Configuration (OpenAI-compatible format)

LLM API配置(兼容OpenAI格式)

OPENAI_API_KEY=your_api_key OPENAI_API_BASE=https://your-llm-endpoint.com/v1 OPENAI_MODEL=gpt-4
OPENAI_API_KEY=your_api_key OPENAI_API_BASE=https://your-llm-endpoint.com/v1 OPENAI_MODEL=gpt-4

Huawei Pangu Model (recommended alternative)

华为盘古模型(推荐替代方案)

PANGU_API_KEY=your_pangu_key PANGU_API_BASE=https://pangu-api.huaweicloud.com
PANGU_API_KEY=your_pangu_key PANGU_API_BASE=https://pangu-api.huaweicloud.com

Push Notification Channels

推送通知渠道

Email (SMTP)

邮件(SMTP)

SMTP_HOST=smtp.gmail.com SMTP_PORT=587 SMTP_USER=your_email@gmail.com SMTP_PASSWORD=your_app_password
SMTP_HOST=smtp.gmail.com SMTP_PORT=587 SMTP_USER=your_email@gmail.com SMTP_PASSWORD=your_app_password

WeChat Work Bot

企业微信机器人

WeChat Work Application

企业微信应用

WECHAT_WORK_CORP_ID=your_corp_id WECHAT_WORK_APP_SECRET=your_app_secret WECHAT_WORK_AGENT_ID=your_agent_id
WECHAT_WORK_CORP_ID=your_corp_id WECHAT_WORK_APP_SECRET=your_app_secret WECHAT_WORK_AGENT_ID=your_agent_id

Telegram Bot

Telegram机器人

TELEGRAM_BOT_TOKEN=your_bot_token TELEGRAM_CHAT_ID=your_chat_id
undefined
TELEGRAM_BOT_TOKEN=your_bot_token TELEGRAM_CHAT_ID=your_chat_id
undefined

Crawler Settings

爬虫设置

Edit
hotsearchcrawler/settings.py
:
python
undefined
编辑
hotsearchcrawler/settings.py
python
undefined

MySQL Connection Pool

MySQL连接池

MYSQL_CONFIG = { 'host': os.getenv('MYSQL_HOST', 'localhost'), 'port': int(os.getenv('MYSQL_PORT', 3306)), 'user': os.getenv('MYSQL_USER'), 'password': os.getenv('MYSQL_PASSWORD'), 'database': os.getenv('MYSQL_DATABASE'), 'charset': 'utf8mb4', 'autocommit': True }
MYSQL_CONFIG = { 'host': os.getenv('MYSQL_HOST', 'localhost'), 'port': int(os.getenv('MYSQL_PORT', 3306)), 'user': os.getenv('MYSQL_USER'), 'password': os.getenv('MYSQL_PASSWORD'), 'database': os.getenv('MYSQL_DATABASE'), 'charset': 'utf8mb4', 'autocommit': True }

Optional: Platform-specific cookies for authenticated access

可选:平台专属Cookie用于认证访问

PLATFORM_COOKIES = { 'weibo': 'your_weibo_cookies', # Optional, for better access 'bilibili': 'your_bilibili_cookies' }
PLATFORM_COOKIES = { 'weibo': 'your_weibo_cookies', # 可选,提升访问权限 'bilibili': 'your_bilibili_cookies' }

Concurrent requests

并发请求数

CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 1

User-Agent rotation

User-Agent轮换

USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' ]
undefined
USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' ]
undefined

Usage

使用方法

Starting the System

启动系统

bash
undefined
bash
undefined

Activate virtual environment

激活虚拟环境

source venv/bin/activate
source venv/bin/activate

Start the main application (web interface + API)

启动主应用(Web界面+API)

python app.py
python app.py

Access web interface at http://localhost:5000

访问Web界面:http://localhost:5000

undefined
undefined

Crawler Management

爬虫管理

python
undefined
python
undefined

Manual crawler test (single platform)

手动测试爬虫(单个平台)

cd hotsearchcrawler python runspider-test.py
cd hotsearchcrawler python runspider-test.py

Start all crawlers (typically triggered via web UI)

启动所有爬虫(通常通过Web UI触发)

python run_spiders.py

**Via Web Interface:**
- Use keyboard shortcuts to start/stop crawlers
- View real-time crawling status
- Monitor data collection metrics
python run_spiders.py

**通过Web界面:**
- 使用快捷键启动/停止爬虫
- 查看实时爬取状态
- 监控数据采集指标

Natural Language Queries

自然语言查询

python
undefined
python
undefined

Examples of conversational queries via web interface:

Web界面对话式查询示例:

"Show me today's top 10 trending topics on Weibo"

"展示今日微博Top10热门话题"

"What's trending about AI technology across all platforms?"

"所有平台上关于AI技术的热门内容有哪些?"

"Analyze sentiment for news about electric vehicles"

"分析电动汽车相关新闻的情感倾向"

"Cluster topics related to economic policy"

"聚类经济政策相关话题"

"Compare hot topics between Bilibili and Zhihu"

"对比哔哩哔哩和知乎的热门话题"

undefined
undefined

Programmatic API Usage

程序化API调用

python
from hotsearch_analysis_agent.analyzer import OpinionAnalyzer
from datetime import datetime, timedelta
python
from hotsearch_analysis_agent.analyzer import OpinionAnalyzer
from datetime import datetime, timedelta

Initialize analyzer

初始化分析器

analyzer = OpinionAnalyzer()
analyzer = OpinionAnalyzer()

Query hot searches

查询热搜

results = analyzer.query_hot_searches( platforms=['weibo', 'zhihu', 'bilibili'], time_range=(datetime.now() - timedelta(hours=24), datetime.now()), keyword='人工智能' )
results = analyzer.query_hot_searches( platforms=['weibo', 'zhihu', 'bilibili'], time_range=(datetime.now() - timedelta(hours=24), datetime.now()), keyword='人工智能' )

Perform sentiment analysis

执行情感分析

sentiment = analyzer.analyze_sentiment(results) print(f"Overall sentiment: {sentiment['overall']}") print(f"Positive: {sentiment['positive_ratio']}%")
sentiment = analyzer.analyze_sentiment(results) print(f"整体情感倾向: {sentiment['overall']}") print(f"正面占比: {sentiment['positive_ratio']}%")

Topic clustering

话题聚类

clusters = analyzer.cluster_topics(results, num_clusters=5) for i, cluster in enumerate(clusters): print(f"Cluster {i+1}: {cluster['keywords']}") print(f" Items: {len(cluster['items'])}")
undefined
clusters = analyzer.cluster_topics(results, num_clusters=5) for i, cluster in enumerate(clusters): print(f"聚类 {i+1}: {cluster['keywords']}") print(f" 条目数: {len(cluster['items'])}")
undefined

Push Notification Setup

推送通知设置

python
from hotsearch_analysis_agent.push_service import PushService
python
from hotsearch_analysis_agent.push_service import PushService

Initialize push service

初始化推送服务

push_service = PushService()
push_service = PushService()

Create scheduled push task

创建定时推送任务

task = push_service.create_task( name="AI Technology Daily Report", keywords=['人工智能', '大模型', '机器学习'], platforms=['weibo', 'zhihu', 'bilibili'], schedule='0 8,12,18 * * *', # Cron format: 8am, 12pm, 6pm daily channels=['wechat_work', 'email'], threshold={'heat_value': 100000, 'sentiment': 'positive'} )
task = push_service.create_task( name="AI技术日报", keywords=['人工智能', '大模型', '机器学习'], platforms=['weibo', 'zhihu', 'bilibili'], schedule='0 8,12,18 * * *', # Cron格式:每日8点、12点、18点 channels=['wechat_work', 'email'], threshold={'heat_value': 100000, 'sentiment': 'positive'} )

Test push task

测试推送任务

python test_push_task.py
undefined
python test_push_task.py
undefined

Analysis Report Generation

分析报告生成

python
from hotsearch_analysis_agent.report_generator import ReportGenerator

generator = ReportGenerator()
python
from hotsearch_analysis_agent.report_generator import ReportGenerator

generator = ReportGenerator()

Generate comprehensive report

生成综合报告

report = generator.generate_report( topic="人工智能与前沿科技", time_range=(datetime.now() - timedelta(days=7), datetime.now()), include_sentiment=True, include_clustering=True, include_trend_analysis=True )
report = generator.generate_report( topic="人工智能与前沿科技", time_range=(datetime.now() - timedelta(days=7), datetime.now()), include_sentiment=True, include_clustering=True, include_trend_analysis=True )

Report includes:

报告包含:

- Core findings with data highlights

- 核心发现与数据亮点

- Detailed news content with source URLs

- 带源URL的详细新闻内容

- Sentiment distribution

- 情感分布

- Topic clusters

- 话题聚类

- Trend analysis

- 趋势分析

- Information spread characteristics

- 信息传播特征

Save report

保存报告

report.save_markdown('output/ai_tech_report.md') report.save_pdf('output/ai_tech_report.pdf')
undefined
report.save_markdown('output/ai_tech_report.md') report.save_pdf('output/ai_tech_report.pdf')
undefined

Common Patterns

常见模式

Multi-Platform Data Aggregation

多平台数据聚合

python
from hotsearch_analysis_agent.aggregator import DataAggregator

aggregator = DataAggregator()
python
from hotsearch_analysis_agent.aggregator import DataAggregator

aggregator = DataAggregator()

Fetch and merge data from multiple platforms

获取并合并多平台数据

merged_data = aggregator.aggregate( platforms=['weibo', 'douyin', 'zhihu', 'bilibili', 'baidu'], dedup_threshold=0.8, # Similarity threshold for deduplication sort_by='heat_value', limit=50 )
merged_data = aggregator.aggregate( platforms=['weibo', 'douyin', 'zhihu', 'bilibili', 'baidu'], dedup_threshold=0.8, # 去重相似度阈值 sort_by='heat_value', limit=50 )

Cross-platform topic correlation

跨平台话题关联

correlations = aggregator.find_correlations(merged_data) print(f"Found {len(correlations)} cross-platform trending topics")
undefined
correlations = aggregator.find_correlations(merged_data) print(f"发现 {len(correlations)} 个跨平台热门话题")
undefined

Video Content Analysis

视频内容分析

python
undefined
python
undefined

The system automatically extracts text from video news

系统会自动从视频新闻中提取文本

using browser automation and LLM analysis

使用浏览器自动化和LLM分析

from hotsearch_analysis_agent.video_analyzer import VideoAnalyzer
video_analyzer = VideoAnalyzer()
from hotsearch_analysis_agent.video_analyzer import VideoAnalyzer
video_analyzer = VideoAnalyzer()

Analyze video-based hot topics (e.g., from Bilibili, Douyin)

分析基于视频的热门话题(如来自哔哩哔哩、抖音)

video_topics = video_analyzer.extract_content( url='https://www.bilibili.com/video/BV13pSoBBEvX/', extract_comments=True, max_comments=100 )
print(f"Video title: {video_topics['title']}") print(f"Description: {video_topics['description']}") print(f"Top comments sentiment: {video_topics['comments_sentiment']}")
undefined
video_topics = video_analyzer.extract_content( url='https://www.bilibili.com/video/BV13pSoBBEvX/', extract_comments=True, max_comments=100 )
print(f"视频标题: {video_topics['title']}") print(f"描述: {video_topics['description']}") print(f"评论情感倾向: {video_topics['comments_sentiment']}")
undefined

Custom LLM Integration

自定义LLM集成

python
from hotsearch_analysis_agent.llm_client import LLMClient
python
from hotsearch_analysis_agent.llm_client import LLMClient

Use Huawei Pangu Model (recommended)

使用华为盘古模型(推荐)

llm = LLMClient( api_base=os.getenv('PANGU_API_BASE'), api_key=os.getenv('PANGU_API_KEY'), model='pangu-embedded-7b' )
llm = LLMClient( api_base=os.getenv('PANGU_API_BASE'), api_key=os.getenv('PANGU_API_KEY'), model='pangu-embedded-7b' )

Or use any OpenAI-compatible endpoint

或使用任何兼容OpenAI的端点

llm = LLMClient( api_base=os.getenv('OPENAI_API_BASE'), api_key=os.getenv('OPENAI_API_KEY'), model='gpt-4' )
llm = LLMClient( api_base=os.getenv('OPENAI_API_BASE'), api_key=os.getenv('OPENAI_API_KEY'), model='gpt-4' )

Analyze custom content

分析自定义内容

analysis = llm.analyze( content=news_content, task='sentiment_and_summary', language='zh' )
undefined
analysis = llm.analyze( content=news_content, task='sentiment_and_summary', language='zh' )
undefined

Scheduled Monitoring

定时监控

python
from hotsearch_analysis_agent.scheduler import MonitorScheduler

scheduler = MonitorScheduler()
python
from hotsearch_analysis_agent.scheduler import MonitorScheduler

scheduler = MonitorScheduler()

Add monitoring rule

添加监控规则

scheduler.add_rule( name="Tech Company Crisis Monitoring", keywords=['某公司', '丑闻', '争议'], alert_conditions={ 'heat_spike': 2.0, # 2x normal heat 'sentiment_drop': -0.3, # 30% sentiment decrease 'platforms_count': 3 # Trending on 3+ platforms }, notification_channels=['wechat_work', 'telegram', 'email'], urgent=True )
scheduler.add_rule( name="科技公司危机监控", keywords=['某公司', '丑闻', '争议'], alert_conditions={ 'heat_spike': 2.0, # 热度达到正常水平的2倍 'sentiment_drop': -0.3, # 情感倾向下降30% 'platforms_count': 3 # 在3个以上平台成为热门 }, notification_channels=['wechat_work', 'telegram', 'email'], urgent=True )

Start scheduler

启动调度器

scheduler.start()
undefined
scheduler.start()
undefined

Troubleshooting

故障排除

Browser Driver Issues

浏览器驱动问题

bash
undefined
bash
undefined

Error: "Message: 'chromedriver' executable needs to be in PATH"

错误:"Message: 'chromedriver' executable needs to be in PATH"

Solution: Verify driver installation

解决方案:验证驱动安装

which chromedriver # Should return path
which chromedriver # 应返回路径

If not found, reinstall:

如果未找到,重新安装:

1. Check browser version

1. 检查浏览器版本

google-chrome --version # or microsoft-edge --version
google-chrome --version # 或 microsoft-edge --version

2. Download exact matching driver version

2. 下载完全匹配的驱动版本

3. Place in /usr/local/bin/ and chmod +x

3. 放入/usr/local/bin/并执行chmod +x

Alternative: Specify driver path in settings

替代方案:在设置中指定驱动路径

CHROMEDRIVER_PATH=/path/to/chromedriver
undefined
CHROMEDRIVER_PATH=/path/to/chromedriver
undefined

Database Connection Errors

数据库连接错误

python
undefined
python
undefined

Error: "Can't connect to MySQL server"

错误:"Can't connect to MySQL server"

Check MySQL service

检查MySQL服务状态

sudo systemctl status mysql
sudo systemctl status mysql

Verify credentials

验证凭据

mysql -u hotsearch_user -p -h localhost hotsearch_db
mysql -u hotsearch_user -p -h localhost hotsearch_db

Check .env file encoding (must be UTF-8 without BOM)

检查.env文件编码(必须为UTF-8无BOM)

file -I .env # Should show charset=utf-8
file -I .env # 应显示charset=utf-8

Test connection in Python

在Python中测试连接

import pymysql try: conn = pymysql.connect( host=os.getenv('MYSQL_HOST'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD'), database=os.getenv('MYSQL_DATABASE') ) print("Connection successful") except Exception as e: print(f"Error: {e}")
undefined
import pymysql try: conn = pymysql.connect( host=os.getenv('MYSQL_HOST'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD'), database=os.getenv('MYSQL_DATABASE') ) print("连接成功") except Exception as e: print(f"错误: {e}")
undefined

Crawler Rate Limiting

爬虫限流

python
undefined
python
undefined

Error: HTTP 429 or blocked requests

错误:HTTP 429或请求被阻止

Solution: Adjust crawler settings

解决方案:调整爬虫设置

In hotsearchcrawler/settings.py:

在hotsearchcrawler/settings.py中:

CONCURRENT_REQUESTS = 8 # Reduce from 16 DOWNLOAD_DELAY = 2 # Increase delay
CONCURRENT_REQUESTS = 8 # 从16减少 DOWNLOAD_DELAY = 2 # 增加延迟

Enable AutoThrottle

启用自动节流

AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 10

Rotate User-Agents and proxies

轮换User-Agent和代理

DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }
undefined
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }
undefined

LLM API Timeouts

LLM API超时

python
undefined
python
undefined

Error: Request timeout or rate limit

错误:请求超时或限流

Solution: Implement retry logic and fallback

解决方案:实现重试逻辑和降级

from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10)) def call_llm_with_retry(prompt): return llm.analyze(prompt)
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10)) def call_llm_with_retry(prompt): return llm.analyze(prompt)

Use batch processing for large datasets

对大型数据集使用批处理

from hotsearch_analysis_agent.batch_processor import BatchProcessor
processor = BatchProcessor(batch_size=10, delay=2) results = processor.process_items(news_items, analyze_func)
undefined
from hotsearch_analysis_agent.batch_processor import BatchProcessor
processor = BatchProcessor(batch_size=10, delay=2) results = processor.process_items(news_items, analyze_func)
undefined

Memory Issues with Large Datasets

大型数据集内存问题

python
undefined
python
undefined

Error: MemoryError or slow processing

错误:MemoryError或处理缓慢

Solution: Use pagination and streaming

解决方案:使用分页和流处理

from hotsearch_analysis_agent.db_client import DBClient
db = DBClient()
from hotsearch_analysis_agent.db_client import DBClient
db = DBClient()

Stream results instead of loading all at once

流式返回结果而非一次性加载全部

for batch in db.stream_hot_searches(batch_size=100): process_batch(batch) # Process and discard to free memory
for batch in db.stream_hot_searches(batch_size=100): process_batch(batch) # 处理后丢弃以释放内存

Use database aggregation instead of in-memory

使用数据库聚合而非内存聚合

aggregated = db.aggregate_by_platform( start_date='2026-01-01', end_date='2026-05-01' )
undefined
aggregated = db.aggregate_by_platform( start_date='2026-01-01', end_date='2026-05-01' )
undefined

Project Structure Reference

项目结构参考

.
├── app.py                          # Main application entry
├── hotsearch_analysis_agent/       # Analysis system
│   ├── analyzer.py                 # Core analysis logic
│   ├── llm_client.py              # LLM integration
│   ├── report_generator.py        # Report generation
│   ├── push_service.py            # Notification service
│   └── scheduler.py               # Task scheduling
├── hotsearchcrawler/              # Crawler cluster
│   ├── spiders/                   # Platform-specific spiders
│   ├── settings.py                # Crawler settings
│   └── run_spiders.py            # Crawler launcher
├── test_push_task.py              # Push notification testing
├── runspider-test.py              # Single crawler testing
├── init.py                        # Database initialization
├── requirements.txt               # Python dependencies
└── .env                          # Environment configuration
.
├── app.py                          # 主应用入口
├── hotsearch_analysis_agent/       # 分析系统
│   ├── analyzer.py                 # 核心分析逻辑
│   ├── llm_client.py              # LLM集成模块
│   ├── report_generator.py        # 报告生成模块
│   ├── push_service.py            # 通知服务模块
│   └── scheduler.py               # 任务调度模块
├── hotsearchcrawler/              # 爬虫集群
│   ├── spiders/                   # 平台专属爬虫
│   ├── settings.py                # 爬虫设置
│   └── run_spiders.py            # 爬虫启动器
├── test_push_task.py              # 推送通知测试
├── runspider-test.py              # 单个爬虫测试
├── init.py                        # 数据库初始化
├── requirements.txt               # Python依赖
└── .env                          # 环境配置

Best Practices

最佳实践

  1. Database Indexing: Ensure indexes on
    platform
    ,
    crawl_time
    , and
    title
    columns for fast queries
  2. LLM Cost Management: Cache analysis results to avoid redundant API calls
  3. Crawler Politeness: Respect platform rate limits and robots.txt
  4. Notification Throttling: Implement cooldown periods to avoid alert fatigue
  5. Data Retention: Set up automatic archival for data older than 90 days
  6. Model Choice: Consider Huawei Pangu for better Chinese language understanding and local deployment
  1. 数据库索引:确保
    platform
    crawl_time
    title
    列有索引以加快查询
  2. LLM成本管理:缓存分析结果以避免重复API调用
  3. 爬虫合规性:尊重平台限流规则和robots.txt
  4. 通知限流:设置冷却期避免告警疲劳
  5. 数据保留:自动归档90天以上的旧数据
  6. 模型选择:考虑使用华为盘古模型以获得更好的中文理解能力和本地部署支持