LLM-Based Intelligent Public Opinion Analytics Assistant

基于LLM的智能舆情分析助手

Skill by ara.so — Data Skills collection.

由ara.so提供的技能——数据技能合集。

Overview

概述

This project is a comprehensive public opinion analytics platform that combines real-time data from 26 hot lists across 15 mainstream platforms (Weibo, Bilibili, Zhihu, Baidu, etc.) with large language model (LLM) analysis capabilities. It provides conversational query interfaces for hot searches, topic clustering, sentiment analysis, and multi-channel push notifications (WeChat, Email, Telegram).

Key Capabilities:

Real-time crawler cluster for 15+ platforms
LLM-powered content analysis (including video content extraction)
Natural language query interface
Topic clustering and sentiment analysis
Multi-channel alert system (Email, WeChat Work, Telegram)
Keyboard shortcuts for crawler control

本项目是一个综合性舆情分析平台，结合了15个主流平台（微博、哔哩哔哩、知乎、百度等）的26个热门榜单的实时数据与大语言模型（LLM）分析能力。它提供了热搜对话式查询接口、话题聚类、情感分析以及多渠道推送通知（微信、邮件、Telegram）功能。

核心功能：

支持15+平台的实时爬虫集群
基于LLM的内容分析（包括视频内容提取）
自然语言查询接口
话题聚类与情感分析
多渠道告警系统（邮件、企业微信、Telegram）
爬虫控制快捷键

Installation

安装步骤

Prerequisites

前置条件

Browser Driver Setup (Required for detail page scraping):

bash

undefined

浏览器驱动配置（详情页爬取必需）：

bash

undefined

Check your Chrome/Edge version first

先检查你的Chrome/Edge版本

Chrome: chrome://settings/help

Edge: edge://settings/help

Download matching driver:

下载匹配的驱动：

ChromeDriver: https://chromedriver.chromium.org/

EdgeDriver: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

Linux/macOS - place driver in PATH:

Linux/macOS - 将驱动放入PATH：

sudo mv chromedriver /usr/local/bin/ sudo chmod +x /usr/local/bin/chromedriver

Verify installation:

验证安装：

chromedriver --version


2. **MySQL Database**:

```bash

chromedriver --version


2. **MySQL数据库**：

```bash

Install MySQL 8.0+

安装MySQL 8.0+

Create database and user

创建数据库和用户

mysql -u root -p

CREATE DATABASE hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; CREATE USER 'hotsearch_user'@'localhost' IDENTIFIED BY 'your_password'; GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'hotsearch_user'@'localhost'; FLUSH PRIVILEGES;


3. **Python Environment**:

```bash

mysql -u root -p

CREATE DATABASE hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; CREATE USER 'hotsearch_user'@'localhost' IDENTIFIED BY 'your_password'; GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'hotsearch_user'@'localhost'; FLUSH PRIVILEGES;


3. **Python环境**：

```bash

Clone repository

克隆仓库

git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant

Create virtual environment

创建虚拟环境

python3 -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate

Install dependencies

安装依赖

pip install -r requirements.txt

undefined

pip install -r requirements.txt

undefined

Database Initialization

数据库初始化

Reference the

init.py

file to create necessary tables:

python

undefined

参考

init.py

文件创建所需表：

python

undefined

Example table structure (adapt from init.py)

示例表结构（改编自init.py）

import pymysql

connection = pymysql.connect( host='localhost', user='hotsearch_user', password='your_password', database='hotsearch_db', charset='utf8mb4' )

cursor = connection.cursor()

import pymysql

connection = pymysql.connect( host='localhost', user='hotsearch_user', password='your_password', database='hotsearch_db', charset='utf8mb4' )

cursor = connection.cursor()

Hot search items table

热搜条目表

cursor.execute(""" CREATE TABLE IF NOT EXISTS hot_search_items ( id INT AUTO_INCREMENT PRIMARY KEY, platform VARCHAR(50) NOT NULL, rank INT, title VARCHAR(500) NOT NULL, url VARCHAR(1000), heat_value VARCHAR(100), crawl_time DATETIME NOT NULL, detail_content TEXT, sentiment VARCHAR(20), INDEX idx_platform (platform), INDEX idx_crawl_time (crawl_time) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; """)

connection.commit() connection.close()

undefined

cursor.execute(""" CREATE TABLE IF NOT EXISTS hot_search_items ( id INT AUTO_INCREMENT PRIMARY KEY, platform VARCHAR(50) NOT NULL, rank INT, title VARCHAR(500) NOT NULL, url VARCHAR(1000), heat_value VARCHAR(100), crawl_time DATETIME NOT NULL, detail_content TEXT, sentiment VARCHAR(20), INDEX idx_platform (platform), INDEX idx_crawl_time (crawl_time) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; """)

connection.commit() connection.close()

undefined

Configuration

配置

Environment Variables

环境变量

Create

.env

file in the project root:

bash

undefined

在项目根目录创建

.env

文件：

bash

undefined

Database Configuration

数据库配置

MYSQL_HOST=localhost MYSQL_PORT=3306 MYSQL_USER=hotsearch_user MYSQL_PASSWORD=your_password MYSQL_DATABASE=hotsearch_db

LLM API Configuration (OpenAI-compatible format)

LLM API配置（兼容OpenAI格式）

OPENAI_API_KEY=your_api_key OPENAI_API_BASE=https://your-llm-endpoint.com/v1 OPENAI_MODEL=gpt-4

Huawei Pangu Model (recommended alternative)

华为盘古模型（推荐替代方案）

PANGU_API_KEY=your_pangu_key PANGU_API_BASE=https://pangu-api.huaweicloud.com

Push Notification Channels

推送通知渠道

Email (SMTP)

邮件（SMTP）

SMTP_HOST=smtp.gmail.com SMTP_PORT=587 SMTP_USER=your_email@gmail.com SMTP_PASSWORD=your_app_password

WeChat Work Bot

企业微信机器人

WECHAT_WORK_WEBHOOK=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY

WeChat Work Application

企业微信应用

WECHAT_WORK_CORP_ID=your_corp_id WECHAT_WORK_APP_SECRET=your_app_secret WECHAT_WORK_AGENT_ID=your_agent_id

Telegram Bot

Telegram机器人

TELEGRAM_BOT_TOKEN=your_bot_token TELEGRAM_CHAT_ID=your_chat_id

undefined

TELEGRAM_BOT_TOKEN=your_bot_token TELEGRAM_CHAT_ID=your_chat_id

undefined

Crawler Settings

爬虫设置

Edit

hotsearchcrawler/settings.py

:

python

undefined

编辑

hotsearchcrawler/settings.py

：

python

undefined

MySQL Connection Pool

MySQL连接池

MYSQL_CONFIG = { 'host': os.getenv('MYSQL_HOST', 'localhost'), 'port': int(os.getenv('MYSQL_PORT', 3306)), 'user': os.getenv('MYSQL_USER'), 'password': os.getenv('MYSQL_PASSWORD'), 'database': os.getenv('MYSQL_DATABASE'), 'charset': 'utf8mb4', 'autocommit': True }

Optional: Platform-specific cookies for authenticated access

可选：平台专属Cookie用于认证访问

PLATFORM_COOKIES = { 'weibo': 'your_weibo_cookies', # Optional, for better access 'bilibili': 'your_bilibili_cookies' }

PLATFORM_COOKIES = { 'weibo': 'your_weibo_cookies', # 可选，提升访问权限 'bilibili': 'your_bilibili_cookies' }

Concurrent requests

并发请求数

CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 1

User-Agent rotation

User-Agent轮换

USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' ]

undefined

USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' ]

undefined

Usage

使用方法

Starting the System

启动系统

bash

undefined

bash

undefined

Activate virtual environment

激活虚拟环境

source venv/bin/activate

Start the main application (web interface + API)

启动主应用（Web界面+API）

python app.py

Access web interface at http://localhost:5000

访问Web界面：http://localhost:5000

undefined

undefined

Crawler Management

爬虫管理

python

undefined

python

undefined

Manual crawler test (single platform)

手动测试爬虫（单个平台）

cd hotsearchcrawler python runspider-test.py

Start all crawlers (typically triggered via web UI)

启动所有爬虫（通常通过Web UI触发）

python run_spiders.py


**Via Web Interface:**
- Use keyboard shortcuts to start/stop crawlers
- View real-time crawling status
- Monitor data collection metrics

python run_spiders.py


**通过Web界面：**
- 使用快捷键启动/停止爬虫
- 查看实时爬取状态
- 监控数据采集指标

Natural Language Queries

自然语言查询

python

undefined

python

undefined

Examples of conversational queries via web interface:

Web界面对话式查询示例：

"Show me today's top 10 trending topics on Weibo"

"展示今日微博Top10热门话题"

"What's trending about AI technology across all platforms?"

"所有平台上关于AI技术的热门内容有哪些？"

"Analyze sentiment for news about electric vehicles"

"分析电动汽车相关新闻的情感倾向"

"Cluster topics related to economic policy"

"聚类经济政策相关话题"

"Compare hot topics between Bilibili and Zhihu"

"对比哔哩哔哩和知乎的热门话题"

undefined

undefined

Programmatic API Usage

程序化API调用

python

from hotsearch_analysis_agent.analyzer import OpinionAnalyzer
from datetime import datetime, timedelta

python

from hotsearch_analysis_agent.analyzer import OpinionAnalyzer
from datetime import datetime, timedelta

Initialize analyzer

初始化分析器

analyzer = OpinionAnalyzer()

Query hot searches

查询热搜

results = analyzer.query_hot_searches( platforms=['weibo', 'zhihu', 'bilibili'], time_range=(datetime.now() - timedelta(hours=24), datetime.now()), keyword='人工智能' )

Perform sentiment analysis

执行情感分析

sentiment = analyzer.analyze_sentiment(results) print(f"Overall sentiment: {sentiment['overall']}") print(f"Positive: {sentiment['positive_ratio']}%")

sentiment = analyzer.analyze_sentiment(results) print(f"整体情感倾向: {sentiment['overall']}") print(f"正面占比: {sentiment['positive_ratio']}%")

Topic clustering

话题聚类

clusters = analyzer.cluster_topics(results, num_clusters=5) for i, cluster in enumerate(clusters): print(f"Cluster {i+1}: {cluster['keywords']}") print(f" Items: {len(cluster['items'])}")

undefined

clusters = analyzer.cluster_topics(results, num_clusters=5) for i, cluster in enumerate(clusters): print(f"聚类 {i+1}: {cluster['keywords']}") print(f" 条目数: {len(cluster['items'])}")

undefined

Push Notification Setup

推送通知设置

python

from hotsearch_analysis_agent.push_service import PushService

python

from hotsearch_analysis_agent.push_service import PushService

Initialize push service

初始化推送服务

push_service = PushService()

Create scheduled push task

创建定时推送任务

task = push_service.create_task( name="AI Technology Daily Report", keywords=['人工智能', '大模型', '机器学习'], platforms=['weibo', 'zhihu', 'bilibili'], schedule='0 8,12,18 * * *', # Cron format: 8am, 12pm, 6pm daily channels=['wechat_work', 'email'], threshold={'heat_value': 100000, 'sentiment': 'positive'} )

task = push_service.create_task( name="AI技术日报", keywords=['人工智能', '大模型', '机器学习'], platforms=['weibo', 'zhihu', 'bilibili'], schedule='0 8,12,18 * * *', # Cron格式：每日8点、12点、18点 channels=['wechat_work', 'email'], threshold={'heat_value': 100000, 'sentiment': 'positive'} )

Test push task

测试推送任务

python test_push_task.py

undefined

python test_push_task.py

undefined

Analysis Report Generation

分析报告生成

python

from hotsearch_analysis_agent.report_generator import ReportGenerator

generator = ReportGenerator()

python

from hotsearch_analysis_agent.report_generator import ReportGenerator

generator = ReportGenerator()

Generate comprehensive report

生成综合报告

report = generator.generate_report( topic="人工智能与前沿科技", time_range=(datetime.now() - timedelta(days=7), datetime.now()), include_sentiment=True, include_clustering=True, include_trend_analysis=True )

Report includes:

报告包含：

- Core findings with data highlights

- 核心发现与数据亮点

- Detailed news content with source URLs

- 带源URL的详细新闻内容

- Sentiment distribution

- 情感分布

- Topic clusters

- 话题聚类

- Trend analysis

- 趋势分析

- Information spread characteristics

- 信息传播特征

Save report

保存报告

report.save_markdown('output/ai_tech_report.md') report.save_pdf('output/ai_tech_report.pdf')

undefined

report.save_markdown('output/ai_tech_report.md') report.save_pdf('output/ai_tech_report.pdf')

undefined

Common Patterns

常见模式

Multi-Platform Data Aggregation

多平台数据聚合

python

from hotsearch_analysis_agent.aggregator import DataAggregator

aggregator = DataAggregator()

python

from hotsearch_analysis_agent.aggregator import DataAggregator

aggregator = DataAggregator()

Fetch and merge data from multiple platforms

获取并合并多平台数据

merged_data = aggregator.aggregate( platforms=['weibo', 'douyin', 'zhihu', 'bilibili', 'baidu'], dedup_threshold=0.8, # Similarity threshold for deduplication sort_by='heat_value', limit=50 )

merged_data = aggregator.aggregate( platforms=['weibo', 'douyin', 'zhihu', 'bilibili', 'baidu'], dedup_threshold=0.8, # 去重相似度阈值 sort_by='heat_value', limit=50 )

Cross-platform topic correlation

跨平台话题关联

correlations = aggregator.find_correlations(merged_data) print(f"Found {len(correlations)} cross-platform trending topics")

undefined

correlations = aggregator.find_correlations(merged_data) print(f"发现 {len(correlations)} 个跨平台热门话题")

undefined

Video Content Analysis

视频内容分析

python

undefined

python

undefined

The system automatically extracts text from video news

系统会自动从视频新闻中提取文本

using browser automation and LLM analysis

使用浏览器自动化和LLM分析

from hotsearch_analysis_agent.video_analyzer import VideoAnalyzer

video_analyzer = VideoAnalyzer()

from hotsearch_analysis_agent.video_analyzer import VideoAnalyzer

video_analyzer = VideoAnalyzer()

Analyze video-based hot topics (e.g., from Bilibili, Douyin)

分析基于视频的热门话题（如来自哔哩哔哩、抖音）

video_topics = video_analyzer.extract_content( url='https://www.bilibili.com/video/BV13pSoBBEvX/', extract_comments=True, max_comments=100 )

print(f"Video title: {video_topics['title']}") print(f"Description: {video_topics['description']}") print(f"Top comments sentiment: {video_topics['comments_sentiment']}")

undefined

video_topics = video_analyzer.extract_content( url='https://www.bilibili.com/video/BV13pSoBBEvX/', extract_comments=True, max_comments=100 )

print(f"视频标题: {video_topics['title']}") print(f"描述: {video_topics['description']}") print(f"评论情感倾向: {video_topics['comments_sentiment']}")

undefined

Custom LLM Integration

自定义LLM集成

python

from hotsearch_analysis_agent.llm_client import LLMClient

python

from hotsearch_analysis_agent.llm_client import LLMClient

Use Huawei Pangu Model (recommended)

使用华为盘古模型（推荐）

llm = LLMClient( api_base=os.getenv('PANGU_API_BASE'), api_key=os.getenv('PANGU_API_KEY'), model='pangu-embedded-7b' )

Or use any OpenAI-compatible endpoint

或使用任何兼容OpenAI的端点

llm = LLMClient( api_base=os.getenv('OPENAI_API_BASE'), api_key=os.getenv('OPENAI_API_KEY'), model='gpt-4' )

Analyze custom content

分析自定义内容

analysis = llm.analyze( content=news_content, task='sentiment_and_summary', language='zh' )

undefined

analysis = llm.analyze( content=news_content, task='sentiment_and_summary', language='zh' )

undefined

Scheduled Monitoring

定时监控

python

from hotsearch_analysis_agent.scheduler import MonitorScheduler

scheduler = MonitorScheduler()

python

from hotsearch_analysis_agent.scheduler import MonitorScheduler

scheduler = MonitorScheduler()

Add monitoring rule

添加监控规则

scheduler.add_rule( name="Tech Company Crisis Monitoring", keywords=['某公司', '丑闻', '争议'], alert_conditions={ 'heat_spike': 2.0, # 2x normal heat 'sentiment_drop': -0.3, # 30% sentiment decrease 'platforms_count': 3 # Trending on 3+ platforms }, notification_channels=['wechat_work', 'telegram', 'email'], urgent=True )

scheduler.add_rule( name="科技公司危机监控", keywords=['某公司', '丑闻', '争议'], alert_conditions={ 'heat_spike': 2.0, # 热度达到正常水平的2倍 'sentiment_drop': -0.3, # 情感倾向下降30% 'platforms_count': 3 # 在3个以上平台成为热门 }, notification_channels=['wechat_work', 'telegram', 'email'], urgent=True )

Start scheduler

启动调度器

scheduler.start()

undefined

scheduler.start()

undefined

Troubleshooting

故障排除

Browser Driver Issues

浏览器驱动问题

bash

undefined

bash

undefined

Error: "Message: 'chromedriver' executable needs to be in PATH"

错误："Message: 'chromedriver' executable needs to be in PATH"

Solution: Verify driver installation

解决方案：验证驱动安装

which chromedriver # Should return path

which chromedriver # 应返回路径

If not found, reinstall:

如果未找到，重新安装：

1. Check browser version

1. 检查浏览器版本

google-chrome --version # or microsoft-edge --version

google-chrome --version # 或 microsoft-edge --version

2. Download exact matching driver version

2. 下载完全匹配的驱动版本

3. Place in /usr/local/bin/ and chmod +x

3. 放入/usr/local/bin/并执行chmod +x

Alternative: Specify driver path in settings

替代方案：在设置中指定驱动路径

CHROMEDRIVER_PATH=/path/to/chromedriver

undefined

CHROMEDRIVER_PATH=/path/to/chromedriver

undefined

Database Connection Errors

数据库连接错误

python

undefined

python

undefined

Error: "Can't connect to MySQL server"

错误："Can't connect to MySQL server"

Check MySQL service

检查MySQL服务状态

sudo systemctl status mysql

Verify credentials

验证凭据

mysql -u hotsearch_user -p -h localhost hotsearch_db

Check .env file encoding (must be UTF-8 without BOM)

检查.env文件编码（必须为UTF-8无BOM）

file -I .env # Should show charset=utf-8

file -I .env # 应显示charset=utf-8

Test connection in Python

在Python中测试连接

import pymysql try: conn = pymysql.connect( host=os.getenv('MYSQL_HOST'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD'), database=os.getenv('MYSQL_DATABASE') ) print("Connection successful") except Exception as e: print(f"Error: {e}")

undefined

import pymysql try: conn = pymysql.connect( host=os.getenv('MYSQL_HOST'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD'), database=os.getenv('MYSQL_DATABASE') ) print("连接成功") except Exception as e: print(f"错误: {e}")

undefined

Crawler Rate Limiting

爬虫限流

python

undefined

python

undefined

Error: HTTP 429 or blocked requests

错误：HTTP 429或请求被阻止

Solution: Adjust crawler settings

解决方案：调整爬虫设置

In hotsearchcrawler/settings.py:

在hotsearchcrawler/settings.py中：

CONCURRENT_REQUESTS = 8 # Reduce from 16 DOWNLOAD_DELAY = 2 # Increase delay

CONCURRENT_REQUESTS = 8 # 从16减少 DOWNLOAD_DELAY = 2 # 增加延迟

Enable AutoThrottle

启用自动节流

AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 10

Rotate User-Agents and proxies

轮换User-Agent和代理

DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }

undefined

DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }

undefined

LLM API Timeouts

LLM API超时

python

undefined

python

undefined

Error: Request timeout or rate limit

错误：请求超时或限流

Solution: Implement retry logic and fallback

解决方案：实现重试逻辑和降级

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10)) def call_llm_with_retry(prompt): return llm.analyze(prompt)

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10)) def call_llm_with_retry(prompt): return llm.analyze(prompt)

Use batch processing for large datasets

对大型数据集使用批处理

from hotsearch_analysis_agent.batch_processor import BatchProcessor

processor = BatchProcessor(batch_size=10, delay=2) results = processor.process_items(news_items, analyze_func)

undefined

from hotsearch_analysis_agent.batch_processor import BatchProcessor

processor = BatchProcessor(batch_size=10, delay=2) results = processor.process_items(news_items, analyze_func)

undefined

Memory Issues with Large Datasets

大型数据集内存问题

python

undefined

python

undefined

Error: MemoryError or slow processing

错误：MemoryError或处理缓慢

Solution: Use pagination and streaming

解决方案：使用分页和流处理

from hotsearch_analysis_agent.db_client import DBClient

db = DBClient()

from hotsearch_analysis_agent.db_client import DBClient

db = DBClient()

Stream results instead of loading all at once

流式返回结果而非一次性加载全部

for batch in db.stream_hot_searches(batch_size=100): process_batch(batch) # Process and discard to free memory

for batch in db.stream_hot_searches(batch_size=100): process_batch(batch) # 处理后丢弃以释放内存

Use database aggregation instead of in-memory

使用数据库聚合而非内存聚合

aggregated = db.aggregate_by_platform( start_date='2026-01-01', end_date='2026-05-01' )

undefined

aggregated = db.aggregate_by_platform( start_date='2026-01-01', end_date='2026-05-01' )

undefined

Project Structure Reference

项目结构参考

.
├── app.py                          # Main application entry
├── hotsearch_analysis_agent/       # Analysis system
│   ├── analyzer.py                 # Core analysis logic
│   ├── llm_client.py              # LLM integration
│   ├── report_generator.py        # Report generation
│   ├── push_service.py            # Notification service
│   └── scheduler.py               # Task scheduling
├── hotsearchcrawler/              # Crawler cluster
│   ├── spiders/                   # Platform-specific spiders
│   ├── settings.py                # Crawler settings
│   └── run_spiders.py            # Crawler launcher
├── test_push_task.py              # Push notification testing
├── runspider-test.py              # Single crawler testing
├── init.py                        # Database initialization
├── requirements.txt               # Python dependencies
└── .env                          # Environment configuration

.
├── app.py                          # 主应用入口
├── hotsearch_analysis_agent/       # 分析系统
│   ├── analyzer.py                 # 核心分析逻辑
│   ├── llm_client.py              # LLM集成模块
│   ├── report_generator.py        # 报告生成模块
│   ├── push_service.py            # 通知服务模块
│   └── scheduler.py               # 任务调度模块
├── hotsearchcrawler/              # 爬虫集群
│   ├── spiders/                   # 平台专属爬虫
│   ├── settings.py                # 爬虫设置
│   └── run_spiders.py            # 爬虫启动器
├── test_push_task.py              # 推送通知测试
├── runspider-test.py              # 单个爬虫测试
├── init.py                        # 数据库初始化
├── requirements.txt               # Python依赖
└── .env                          # 环境配置

Best Practices

最佳实践

Database Indexing: Ensure indexes on
```
platform
```
,
```
crawl_time
```
, and
```
title
```
columns for fast queries
LLM Cost Management: Cache analysis results to avoid redundant API calls
Crawler Politeness: Respect platform rate limits and robots.txt
Notification Throttling: Implement cooldown periods to avoid alert fatigue
Data Retention: Set up automatic archival for data older than 90 days
Model Choice: Consider Huawei Pangu for better Chinese language understanding and local deployment

数据库索引：确保
```
platform
```
、
```
crawl_time
```
和
```
title
```
列有索引以加快查询
LLM成本管理：缓存分析结果以避免重复API调用
爬虫合规性：尊重平台限流规则和robots.txt
通知限流：设置冷却期避免告警疲劳
数据保留：自动归档90天以上的旧数据
模型选择：考虑使用华为盘古模型以获得更好的中文理解能力和本地部署支持

llm-intelligent-public-opinion-analytics

Original

Translation

LLM-Based Intelligent Public Opinion Analytics Assistant

基于LLM的智能舆情分析助手

Overview

概述

Installation

安装步骤

Prerequisites

前置条件

Check your Chrome/Edge version first

先检查你的Chrome/Edge版本

Chrome: chrome://settings/help

Chrome: chrome://settings/help

Edge: edge://settings/help

Edge: edge://settings/help

Download matching driver:

下载匹配的驱动：

ChromeDriver: https://chromedriver.chromium.org/

ChromeDriver: https://chromedriver.chromium.org/

EdgeDriver: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

EdgeDriver: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

Linux/macOS - place driver in PATH:

Linux/macOS - 将驱动放入PATH：

Verify installation:

验证安装：

Install MySQL 8.0+

安装MySQL 8.0+

Create database and user

创建数据库和用户

Clone repository

克隆仓库

Create virtual environment

创建虚拟环境

Install dependencies

安装依赖

Database Initialization

数据库初始化

Example table structure (adapt from init.py)

示例表结构（改编自init.py）

Hot search items table

热搜条目表

Configuration

配置

Environment Variables

环境变量

Database Configuration

数据库配置

LLM API Configuration (OpenAI-compatible format)

LLM API配置（兼容OpenAI格式）

Huawei Pangu Model (recommended alternative)

华为盘古模型（推荐替代方案）

Push Notification Channels

推送通知渠道

Email (SMTP)

邮件（SMTP）

WeChat Work Bot

企业微信机器人

WeChat Work Application

企业微信应用

Telegram Bot

Telegram机器人

Crawler Settings

爬虫设置

MySQL Connection Pool

MySQL连接池

Optional: Platform-specific cookies for authenticated access

可选：平台专属Cookie用于认证访问

Concurrent requests

并发请求数

User-Agent rotation

User-Agent轮换

Usage

使用方法

Starting the System

启动系统

Activate virtual environment

激活虚拟环境

Start the main application (web interface + API)