llm-public-opinion-analytics

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM-Based Public Opinion Analytics Assistant

基于LLM的舆情分析助手

Skill by ara.so — Data Skills collection.
ara.so提供的Skill — 数据技能合集。

Overview

概述

This project is an intelligent public opinion analysis assistant that integrates real-time data from 15 mainstream platforms across 26 ranking lists with large language model (LLM) analysis capabilities. It provides conversational hot search queries, topic-specific searches, topic clustering, and sentiment analysis. The system supports:
  • Real-time web scraping from platforms like Weibo, Bilibili, Douyin, Baidu, etc.
  • LLM-powered content analysis (including video content extraction)
  • Multi-channel push notifications (WeChat, Enterprise WeChat, Telegram, Email)
  • Keyboard shortcuts for crawler control
  • Quick data lookup and platform jumping
本项目是一款智能舆情分析助手,整合了来自15个主流平台26个榜单的实时数据与大语言模型(LLM)分析能力。它支持对话式热搜查询、特定主题搜索、主题聚类以及情感分析。该系统具备以下功能:
  • 实时爬取微博(Weibo)、哔哩哔哩(Bilibili)、抖音(Douyin)、百度(Baidu)等平台的数据
  • 基于LLM的内容分析(包括视频内容提取)
  • 多渠道推送通知(微信、企业微信、Telegram、邮件)
  • 爬虫控制快捷键
  • 快速数据查询与平台跳转

Installation

安装步骤

Prerequisites

前置条件

  1. Python Environment: Python 3.8+
  2. MySQL Database: MySQL 5.7+ or 8.0+
  3. Browser Driver: ChromeDriver or EdgeDriver
  1. Python环境:Python 3.8+
  2. MySQL数据库:MySQL 5.7+ 或 8.0+
  3. 浏览器驱动:ChromeDriver 或 EdgeDriver

Step 1: Browser Driver Setup

步骤1:浏览器驱动配置

Download the driver matching your browser version:
Add the driver to your system PATH:
bash
undefined
下载与浏览器版本匹配的驱动:
将驱动添加至系统PATH:
bash
undefined

macOS/Linux

macOS/Linux

export PATH=$PATH:/path/to/driver/directory
export PATH=$PATH:/path/to/driver/directory

Windows: Add to System Environment Variables

Windows: 添加至系统环境变量


Verify installation:

```bash
chromedriver --version

验证安装:

```bash
chromedriver --version

or

msedgedriver --version
undefined
msedgedriver --version
undefined

Step 2: Clone and Install Dependencies

步骤2:克隆项目并安装依赖

bash
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant
bash
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant

Create virtual environment

创建虚拟环境

python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
python -m venv venv source venv/bin/activate # Windows系统执行:venv\Scripts\activate

Install dependencies

安装依赖

pip install -r requirements.txt
undefined
pip install -r requirements.txt
undefined

Step 3: Database Setup

步骤3:数据库配置

Create MySQL database and tables:
python
undefined
创建MySQL数据库及表:
python
undefined

Reference init.py for schema

参考init.py中的数据库结构

import mysql.connector
conn = mysql.connector.connect( host=os.getenv('MYSQL_HOST', 'localhost'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD') )
cursor = conn.cursor() cursor.execute("CREATE DATABASE IF NOT EXISTS hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci") cursor.execute("USE hotsearch_db")
import mysql.connector
conn = mysql.connector.connect( host=os.getenv('MYSQL_HOST', 'localhost'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD') )
cursor = conn.cursor() cursor.execute("CREATE DATABASE IF NOT EXISTS hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci") cursor.execute("USE hotsearch_db")

Create tables (see init.py for full schema)

创建表(完整结构请查看init.py)

cursor.execute(""" CREATE TABLE IF NOT EXISTS hot_search_items ( id INT AUTO_INCREMENT PRIMARY KEY, platform VARCHAR(50), title VARCHAR(500), url TEXT, rank_index INT, heat_value VARCHAR(100), collected_at DATETIME, content TEXT, sentiment VARCHAR(20), INDEX idx_platform (platform), INDEX idx_collected (collected_at) ) """)
conn.commit()
undefined
cursor.execute(""" CREATE TABLE IF NOT EXISTS hot_search_items ( id INT AUTO_INCREMENT PRIMARY KEY, platform VARCHAR(50), title VARCHAR(500), url TEXT, rank_index INT, heat_value VARCHAR(100), collected_at DATETIME, content TEXT, sentiment VARCHAR(20), INDEX idx_platform (platform), INDEX idx_collected (collected_at) ) """)
conn.commit()
undefined

Step 4: Environment Configuration

步骤4:环境配置

Create
.env
file in project root:
bash
undefined
在项目根目录创建
.env
文件:
bash
undefined

MySQL Configuration

MySQL配置

MYSQL_HOST=localhost MYSQL_PORT=3306 MYSQL_USER=your_mysql_user MYSQL_PASSWORD=your_mysql_password MYSQL_DATABASE=hotsearch_db
MYSQL_HOST=localhost MYSQL_PORT=3306 MYSQL_USER=your_mysql_user MYSQL_PASSWORD=your_mysql_password MYSQL_DATABASE=hotsearch_db

LLM Configuration (OpenAI-compatible API)

LLM配置(兼容OpenAI的API)

OPENAI_API_KEY=your_api_key OPENAI_API_BASE=https://api.openai.com/v1 MODEL_NAME=gpt-4
OPENAI_API_KEY=your_api_key OPENAI_API_BASE=https://api.openai.com/v1 MODEL_NAME=gpt-4

Or use Huawei Pangu Model (local deployment)

或使用华为盘古大模型(本地部署)

PANGU_MODEL_PATH=/path/to/pangu/model

PANGU_MODEL_PATH=/path/to/pangu/model

PANGU_API_URL=http://localhost:8080

PANGU_API_URL=http://localhost:8080

Push Notification Channels

推送通知渠道

WeChat Work Bot

企业微信机器人

WECHAT_WORK_BOT_WEBHOOK=your_webhook_url
WECHAT_WORK_BOT_WEBHOOK=your_webhook_url

WeChat Work App

企业微信应用

WECHAT_WORK_CORP_ID=your_corp_id WECHAT_WORK_AGENT_ID=your_agent_id WECHAT_WORK_SECRET=your_secret
WECHAT_WORK_CORP_ID=your_corp_id WECHAT_WORK_AGENT_ID=your_agent_id WECHAT_WORK_SECRET=your_secret

Telegram

Telegram

TELEGRAM_BOT_TOKEN=your_bot_token TELEGRAM_CHAT_ID=your_chat_id
TELEGRAM_BOT_TOKEN=your_bot_token TELEGRAM_CHAT_ID=your_chat_id

Email (SMTP)

邮件(SMTP)

SMTP_HOST=smtp.gmail.com SMTP_PORT=587 SMTP_USER=your_email@gmail.com SMTP_PASSWORD=your_app_password SMTP_RECIPIENTS=recipient1@example.com,recipient2@example.com
undefined
SMTP_HOST=smtp.gmail.com SMTP_PORT=587 SMTP_USER=your_email@gmail.com SMTP_PASSWORD=your_app_password SMTP_RECIPIENTS=recipient1@example.com,recipient2@example.com
undefined

Core Components

核心组件

1. Web Scraping System (
hotsearchcrawler/
)

1. 网页爬取系统(
hotsearchcrawler/

The crawler cluster supports 15 platforms with 26 ranking lists:
python
undefined
爬虫集群支持15个平台的26个榜单:
python
undefined

Run all spiders

运行所有爬虫

python run_spiders.py
python run_spiders.py

Test specific spider

测试特定爬虫

python runspider-test.py weibo # Test Weibo scraper
undefined
python runspider-test.py weibo # 测试微博爬虫
undefined

Crawler Configuration

爬虫配置

Edit
hotsearchcrawler/settings.py
:
python
undefined
编辑
hotsearchcrawler/settings.py
python
undefined

MySQL settings

MySQL设置

MYSQL_HOST = os.getenv('MYSQL_HOST', 'localhost') MYSQL_PORT = int(os.getenv('MYSQL_PORT', 3306)) MYSQL_USER = os.getenv('MYSQL_USER') MYSQL_PASSWORD = os.getenv('MYSQL_PASSWORD') MYSQL_DATABASE = os.getenv('MYSQL_DATABASE', 'hotsearch_db')
MYSQL_HOST = os.getenv('MYSQL_HOST', 'localhost') MYSQL_PORT = int(os.getenv('MYSQL_PORT', 3306)) MYSQL_USER = os.getenv('MYSQL_USER') MYSQL_PASSWORD = os.getenv('MYSQL_PASSWORD') MYSQL_DATABASE = os.getenv('MYSQL_DATABASE', 'hotsearch_db')

Optional: Platform-specific cookies

可选:平台专属Cookie

COOKIES = { 'weibo': 'your_weibo_cookies', 'bilibili': 'your_bilibili_cookies' }
COOKIES = { 'weibo': 'your_weibo_cookies', 'bilibili': 'your_bilibili_cookies' }

Crawler settings

爬虫设置

CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 1 RANDOMIZE_DOWNLOAD_DELAY = True
undefined
CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 1 RANDOMIZE_DOWNLOAD_DELAY = True
undefined

Available Platforms

支持的平台

  • Social Media: Weibo, Douyin, Kuaishou
  • Video: Bilibili, Tencent Video
  • News: Baidu, Toutiao, Zhihu
  • E-commerce: Taobao, JD.com
  • Gaming: Steam, Tap Tap
  • Others: Tieba, Douban, etc.
  • 社交媒体:微博(Weibo)、抖音(Douyin)、快手(Kuaishou)
  • 视频平台:哔哩哔哩(Bilibili)、腾讯视频(Tencent Video)
  • 资讯平台:百度(Baidu)、头条(Toutiao)、知乎(Zhihu)
  • 电商平台:淘宝(Taobao)、京东(JD.com)
  • 游戏平台:Steam、Tap Tap
  • 其他:贴吧(Tieba)、豆瓣(Douban)等

2. Analysis System (
hotsearch_analysis_agent/
)

2. 分析系统(
hotsearch_analysis_agent/

LLM-powered analysis engine for topic clustering, sentiment analysis, and report generation.
python
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer
基于LLM的分析引擎,支持主题聚类、情感分析及报告生成。
python
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer

Initialize analyzer

初始化分析器

analyzer = HotSearchAnalyzer( api_key=os.getenv('OPENAI_API_KEY'), api_base=os.getenv('OPENAI_API_BASE'), model_name=os.getenv('MODEL_NAME', 'gpt-4') )
analyzer = HotSearchAnalyzer( api_key=os.getenv('OPENAI_API_KEY'), api_base=os.getenv('OPENAI_API_BASE'), model_name=os.getenv('MODEL_NAME', 'gpt-4') )

Analyze topics

分析主题

topics = analyzer.fetch_topics( platform='weibo', start_date='2026-05-01', end_date='2026-05-20' )
topics = analyzer.fetch_topics( platform='weibo', start_date='2026-05-01', end_date='2026-05-20' )

Topic clustering

主题聚类

clusters = analyzer.cluster_topics(topics, n_clusters=5)
clusters = analyzer.cluster_topics(topics, n_clusters=5)

Sentiment analysis

情感分析

for topic in topics: sentiment = analyzer.analyze_sentiment(topic['title'], topic['content']) print(f"{topic['title']}: {sentiment}")
for topic in topics: sentiment = analyzer.analyze_sentiment(topic['title'], topic['content']) print(f"{topic['title']}: {sentiment}")

Generate report

生成报告

report = analyzer.generate_report( query="人工智能与前沿科技", platforms=['weibo', 'bilibili', 'zhihu'], days=7 ) print(report)
undefined
report = analyzer.generate_report( query="人工智能与前沿科技", platforms=['weibo', 'bilibili', 'zhihu'], days=7 ) print(report)
undefined

Custom LLM Integration

自定义LLM集成

python
undefined
python
undefined

Using Huawei Pangu Model (local deployment)

使用华为盘古大模型(本地部署)

from hotsearch_analysis_agent.llm import PanguLLM
pangu = PanguLLM( model_path=os.getenv('PANGU_MODEL_PATH'), api_url=os.getenv('PANGU_API_URL') )
response = pangu.generate( prompt="分析以下新闻的情感倾向:\n{news_content}", max_tokens=500 )
undefined
from hotsearch_analysis_agent.llm import PanguLLM
pangu = PanguLLM( model_path=os.getenv('PANGU_MODEL_PATH'), api_url=os.getenv('PANGU_API_URL') )
response = pangu.generate( prompt="分析以下新闻的情感倾向:\n{news_content}", max_tokens=500 )
undefined

3. Web Application (
app.py
)

3. Web应用(
app.py

FastAPI-based web interface for interactive queries and control.
python
undefined
基于FastAPI的Web界面,支持交互式查询与控制。
python
undefined

Start the web application

启动Web应用

python app.py
python app.py

Default runs on http://localhost:8000

默认运行在http://localhost:8000

undefined
undefined

API Endpoints

API接口

python
from fastapi import FastAPI
from hotsearch_analysis_agent.api import router

app = FastAPI()
app.include_router(router)
python
from fastapi import FastAPI
from hotsearch_analysis_agent.api import router

app = FastAPI()
app.include_router(router)

Example API calls

API调用示例

import httpx
import httpx

Query hot searches

查询热搜

response = httpx.get('http://localhost:8000/api/hot-search', params={ 'platform': 'weibo', 'limit': 20 })
response = httpx.get('http://localhost:8000/api/hot-search', params={ 'platform': 'weibo', 'limit': 20 })

Search by keyword

关键词搜索

response = httpx.post('http://localhost:8000/api/search', json={ 'keyword': '人工智能', 'platforms': ['weibo', 'zhihu'], 'days': 7 })
response = httpx.post('http://localhost:8000/api/search', json={ 'keyword': '人工智能', 'platforms': ['weibo', 'zhihu'], 'days': 7 })

Start crawler

启动爬虫

response = httpx.post('http://localhost:8000/api/crawler/start', json={ 'platforms': ['weibo', 'bilibili'] })
response = httpx.post('http://localhost:8000/api/crawler/start', json={ 'platforms': ['weibo', 'bilibili'] })

Stop crawler

停止爬虫

undefined
undefined

Push Notification System

推送通知系统

Configure and test multi-channel alerts:
python
undefined
配置并测试多渠道告警:
python
undefined

test_push_task.py

test_push_task.py

from hotsearch_analysis_agent.push import PushManager
manager = PushManager()
from hotsearch_analysis_agent.push import PushManager
manager = PushManager()

Configure push task

配置推送任务

task = { 'name': 'AI Tech Monitor', 'query': '人工智能', 'platforms': ['weibo', 'zhihu', 'bilibili'], 'schedule': '0 9,18 * * *', # Cron format: 9 AM and 6 PM daily 'channels': ['wechat_work', 'telegram', 'email'], 'min_heat': 100000 # Minimum heat value threshold }
manager.create_task(task)
task = { 'name': 'AI Tech Monitor', 'query': '人工智能', 'platforms': ['weibo', 'zhihu', 'bilibili'], 'schedule': '0 9,18 * * *', # Cron格式:每日上午9点和下午6点 'channels': ['wechat_work', 'telegram', 'email'], 'min_heat': 100000 # 最低热度阈值 }
manager.create_task(task)

Test push manually

手动测试推送

report = """
report = """

AI Technology Hot Topics - 2026-05-20

AI技术热点话题 - 2026-05-20

Key Findings

核心发现

  • GPT-6 context window leaked: 2M tokens
  • DeepSeek V4 uses Huawei Ascend chips
  • Chinese LLM API calls lead globally for 5 weeks
[Full report content...] """
  • GPT-6上下文窗口泄露:2M tokens
  • DeepSeek V4采用华为昇腾芯片
  • 中国LLM API调用量连续5周全球领先
[完整报告内容...] """

Send to WeChat Work

发送至企业微信

manager.send_wechat_work(report)
manager.send_wechat_work(report)

Send to Telegram

发送至Telegram

manager.send_telegram(report)
manager.send_telegram(report)

Send email

发送邮件

manager.send_email( subject="AI Technology Hot Topics - 2026-05-20", content=report )
undefined
manager.send_email( subject="AI技术热点话题 - 2026-05-20", content=report )
undefined

Push Channel Configuration

推送渠道配置

python
undefined
python
undefined

WeChat Work Bot (Group Webhook)

企业微信机器人(群聊Webhook)

import requests
def send_wechat_work_bot(content): webhook = os.getenv('WECHAT_WORK_BOT_WEBHOOK') data = { "msgtype": "markdown", "markdown": { "content": content } } requests.post(webhook, json=data)
import requests
def send_wechat_work_bot(content): webhook = os.getenv('WECHAT_WORK_BOT_WEBHOOK') data = { "msgtype": "markdown", "markdown": { "content": content } } requests.post(webhook, json=data)

Telegram Bot

Telegram机器人

from telegram import Bot
def send_telegram(content): bot = Bot(token=os.getenv('TELEGRAM_BOT_TOKEN')) chat_id = os.getenv('TELEGRAM_CHAT_ID') bot.send_message(chat_id=chat_id, text=content, parse_mode='Markdown')
from telegram import Bot
def send_telegram(content): bot = Bot(token=os.getenv('TELEGRAM_BOT_TOKEN')) chat_id = os.getenv('TELEGRAM_CHAT_ID') bot.send_message(chat_id=chat_id, text=content, parse_mode='Markdown')

Email via SMTP

SMTP邮件发送

import smtplib from email.mime.text import MIMEText
def send_email(subject, content): msg = MIMEText(content, 'html', 'utf-8') msg['Subject'] = subject msg['From'] = os.getenv('SMTP_USER') msg['To'] = os.getenv('SMTP_RECIPIENTS')
with smtplib.SMTP(os.getenv('SMTP_HOST'), int(os.getenv('SMTP_PORT'))) as server:
    server.starttls()
    server.login(os.getenv('SMTP_USER'), os.getenv('SMTP_PASSWORD'))
    server.send_message(msg)
undefined
import smtplib from email.mime.text import MIMEText
def send_email(subject, content): msg = MIMEText(content, 'html', 'utf-8') msg['Subject'] = subject msg['From'] = os.getenv('SMTP_USER') msg['To'] = os.getenv('SMTP_RECIPIENTS')
with smtplib.SMTP(os.getenv('SMTP_HOST'), int(os.getenv('SMTP_PORT'))) as server:
    server.starttls()
    server.login(os.getenv('SMTP_USER'), os.getenv('SMTP_PASSWORD'))
    server.send_message(msg)
undefined

Common Usage Patterns

常见使用场景

Pattern 1: Daily Hot Topic Monitoring

场景1:每日热点话题监控

python
from datetime import datetime, timedelta
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer
from hotsearch_analysis_agent.push import PushManager

analyzer = HotSearchAnalyzer()
push_manager = PushManager()
python
from datetime import datetime, timedelta
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer
from hotsearch_analysis_agent.push import PushManager

analyzer = HotSearchAnalyzer()
push_manager = PushManager()

Get yesterday's hot topics

获取昨日热点话题

yesterday = datetime.now() - timedelta(days=1) topics = analyzer.fetch_topics( platforms=['weibo', 'zhihu', 'bilibili'], start_date=yesterday.strftime('%Y-%m-%d'), heat_threshold=50000 )
yesterday = datetime.now() - timedelta(days=1) topics = analyzer.fetch_topics( platforms=['weibo', 'zhihu', 'bilibili'], start_date=yesterday.strftime('%Y-%m-%d'), heat_threshold=50000 )

Cluster and analyze

聚类并分析

clusters = analyzer.cluster_topics(topics, n_clusters=5)
clusters = analyzer.cluster_topics(topics, n_clusters=5)

Generate report

生成报告

report = analyzer.generate_report_from_clusters(clusters)
report = analyzer.generate_report_from_clusters(clusters)

Push to all channels

推送至所有渠道

push_manager.broadcast(report, channels=['wechat_work', 'telegram', 'email'])
undefined
push_manager.broadcast(report, channels=['wechat_work', 'telegram', 'email'])
undefined

Pattern 2: Keyword Alert System

场景2:关键词告警系统

python
undefined
python
undefined

Monitor specific keywords and send immediate alerts

监控特定关键词并即时发送告警

from hotsearch_analysis_agent.monitor import KeywordMonitor
monitor = KeywordMonitor( keywords=['芯片', 'AI', '大模型', '华为'], platforms=['weibo', 'toutiao', 'zhihu'], check_interval=300 # Check every 5 minutes )
def on_match(topic): """Callback when keyword is matched""" alert = f""" 🔔 Keyword Alert: {topic['title']} Platform: {topic['platform']} Heat: {topic['heat_value']} URL: {topic['url']} """ push_manager.send_telegram(alert)
monitor.start(callback=on_match)
undefined
from hotsearch_analysis_agent.monitor import KeywordMonitor
monitor = KeywordMonitor( keywords=['芯片', 'AI', '大模型', '华为'], platforms=['weibo', 'toutiao', 'zhihu'], check_interval=300 # 每5分钟检查一次 )
def on_match(topic): """匹配到关键词时的回调函数""" alert = f""" 🔔 关键词告警: {topic['title']} 平台: {topic['platform']} 热度: {topic['heat_value']} 链接: {topic['url']} """ push_manager.send_telegram(alert)
monitor.start(callback=on_match)
undefined

Pattern 3: Deep Content Analysis

场景3:深度内容分析

python
undefined
python
undefined

Analyze news detail pages (including video content)

分析新闻详情页(包括视频内容)

from hotsearch_analysis_agent.content_extractor import ContentExtractor
extractor = ContentExtractor()
from hotsearch_analysis_agent.content_extractor import ContentExtractor
extractor = ContentExtractor()

Get detailed content from URL

从URL提取详细内容

url = 'https://www.bilibili.com/video/BV13pSoBBEvX/' content = extractor.extract(url)
print(f"Title: {content['title']}") print(f"Type: {content['type']}") # 'video' or 'article' print(f"Content: {content['text'][:500]}...") # Extracted transcript/text
url = 'https://www.bilibili.com/video/BV13pSoBBEvX/' content = extractor.extract(url)
print(f"标题: {content['title']}") print(f"类型: {content['type']}") # 'video' 或 'article' print(f"内容: {content['text'][:500]}...") # 提取的字幕/文本

Analyze sentiment

情感分析

sentiment = analyzer.analyze_sentiment(content['title'], content['text']) print(f"Sentiment: {sentiment}")
sentiment = analyzer.analyze_sentiment(content['title'], content['text']) print(f"情感倾向: {sentiment}")

Extract entities

提取实体

entities = analyzer.extract_entities(content['text']) print(f"Entities: {entities}")
undefined
entities = analyzer.extract_entities(content['text']) print(f"实体: {entities}")
undefined

Pattern 4: Custom Report Generation

场景4:自定义报告生成

python
undefined
python
undefined

Generate custom analytical report

生成自定义分析报告

report_config = { 'title': '科技行业周报', 'query': '人工智能 OR 芯片 OR 量子计算', 'platforms': ['all'], 'date_range': 7, 'sections': [ 'core_findings', # Key discoveries 'news_details', # Detailed news list 'trend_analysis', # Trend analysis 'entity_network' # Entity relationship graph ], 'output_format': 'markdown' }
report = analyzer.generate_custom_report(**report_config)
report_config = { 'title': '科技行业周报', 'query': '人工智能 OR 芯片 OR 量子计算', 'platforms': ['all'], 'date_range': 7, 'sections': [ 'core_findings', # 核心发现 'news_details', # 新闻详情列表 'trend_analysis', # 趋势分析 'entity_network' # 实体关系图 ], 'output_format': 'markdown' }
report = analyzer.generate_custom_report(**report_config)

Save to file

保存至文件

with open(f"report_{datetime.now().strftime('%Y%m%d')}.md", 'w', encoding='utf-8') as f: f.write(report)
undefined
with open(f"report_{datetime.now().strftime('%Y%m%d')}.md", 'w', encoding='utf-8') as f: f.write(report)
undefined

Troubleshooting

故障排查

Issue 1: Browser Driver Errors

问题1:浏览器驱动错误

selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH
Solution: Ensure ChromeDriver/EdgeDriver is in system PATH and matches browser version.
bash
undefined
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH
解决方案:确保ChromeDriver/EdgeDriver已添加至系统PATH,且版本与浏览器匹配。
bash
undefined

Check driver version

检查驱动版本

chromedriver --version
chromedriver --version

Check Chrome version

检查Chrome版本

google-chrome --version # Linux
google-chrome --version # Linux系统

or open chrome://version in browser

或在浏览器中打开chrome://version查看

undefined
undefined

Issue 2: Database Connection Failures

问题2:数据库连接失败

mysql.connector.errors.ProgrammingError: Access denied for user
Solution: Verify MySQL credentials in
.env
and ensure user has proper permissions.
sql
-- Grant permissions
GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'your_user'@'localhost';
FLUSH PRIVILEGES;
mysql.connector.errors.ProgrammingError: Access denied for user
解决方案:验证
.env
中的MySQL凭据,确保用户拥有足够权限。
sql
undefined

Issue 3: LLM API Rate Limits

授予权限

openai.error.RateLimitError: Rate limit exceeded
Solution: Implement request throttling or switch to local model:
python
import time
from functools import wraps

def rate_limit(calls_per_minute=10):
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait_time = min_interval - elapsed
            if wait_time > 0:
                time.sleep(wait_time)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limit(calls_per_minute=10)
def call_llm(prompt):
    return analyzer.generate(prompt)
GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'your_user'@'localhost'; FLUSH PRIVILEGES;
undefined

Issue 4: Crawler Being Blocked

问题3:LLM API速率限制

Solution: Rotate user agents and add delays:
python
undefined
openai.error.RateLimitError: Rate limit exceeded
解决方案:实现请求限流或切换至本地模型:
python
import time
from functools import wraps

def rate_limit(calls_per_minute=10):
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait_time = min_interval - elapsed
            if wait_time > 0:
                time.sleep(wait_time)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limit(calls_per_minute=10)
def call_llm(prompt):
    return analyzer.generate(prompt)

In hotsearchcrawler/settings.py

问题4:爬虫被拦截

DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }
DOWNLOAD_DELAY = 3 RANDOMIZE_DOWNLOAD_DELAY = True CONCURRENT_REQUESTS_PER_DOMAIN = 2
undefined
解决方案:轮换用户代理并添加延迟:
python
undefined

Issue 5: Encoding Issues with Chinese Text

在hotsearchcrawler/settings.py中配置

Solution: Ensure UTF-8 encoding throughout:
python
undefined
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }
DOWNLOAD_DELAY = 3 RANDOMIZE_DOWNLOAD_DELAY = True CONCURRENT_REQUESTS_PER_DOMAIN = 2
undefined

Database connection

问题5:中文文本编码问题

import mysql.connector
conn = mysql.connector.connect( host=os.getenv('MYSQL_HOST'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD'), database=os.getenv('MYSQL_DATABASE'), charset='utf8mb4', collation='utf8mb4_unicode_ci' )
解决方案:确保全程使用UTF-8编码:
python
undefined

File operations

数据库连接

with open('report.md', 'w', encoding='utf-8') as f: f.write(report)
undefined
import mysql.connector
conn = mysql.connector.connect( host=os.getenv('MYSQL_HOST'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD'), database=os.getenv('MYSQL_DATABASE'), charset='utf8mb4', collation='utf8mb4_unicode_ci' )

Advanced Configuration

文件操作

Using Huawei Pangu Model (Local Deployment)

Download and deploy the model:
bash
undefined
with open('report.md', 'w', encoding='utf-8') as f: f.write(report)
undefined

Start model service

使用华为盘古大模型(本地部署)

python -m hotsearch_analysis_agent.llm.pangu_server --model_path /path/to/model --port 8080

Configure in code:

```python
from hotsearch_analysis_agent.llm import PanguLLM

analyzer = HotSearchAnalyzer(
    llm=PanguLLM(api_url='http://localhost:8080')
)
下载并部署模型:
bash
undefined

启动模型服务

Scale up with multiple crawler instances:
bash
undefined
python -m hotsearch_analysis_agent.llm.pangu_server --model_path /path/to/model --port 8080

在代码中配置:

```python
from hotsearch_analysis_agent.llm import PanguLLM

analyzer = HotSearchAnalyzer(
    llm=PanguLLM(api_url='http://localhost:8080')
)

Instance 1: Weibo, Zhihu

分布式爬取

python run_spiders.py --platforms weibo,zhihu
通过多个爬虫实例扩展规模:
bash
undefined

Instance 2: Bilibili, Douyin

实例1:微博、知乎

python run_spiders.py --platforms bilibili,douyin
python run_spiders.py --platforms weibo,zhihu

Instance 3: News platforms

实例2:哔哩哔哩、抖音

python run_spiders.py --platforms baidu,toutiao
undefined
python run_spiders.py --platforms bilibili,douyin

Project Structure Reference

实例3:资讯平台

.
├── app.py                          # Web application entry
├── run_spiders.py                  # Crawler launcher
├── runspider-test.py               # Crawler testing
├── test_push_task.py               # Push notification testing
├── init.py                         # Database initialization
├── requirements.txt                # Python dependencies
├── .env                            # Environment configuration
├── hotsearchcrawler/               # Crawler cluster
│   ├── spiders/                    # Platform-specific spiders
│   ├── settings.py                 # Crawler settings
│   └── pipelines.py                # Data pipelines
└── hotsearch_analysis_agent/       # Analysis system
    ├── analyzer.py                 # Core analysis engine
    ├── llm/                        # LLM integrations
    ├── push/                       # Push notification modules
    ├── api/                        # Web API endpoints
    └── content_extractor.py        # Content extraction utilities
python run_spiders.py --platforms baidu,toutiao
undefined

项目结构参考

.
├── app.py                          # Web应用入口
├── run_spiders.py                  # 爬虫启动器
├── runspider-test.py               # 爬虫测试脚本
├── test_push_task.py               # 推送通知测试脚本
├── init.py                         # 数据库初始化脚本
├── requirements.txt                # Python依赖列表
├── .env                            # 环境配置文件
├── hotsearchcrawler/               # 爬虫集群
│   ├── spiders/                    # 平台专属爬虫
│   ├── settings.py                 # 爬虫配置
│   └── pipelines.py                # 数据管道
└── hotsearch_analysis_agent/       # 分析系统
    ├── analyzer.py                 # 核心分析引擎
    ├── llm/                        # LLM集成模块
    ├── push/                       # 推送通知模块
    ├── api/                        # Web API接口
    └── content_extractor.py        # 内容提取工具