opinion-miner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese舆情分析工具 (Opinion Miner)
Opinion Miner Tool
分析社区评论,挖掘用户真正的核心观点和立场。
Analyze community comments to uncover users' true core viewpoints and positions.
何时使用此技能
When to Use This Skill
当用户想要了解社区对某个话题的看法时使用此技能——不仅仅是"他们说了什么",而是"他们持有什么样的立场以及在哪里存在分歧"。目标是将原始评论转化为结构化的洞察。
典型触发场景:
- "大家对 X 在 Reddit/Bilibili/GitHub 上怎么说?"
- "分析一下这个帖子的评论"
- "这里主要的分歧点是什么?"
- "帮我了解一下社区对这个问题的态度"
- "做一下这个话题的舆情分析"
Use this skill when users want to understand the community's opinions on a topic — not just "what they said", but "what positions they hold and where disagreements lie". The goal is to transform raw comments into structured insights.
Typical Trigger Scenarios:
- "What are people saying about X on Reddit/Bilibili/GitHub?"
- "Analyze the comments on this post"
- "What are the main points of disagreement here?"
- "Help me understand the community's attitude towards this issue"
- "Conduct opinion analysis on this topic"
支持的数据来源
Supported Data Sources
| 数据来源 | 爬取方式 |
|---|---|
| Bilibili 视频评论 | |
| Reddit 帖子 | 通过 |
| GitHub Issues 评论 | 通过 |
如果用户提供了 URL,先识别平台类型,然后使用相应的方法进行爬取。
| Data Sources | Scraping Methods |
|---|---|
| Bilibili Video Comments | |
| Reddit Posts | Access |
| GitHub Issues Comments | Call GitHub API ( |
If the user provides a URL, first identify the platform type, then use the corresponding method for scraping.
工作流程
Workflow
步骤 1: 爬取评论
Step 1: Scrape Comments
从给定的 URL 收集所有评论。将原始数据保存到 ,使用以下结构:
comments_raw.jsonjson
[
{
"id": "unique-id",
"author": "username",
"text": "comment body",
"likes": 0,
"replies": [],
"timestamp": "2026-01-15T10:30:00Z"
}
]平台特定的爬取方式:
Bilibili: 先尝试评论 API — 。分页爬取直到评论耗尽。如果 API 失败,回退到 :
https://api.bilibili.com/x/v2/reply/main?type=1&oid={video_id}&mode=3&ps=20&pn={page}agent-browseragent-browser open "https://www.bilibili.com/video/BVxxxxx" && agent-browser wait --load networkidle
agent-browser snapshot -i然后滚动页面并通过 DOM 快照提取评论。
Reddit: 使用 JSON API — 在任意 Reddit URL 末尾添加 :
.jsonwebfetch "https://www.reddit.com/r/subreddit/comments/postid.json?limit=500"解析嵌套的树状结构。将回复作为嵌套评论包含在内,但在聚类时将其展平(回复通常会重复父评论的观点)。
GitHub Issues: 使用 GitHub API:
webfetch "https://api.github.com/repos/owner/repo/issues/issue_number/comments?per_page=100"使用 进行分页。同时获取 issue 正文——它定义了讨论的背景。
&page=NCollect all comments from the given URL. Save the raw data to using the following structure:
comments_raw.jsonjson
[
{
"id": "unique-id",
"author": "username",
"text": "comment body",
"likes": 0,
"replies": [],
"timestamp": "2026-01-15T10:30:00Z"
}
]Platform-Specific Scraping Methods:
Bilibili: First try the comment API — . Scrape paginated results until all comments are retrieved. If the API fails, fall back to :
https://api.bilibili.com/x/v2/reply/main?type=1&oid={video_id}&mode=3&ps=20&pn={page}agent-browseragent-browser open "https://www.bilibili.com/video/BVxxxxx" && agent-browser wait --load networkidle
agent-browser snapshot -iThen scroll the page and extract comments via DOM snapshots.
Reddit: Use the JSON API — append to any Reddit URL:
.jsonwebfetch "https://www.reddit.com/r/subreddit/comments/postid.json?limit=500"Parse the nested tree structure. Include replies as nested comments, but flatten them during clustering (replies usually repeat the parent comment's viewpoint).
GitHub Issues: Use the GitHub API:
webfetch "https://api.github.com/repos/owner/repo/issues/issue_number/comments?per_page=100"Use for pagination. Also retrieve the issue body — it defines the context of the discussion.
&page=N步骤 2: 预处理
Step 2: Preprocessing
在分析之前清理原始评论:
- 移除机器人评论、垃圾信息以及无意义评论(如 "+1"、"bump"、单个表情)
- 如果评论超过 500 条,战略性采样——选取高赞评论 + 随机抽取中热度评论,以捕捉少数派观点
- 保留元数据(点赞数、作者)——有助于判断哪些观点更受欢迎
将清理后的数据保存到 。
comments_cleaned.jsonClean the raw comments before analysis:
- Remove bot comments, spam, and meaningless comments (e.g., "+1", "bump", single emojis)
- If there are more than 500 comments, perform strategic sampling — select high-like comments + randomly sample medium-popularity comments to capture minority viewpoints
- Retain metadata (like count, author) — helps judge which viewpoints are more popular
Save the cleaned data to .
comments_cleaned.json步骤 3: 语义聚类
Step 3: Semantic Clustering
阅读所有清理后的评论,按语义相似性进行分组——表达相同底层论点的评论归为一组,即使措辞完全不同。
高效聚类的方法:
- 批量阅读评论(每次 50-100 条)并进行第一轮分组
- 通过对比各批次之间的聚类结果进行合并——相同论点 = 同一聚类
- 每个聚类应该代表一个独特的立场或论点,而不仅仅是主题
- 用简洁的论点陈述来命名每个聚类(而不是主题标签)
聚类命名规范: 每个聚类名称应该是一个论点,而不是主题。
- 好的例子:"此功能破坏向后兼容性,应该改为可选"
- 不好的例子:"向后兼容性担忧"
将聚类结果保存到 :
clusters.jsonjson
[
{
"cluster_id": 1,
"name": "Concise argument statement",
"comment_count": 45,
"representative_comments": ["full text of 2-3 best examples"],
"support_ratio": 0.7,
"sample_comment_ids": ["id1", "id2", "id3"]
}
]Read all cleaned comments and group them by semantic similarity — comments expressing the same underlying argument are grouped together, even if the wording is completely different.
Efficient Clustering Methods:
- Read comments in batches (50-100 at a time) and perform the first round of grouping
- Merge clusters by comparing results across batches — same argument = same cluster
- Each cluster should represent a unique position or argument, not just a topic
- Name each cluster with a concise argument statement (instead of a hashtag)
Cluster Naming Guidelines: Each cluster name should be an argument, not a topic.
- Good example: "This feature breaks backward compatibility and should be made optional"
- Bad example: "Backward compatibility concerns"
Save the clustering results to :
clusters.jsonjson
[
{
"cluster_id": 1,
"name": "Concise argument statement",
"comment_count": 45,
"representative_comments": ["full text of 2-3 best examples"],
"support_ratio": 0.7,
"sample_comment_ids": ["id1", "id2", "id3"]
}
]步骤 4: 辩论分析
Step 4: Debate Analysis
针对每个聚类,确定以下内容:
- 立场:这个群体到底在争论什么?
- 论据:他们引用了什么事实、经验或逻辑?
- 信念强度:他们是肯定的还是犹豫的?利用语言线索和点赞数进行判断
- 与其他聚类的关系:这是对另一个聚类的反对观点吗?还是补充或延伸?
然后综合所有聚类进行识别:
- 核心争论轴:根本性的分歧(如"安全 vs. 便利"、"创新 vs. 稳定")
- 共识点:大多数聚类都同意的点
- 分歧点:社区存在明显对立的点
- 少数派观点:持有者少但论据有力的观点
For each cluster, identify the following:
- Position: What exactly is this group arguing for?
- Arguments: What facts, experiences, or logic do they cite?
- Belief Strength: Are they definitive or hesitant? Use language cues and like counts to judge
- Relationship with Other Clusters: Is this an opposing view to another cluster? Or a supplement or extension?
Then synthesize all clusters to identify:
- Core Debate Axes: Fundamental disagreements (e.g., "Security vs. Convenience", "Innovation vs. Stability")
- Consensus Points: Points agreed upon by most clusters
- Disagreement Points: Points where the community has clear opposition
- Minority Viewpoints: Viewpoints held by few but with strong arguments
步骤 5: 生成报告
Step 5: Generate Report
使用以下模板输出 Markdown 报告:
markdown
undefinedOutput a Markdown report using the following template:
markdown
undefined[Topic] 社区观点分析
[Topic] Community Opinion Analysis
数据来源: [URL] 评论总数: N (分析了 M 条有效评论) 生成时间: YYYY-MM-DD
Data Source: [URL] Total Comments: N (M valid comments analyzed) Generated Time: YYYY-MM-DD
摘要
Summary
[2-3 句话概括社区的整体态度和主要分歧]
[2-3 sentences summarizing the community's overall attitude and main disagreements]
核心争论点
Core Debate Points
[描述最核心的 1-2 个分歧轴,解释为什么这是争论的焦点]
[Describe the 1-2 most core disagreement axes, explain why these are the focus of the debate]
观点聚类
Viewpoint Clusters
观点 1: [论点陈述]
Viewpoint 1: [Argument Statement]
- 占比: ~X% (约 N 条评论)
- 核心论据: [支持这个观点的主要理由]
- 典型评论: [1-2 条代表性原文]
- 热度: [点赞/支持度]
- Proportion: ~X% (about N comments)
- Core Arguments: [Main reasons supporting this viewpoint]
- Typical Comments: [1-2 representative original comments]
- Popularity: [Like count/support level]
观点 2: [论点陈述]
Viewpoint 2: [Argument Statement]
...
...
共识与分歧
Consensus & Disagreements
共识
Consensus
- [大多数人都同意的点]
- [Points agreed upon by most people]
分歧
Disagreements
- [主要对立点,哪些观点之间存在直接冲突]
- [Main opposing points, which viewpoints are in direct conflict]
少数派观点
Minority Viewpoints
- [持有者少但论据有力的观点,值得关注]
- [Viewpoints held by few but with strong arguments, worthy of attention]
情绪分析
Sentiment Analysis
- 整体情绪: [正面/负面/中立/混合]
- 情绪强度: [激烈/温和]
- 情绪变化趋势: [如有时间线数据]
如果用户也请求 JSON 输出,请同时保存结构化数据。- Overall Sentiment: [Positive/Negative/Neutral/Mixed]
- Sentiment Intensity: [Intense/Mild]
- Sentiment Trend: [If timeline data is available]
If the user also requests JSON output, save the structured data as well.分析技巧
Analysis Tips
- 不要只数投票数——高赞的少数派观点可能比低参与度的多数派立场更重要
- 注意人们如何争论,而不仅仅是他们说了什么。讽刺、情绪化语言和防御性表态都表明强烈的立场
- 寻找隐含的论点——有时真正的分歧并未明确说出(例如,人们争论实现细节实际上可能是在争论优先级)
- 与回复交叉参考——一个强烈反对父评论的回复揭示了辩论结构
- 如果评论涉及多种语言,按论点进行聚类(不考虑语言),然后在每个聚类中注明语言分布
- Don't just count votes — a high-like minority viewpoint may be more important than a low-participation majority position
- Pay attention to how people argue, not just what they say. Sarcasm, emotional language, and defensive statements all indicate strong positions
- Look for implied arguments — sometimes the real disagreement is not explicitly stated (e.g., people arguing about implementation details may actually be arguing about priorities)
- Cross-reference with replies — a reply strongly opposing a parent comment reveals the debate structure
- If comments involve multiple languages, cluster by argument (regardless of language), then note the language distribution in each cluster