mteb-leaderboard
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMTEB Leaderboard Query Skill
MTEB排行榜查询技能
This skill provides guidance for accurately querying machine learning model leaderboards and benchmarks, particularly the Massive Text Embedding Benchmark (MTEB) and related embedding leaderboards.
本技能为准确查询机器学习模型排行榜与基准测试提供指导,尤其针对Massive Text Embedding Benchmark(MTEB)及相关嵌入排行榜。
When to Use This Skill
何时使用本技能
- Finding top-performing models on specific benchmarks (MTEB, Scandinavian Embedding Benchmark, etc.)
- Answering questions about current leaderboard standings
- Comparing model performance across different benchmarks
- Tasks with specific temporal requirements (e.g., "as of August 2025")
- 查找特定基准(MTEB、斯堪的纳维亚嵌入基准等)上的顶级模型
- 解答关于当前排行榜排名的问题
- 跨不同基准比较模型性能
- 带有特定时间要求的任务(例如:“截至2025年8月”)
Core Approach
核心方法
Step 1: Identify Authoritative Data Sources
步骤1:确定权威数据源
Before searching for results, establish which sources contain authoritative, current data:
-
Primary Sources (prefer these):
- Official leaderboard websites (e.g., on HuggingFace Spaces)
mteb-leaderboard - GitHub repositories with raw benchmark data
- API endpoints or JSON data files from leaderboard maintainers
- Official leaderboard websites (e.g.,
-
Secondary Sources (use with caution):
- Academic papers (often outdated by publication time)
- Blog posts and articles (may reference outdated results)
- News articles about benchmark results
在搜索结果前,先确认哪些来源包含权威、最新的数据:
-
主要来源(优先选择):
- 官方排行榜网站(例如HuggingFace Spaces上的)
mteb-leaderboard - 包含原始基准数据的GitHub仓库
- 排行榜维护者提供的API端点或JSON数据文件
- 官方排行榜网站(例如HuggingFace Spaces上的
-
次要来源(谨慎使用):
- 学术论文(通常发表时数据已过时)
- 博客文章与专栏(可能引用过时结果)
- 关于基准测试结果的新闻报道
Step 2: Verify Temporal Alignment
步骤2:验证时间一致性
When a task specifies a time constraint (e.g., "as of August 2025"):
- Check source publication/update dates - Academic papers are typically 6-18 months behind current leaderboard state
- Look for "last updated" timestamps on leaderboard pages
- Never assume paper results reflect current standings without verification
- Be explicit about temporal gaps - If using data from June 2024 to answer about August 2025, this is a 14+ month gap that likely invalidates the data
当任务指定时间限制(例如:“截至2025年8月”)时:
- 检查来源的发布/更新日期 - 学术论文的数据通常比当前排行榜状态滞后6-18个月
- 查看排行榜页面上的“最后更新”时间戳
- 绝不假设论文结果能反映当前排名,必须验证
- 明确说明时间差距 - 若用2024年6月的数据解答2025年8月的问题,这14个月以上的差距会导致数据失效
Step 3: Access Live Leaderboard Data
步骤3:获取实时排行榜数据
When web pages don't render properly (interactive charts, JavaScript-heavy pages):
-
Look for raw data endpoints:
- Check for or
/api/endpoints/data/ - Search for JSON files in the page source
- Look for GitHub repositories backing the leaderboard
- Check for
-
Try alternative access methods:
- HuggingFace Spaces often have Gradio APIs
- Many leaderboards publish CSV/JSON exports
- Check GitHub issues/discussions for data access tips
-
Search for data repositories:
site:github.com [leaderboard name] results jsonsite:huggingface.co [benchmark name] leaderboard
当网页无法正常渲染(交互式图表、重度依赖JavaScript的页面)时:
-
查找原始数据端点:
- 检查或
/api/端点/data/ - 在页面源码中搜索JSON文件
- 查找支持排行榜的GitHub仓库
- 检查
-
尝试替代访问方式:
- HuggingFace Spaces通常提供Gradio API
- 许多排行榜会发布CSV/JSON导出文件
- 查看GitHub议题/讨论获取数据访问技巧
-
搜索数据仓库:
site:github.com [leaderboard name] results jsonsite:huggingface.co [benchmark name] leaderboard
Step 4: Validate Model Eligibility
步骤4:验证模型资格
Do not make assumptions about which models "count" on a leaderboard:
- Check official leaderboard criteria - Some include API models, some don't
- Verify the answer format requirements against actual leaderboard entries
- Do not exclude models based on assumptions about what can be represented in a given format
- Consider all model types: open-source, API-based, fine-tuned variants
不要假设哪些模型能被计入排行榜:
- 查看官方排行榜标准 - 部分排行榜包含API模型,部分不包含
- 对照实际排行榜条目验证答案格式要求
- 不要基于对格式表现的假设排除模型
- 考虑所有模型类型:开源模型、基于API的模型、微调变体
Verification Strategies
验证策略
Cross-Reference Multiple Sources
多来源交叉验证
- Compare results from at least 2-3 independent sources
- If sources disagree, prioritize the most recent authoritative source
- Document discrepancies and their potential causes
- 对比至少2-3个独立来源的结果
- 若来源存在分歧,优先选择最新的权威来源
- 记录差异及其潜在原因
Sanity Check Results
合理性检查
- Verify the model actually appears on the leaderboard
- Confirm the model name/organization format matches the source
- Check if the model was released before the specified date
- 验证模型确实出现在排行榜上
- 确认模型名称/组织格式与来源一致
- 检查模型是否在指定日期前发布
Test Alternative Access Methods
尝试替代访问方法
When primary access fails:
- Try the Wayback Machine for historical snapshots
- Search for leaderboard maintainer announcements
- Look for community discussions about recent changes
- Check if there's a programmatic API
当主要访问方式失败时:
- 尝试使用Wayback Machine获取历史快照
- 搜索排行榜维护者的公告
- 查找关于近期变更的社区讨论
- 检查是否有程序化API
Common Pitfalls to Avoid
常见误区规避
1. Relying on Outdated Academic Papers
1. 依赖过时的学术论文
Academic papers have publication delays of 3-12 months. A paper published in June 2024 contains data from early 2024 at best. Never use paper results for questions about current standings.
学术论文存在3-12个月的发布延迟。2024年6月发表的论文最多仅包含2024年初的数据。绝不要用论文结果解答关于当前排名的问题。
2. Giving Up When Web Scraping Fails
2. 网页抓取失败就放弃
Interactive leaderboards often don't render in simple web fetches. Always try:
- Looking for underlying data files
- Checking GitHub repositories
- Finding API endpoints
- Searching for data exports
交互式排行榜通常无法通过简单网页抓取渲染。务必尝试:
- 查找底层数据文件
- 检查GitHub仓库
- 寻找API端点
- 搜索数据导出文件
3. Making Assumptions About Model Format
3. 对模型格式做出假设
Do not assume API models (OpenAI, Cohere, etc.) cannot be valid answers. Check the actual task requirements and leaderboard contents.
不要假设API模型(OpenAI、Cohere等)不能作为有效答案。请查看实际任务要求和排行榜内容。
4. Premature Conclusion Without Verification
4. 未验证就得出结论
Before writing a final answer:
- Verify the model appears on the actual leaderboard
- Confirm the ranking is current
- Check that the model meets all task requirements
在撰写最终答案前:
- 验证模型确实出现在实际排行榜上
- 确认排名是最新的
- 检查模型是否满足所有任务要求
5. Ignoring Temporal Requirements
5. 忽略时间要求
If a task asks about a specific date, ensure data sources reflect that timeframe. A 14-month gap between data and required date is unacceptable.
如果任务指定了特定日期,确保数据源符合该时间范围。数据与要求日期之间存在14个月以上的差距是不可接受的。
Systematic Search Strategy
系统化搜索策略
When searching for leaderboard information:
-
Start broad, then narrow:
[benchmark name] leaderboard 2025[benchmark name] top models currentsite:huggingface.co [benchmark name]
-
Search for raw data:
[benchmark name] results github[benchmark name] json data[benchmark name] api
-
Search for recent updates:
[benchmark name] new top model [current year][benchmark name] leaderboard update
-
Avoid repetitive similar queries - If a query pattern isn't working after 2-3 attempts, change the approach rather than making minor variations
搜索排行榜信息时:
-
从宽泛到精准:
[benchmark name] leaderboard 2025[benchmark name] top models currentsite:huggingface.co [benchmark name]
-
搜索原始数据:
[benchmark name] results github[benchmark name] json data[benchmark name] api
-
搜索近期更新:
[benchmark name] new top model [current year][benchmark name] leaderboard update
-
避免重复相似查询 - 如果某种查询模式尝试2-3次后无效,应更换方法而非做微小调整
Output Checklist
输出检查清单
Before submitting an answer, verify:
- Data source is current (not outdated paper)
- Model appears on the actual leaderboard
- Temporal requirements are met
- Model format matches requirements
- No unvalidated assumptions were made
- Answer was cross-referenced where possible
提交答案前,请验证:
- 数据源是最新的(非过时论文)
- 模型确实出现在实际排行榜上
- 满足时间要求
- 模型格式符合要求
- 未做出未验证的假设
- 答案已尽可能多来源交叉验证