Loading...
Loading...
Found 24 Skills
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python
Master fine-tuning of large language models for specific domains and tasks. Covers data preparation, training techniques, optimization strategies, and evaluation methods. Use when adapting models for specialized applications, reducing inference costs, or improving domain-specific performance.
Designs structured benchmarks for comparing algorithms, models, or implementations. Selects appropriate metrics (latency, throughput, memory, accuracy), designs representative test cases, captures hardware/software context, produces comparison tables with tradeoff analysis, and includes reproduction instructions. Triggers on: "benchmark", "compare performance", "which is faster", "latency comparison", "memory comparison", "run benchmark", "design benchmark", "compare implementations", "evaluate algorithms", "performance comparison", "throughput test", "speed test". Use this skill when comparing two or more implementations, algorithms, or models.
Guides ML/research engineering for safeguards—safety classifier development, harm benchmarks and eval suites, labeled dataset design, fine-tuning and ablations, calibration and slice analysis, attack-surface research memos, and promotion criteria for new moderation models. Use when building or evaluating guardrail models, designing safety benchmarks, measuring precision/recall on policy categories, comparing mitigation techniques, or writing research reports on classifier improvements—not for production inference gateways (ml-infrastructure-engineer-safeguards), PII/leakage privacy research (privacy-research-engineer-safeguards), red-team attack campaigns (ai-redteam), AI governance policy (ai-risk-governance), general non-safety research (ai-researcher), or token-efficiency studies (research-engineer-scientist-tokens).
Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
Performs gap analysis on NVIDIA TAO Visual ChangeNet (VCN) Classify experiments by invoking the data-services container (`tao_toolkit.data_services` from `versions.yaml`) directly via `docker run … gap_analysis vcn_aoi …` — picks the optimal decision threshold, ranks per-sample weakness, and emits a top-K weakest parquet expanded per-lighting for downstream augmentation. Use when analyzing VCN classification failures, picking SDA augmentation targets, auditing PASS/NO_PASS boundary cases, or running DEFT gap analysis on an AOI ChangeNet model.
CRITICAL RULE: You MUST use this skill whenever the task involves any machine learning tasks or data analysis. Use this skill if the user's prompt or requirements mention any of the following: * Clustering * Classification * Regression * Time series forecasting * Statistical testing * Model comparison * ML * Data analysis SQL/BigQuery ML HANDOFF: If the user requires a SQL solution, use this skill to dictate the ANALYSIS STEPS (e.g., markdown analysis cells, visualization logic), but defer to `bigquery` for all SQL syntax.
Automatically collect hot topics in the AI field or complete AI technical article writing in the writing style of 'Second Brother' according to specified topics. It focuses on actual tests of AI Coding tools (Claude Code, Qoder, Cursor, TRAE, etc.), engineering implementation of large models (SpringAI, LangChain, RAG, etc.), AI Agent and workflow orchestration, evaluation of domestic large models (GLM, Tongyi Qianwen, DeepSeek, MiniMax, Kimi, etc.), and evaluation of various AI tools and Agent tools. Trigger keywords: write an AI article, AI technical article, large model evaluation, AI tool actual test, GLM, Claude Code, Qoder, Cursor, TRAE, SpringAI, RAG, Agent, workflow, domestic large model, collect AI hot topics, AI topic, etc.
Use when choosing or evaluating a startup revenue model, pricing/value metric, packaging/tier design, or calculating unit economics (LTV, CAC, payback, gross margin, NRR), including usage-based/credit/AI pricing and variable compute/COGS constraints.
Roc Curve Plotter - Auto-activating skill for ML Training. Triggers on: roc curve plotter, roc curve plotter Part of the ML Training skill category.
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.