Loading...
Loading...
Found 32 Skills
Produce an LLM Build Pack (prompt+tool contract, data/eval plan, architecture+safety, launch checklist). Use for building with LLMs, GPT/Claude apps, prompt engineering, RAG, and tool-using agents.
MCP (Model Context Protocol) server build and evaluation guide, including local conventions for tool surfaces, config, and testing
Use when user wants to find a note to publish as a blog post. Triggers on「选一篇笔记发博客」「note to blog」「写博客」「博客选题」. Scans Obsidian notes via Python script, evaluates blog-readiness, supports batch selection with fast/deep dual-track and parallel Agent dispatch.
Use when discussing or working with DeepEval (the python AI evaluation framework)
MLflow ML lifecycle management. Use for ML experiment tracking.
Build and run LLM-as-judge evaluation pipelines using Amazon Bedrock Evaluation Jobs with pre-computed inference datasets. Use when setting up automated model evaluation, designing test scenarios, collecting pre-computed responses, configuring custom metrics, creating AWS infrastructure, running evaluation jobs, parsing results, and iterating on findings.
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
Use when you want rubric based LLM quality scoring on generated outputs; pair with addon-deterministic-eval-suite.