Search Results: llm-judge

Found 9 Skills

llm-judge

LLM-as-judge methodology for comparing code implementations across repositories. Scores implementations on functionality, security, test quality, overengineering, and dead code using weighted rubrics. Used by /beagle:llm-judge command.

🇺🇸|EnglishTranslated

AI & Machine Learningajrlewis/ai-skills

addon-llm-judge-evals

Use when you want rubric based LLM quality scoring on generated outputs; pair with addon-deterministic-eval-suite.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningposthog/skills

exploring-llm-evaluations

Investigate LLM analytics evaluations of both types — `hog` (deterministic code-based) and `llm_judge` (LLM-prompt-based). Find existing evaluations, inspect their configuration, run them against specific generations, query individual pass/fail results, and generate AI-powered summaries of patterns across many runs. Use when the user asks to debug why an evaluation is failing, surface common failure modes, compare results across filters, dry-run a Hog evaluator, prototype a new LLM-judge prompt, or manage the evaluation lifecycle (create, update, enable/disable, delete).

🇺🇸|EnglishTranslated

Testing & QAingpoc/skills

testing

Use when running tests to validate implementations, collecting test evidence, or debugging failures. Load in TEST state. Covers unit tests (pytest/jest), API tests (curl), browser tests (Claude-in-Chrome), database verification. All results are code-verified, not LLM-judged.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningposthog/skills

feature-usage-feed

Set up an LLM-judge evaluation that extracts canonical use cases for a PostHog feature at scale and streams the results to a Slack channel as a live feed. Use when someone wants to understand how users are actually using a specific AI/LLM-powered feature in production — what they're investigating, what questions they're trying to answer, and what patterns surface — without manually reading hundreds of traces. Assumes the feature emits `$ai_generation` and `$ai_evaluation` events with `$session_id` linkage to the trigger user's recording (the standard setup post the session-summary linkage PRs).

🇺🇸|EnglishTranslated

AI & Machine Learningmicrosoft/eval-guide

eval-faq

Answers AI agent evaluation methodology questions with practical, opinionated guidance grounded primarily in Microsoft's agent evaluation ecosystem (MS Learn, Eval Scenario Library, Triage & Improvement Playbook, Eval Guidance Kit) supplemented by select industry sources.

🇺🇸|EnglishTranslated

AI & Machine Learninglangchain-ai/langsmith-sk...

langsmith-evaluator

INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run via LangSmith. Uses the langsmith CLI tool.

🇺🇸|EnglishTranslated

AI & Machine Learninghamelsmu/evals-skills

validate-evaluator

Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) when you need to verify alignment before trusting its outputs. Do NOT use for code-based evaluators (those are deterministic; test with standard unit tests).

🇺🇸|EnglishTranslated

AI & Machine Learninghamelsmu/evals-skills

eval-audit

Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).

🇺🇸|EnglishTranslated