Loading...
Loading...
Found 1,863 Skills
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
Respond in Chinese to user requests; when the user's message is fully in English (ignoring punctuation, digits, emojis, and whitespace), append a brief Chinese evaluation plus a 1-10 score.
INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run via LangSmith. Uses the langsmith CLI tool.
Evaluate agents and skills for quality, completeness, and standards compliance using a 6-step rubric: Identify, Structural, Content, Code, Integration, Report. Use when auditing agents/skills, checking quality after creation or update, or reviewing collection health. Triggers: "evaluate", "audit", "check quality", "review agent", "score skill". Do NOT use for creating or modifying agents/skills — only for read-only assessment and scoring.
Evaluates NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments. Use when setting up cosmos-policy for robot manipulation evaluation, running headless GPU evaluations with EGL rendering, or profiling inference latency on cluster or local GPU machines.
Analyze SaaS company valuation compression between funding rounds. Use this skill whenever the user asks about: how much a SaaS company's valuation multiple changed between rounds, why the ARR multiple compressed or expanded, comparing a company's compression to macro benchmarks, or explaining what drove valuation changes for any VC-backed software company. Trigger on phrases like "valuation compression", "ARR multiple", "round-to-round valuation", "multiple change", or when the user asks to compare a company's funding rounds. Always use this skill for any multi-round SaaS valuation analysis — do not try to answer from memory alone.
This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment. Part of the context engineering skill suite — also activates when the user mentions "context engineering" or "context-engineering" in the context of evaluating LLM output quality.
Full browser UAT for web apps — Playwright testing with console/network error capture, accessibility checks, i18n validation, and bug triage. Use when running screen-by-screen UAT or testing specific features in any web or hybrid app (React, Vue, Angular, Ionic, Next.js, etc).
Evaluate options for a specific design decision node and recommend one with explicit trade-offs. Use when the design already exposes a concrete choice such as architecture style, state management approach, auth model, storage pattern, sync strategy, multi-agent coordination model, language or runtime, UI framework, data-layer library, or tooling selection. Trigger when the user needs structured comparison and recommendation for a bounded design decision. Do not use for broad design discovery, full-system decomposition, or final readiness review.
Prepare a research artifact package for conference artifact evaluation, reproducibility review, badges, supplementary material, or post-acceptance artifact release. Use this skill whenever the user needs install instructions, reviewer-facing reproduction commands, Docker or environment checks, data/checkpoint packaging, hardware/runtime estimates, anonymized or public artifact metadata, artifact evaluation forms, or a claim-to-artifact reproducibility audit for ML/AI venues.
Evaluates ML models for performance, fairness, and reliability. Use for metric selection, cross-validation strategies, overfitting/underfitting diagnosis, hyperparameter tuning, LLM evaluation, A/B testing, and production monitoring for model drift.
Discounted cash flow valuation and intrinsic value analysis for public companies. Use when the brief asks for DCF, fair value, intrinsic value, price target, undervalued or overvalued analysis, or "what is this company worth?"