Loading...
Loading...
Found 1,140 Skills
Comprehensive security and safety evaluation system for agent skills (.skill files). Use when users provide GitHub URLs, website links, or .skill files for download and request security assessment, safety evaluation, or ask "is this skill safe to use." Evaluates prompt injection risks, malicious code patterns, hidden instructions, data exfiltration attempts, and provides actionable recommendations with risk scoring.
Critically assess external feedback (code reviews, AI reviewers, PR comments) and decide which suggestions to apply using a confidence-based framework with adversarial verification. Use when the user asks to "evaluate findings", "assess review comments", "triage review feedback", "evaluate review output", or "filter false positives".
Comprehensive technology stack evaluation and comparison tool with TCO analysis, security assessment, and intelligent recommendations for engineering teams
INVOKE THIS SKILL for LLM-as-judge evaluation workflows on Arize: creating/updating evaluators, running evaluations on spans or experiments, tasks, trigger-run, column mapping, and continuous monitoring. Use when the user says: create an evaluator, LLM judge, hallucination/faithfulness/correctness/relevance, run eval, score my spans or experiment, ax tasks, trigger-run, trigger eval, column mapping, continuous monitoring, query filter for evals, evaluator version, or improve an evaluator prompt.
Evaluate agents and skills for quality, completeness, and standards compliance using a 6-step rubric: Identify, Structural, Content, Code, Integration, Report. Use when auditing agents/skills, checking quality after creation or update, or reviewing collection health. Triggers: "evaluate", "audit", "check quality", "review agent", "score skill". Do NOT use for creating or modifying agents/skills — only for read-only assessment and scoring.
Systematic usability evaluation using established heuristics (Nielsen's 10, Shneiderman's 8, or custom rubrics). Use when reviewing UI designs, screenshots, prototypes, or live products for usability issues. Triggers on "review this design", "what's wrong with this UI", "usability check", "evaluate this interface", or when user shares screenshots/mockups asking for feedback.
Alibaba Cloud Governance Center evaluation report skill. Use for querying governance maturity check results, generating structured risk reports, and account compliance analysis. Triggers: "云治理", "成熟度检测", "合规检查", "安全风险", "治理检测", "governance evaluation", "maturity check", "compliance report", "risk report", "governance center".
Estimate intrinsic value of stocks and companies using DCF, dividend discount models, comparable multiples, and residual income. Use when the user asks about discounted cash flow, DCF models, WACC, terminal value, dividend discount models, comparable multiples, or sum-of-the-parts valuation. Also trigger when users mention 'what is this stock worth', 'fair value estimate', 'Gordon growth model', 'free cash flow valuation', 'cost of equity', 'sensitivity analysis', 'exit multiple', or ask whether a stock is overvalued or undervalued.
Evaluates NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments. Use when setting up cosmos-policy for robot manipulation evaluation, running headless GPU evaluations with EGL rendering, or profiling inference latency on cluster or local GPU machines.
Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
Use this skill for ANY question about CREATING evaluators. Covers creating custom metrics, LLM as Judge evaluators, code-based evaluators, and uploading evaluation logic to LangSmith. Includes basic usage of evaluators to run evaluations.
Comprehensive framework for evaluating AI vendors and solutions to avoid costly mistakes. Use this skill when assessing AI vendor proposals, conducting due diligence, evaluating contracts, comparing vendors, or making build-vs-buy decisions. Helps identify red flags, assess pricing models, evaluate technical capabilities, and conduct structured vendor comparisons.