evaluation-framework
Original:🇺🇸 English
Translated
Consult this skill when building evaluation or scoring systems. Use when implementing evaluation systems, creating quality gates, designing scoring rubrics, building decision frameworks. Do not use when simple pass/fail without scoring needs.
5installs
Added on
NPX Install
npx skill4agent add athola/claude-night-market evaluation-frameworkTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Table of Contents
- Overview
- When to Use
- Core Pattern
- 1. Define Criteria
- 2. Score Each Criterion
- 3. Calculate Weighted Total
- 4. Apply Decision Thresholds
- Quick Start
- Define Your Evaluation
- Example: Code Review Evaluation
- Evaluation Workflow
- Common Use Cases
- Integration Pattern
- Detailed Resources
- Exit Criteria
Evaluation Framework
Overview
A generic framework for weighted scoring and threshold-based decision making. Provides reusable patterns for evaluating any artifact against configurable criteria with consistent scoring methodology.
This framework abstracts the common pattern of: define criteria → assign weights → score against criteria → apply thresholds → make decisions.
When To Use
- Implementing quality gates or evaluation rubrics
- Building scoring systems for artifacts, proposals, or submissions
- Need consistent evaluation methodology across different domains
- Want threshold-based automated decision making
- Creating assessment tools with weighted criteria
When NOT To Use
- Simple pass/fail without scoring needs
Core Pattern
1. Define Criteria
yaml
criteria:
- name: criterion_name
weight: 0.30 # 30% of total score
description: What this measures
scoring_guide:
90-100: Exceptional
70-89: Strong
50-69: Acceptable
30-49: Weak
0-29: PoorVerification: Run the command with flag to verify availability.
--help2. Score Each Criterion
python
scores = {
"criterion_1": 85, # Out of 100
"criterion_2": 92,
"criterion_3": 78,
}Verification: Run the command with flag to verify availability.
--help3. Calculate Weighted Total
python
total = sum(score * weights[criterion] for criterion, score in scores.items())
# Example: (85 × 0.30) + (92 × 0.40) + (78 × 0.30) = 85.5Verification: Run the command with flag to verify availability.
--help4. Apply Decision Thresholds
yaml
thresholds:
80-100: Accept with priority
60-79: Accept with conditions
40-59: Review required
20-39: Reject with feedback
0-19: RejectVerification: Run the command with flag to verify availability.
--helpQuick Start
Define Your Evaluation
- Identify criteria: What aspects matter for your domain?
- Assign weights: Which criteria are most important? (sum to 1.0)
- Create scoring guides: What does each score range mean?
- Set thresholds: What total scores trigger which decisions?
Example: Code Review Evaluation
yaml
criteria:
correctness: {weight: 0.40, description: Does code work as intended?}
maintainability: {weight: 0.25, description: Is it readable?}
performance: {weight: 0.20, description: Meets performance needs?}
testing: {weight: 0.15, description: Tests detailed?}
thresholds:
85-100: Approve immediately
70-84: Approve with minor feedback
50-69: Request changes
0-49: Reject, major issuesVerification: Run to verify tests pass.
pytest -vEvaluation Workflow
**Verification:** Run the command with `--help` flag to verify availability.
1. Review artifact against each criterion
2. Assign 0-100 score for each criterion
3. Calculate: total = Σ(score × weight)
4. Compare total to thresholds
5. Take action based on threshold rangeVerification: Run the command with flag to verify availability.
--helpCommon Use Cases
Quality Gates: Code review, PR approval, release readiness
Content Evaluation: Document quality, knowledge intake, skill assessment
Resource Allocation: Backlog prioritization, investment decisions, triage
Integration Pattern
yaml
# In your skill's frontmatter
dependencies: [leyline:evaluation-framework]Verification: Run the command with flag to verify availability.
--helpThen customize the framework for your domain:
- Define domain-specific criteria
- Set appropriate weights for your context
- Establish meaningful thresholds
- Document what each score range means
Detailed Resources
- Scoring Patterns: See for detailed methodology
modules/scoring-patterns.md - Decision Thresholds: See for threshold design
modules/decision-thresholds.md
Exit Criteria
- Criteria defined with clear descriptions
- Weights assigned and sum to 1.0
- Scoring guides documented for each criterion
- Thresholds mapped to specific actions
- Evaluation process documented and reproducible
Troubleshooting
Common Issues
Command not found
Ensure all dependencies are installed and in PATH
Permission errors
Check file permissions and run with appropriate privileges
Unexpected behavior
Enable verbose logging with flag
--verbose