Scaling and the Road to Human-Level AI
Strategic framework for understanding AI scaling laws and building products that leverage predictable AI capability improvements.
Core Concepts
Two Phases of AI Training
Pretraining: Models learn to predict the next token by imitating human-written text, understanding underlying correlations in data.
Reinforcement Learning (RL): Models are optimized based on human feedback, reinforcing helpful/honest/harmless behaviors and discouraging harmful ones.
Scaling laws exist for both phases—performance improves predictably with increased compute, data, and parameters.
Key Metrics
- Task Horizon: Length/complexity of tasks AI can complete, measured in equivalent human time
- Elo Scores: Rating system measuring model preference comparisons
- Context Window: Amount of information processable in a single conversation
Scaling Law Reliability
Scaling laws have held across 5+ orders of magnitude with physics-level precision. When scaling appears broken, assume training implementation issues first, not fundamental limits.
Strategic Decision Framework
Assess Current AI Capabilities
Use the two-axis capability framework:
- Y-axis (Flexibility): What modalities can the model handle?
- X-axis (Task Horizon): What equivalent human-time tasks can it complete?
Current trajectory: Task horizons double approximately every 7 months.
Product Timing Strategy
Current capability assessment:
├── Works reliably now → Build and ship immediately
├── Works 70-80% of time → Viable for error-tolerant use cases
├── Works marginally → Build now, ship when next model releases
└── Doesn't work at all → Wait 1-2 model generations
Key insight: Build products that don't quite work yet with current AI capabilities. Target capabilities slightly beyond current models—future models will make marginal products work.
Use Case Selection Criteria
Prioritize applications where:
- 70-80% accuracy is acceptable
- Breadth of knowledge matters more than deep focus on one hard problem
- Cross-domain synthesis creates value (biology + psychology + history)
- Human review can catch and correct errors
Deprioritize applications requiring:
- Near-perfect accuracy on first attempt
- Deep specialized reasoning without verification
- Tasks where errors compound catastrophically
Human-AI Collaboration Model
Role Division
Position humans as managers and sanity-checkers:
- AI generates options and drafts
- Humans verify, select, and course-correct
- AI's judgment-generation gap is smaller than humans'
Leverage AI's Strengths
Breadth over depth: AI excels at synthesizing information across many domains simultaneously. Target applications requiring:
- Literature synthesis across fields
- Pattern recognition across diverse data sources
- Rapid exploration of solution spaces
Practical Workflow
- Define the task scope and success criteria
- Have AI generate initial approach/draft
- Review for sanity and strategic alignment
- Iterate with targeted corrections
- Use AI to refine based on feedback
Forecasting AI Capabilities
Timeline Estimation Method
To estimate when a capability becomes viable:
1. Identify current task horizon (what length tasks work reliably)
2. Apply 7-month doubling rule
3. Calculate generations needed:
- Hour-long tasks → Day-long tasks: ~3 doublings (~21 months)
- Day-long tasks → Week-long tasks: ~3 doublings (~21 months)
- Week-long tasks → Month-long tasks: ~4 doublings (~28 months)
Self-Correction Multiplier
Each improvement in a model's ability to notice and correct its own mistakes roughly doubles task horizon length. Factor this into capability forecasts.
Integration Strategy
Avoid the Steam Engine Mistake
Don't just replace existing processes with AI equivalents. Redesign entire systems around AI capabilities (electricity adoption analogy—factories were redesigned around electric motors, not just swapping steam for electric).
Accelerate Adoption
Use AI to integrate AI into products and businesses. The bottleneck is adoption speed, not capability. When facing integration challenges:
- Have AI analyze your current workflow
- Identify substitution points and redesign opportunities
- Prototype with AI assistance
- Iterate rapidly
Jevons Paradox Awareness
Expect that increased AI efficiency leads to increased consumption, not decreased cost. Plan for:
- More AI usage as capabilities improve
- New use cases emerging from better performance
- Expanding scope rather than shrinking budgets
Diagnostic Framework
When Scaling Appears Broken
Before concluding a capability limit exists:
- Verify training/prompting methodology
- Check for data quality issues
- Test with alternative approaches
- Compare against scaling law predictions
Default assumption: Implementation issues, not fundamental limits.
Evaluating Model Improvements
Compare new models against:
- Expected scaling law trajectory
- Task horizon benchmarks
- Cross-domain performance consistency
Deviations from smooth improvement suggest training issues worth investigating.
Example Applications
Product Development Decision
Scenario: Building an AI code review tool
Assessment:
- Current models: Reliable for single-file reviews (~minutes)
- Target capability: Full PR reviews with context (~hours)
- Gap: ~2-3 doublings needed
Decision: Build now with single-file scope, architecture for expansion.
Ship current capability, expand automatically as models improve.
Capability Targeting
Scenario: Choosing between deep analysis vs broad synthesis features
AI strength analysis:
- Deep focus on one hard problem: Human-competitive, not superior
- Synthesizing across 10 domains: Clear AI advantage
Decision: Prioritize cross-domain synthesis features.
Example: Research assistant that connects findings across biology,
psychology, and economics papers simultaneously.
Timeline Planning
Scenario: When will AI handle week-long research projects reliably?
Current state (2024): Hour-long tasks reliable
Doubling rate: ~7 months
Calculation:
- Hour → Day: 3 doublings = 21 months
- Day → Week: 3 doublings = 21 months
- Total: ~42 months (rough estimate)
Planning implication: Build infrastructure now, expect capability 2027-2028.