query-optimizer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Query Optimizer

SQL查询优化器

Optimize SQL queries for better performance through indexing, rewriting, and analysis.
通过索引、查询重写和分析来优化SQL查询性能。

Quick Start

快速入门

Use EXPLAIN to analyze queries, add indexes on WHERE/JOIN columns, avoid SELECT *, limit results.
使用EXPLAIN分析查询,在WHERE/JOIN关联的列上添加索引,避免使用SELECT *,限制返回结果数量。

Instructions

操作指南

Query Analysis with EXPLAIN

使用EXPLAIN进行查询分析

Basic EXPLAIN:
sql
EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
EXPLAIN ANALYZE (actual execution):
sql
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
Key metrics to check:
  • Seq Scan (bad) vs Index Scan (good)
  • Rows: Estimated vs actual
  • Cost: Lower is better
  • Execution time
基础EXPLAIN用法:
sql
EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
EXPLAIN ANALYZE(实际执行分析):
sql
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
需要关注的关键指标:
  • 全表扫描(性能差)vs 索引扫描(性能优)
  • 行数:预估 vs 实际
  • 成本:数值越低越好
  • 执行时间

Common Performance Issues

常见性能问题

1. Missing Indexes
Problem:
sql
-- Seq Scan on users (cost=0.00..1234.56)
SELECT * FROM users WHERE email = 'user@example.com';
Solution:
sql
CREATE INDEX idx_users_email ON users(email);
-- Now: Index Scan using idx_users_email
**2. SELECT ***
Problem:
sql
SELECT * FROM posts;  -- Fetches all columns
Solution:
sql
SELECT id, title, created_at FROM posts;  -- Only needed columns
3. N+1 Queries
Problem:
sql
-- Fetches posts
SELECT * FROM posts;
-- Then for each post:
SELECT * FROM users WHERE id = ?;
Solution:
sql
-- Single query with JOIN
SELECT posts.*, users.name 
FROM posts 
JOIN users ON posts.user_id = users.id;
4. No LIMIT
Problem:
sql
SELECT * FROM posts ORDER BY created_at DESC;  -- Returns all rows
Solution:
sql
SELECT * FROM posts ORDER BY created_at DESC LIMIT 20;
1. 缺少索引
问题:
sql
-- Seq Scan on users (cost=0.00..1234.56)
SELECT * FROM users WHERE email = 'user@example.com';
解决方案:
sql
CREATE INDEX idx_users_email ON users(email);
-- Now: Index Scan using idx_users_email
**2. 使用SELECT ***
问题:
sql
SELECT * FROM posts;  -- Fetches all columns
解决方案:
sql
SELECT id, title, created_at FROM posts;  -- Only needed columns
3. N+1查询问题
问题:
sql
-- Fetches posts
SELECT * FROM posts;
-- Then for each post:
SELECT * FROM users WHERE id = ?;
解决方案:
sql
-- Single query with JOIN
SELECT posts.*, users.name 
FROM posts 
JOIN users ON posts.user_id = users.id;
4. 未限制返回结果
问题:
sql
SELECT * FROM posts ORDER BY created_at DESC;  -- Returns all rows
解决方案:
sql
SELECT * FROM posts ORDER BY created_at DESC LIMIT 20;

Indexing Strategies

索引策略

Single column index:
sql
CREATE INDEX idx_users_email ON users(email);
Composite index (order matters):
sql
-- For: WHERE user_id = ? AND created_at > ?
CREATE INDEX idx_posts_user_created ON posts(user_id, created_at);
Covering index (includes all needed columns):
sql
-- For: SELECT id, title FROM posts WHERE user_id = ?
CREATE INDEX idx_posts_user_id_title ON posts(user_id) INCLUDE (title);
Partial index (filtered):
sql
CREATE INDEX idx_active_users ON users(email) WHERE is_active = true;
Index on expressions:
sql
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- For: WHERE LOWER(email) = 'user@example.com'
单列索引:
sql
CREATE INDEX idx_users_email ON users(email);
复合索引(顺序很重要):
sql
-- For: WHERE user_id = ? AND created_at > ?
CREATE INDEX idx_posts_user_created ON posts(user_id, created_at);
覆盖索引(包含所有需要的列):
sql
-- For: SELECT id, title FROM posts WHERE user_id = ?
CREATE INDEX idx_posts_user_id_title ON posts(user_id) INCLUDE (title);
部分索引(带过滤条件):
sql
CREATE INDEX idx_active_users ON users(email) WHERE is_active = true;
表达式索引:
sql
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- For: WHERE LOWER(email) = 'user@example.com'

Query Rewriting

查询重写

Use EXISTS instead of IN for large sets:
sql
-- Slow
SELECT * FROM users WHERE id IN (SELECT user_id FROM posts);

-- Faster
SELECT * FROM users u WHERE EXISTS (
    SELECT 1 FROM posts p WHERE p.user_id = u.id
);
Use JOIN instead of subquery:
sql
-- Slow
SELECT * FROM posts WHERE user_id IN (
    SELECT id FROM users WHERE is_active = true
);

-- Faster
SELECT p.* FROM posts p
JOIN users u ON p.user_id = u.id
WHERE u.is_active = true;
Avoid functions on indexed columns:
sql
-- Bad: Can't use index
SELECT * FROM users WHERE YEAR(created_at) = 2024;

-- Good: Can use index
SELECT * FROM users 
WHERE created_at >= '2024-01-01' 
AND created_at < '2025-01-01';
Use UNION ALL instead of UNION:
sql
-- Slow: Removes duplicates
SELECT id FROM posts UNION SELECT id FROM drafts;

-- Fast: No duplicate removal
SELECT id FROM posts UNION ALL SELECT id FROM drafts;
针对大数据集,用EXISTS替代IN:
sql
-- Slow
SELECT * FROM users WHERE id IN (SELECT user_id FROM posts);

-- Faster
SELECT * FROM users u WHERE EXISTS (
    SELECT 1 FROM posts p WHERE p.user_id = u.id
);
用JOIN替代子查询:
sql
-- Slow
SELECT * FROM posts WHERE user_id IN (
    SELECT id FROM users WHERE is_active = true
);

-- Faster
SELECT p.* FROM posts p
JOIN users u ON p.user_id = u.id
WHERE u.is_active = true;
避免在索引列上使用函数:
sql
-- Bad: Can't use index
SELECT * FROM users WHERE YEAR(created_at) = 2024;

-- Good: Can use index
SELECT * FROM users 
WHERE created_at >= '2024-01-01' 
AND created_at < '2025-01-01';
用UNION ALL替代UNION:
sql
-- Slow: Removes duplicates
SELECT id FROM posts UNION SELECT id FROM drafts;

-- Fast: No duplicate removal
SELECT id FROM posts UNION ALL SELECT id FROM drafts;

JOIN Optimization

JOIN查询优化

Order matters - filter early:
sql
-- Bad: Large intermediate result
SELECT * FROM posts p
JOIN users u ON p.user_id = u.id
WHERE p.created_at > '2024-01-01';

-- Good: Filter first
SELECT * FROM posts p
WHERE p.created_at > '2024-01-01'
JOIN users u ON p.user_id = u.id;
Use appropriate JOIN type:
sql
-- INNER JOIN: Only matching rows
SELECT * FROM posts p
INNER JOIN users u ON p.user_id = u.id;

-- LEFT JOIN: All posts, even without user
SELECT * FROM posts p
LEFT JOIN users u ON p.user_id = u.id;
Index JOIN columns:
sql
CREATE INDEX idx_posts_user_id ON posts(user_id);
CREATE INDEX idx_users_id ON users(id);  -- Usually PK already indexed
顺序很重要 - 提前过滤数据:
sql
-- Bad: Large intermediate result
SELECT * FROM posts p
JOIN users u ON p.user_id = u.id
WHERE p.created_at > '2024-01-01';

-- Good: Filter first
SELECT * FROM posts p
WHERE p.created_at > '2024-01-01'
JOIN users u ON p.user_id = u.id;
选择合适的JOIN类型:
sql
-- INNER JOIN: Only matching rows
SELECT * FROM posts p
INNER JOIN users u ON p.user_id = u.id;

-- LEFT JOIN: All posts, even without user
SELECT * FROM posts p
LEFT JOIN users u ON p.user_id = u.id;
为JOIN关联列添加索引:
sql
CREATE INDEX idx_posts_user_id ON posts(user_id);
CREATE INDEX idx_users_id ON users(id);  -- Usually PK already indexed

Pagination Optimization

分页查询优化

Offset pagination (slow for large offsets):
sql
-- Slow for page 1000
SELECT * FROM posts 
ORDER BY created_at DESC 
LIMIT 20 OFFSET 20000;
Cursor pagination (faster):
sql
-- First page
SELECT * FROM posts 
ORDER BY created_at DESC, id DESC 
LIMIT 20;

-- Next page (using last created_at and id)
SELECT * FROM posts 
WHERE (created_at, id) < ('2024-01-01 12:00:00', 12345)
ORDER BY created_at DESC, id DESC 
LIMIT 20;
偏移量分页(大偏移量时速度慢):
sql
-- Slow for page 1000
SELECT * FROM posts 
ORDER BY created_at DESC 
LIMIT 20 OFFSET 20000;
游标分页(速度更快):
sql
-- First page
SELECT * FROM posts 
ORDER BY created_at DESC, id DESC 
LIMIT 20;

-- Next page (using last created_at and id)
SELECT * FROM posts 
WHERE (created_at, id) < ('2024-01-01 12:00:00', 12345)
ORDER BY created_at DESC, id DESC 
LIMIT 20;

Aggregation Optimization

聚合查询优化

Use indexes for GROUP BY:
sql
CREATE INDEX idx_posts_user_id ON posts(user_id);

SELECT user_id, COUNT(*) 
FROM posts 
GROUP BY user_id;
Filter before aggregating:
sql
-- Good
SELECT user_id, COUNT(*) 
FROM posts 
WHERE created_at > '2024-01-01'
GROUP BY user_id;
Use HAVING for aggregate filters:
sql
SELECT user_id, COUNT(*) as post_count
FROM posts 
GROUP BY user_id
HAVING COUNT(*) > 10;
为GROUP BY添加索引:
sql
CREATE INDEX idx_posts_user_id ON posts(user_id);

SELECT user_id, COUNT(*) 
FROM posts 
GROUP BY user_id;
聚合前先过滤数据:
sql
-- Good
SELECT user_id, COUNT(*) 
FROM posts 
WHERE created_at > '2024-01-01'
GROUP BY user_id;
用HAVING过滤聚合结果:
sql
SELECT user_id, COUNT(*) as post_count
FROM posts 
GROUP BY user_id
HAVING COUNT(*) > 10;

Subquery Optimization

子查询优化

Correlated subqueries (slow):
sql
-- Bad: Runs subquery for each row
SELECT * FROM users u
WHERE (SELECT COUNT(*) FROM posts WHERE user_id = u.id) > 10;
JOIN instead:
sql
-- Good: Single query
SELECT u.* FROM users u
JOIN (
    SELECT user_id, COUNT(*) as post_count
    FROM posts
    GROUP BY user_id
    HAVING COUNT(*) > 10
) p ON u.id = p.user_id;
关联子查询(速度慢):
sql
-- Bad: Runs subquery for each row
SELECT * FROM users u
WHERE (SELECT COUNT(*) FROM posts WHERE user_id = u.id) > 10;
改用JOIN:
sql
-- Good: Single query
SELECT u.* FROM users u
JOIN (
    SELECT user_id, COUNT(*) as post_count
    FROM posts
    GROUP BY user_id
    HAVING COUNT(*) > 10
) p ON u.id = p.user_id;

Caching Strategies

缓存策略

Materialized views:
sql
CREATE MATERIALIZED VIEW user_post_counts AS
SELECT user_id, COUNT(*) as post_count
FROM posts
GROUP BY user_id;

-- Refresh periodically
REFRESH MATERIALIZED VIEW user_post_counts;
Query result caching (application level):
python
undefined
物化视图:
sql
CREATE MATERIALIZED VIEW user_post_counts AS
SELECT user_id, COUNT(*) as post_count
FROM posts
GROUP BY user_id;

-- Refresh periodically
REFRESH MATERIALIZED VIEW user_post_counts;
查询结果缓存(应用层):
python
undefined

Cache expensive queries

Cache expensive queries

@cache(ttl=300) def get_popular_posts(): return db.query("SELECT * FROM posts ORDER BY views DESC LIMIT 10")
undefined
@cache(ttl=300) def get_popular_posts(): return db.query("SELECT * FROM posts ORDER BY views DESC LIMIT 10")
undefined

Common Patterns

常见模式

Full-text Search

全文搜索

PostgreSQL:
sql
-- Add tsvector column
ALTER TABLE posts ADD COLUMN search_vector tsvector;

-- Update with trigger
CREATE INDEX idx_posts_search ON posts USING GIN(search_vector);

-- Search
SELECT * FROM posts 
WHERE search_vector @@ to_tsquery('postgresql & optimization');
Use dedicated search engine for complex needs:
  • Elasticsearch
  • Algolia
  • Meilisearch
PostgreSQL:
sql
-- Add tsvector column
ALTER TABLE posts ADD COLUMN search_vector tsvector;

-- Update with trigger
CREATE INDEX idx_posts_search ON posts USING GIN(search_vector);

-- Search
SELECT * FROM posts 
WHERE search_vector @@ to_tsquery('postgresql & optimization');
复杂场景使用专用搜索引擎:
  • Elasticsearch
  • Algolia
  • Meilisearch

Batch Operations

批量操作

Bulk insert:
sql
-- Bad: Multiple inserts
INSERT INTO users (name) VALUES ('User 1');
INSERT INTO users (name) VALUES ('User 2');

-- Good: Single insert
INSERT INTO users (name) VALUES 
('User 1'),
('User 2'),
('User 3');
Bulk update:
sql
-- Use CASE for conditional updates
UPDATE posts 
SET status = CASE 
    WHEN views > 1000 THEN 'popular'
    WHEN views > 100 THEN 'normal'
    ELSE 'new'
END;
批量插入:
sql
-- Bad: Multiple inserts
INSERT INTO users (name) VALUES ('User 1');
INSERT INTO users (name) VALUES ('User 2');

-- Good: Single insert
INSERT INTO users (name) VALUES 
('User 1'),
('User 2'),
('User 3');
批量更新:
sql
-- Use CASE for conditional updates
UPDATE posts 
SET status = CASE 
    WHEN views > 1000 THEN 'popular'
    WHEN views > 100 THEN 'normal'
    ELSE 'new'
END;

Connection Pooling

连接池

python
undefined
python
undefined

Use connection pool

Use connection pool

from sqlalchemy import create_engine
engine = create_engine( 'postgresql://user:pass@localhost/db', pool_size=20, max_overflow=10 )
undefined
from sqlalchemy import create_engine
engine = create_engine( 'postgresql://user:pass@localhost/db', pool_size=20, max_overflow=10 )
undefined

Performance Monitoring

性能监控

Check slow queries:
sql
-- PostgreSQL: Enable slow query log
ALTER DATABASE mydb SET log_min_duration_statement = 1000;  -- 1 second

-- View pg_stat_statements
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
Check index usage:
sql
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan = 0  -- Unused indexes
ORDER BY pg_relation_size(indexrelid) DESC;
Check table statistics:
sql
SELECT 
    schemaname,
    tablename,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;
查看慢查询:
sql
-- PostgreSQL: Enable slow query log
ALTER DATABASE mydb SET log_min_duration_statement = 1000;  -- 1 second

-- View pg_stat_statements
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
查看索引使用情况:
sql
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan = 0  -- Unused indexes
ORDER BY pg_relation_size(indexrelid) DESC;
查看表统计信息:
sql
SELECT 
    schemaname,
    tablename,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;

Best Practices

最佳实践

Always:
  • Use EXPLAIN ANALYZE for slow queries
  • Index foreign keys
  • Index WHERE/JOIN columns
  • Limit result sets
  • Use prepared statements
Avoid:
  • SELECT *
  • Functions on indexed columns in WHERE
  • Correlated subqueries
  • Large OFFSET values
  • Over-indexing
Monitor:
  • Slow query log
  • Index usage
  • Table statistics
  • Connection pool
务必遵循:
  • 对慢查询使用EXPLAIN ANALYZE分析
  • 为外键添加索引
  • 为WHERE/JOIN关联列添加索引
  • 限制结果集大小
  • 使用预编译语句
避免操作:
  • 使用SELECT *
  • 在WHERE条件中的索引列上使用函数
  • 使用关联子查询
  • 使用大偏移量OFFSET
  • 过度索引
监控内容:
  • 慢查询日志
  • 索引使用情况
  • 表统计信息
  • 连接池状态

Troubleshooting

问题排查

Query still slow after indexing:
  • Check if index is being used (EXPLAIN)
  • Verify index column order for composite indexes
  • Consider covering index
  • Check for stale statistics (ANALYZE table)
Too many indexes:
  • Remove unused indexes
  • Combine similar indexes
  • Monitor write performance
High memory usage:
  • Reduce work_mem
  • Optimize sort operations
  • Use streaming instead of loading all data
添加索引后查询仍然缓慢:
  • 检查索引是否被使用(通过EXPLAIN)
  • 验证复合索引的列顺序
  • 考虑使用覆盖索引
  • 检查表统计信息是否过时(执行ANALYZE table)
索引过多:
  • 删除未使用的索引
  • 合并相似索引
  • 监控写入性能
内存占用过高:
  • 降低work_mem参数
  • 优化排序操作
  • 使用流处理而非加载全部数据