Configure, tune, design schemas, migrate, back up, and review database engines - from single-node dev setups to PCI-compliant production clusters. The goal is correct, performant, durable databases that survive failures, pass audits, and don't wake you up at 3am.
Workload type - OLTP (many small transactions) vs OLAP (few large queries) vs mixed. Default: OLTP.
Data volume - row counts, table sizes, growth rate. Default: medium (1-100GB). If unknown, optimize for growth.
Compliance - PCI-DSS CDE? HIPAA? What data classification? Default: no compliance scope (but still apply security baseline).
HA requirements - RTO/RPO targets, multi-AZ, read replicas. Default: single-node with PITR.
Existing infrastructure - what's already running, what ORMs/drivers are in use. Inspect the codebase if accessible.
Step 3: Build
Follow the domain-specific section below. Always apply the production checklist and AI self-check before finishing.
Common operations - quick-start patterns:
Query optimization (the most frequent request):
Get the plan:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;
(PG),
EXPLAIN FORMAT=TREE ...;
(MySQL 8.0+),
db.collection.explain('executionStats').find(...)
(MongoDB).
Look for:
Seq Scan
on large tables (PG) /
Full Table Scan
(MySQL) /
COLLSCAN
(MongoDB),
Nested Loop
with high row estimates,
Sort
spilling to disk (
Sort Method: external merge
), and
Rows Removed by Filter
>>
Rows
returned.
2a. Before assuming an index problem, verify semantic correctness: check JOIN conditions for column mismatches, confirm WHERE clause predicates don't unintentionally filter nulls, and verify LEFT JOIN semantics aren't silently converted to INNER JOIN by outer-table filters.
Check index usage:
SELECT schemaname, relname, idx_scan, seq_scan FROM pg_stat_user_tables WHERE seq_scan > 100 ORDER BY seq_scan DESC;
(PG; adjust
schemaname = 'public'
filter for non-public schemas) or
SELECT * FROM sys.schema_unused_indexes;
(MySQL performance_schema).
If a missing index is the fix, create it with
CONCURRENTLY
(PG):
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);
or
ALGORITHM=INPLACE, LOCK=NONE
(MySQL).
Re-run
EXPLAIN ANALYZE
to confirm the planner uses the new index and that estimated vs actual rows are close.
Lock contention / deadlock diagnosis (second most common "it's stuck" scenario):
PG:
SELECT pid, age(backend_xact_start), query, wait_event_type, wait_event FROM pg_stat_activity WHERE state != 'idle' AND wait_event IS NOT NULL ORDER BY age(backend_xact_start) DESC;
PG blocked queries:
SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query, blocking.pid AS blocking_pid, blocking.query AS blocking_query FROM pg_stat_activity blocked JOIN pg_locks bl ON bl.pid = blocked.pid JOIN pg_locks kl ON kl.locktype = bl.locktype AND kl.database IS NOT DISTINCT FROM bl.database AND kl.relation IS NOT DISTINCT FROM bl.relation AND kl.page IS NOT DISTINCT FROM bl.page AND kl.tuple IS NOT DISTINCT FROM bl.tuple AND kl.transactionid IS NOT DISTINCT FROM bl.transactionid AND kl.pid != bl.pid JOIN pg_stat_activity blocking ON blocking.pid = kl.pid WHERE NOT bl.granted AND kl.granted;
MySQL:
SHOW ENGINE INNODB STATUS\G
- look for
LATEST DETECTED DEADLOCK
section. Also:
SELECT * FROM performance_schema.data_lock_waits;
(MySQL 8.0+).
MongoDB:
db.currentOp({"waitingForLock": true})
and check
mongod
log for
LockTimeout
entries.
Fix: kill the blocking query if it's stuck (
SELECT pg_terminate_backend(PID);
/
KILL CONNECTION thread_id;
), then investigate why the lock was held (long transactions, missing indexes on UPDATE/DELETE WHERE clauses, lock ordering bugs in application code).
Connection pooler sizing (second most common):
Determine backend budget:
max_connections
minus superuser_reserved minus replication slots = available.
PgBouncer
default_pool_size
per user/db pair: start at
available / number_of_app_instances
, round down.
Set
max_client_conn
to the total connections your app fleet will open (all instances combined).
Use
transaction
pool mode for web apps (stateless requests). Use
session
mode only if the app uses prepared statements without PgBouncer 1.21+
Migrate sequences: logical replication does not replicate sequence values - copy them manually
Test application against the new version (read traffic, connection pooler split). Monitor during split-test: query latency p50/p95/p99 (compare old vs new - regression means planner stats or config drift), error rates by query type (new version may have stricter behavior), connection pool saturation (
SHOW POOLS
in PgBouncer - watch
cl_waiting
), replication lag trend (should stay flat or decrease, never grow during read-only test), and memory/CPU on the new instance. Run for at least one full traffic cycle (24h if your workload is diurnal). Abort and route back to old primary if error rate exceeds baseline or p99 latency degrades >20%.
Cutover: stop writes to old primary, verify lag = 0, point connection pooler to new primary
Drop subscription and decommission old primary
Rollback procedure (if replication stalls or validation fails):
Lag not reaching zero: Check
pg_stat_subscription
for
last_msg_send_time
vs
last_msg_receipt_time
delta. Common causes: long-running transactions on source blocking WAL send, tables missing primary keys (forces full-row comparison), or network throughput limits. If lag is stuck, check
pg_replication_slots
on the source for
active = false
- an inactive slot means the subscription dropped. Recreate the subscription; do not proceed with cutover.
Application validation fails on new version: Route all traffic back to the old primary (update pooler config). The old primary never stopped accepting writes, so no data is lost. Drop the subscription on the new instance:
DROP SUBSCRIPTION upgrade_sub;
- this also drops the replication slot on the source. Investigate, fix, and restart from step 3.
Post-cutover rollback (writes already went to new primary): This is the hard case. Options: (a) set up reverse logical replication from new -> old before decommissioning the old primary (plan this before cutover if RTO requires it), or (b) restore old primary from backup + WAL and accept data loss for the cutover window. Option (a) requires the old primary to still be running. Decision point: if you need rollback after cutover, set up reverse replication in step 6 before routing write traffic.
Key pitfalls (check all of these before starting):
Tables need primary keys or
REPLICA IDENTITY FULL
- check first:
SELECT c.relname FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid WHERE n.nspname = 'public' AND c.relkind = 'r' AND NOT EXISTS (SELECT 1 FROM pg_constraint WHERE conrelid = c.oid AND contype = 'p');
DDL is not replicated - schema changes during migration need manual sync on both sides
Large objects (
lo
) are not replicated
Sequence values drift - copy them manually after cutover, not before
Read
references/migration-patterns.md
for cross-engine type mapping, ORM migration tooling, and detailed migration patterns.
Pooling, Backup, and Platform Choice
PostgreSQL pooling is usually non-negotiable; use PgBouncer unless a concrete reason says otherwise.
Backup discipline means restore testing, encryption, retention limits, and monitoring backup freshness.
Managed databases reduce toil but do not remove shared-responsibility or compliance review.
Self-hosted databases buy control at the cost of HA, patching, and operational burden.
Production Checklist
All Engines
Authentication uses modern hashing (SCRAM-SHA-256, caching_sha2_password, certificate auth)
TLS enforced for all connections (not just "available")
No default passwords, no
trust
auth, no passwordless access
Connection pool in front of the database (PgBouncer, ProxySQL, or application-side)
max_connections
sized for actual workload, not default
Backup strategy implemented, tested, and monitored
Restore procedure documented and tested (at least monthly)
Monitoring in place (connections, query performance, replication lag, disk usage)
Slow query logging enabled with appropriate threshold
Dead/unused indexes identified and removed
Character encoding correct (
utf8mb4
for MySQL,
UTF8
for PG)
PostgreSQL-Specific
pg_hba.conf
:
hostssl
only,
scram-sha-256
only, CIDR-restricted
shared_buffers
= 25% RAM,
effective_cache_size
= 75% RAM
work_mem
sized for concurrency (default 64MB is high for OLTP with many connections - per-sort, not per-connection)
random_page_cost = 1.1
for SSD storage
statement_timeout
set per-role (not globally - migrations need longer)
Access control: separate roles, no shared accounts, MFA for CDE access (Req 7, 8)
Data masking: PAN display limited to last 4 digits (Req 3.3)
No cardholder data in non-prod environments (Req 6.5.4)
Quarterly access review documented (Req 7.2.5)
Vulnerability scanning includes database engine, not just containers (Req 11.3)
Reference Files
references/config-templates.md
- engine configuration templates
references/backup-patterns.md
- backup, restore, and PITR patterns
references/migration-patterns.md
- cross-engine migration patterns and type-mapping guidance
Rules
These are non-negotiable. Violating any of these is a bug.
Backups without restore tests are not backups. Test restores monthly. Document the procedure.
No
trust
auth in
pg_hba.conf
. Not in dev, not in Docker, not anywhere. Use
scram-sha-256
.
MySQL strict mode ON.
STRICT_TRANS_TABLES
prevents silent data truncation. Without it, MySQL silently corrupts your data.
TLS enforced, not just available.
require_secure_transport
,
hostssl
,
requireTLS
. Connections without TLS are a finding.
Connection pooler for PostgreSQL. PG's process-per-connection model doesn't scale without one. Use PgBouncer.
Indexes on foreign keys in PostgreSQL. PG doesn't auto-create them. Missing FK indexes cause sequential scans on JOINs and cascading DELETEs.
utf8mb4
for MySQL, always.
utf8
is a lie - it's 3-byte only, can't store emoji or many CJK characters.
timestamptz
for PostgreSQL, always. Bare
timestamp
stores no timezone, breaks when server timezone changes.
Parameterized queries everywhere. String concatenation for SQL is a bug, not a shortcut. Doubly true for AI-generated code.
Disk-level encryption is insufficient for PCI-DSS 4.0. Req 3.5.1.2 requires TDE, column-level, or application-layer encryption.
Patch MongoBleed (CVE-2025-14847). Self-hosted MongoDB < 8.0.17 / 7.0.28 / 6.0.27 is actively exploitable with no authentication required.
Patch MongoDB compression DoS (CVE-2026-25611). Pre-auth DoS via crafted OP_COMPRESSED messages. Default config affected (compression enabled since 3.6). Fixed in 8.0.18+ / 8.2.4+ / 7.0.29+.
Patch PgBouncer (CVE-2025-12819). PgBouncer < 1.25.1 can allow unauthenticated SQL execution when
track_extra_parameters
includes
search_path
AND
auth_user
is set (both non-default). Upgrade regardless - the fix is low-risk.
Chunk bulk inserts. Never build a single
INSERT ... VALUES
with an unbounded row list. Compute batch size from the lowest host-parameter ceiling across supported backends (
floor(limit / columns_per_row)
). Wrap chunks in one transaction when atomicity matters.
Run the AI self-check. Every generated migration, schema, or config gets verified against the checklist above before returning.
Related Skills
code-review - has a
databases.md
reference for application-level database bug patterns (transaction misuse, NULL handling, ORM N+1, type coercion). This skill covers engine configuration and operations; code-review covers how the application uses the database.
security-audit - for SQL injection detection and credential scanning in application code
kubernetes - for deploying databases on K8s (StatefulSets, operators, PVCs)
terraform - for provisioning managed databases (RDS, Cloud SQL, Atlas)
docker - for database containers in Docker Compose
ansible - for database server configuration management