Loading...
Loading...
Use when the user needs CI/CD pipelines, Docker configuration, Kubernetes deployment, infrastructure-as-code, monitoring, or zero-downtime deployment strategies. Triggers: user says "devops", "docker", "kubernetes", "CI/CD", "infrastructure", "monitoring", "deploy to production", "container", "terraform", "observability".
npx skill4agent add pixel-process-ug/superkit-agents senior-devops| Scale | Topology | Orchestration | Recommended |
|---|---|---|---|
| Hobby / MVP | Single server | Docker Compose | Railway, Fly.io |
| Startup (< 100k users) | Small cluster | ECS, Cloud Run | AWS ECS, GCP Cloud Run |
| Growth (100k - 1M users) | Multi-AZ cluster | Kubernetes | EKS, GKE |
| Enterprise (1M+ users) | Multi-region | Kubernetes + service mesh | EKS/GKE + Istio |
| Compliance-heavy | Dedicated/private cloud | Kubernetes | Self-managed K8s |
# 1. Use specific version tags (not :latest)
FROM node:20-alpine AS base
# 2. Set working directory
WORKDIR /app
# 3. Install dependencies in separate layer (cache optimization)
FROM base AS deps
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile --prod
FROM base AS build-deps
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile
# 4. Build in separate stage
FROM build-deps AS builder
COPY . .
RUN pnpm build
# 5. Production image — minimal size
FROM base AS runner
ENV NODE_ENV=production
# 6. Don't run as root
RUN addgroup --system --gid 1001 app && \
adduser --system --uid 1001 app
USER app
# 7. Copy only what's needed
COPY /app/node_modules ./node_modules
COPY /app/dist ./dist
# 8. Health check
HEALTHCHECK \
CMD wget -qO- http://localhost:3000/health || exit 1
# 9. Expose port and set entrypoint
EXPOSE 3000
CMD ["node", "dist/server.js"]| Rule | Why |
|---|---|
| Multi-stage builds | Minimize image size |
| Exclude node_modules, .git, tests |
| Non-root user | Security hardening |
| Specific base image versions | Reproducible builds |
| Layer ordering (deps before src) | Cache efficiency |
| HEALTHCHECK instruction | Container health monitoring |
| No secrets in build args/layers | Prevent credential leaks |
services:
app:
build:
context: .
dockerfile: Dockerfile
target: runner
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://postgres:postgres@db:5432/app
- REDIS_URL=redis://cache:6379
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
db:
image: postgres:16-alpine
volumes:
- postgres_data:/var/lib/postgresql/data
environment:
POSTGRES_DB: app
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
cache:
image: redis:7-alpine
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v3
- uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm
- run: pnpm install --frozen-lockfile
- run: pnpm lint
- run: pnpm typecheck
- run: pnpm test -- --coverage
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npx audit-ci --moderate
- uses: aquasecurity/trivy-action@master
with:
scan-type: fs
severity: HIGH,CRITICAL
build-and-push:
needs: [lint-and-test, security-scan]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-and-push
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: echo "Deploying ${{ github.sha }}"modules/
vpc/
main.tf, variables.tf, outputs.tf
ecs/
main.tf, variables.tf, outputs.tf
environments/
staging/
main.tf, terraform.tfvars
production/
main.tf, terraform.tfvars| Rule | Why |
|---|---|
| Remote state backend (S3 + DynamoDB) | Shared state, locking |
| State locking | Prevent concurrent modifications |
| Environment-specific variable files | Separation of concerns |
| Module versioning | Reproducible shared infra |
| Catch issues before apply |
| Drift detection on schedule | Detect manual changes |
| Tag all resources | Ownership, cost allocation |
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | cpu_usage_percent | cpu_throttled | — |
| Memory | memory_usage_bytes | oom_kills | — |
| Disk | disk_usage_percent | io_wait | disk_errors |
| Network | bytes_total | queue_length | errors_total |
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning| Practice | Why |
|---|---|
| Alert on symptoms, not causes | Reduces noise, focuses on impact |
| Every alert has a runbook link | Enables fast response |
| Tiered severity | critical=page, warning=ticket, info=log |
| Aggregate before alerting | Avoid flapping |
| Review and prune quarterly | Prevent alert fatigue |
| Strategy | How It Works | Risk | Rollback Speed |
|---|---|---|---|
| Rolling | Replace instances one at a time | Low | Medium |
| Blue-Green | Switch traffic between two environments | Low | Instant |
| Canary | Route small % to new version, gradually increase | Very Low | Instant |
| Feature Flags | Deploy code dark, enable via flag | Very Low | Instant |
kubectl rollout undo deployment/app| Rule | Rationale |
|---|---|
| Migrations must be backward compatible | Old code + new schema must work |
| Never rename/drop columns in same deploy | Two-phase change required |
| Two-phase: add column -> deploy -> remove old | Zero-downtime schema evolution |
| Always test rollback of each migration | Ensure reversibility |
| Anti-Pattern | Why It Is Wrong | What to Do Instead |
|---|---|---|
| Manual production deployments | No audit trail, error-prone | Automate via CI/CD |
| Shared or hardcoded secrets | Security breach risk | Use secrets manager |
| No rollback plan before deploying | Stuck if deploy fails | Document rollback before every deploy |
| Non-reproducible | Pin specific version tags |
| Running containers as root | Security vulnerability | Use non-root user in Dockerfile |
| Alert fatigue from non-actionable alerts | Real issues get missed | Alert on symptoms, tune thresholds |
| Skipping staging environment | Bugs found in production | Always deploy to staging first |
| Snowflake servers with manual config | Cannot reproduce, cannot scale | Infrastructure as code |
| Monitoring without alerting | Nobody notices problems | Wire alerts to monitoring |
mcp__context7__resolve-library-idmcp__context7__query-docsdockerkubernetesterraform| Skill | Integration |
|---|---|
| Provides higher-level deploy pipeline orchestration |
| Security scan stage in CI pipeline |
| Infrastructure changes are planned like features |
| Post-deploy verification gate |
| Merge triggers deployment pipeline |
| MCP servers need containerization and deployment |