Production Deployment
Complete guide for deploying ScoutQuest in production with high availability, security, and performance optimization.
Architecture Overview
A production ScoutQuest deployment typically includes:
🏗️ Infrastructure
- Load balancers
- Multiple ScoutQuest instances
- Database clustering
- Network isolation
🔐 Security
- TLS/SSL encryption
- Authentication & authorization
- Network policies
- Secret management
📊 Observability
- Metrics collection
- Centralized logging
- Distributed tracing
- Alerting systems
⚡ Performance
- Resource optimization
- Caching strategies
- Connection pooling
- Rate limiting
Kubernetes Deployment
Production Configuration
# scoutquest-production.yaml
apiVersion: v1
kind: Namespace
metadata:
name: scoutquest-system
labels:
name: scoutquest-system
---
apiVersion: v1
kind: ConfigMap
metadata:
name: scoutquest-config
namespace: scoutquest-system
data:
config.toml: |
[server]
host = "0.0.0.0"
port = 8080
[registry]
cleanup_interval = 30
health_check_interval = 10
max_retries = 3
[security]
enable_tls = true
cert_file = "/etc/certs/tls.crt"
key_file = "/etc/certs/tls.key"
require_auth = true
[database]
url = "postgresql://scoutquest:${POSTGRES_PASSWORD}@postgres-service:5432/scoutquest"
max_connections = 20
min_connections = 5
[observability]
metrics_enabled = true
tracing_enabled = true
jaeger_endpoint = "http://jaeger-collector:14268/api/traces"
[performance]
request_timeout = 30
max_concurrent_requests = 1000
enable_compression = true
---
apiVersion: v1
kind: Secret
metadata:
name: scoutquest-secrets
namespace: scoutquest-system
type: Opaque
data:
postgres-password:
jwt-secret:
api-key:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: scoutquest-server
namespace: scoutquest-system
labels:
app: scoutquest-server
version: v1.0.0
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: scoutquest-server
template:
metadata:
labels:
app: scoutquest-server
version: v1.0.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: scoutquest-service-account
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
containers:
- name: scoutquest
image: scoutquest/server:v1.0.0
imagePullPolicy: Always
ports:
- containerPort: 8080
name: http
protocol: TCP
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: scoutquest-secrets
key: postgres-password
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: scoutquest-secrets
key: jwt-secret
- name: RUST_LOG
value: "info,scoutquest=debug"
volumeMounts:
- name: config
mountPath: /etc/scoutquest
readOnly: true
- name: tls-certs
mountPath: /etc/certs
readOnly: true
livenessProbe:
httpGet:
path: /health
port: http
scheme: HTTPS
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: http
scheme: HTTPS
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumes:
- name: config
configMap:
name: scoutquest-config
- name: tls-certs
secret:
secretName: scoutquest-tls-cert
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- scoutquest-server
topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
name: scoutquest-service
namespace: scoutquest-system
labels:
app: scoutquest-server
spec:
type: ClusterIP
ports:
- port: 8080
targetPort: http
protocol: TCP
name: https
selector:
app: scoutquest-server
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: scoutquest-ingress
namespace: scoutquest-system
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
tls:
- hosts:
- scoutquest.example.com
secretName: scoutquest-tls-cert
rules:
- host: scoutquest.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: scoutquest-service
port:
number: 8080
RBAC Configuration
# rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: scoutquest-service-account
namespace: scoutquest-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: scoutquest-cluster-role
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: scoutquest-cluster-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: scoutquest-cluster-role
subjects:
- kind: ServiceAccount
name: scoutquest-service-account
namespace: scoutquest-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: scoutquest-role
namespace: scoutquest-system
rules:
- apiGroups: [""]
resources: ["secrets", "configmaps"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scoutquest-role-binding
namespace: scoutquest-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: scoutquest-role
subjects:
- kind: ServiceAccount
name: scoutquest-service-account
namespace: scoutquest-system
PostgreSQL High Availability
# postgres-ha.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-cluster
namespace: scoutquest-system
spec:
instances: 3
postgresql:
parameters:
max_connections: "200"
shared_buffers: "256MB"
effective_cache_size: "1GB"
maintenance_work_mem: "64MB"
checkpoint_completion_target: "0.9"
wal_buffers: "16MB"
default_statistics_target: "100"
random_page_cost: "1.1"
effective_io_concurrency: "200"
work_mem: "4MB"
min_wal_size: "1GB"
max_wal_size: "4GB"
bootstrap:
initdb:
database: scoutquest
owner: scoutquest
secret:
name: postgres-credentials
storage:
size: 100Gi
storageClass: fast-ssd
monitoring:
enabled: true
backup:
retentionPolicy: "30d"
barmanObjectStore:
destinationPath: "s3://backup-bucket/postgres"
s3Credentials:
accessKeyId:
name: backup-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: backup-credentials
key: SECRET_ACCESS_KEY
wal:
retention: "5d"
data:
retention: "30d"
---
apiVersion: v1
kind: Secret
metadata:
name: postgres-credentials
namespace: scoutquest-system
type: kubernetes.io/basic-auth
data:
username:
password:
Security Configuration
TLS and Certificate Management
# cert-manager-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
- dns01:
route53:
region: us-east-1
accessKeyID:
secretAccessKeySecretRef:
name: route53-credentials
key: secret-access-key
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: scoutquest-tls-cert
namespace: scoutquest-system
spec:
secretName: scoutquest-tls-cert
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- scoutquest.example.com
- api.scoutquest.example.com
Network Policies
# network-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: scoutquest-network-policy
namespace: scoutquest-system
spec:
podSelector:
matchLabels:
app: scoutquest-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-system
- namespaceSelector:
matchLabels:
name: monitoring-system
- podSelector:
matchLabels:
app: scoutquest-server
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: scoutquest-system
ports:
- protocol: TCP
port: 5432
- to: []
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
- protocol: TCP
port: 443
- protocol: TCP
port: 14268
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: postgres-network-policy
namespace: scoutquest-system
spec:
podSelector:
matchLabels:
postgresql: postgres-cluster
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: scoutquest-server
ports:
- protocol: TCP
port: 5432
Pod Security Standards
# pod-security-policy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: scoutquest-psp
namespace: scoutquest-system
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'secret'
- 'emptyDir'
- 'downwardAPI'
- 'projected'
- 'persistentVolumeClaim'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
rule: 'MustRunAsNonRoot'
supplementalGroups:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
fsGroup:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
readOnlyRootFilesystem: true
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: scoutquest-psp-role
namespace: scoutquest-system
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames:
- scoutquest-psp
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scoutquest-psp-binding
namespace: scoutquest-system
roleRef:
kind: Role
name: scoutquest-psp-role
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: scoutquest-service-account
namespace: scoutquest-system
Docker Compose for Smaller Deployments
# docker-compose.prod.yml
version: '3.8'
services:
scoutquest-1:
image: scoutquest/server:v1.0.0
container_name: scoutquest-server-1
restart: unless-stopped
environment:
- POSTGRES_URL=postgresql://scoutquest:${POSTGRES_PASSWORD}@postgres:5432/scoutquest
- REDIS_URL=redis://redis:6379
- JWT_SECRET=${JWT_SECRET}
- RUST_LOG=info,scoutquest=debug
- SERVER_PORT=8080
volumes:
- ./config/production.toml:/etc/scoutquest/config.toml:ro
- ./certs:/etc/certs:ro
- scoutquest-logs:/var/log/scoutquest
networks:
- scoutquest-network
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "https://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
scoutquest-2:
image: scoutquest/server:v1.0.0
container_name: scoutquest-server-2
restart: unless-stopped
environment:
- POSTGRES_URL=postgresql://scoutquest:${POSTGRES_PASSWORD}@postgres:5432/scoutquest
- REDIS_URL=redis://redis:6379
- JWT_SECRET=${JWT_SECRET}
- RUST_LOG=info,scoutquest=debug
- SERVER_PORT=8080
volumes:
- ./config/production.toml:/etc/scoutquest/config.toml:ro
- ./certs:/etc/certs:ro
- scoutquest-logs:/var/log/scoutquest
networks:
- scoutquest-network
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
nginx:
image: nginx:1.25-alpine
container_name: scoutquest-nginx
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./nginx/ssl:/etc/nginx/ssl:ro
- nginx-logs:/var/log/nginx
networks:
- scoutquest-network
depends_on:
- scoutquest-1
- scoutquest-2
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 30s
timeout: 10s
retries: 3
postgres:
image: postgres:15-alpine
container_name: scoutquest-postgres
restart: unless-stopped
environment:
- POSTGRES_DB=scoutquest
- POSTGRES_USER=scoutquest
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_INITDB_ARGS=--auth-host=scram-sha-256
volumes:
- postgres-data:/var/lib/postgresql/data
- ./postgres/init:/docker-entrypoint-initdb.d:ro
- ./postgres/postgresql.conf:/etc/postgresql/postgresql.conf:ro
networks:
- scoutquest-network
command: >
postgres
-c config_file=/etc/postgresql/postgresql.conf
-c log_statement=all
-c log_destination=stderr
-c logging_collector=on
-c log_directory=/var/log/postgresql
healthcheck:
test: ["CMD-SHELL", "pg_isready -U scoutquest -d scoutquest"]
interval: 30s
timeout: 10s
retries: 5
redis:
image: redis:7-alpine
container_name: scoutquest-redis
restart: unless-stopped
command: >
redis-server
--appendonly yes
--requirepass ${REDIS_PASSWORD}
--maxmemory 256mb
--maxmemory-policy allkeys-lru
volumes:
- redis-data:/data
networks:
- scoutquest-network
healthcheck:
test: ["CMD", "redis-cli", "--raw", "incr", "ping"]
interval: 30s
timeout: 10s
retries: 5
# Monitoring stack
prometheus:
image: prom/prometheus:v2.45.0
container_name: prometheus
restart: unless-stopped
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
volumes:
- ./monitoring/prometheus:/etc/prometheus
- prometheus-data:/prometheus
networks:
- monitoring-network
- scoutquest-network
ports:
- "9090:9090"
grafana:
image: grafana/grafana:10.0.0
container_name: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_INSTALL_PLUGINS=grafana-piechart-panel
volumes:
- grafana-data:/var/lib/grafana
- ./monitoring/grafana:/etc/grafana/provisioning
networks:
- monitoring-network
ports:
- "3001:3000"
volumes:
postgres-data:
redis-data:
prometheus-data:
grafana-data:
scoutquest-logs:
nginx-logs:
networks:
scoutquest-network:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16
monitoring-network:
driver: bridge
Nginx Load Balancer Configuration
# nginx/nginx.conf
events {
worker_connections 1024;
use epoll;
multi_accept on;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
# Logging
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_time uct="$upstream_connect_time" '
'uht="$upstream_header_time" urt="$upstream_response_time"';
access_log /var/log/nginx/access.log main;
error_log /var/log/nginx/error.log warn;
# Performance
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
client_max_body_size 16M;
# Gzip compression
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_proxied any;
gzip_comp_level 6;
gzip_types text/plain text/css text/xml text/javascript
application/json application/javascript application/xml+rss
application/atom+xml image/svg+xml;
# Rate limiting
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=health:10m rate=100r/s;
# Upstream configuration
upstream scoutquest_backend {
least_conn;
server scoutquest-1:8080 max_fails=3 fail_timeout=30s;
server scoutquest-2:8080 max_fails=3 fail_timeout=30s;
keepalive 32;
}
# Health check endpoint
server {
listen 80;
server_name _;
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
}
# HTTP to HTTPS redirect
server {
listen 80;
server_name scoutquest.example.com;
return 301 https://$server_name$request_uri;
}
# Main HTTPS server
server {
listen 443 ssl http2;
server_name scoutquest.example.com;
# SSL configuration
ssl_certificate /etc/nginx/ssl/cert.pem;
ssl_certificate_key /etc/nginx/ssl/key.pem;
ssl_session_timeout 1d;
ssl_session_cache shared:MozTLS:10m;
ssl_session_tickets off;
# Modern configuration
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphers off;
# HSTS
add_header Strict-Transport-Security "max-age=63072000" always;
# Security headers
add_header X-Frame-Options DENY always;
add_header X-Content-Type-Options nosniff always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';" always;
# API routes with rate limiting
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass https://scoutquest_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_bypass $http_upgrade;
# Timeouts
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
}
# Health check with higher rate limit
location /health {
limit_req zone=health burst=50 nodelay;
proxy_pass https://scoutquest_backend/health;
access_log off;
}
# Metrics endpoint (internal only)
location /metrics {
allow 172.20.0.0/16;
deny all;
proxy_pass https://scoutquest_backend/metrics;
}
# Static assets
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg)$ {
expires 1y;
add_header Cache-Control "public, immutable";
try_files $uri =404;
}
}
}
Performance Optimization
Production Configuration Tuning
# config/production.toml
[server]
host = "0.0.0.0"
port = 8080
worker_threads = 8
max_blocking_threads = 16
[registry]
cleanup_interval = 60
health_check_interval = 15
max_retries = 5
health_check_timeout = 10
batch_size = 100
concurrent_health_checks = 50
[database]
url = "postgresql://scoutquest:${POSTGRES_PASSWORD}@postgres:5432/scoutquest"
max_connections = 50
min_connections = 10
connection_timeout = 30
idle_timeout = 300
max_lifetime = 3600
[cache]
enabled = true
redis_url = "redis://:${REDIS_PASSWORD}@redis:6379"
default_ttl = 300
max_connections = 20
connection_timeout = 5
command_timeout = 5
[security]
enable_tls = true
cert_file = "/etc/certs/cert.pem"
key_file = "/etc/certs/key.pem"
require_auth = true
jwt_secret = "${JWT_SECRET}"
jwt_expiry = 3600
rate_limit_requests = 1000
rate_limit_window = 60
[performance]
request_timeout = 30
max_concurrent_requests = 5000
enable_compression = true
compression_level = 6
keep_alive_timeout = 75
tcp_nodelay = true
tcp_keepalive = true
[observability]
metrics_enabled = true
tracing_enabled = true
jaeger_endpoint = "http://jaeger:14268/api/traces"
log_level = "info"
structured_logging = true
[backup]
enabled = true
s3_bucket = "scoutquest-backups"
s3_region = "us-east-1"
backup_interval = 3600
retention_days = 30
Database Performance Tuning
# postgres/postgresql.conf
# Connection Settings
max_connections = 200
superuser_reserved_connections = 3
# Memory Settings
shared_buffers = 512MB # 25% of total RAM
effective_cache_size = 1536MB # 75% of total RAM
work_mem = 4MB # RAM / max_connections
maintenance_work_mem = 128MB
# Checkpoint Settings
checkpoint_completion_target = 0.9
checkpoint_timeout = 15min
max_wal_size = 2GB
min_wal_size = 512MB
checkpoint_warning = 30s
# WAL Settings
wal_buffers = 16MB
wal_level = replica
wal_log_hints = on
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/archive/%f && cp %p /var/lib/postgresql/archive/%f'
# Query Planner
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
# Background Writer
bgwriter_delay = 50ms
bgwriter_lru_maxpages = 100
bgwriter_lru_multiplier = 2.0
# Autovacuum
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 20s
autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = 0.02
autovacuum_analyze_scale_factor = 0.01
# Logging
log_destination = 'stderr'
logging_collector = on
log_directory = '/var/log/postgresql'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_min_duration_statement = 100ms
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
# Replication (for read replicas)
max_wal_senders = 3
wal_keep_segments = 64
Connection Pooling with PgBouncer
# pgbouncer/pgbouncer.ini
[databases]
scoutquest = host=postgres port=5432 dbname=scoutquest
[pgbouncer]
pool_mode = transaction
listen_port = 6432
listen_addr = *
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /var/run/pgbouncer/pgbouncer.pid
# Pool configuration
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 5
max_db_connections = 50
# Timing
server_reset_query = DISCARD ALL
server_check_query = select 1
server_check_delay = 30
server_connect_timeout = 15
server_login_retry = 15
client_login_timeout = 60
autodb_idle_timeout = 3600
# Performance
application_name_add_host = 1
ignore_startup_parameters = extra_float_digits
Backup and Disaster Recovery
Automated Backup Strategy
# backup/backup-script.sh
#!/bin/bash
set -euo pipefail
# Configuration
BACKUP_DIR="/backups"
S3_BUCKET="scoutquest-backups"
RETENTION_DAYS=30
POSTGRES_HOST="postgres"
POSTGRES_DB="scoutquest"
POSTGRES_USER="scoutquest"
# Create timestamped backup directory
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
BACKUP_PATH="${BACKUP_DIR}/${TIMESTAMP}"
mkdir -p "${BACKUP_PATH}"
echo "Starting backup at $(date)"
# Database backup
echo "Backing up database..."
pg_dump -h "${POSTGRES_HOST}" -U "${POSTGRES_USER}" -d "${POSTGRES_DB}" \
--format=custom --compress=9 --verbose \
--file="${BACKUP_PATH}/database.dump"
# Configuration backup
echo "Backing up configuration..."
tar -czf "${BACKUP_PATH}/config.tar.gz" /etc/scoutquest/
# Create backup manifest
cat > "${BACKUP_PATH}/manifest.json" << EOF
{
"timestamp": "${TIMESTAMP}",
"database_size": $(stat -c%s "${BACKUP_PATH}/database.dump"),
"config_size": $(stat -c%s "${BACKUP_PATH}/config.tar.gz"),
"scoutquest_version": "$(scoutquest --version 2>/dev/null || echo 'unknown')",
"postgres_version": "$(psql -h ${POSTGRES_HOST} -U ${POSTGRES_USER} -d ${POSTGRES_DB} -t -c 'SELECT version();' 2>/dev/null || echo 'unknown')"
}
EOF
# Upload to S3
echo "Uploading to S3..."
aws s3 sync "${BACKUP_PATH}" "s3://${S3_BUCKET}/${TIMESTAMP}/" \
--storage-class STANDARD_IA \
--server-side-encryption AES256
# Verify backup
echo "Verifying backup..."
aws s3 ls "s3://${S3_BUCKET}/${TIMESTAMP}/" --recursive | grep -q "database.dump"
aws s3 ls "s3://${S3_BUCKET}/${TIMESTAMP}/" --recursive | grep -q "config.tar.gz"
# Cleanup old local backups
echo "Cleaning up old local backups..."
find "${BACKUP_DIR}" -type d -name "20*" -mtime +7 -exec rm -rf {} \; || true
# Cleanup old S3 backups
echo "Cleaning up old S3 backups..."
CUTOFF_DATE=$(date -d "${RETENTION_DAYS} days ago" +"%Y%m%d")
aws s3 ls "s3://${S3_BUCKET}/" | while read -r line; do
BACKUP_DATE=$(echo "$line" | awk '{print $2}' | sed 's/_.*//g')
if [[ "${BACKUP_DATE}" < "${CUTOFF_DATE}" ]]; then
BACKUP_PREFIX=$(echo "$line" | awk '{print $2}')
echo "Deleting old backup: ${BACKUP_PREFIX}"
aws s3 rm "s3://${S3_BUCKET}/${BACKUP_PREFIX}" --recursive
fi
done
echo "Backup completed successfully at $(date)"
# Send notification
if command -v curl &> /dev/null && [[ -n "${SLACK_WEBHOOK:-}" ]]; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"✅ ScoutQuest backup completed successfully: ${TIMESTAMP}\"}" \
"${SLACK_WEBHOOK}"
fi
Disaster Recovery Procedures
# disaster-recovery/restore.sh
#!/bin/bash
set -euo pipefail
# Configuration
S3_BUCKET="scoutquest-backups"
RESTORE_TIMESTAMP="${1:-latest}"
POSTGRES_HOST="${POSTGRES_HOST:-postgres}"
POSTGRES_DB="${POSTGRES_DB:-scoutquest}"
POSTGRES_USER="${POSTGRES_USER:-scoutquest}"
echo "Starting disaster recovery process..."
# Find backup to restore
if [[ "${RESTORE_TIMESTAMP}" == "latest" ]]; then
RESTORE_TIMESTAMP=$(aws s3 ls "s3://${S3_BUCKET}/" | tail -n 1 | awk '{print $2}' | sed 's/\///g')
fi
echo "Restoring from backup: ${RESTORE_TIMESTAMP}"
# Create temporary directory
TEMP_DIR=$(mktemp -d)
trap "rm -rf ${TEMP_DIR}" EXIT
# Download backup
echo "Downloading backup from S3..."
aws s3 sync "s3://${S3_BUCKET}/${RESTORE_TIMESTAMP}/" "${TEMP_DIR}/"
# Verify backup integrity
echo "Verifying backup integrity..."
if [[ ! -f "${TEMP_DIR}/database.dump" ]] || [[ ! -f "${TEMP_DIR}/config.tar.gz" ]]; then
echo "ERROR: Backup files not found or incomplete"
exit 1
fi
# Check if database is accessible
if ! pg_isready -h "${POSTGRES_HOST}" -U "${POSTGRES_USER}"; then
echo "ERROR: Database is not accessible"
exit 1
fi
# Create backup of current database (if exists)
echo "Creating safety backup of current database..."
SAFETY_BACKUP="${TEMP_DIR}/current_db_backup.dump"
pg_dump -h "${POSTGRES_HOST}" -U "${POSTGRES_USER}" -d "${POSTGRES_DB}" \
--format=custom --file="${SAFETY_BACKUP}" || echo "No existing database to backup"
# Stop ScoutQuest services
echo "Stopping ScoutQuest services..."
kubectl scale deployment scoutquest-server --replicas=0 -n scoutquest-system || true
docker-compose stop scoutquest-1 scoutquest-2 || true
# Drop and recreate database
echo "Recreating database..."
psql -h "${POSTGRES_HOST}" -U "${POSTGRES_USER}" -d postgres << EOF
DROP DATABASE IF EXISTS ${POSTGRES_DB};
CREATE DATABASE ${POSTGRES_DB} OWNER ${POSTGRES_USER};
EOF
# Restore database
echo "Restoring database..."
pg_restore -h "${POSTGRES_HOST}" -U "${POSTGRES_USER}" -d "${POSTGRES_DB}" \
--verbose --clean --if-exists "${TEMP_DIR}/database.dump"
# Restore configuration
echo "Restoring configuration..."
tar -xzf "${TEMP_DIR}/config.tar.gz" -C /
# Verify database connection
echo "Verifying database restoration..."
RESTORED_TABLES=$(psql -h "${POSTGRES_HOST}" -U "${POSTGRES_USER}" -d "${POSTGRES_DB}" \
-t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema = 'public';")
if [[ "${RESTORED_TABLES}" -lt 1 ]]; then
echo "ERROR: Database restoration failed - no tables found"
exit 1
fi
# Start services
echo "Starting ScoutQuest services..."
kubectl scale deployment scoutquest-server --replicas=3 -n scoutquest-system || true
docker-compose start scoutquest-1 scoutquest-2 || true
# Wait for services to be ready
echo "Waiting for services to be ready..."
sleep 30
# Health check
echo "Performing health check..."
for i in {1..30}; do
if curl -f -s http://localhost/health > /dev/null; then
echo "✅ ScoutQuest is healthy and running"
break
fi
echo "Waiting for service to be ready... ($i/30)"
sleep 10
done
# Send notification
if command -v curl &> /dev/null && [[ -n "${SLACK_WEBHOOK:-}" ]]; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"✅ ScoutQuest disaster recovery completed successfully. Restored from: ${RESTORE_TIMESTAMP}\"}" \
"${SLACK_WEBHOOK}"
fi
echo "Disaster recovery completed successfully!"
echo "Restored from backup: ${RESTORE_TIMESTAMP}"
echo "Safety backup of previous database available at: ${SAFETY_BACKUP}"
Operational Best Practices
✅ Production Checklist
- High Availability: Deploy multiple instances across availability zones
- Load Balancing: Use proper load balancers with health checks
- Database: Set up database clustering and read replicas
- Security: Enable TLS, implement proper authentication
- Monitoring: Set up comprehensive monitoring and alerting
- Logging: Configure centralized logging with log rotation
- Backups: Automate regular backups and test recovery procedures
- Resource Limits: Set appropriate CPU/memory limits and requests
- Network Security: Implement network policies and firewalls
- Update Strategy: Plan for zero-downtime rolling updates
⚠️ Common Production Issues
- Database Connection Exhaustion: Use connection pooling
- Memory Leaks: Monitor memory usage and set limits
- Certificate Expiry: Set up automated certificate renewal
- Split Brain Scenarios: Implement proper leader election
- Cascading Failures: Use circuit breakers and timeouts
- Log Storage: Implement log rotation and retention policies
Maintenance Windows
# maintenance/rolling-update.sh
#!/bin/bash
set -euo pipefail
NEW_IMAGE="scoutquest/server:${1:-latest}"
NAMESPACE="scoutquest-system"
DEPLOYMENT="scoutquest-server"
echo "Starting rolling update to ${NEW_IMAGE}"
# Update deployment
kubectl set image deployment/${DEPLOYMENT} \
scoutquest=${NEW_IMAGE} \
-n ${NAMESPACE}
# Wait for rollout to complete
kubectl rollout status deployment/${DEPLOYMENT} \
-n ${NAMESPACE} \
--timeout=600s
# Verify deployment
READY_REPLICAS=$(kubectl get deployment ${DEPLOYMENT} \
-n ${NAMESPACE} \
-o jsonpath='{.status.readyReplicas}')
if [[ "${READY_REPLICAS}" -ge 2 ]]; then
echo "✅ Rolling update completed successfully"
# Run health checks
kubectl get pods -n ${NAMESPACE} -l app=scoutquest-server
# Test endpoints
curl -f http://scoutquest.example.com/health
curl -f http://scoutquest.example.com/api/health
else
echo "❌ Rolling update failed"
kubectl rollout undo deployment/${DEPLOYMENT} -n ${NAMESPACE}
exit 1
fi