Production Deployment

This guide covers everything you need to deploy OJS in a production environment.

1. Choose Your Backend

Use Case	Recommended Backend	Why
Speed-critical, low latency	Redis	Sub-millisecond enqueue, mature ecosystem
ACID guarantees, SQL queryability	PostgreSQL	Strong durability, transactional enqueue
Cloud-native microservices	NATS	Single binary, built-in clustering
Event replay, compliance	Kafka	Immutable log, unlimited throughput
AWS-native, zero ops	SQS	Fully managed, pay-per-use
Development / CI	Lite	Zero deps, sub-50ms startup

See the Backend Selection Guide for detailed comparison.

2. Deploy with Kubernetes

Helm Chart

# Add the OJS Helm repository
helm repo add openjobspec https://openjobspec.github.io/charts
helm repo update

# Install with Redis backend
helm install ojs openjobspec/ojs-server \
  --set backend=redis \
  --set redis.url=redis://redis:6379 \
  --set auth.apiKey=your-secret-key \
  --set replicas=3

Key Configuration Values

backend: redis
replicas: 3

redis:
  url: redis://redis-cluster:6379

auth:
  apiKey: "${OJS_API_KEY}"     # Required in production
  enabled: true

resources:
  requests:
    memory: "128Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "1000m"

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilization: 70

monitoring:
  prometheus:
    enabled: true
  grafana:
    dashboards: true

Docker Compose (Single Node)

For simpler deployments:

cd ojs-cloud/deploy
cp .env.example .env   # Edit with your secrets
docker compose -f docker-compose.production.yml up -d

3. Security Hardening

Authentication

All production deployments MUST enable API key authentication:

# Environment variable
OJS_AUTH_REQUIRED=true
OJS_API_KEY=your-strong-random-key-32-chars-minimum

Encryption

Enable job payload encryption for sensitive data:

OJS_ENCRYPTION_ENABLED=true
OJS_ENCRYPTION_KEY=your-32-byte-aes-key-base64-encoded

Network Security

Place OJS servers on a private network (not internet-facing)
Use a reverse proxy (Nginx, Caddy, ALB) for TLS termination
Enable rate limiting to prevent abuse
Set CORS headers if Admin UI is on a different domain

Policy Engine

Define governance rules for job processing:

[
  {
    "id": "block-pii-queue",
    "name": "Block PII on public queues",
    "action": "deny",
    "enabled": true,
    "conditions": {
      "queues": ["public-*"],
      "tags": ["contains-pii"]
    }
  }
]

4. Observability

Prometheus Metrics

Every OJS backend exposes metrics at /metrics:

# Key metrics to monitor
ojs_jobs_enqueued_total          # Total jobs enqueued
ojs_jobs_completed_total         # Total jobs completed
ojs_jobs_failed_total            # Total jobs failed
ojs_queue_depth                  # Current queue depth
ojs_job_duration_seconds         # Job processing time histogram
ojs_worker_active_jobs           # Currently active jobs per worker

Grafana Dashboards

Import the pre-built dashboards from deploy/grafana/:

Overview — System-wide throughput, latency, error rate
Queues — Per-queue depth, throughput, and age
Workers — Worker count, utilization, and heartbeat status
Jobs — Job lifecycle timing and state distribution
Errors — Error rate by type, retry patterns, dead letter growth
Performance — p50/p95/p99 latency, memory, CPU

Alerting

Recommended alerts:

Alert	Condition	Severity
Queue backlog growing	Depth > 1000 for > 5 min	Warning
High failure rate	> 10% for > 2 min	Critical
Worker stall	No heartbeat for > 60s	Critical
Dead letter growth	> 100 jobs in 1 hour	Warning

OpenTelemetry

Enable distributed tracing across producers and workers:

// Go SDK
worker.Use(ojs.OpenTelemetryMiddleware(ojs.OTelConfig{
    ServiceName: "payment-worker",
    Endpoint:    "otel-collector:4317",
}))

5. High Availability

Multi-Replica Deployment

Run 3+ OJS server replicas behind a load balancer. All backends support concurrent access from multiple server instances.

Backend Redundancy

Backend	HA Strategy
Redis	Redis Sentinel or Redis Cluster
PostgreSQL	Streaming replication + pgbouncer
NATS	NATS Cluster (built-in)
Kafka	Multi-broker cluster
SQS	AWS-managed (multi-AZ by default)

Graceful Shutdown

OJS backends support graceful shutdown:

# Kubernetes terminationGracePeriodSeconds
terminationGracePeriodSeconds: 30

# Or manually
kill -SIGTERM <pid>   # Starts graceful shutdown
# Active jobs complete, no new jobs are fetched

6. Performance Tuning

Worker Concurrency

# Start with: concurrency = 2 × CPU cores
OJS_WORKER_CONCURRENCY=16

Poll Interval

# High throughput: shorter interval
OJS_WORKER_POLL_INTERVAL=200ms

# Low throughput: longer to save CPU
OJS_WORKER_POLL_INTERVAL=2s

Auto-Tuning

Enable the auto-tuning engine for automatic optimization:

OJS_AUTOTUNE=true
OJS_AUTOTUNE_INTERVAL=30s

The engine analyzes throughput, latency, and queue depth to recommend optimal concurrency, poll intervals, and connection pool sizes.