Skip to content

Observability & Debugging

This tutorial covers monitoring, tracing, and debugging your OJS deployment. By the end, you’ll have dashboards, alerts, distributed traces, and CLI debugging skills.

  • A running OJS backend (any backend works; we’ll use Redis)
  • Docker and Docker Compose
  • The OJS CLI installed (go install github.com/openjobspec/ojs-cli@latest)

Every OJS backend exposes Prometheus metrics at /metrics. Key metrics:

ojs_jobs_enqueued_total{queue, type} # Jobs enqueued
ojs_jobs_completed_total{queue, type} # Jobs completed
ojs_jobs_failed_total{queue, type} # Jobs failed
ojs_queue_depth{queue} # Current queue depth
ojs_job_duration_seconds{queue, type} # Processing time histogram
ojs_worker_active_jobs{worker_id} # Active jobs per worker
ojs_worker_heartbeat_age_seconds # Time since last heartbeat

Create a docker-compose.observability.yml:

services:
redis:
image: redis:7-alpine
ports: ["6379:6379"]
ojs:
image: ghcr.io/openjobspec/ojs-backend-redis:0.2.0
ports: ["8080:8080"]
environment:
REDIS_URL: redis://redis:6379
OJS_ALLOW_INSECURE_NO_AUTH: "true"
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
GF_AUTH_ANONYMOUS_ENABLED: "true"

And a prometheus.yml:

global:
scrape_interval: 15s
scrape_configs:
- job_name: ojs
static_configs:
- targets: ["ojs:8080"]
Terminal window
docker compose -f docker-compose.observability.yml up -d
Terminal window
curl -s http://localhost:8080/metrics | grep ojs_

You should see counters, histograms, and gauges for all job operations.

  1. Open Grafana at http://localhost:3000 (admin/admin)
  2. Add Prometheus data source: http://prometheus:9090
  3. Import the OJS dashboards from deploy/grafana/:
    • Overview — system-wide throughput, latency, error rate
    • Queues — per-queue depth, age, throughput
    • Workers — count, utilization, heartbeat status
    • Jobs — lifecycle timing and state distribution
    • Errors — error rate by type, retry patterns
    • Performance — p50/p95/p99 latency

OJS SDKs include built-in OpenTelemetry middleware that traces jobs across producers and workers.

import "go.opentelemetry.io/otel"
// Producer: traces propagate automatically
client := ojs.NewClient("http://localhost:8080",
ojs.WithOTel(ojs.OTelConfig{
ServiceName: "order-api",
}),
)
// Worker: traces link to producer spans
worker := ojs.NewWorker("http://localhost:8080",
ojs.WithOTel(ojs.OTelConfig{
ServiceName: "email-worker",
}),
)
import { OJSWorker, openTelemetryMiddleware } from '@openjobspec/sdk';
const worker = new OJSWorker({ url: 'http://localhost:8080' });
worker.use(openTelemetryMiddleware({
serviceName: 'email-worker',
endpoint: 'http://otel-collector:4317',
}));
worker = ojs.Worker("http://localhost:8080")
@worker.middleware
async def otel_middleware(ctx, next):
with tracer.start_as_current_span(f"ojs.{ctx.job.type}"):
return await next(ctx)

Add Jaeger to your Docker Compose:

jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC

Open http://localhost:16686 to see traces spanning from enqueue → fetch → process → ack.

The OJS CLI provides powerful debugging commands:

Terminal window
ojs monitor --url http://localhost:8080

Shows a real-time TUI with queue depths, throughput, worker status, and error rates.

Terminal window
# Get job details
ojs status <job-id> --detail
# View job history (state transitions with timestamps)
ojs debug history <job-id>
# Trace a job's full lifecycle
ojs debug trace <job-id>
Terminal window
# Queue stats
ojs queues --url http://localhost:8080
# Check for bottlenecks
ojs debug bottleneck --queue default
# View dead letter queue
ojs dead-letter list
Terminal window
# Run diagnostic suite
ojs doctor --url http://localhost:8080
# Output includes:
# ✓ Server reachable
# ✓ Backend connected (Redis latency: 1.2ms)
# ✓ Conformance level: L4
# ✓ No stale workers
# ✓ Dead letter queue: 0 jobs

OJS includes an auto-tuning engine that analyzes your metrics and recommends optimal settings.

Terminal window
# Via environment variable
OJS_AUTOTUNE=true
# Or via the API
curl http://localhost:8080/ojs/v1/admin/autotune/analyze | jq .
ParameterHow it’s tuned
Worker concurrencyLittle’s Law: throughput × latency
Poll intervalQueue depth + throughput analysis
Retry backoffError rate pattern classification
Connection poolPeak concurrency × 1.5
Visibility timeoutp99 latency × 2
Terminal window
curl http://localhost:8080/ojs/v1/admin/autotune/analyze | jq '.recommendations'

The auto-tuning engine includes anomaly detection that alerts on:

  • Failure spikes — failure rate exceeds baseline by 2σ
  • Latency drift — p50 latency trending upward
  • Queue backlog — depth growing faster than processing
  • Throughput drops — sudden decrease vs learned baseline
Terminal window
curl http://localhost:8080/ojs/v1/admin/autotune/anomalies?learn=true | jq .

Create Prometheus alerting rules for production:

prometheus-alerts.yml
groups:
- name: ojs
rules:
- alert: OJSQueueBacklog
expr: ojs_queue_depth > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Queue {{ $labels.queue }} has {{ $value }} pending jobs"
- alert: OJSHighFailureRate
expr: rate(ojs_jobs_failed_total[5m]) / rate(ojs_jobs_enqueued_total[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Failure rate above 10% for queue {{ $labels.queue }}"
- alert: OJSWorkerStall
expr: time() - ojs_worker_last_heartbeat_timestamp > 60
labels:
severity: critical
annotations:
summary: "Worker {{ $labels.worker_id }} has not sent a heartbeat in 60s"
ToolWhat it showsURL
PrometheusRaw metricshttp://localhost:9090
GrafanaVisual dashboardshttp://localhost:3000
JaegerDistributed traceshttp://localhost:16686
OJS Admin UIJob managementhttp://localhost:8080/ojs/admin/
ojs monitorReal-time TUICLI
ojs doctorHealth diagnosticsCLI
Auto-tuning APIPerformance recommendations/ojs/v1/admin/autotune/

Next: Production Deployment →