Observability

Strategy

Start with lightweight, self-hosted observability using open-source tooling. The observability layer is modular — logging, metrics, tracing, and client error reporting are isolated behind interfaces so the underlying tools can be swapped without affecting application code.

Server-Side

Stack

ConcernToolRole
Structured loggingtracing (Rust crate)Structured, leveled logs emitted from server code
Distributed tracingtracing + OpenTelemetryRequest-scoped trace spans across async operations
MetricsPrometheusCounter, gauge, and histogram metrics scraped from the server
Log aggregationLokiCollects and indexes structured logs
DashboardsGrafanaVisualizes logs (Loki), metrics (Prometheus), and traces

Why This Stack

  • tracing is the Rust ecosystem standard — Axum, Tonic, and Dioxus all integrate with it natively. One instrumentation library covers logging, spans, and metrics export.
  • OpenTelemetry is the vendor-neutral standard for telemetry. Exporting via OTLP means switching backends (e.g., from Loki to Datadog, or from self-hosted to managed) requires only configuration changes.
  • Grafana + Loki + Prometheus are self-hosted, free, and run as Docker containers alongside the server — fits the existing containerized deployment model.

Server Crate Structure

Observability is an isolated module in the server crate:

apps/server/
  src/
    auth/
    billing/
    observability/        # tracing setup, metrics registration, OTLP export
    handlers/
    ...

Application code uses tracing macros (info!, warn!, error!, #[instrument]) and never references the observability backend directly. Swapping Loki for a managed service means changing configuration in observability/, not application code.

What to Instrument

Request tracing:

  • Every ConnectRPC call gets a trace span (Tonic middleware handles this automatically)
  • Spans include: method name, user_id, org_id, duration, status code

Key operations:

  • Auth flows (login, token refresh, invite upgrade)
  • Score submission (including dedup checks)
  • Sync operations (queue flush, read-down)
  • Billing webhook processing
  • Database query duration

Metrics:

  • Request rate and latency per RPC method
  • Error rate by error code
  • Active connections / concurrent streams
  • Score submission volume
  • Queue flush success/failure rate
  • Database connection pool utilization

Structured Log Format

All logs are structured JSON via tracing:

{
  "timestamp": "2026-05-15T14:30:00Z",
  "level": "info",
  "message": "score created",
  "span": {"method": "ScoreService/CreateScore", "trace_id": "abc123"},
  "fields": {"event_id": "018f...", "shooter_id": "018f...", "org_id": "018f..."}
}

No unstructured string logs. Every log entry is queryable in Loki.

Client-Side

Mobile Apps (iOS + Android)

Native crash and error reporting, plus custom event tracking:

ConcerniOSAndroid
Crash reportingNative crash logs (PLCrashReporter or similar)Native crash logs (uncaught exception handler)
Error eventsReported to server via a lightweight reporting RPCSame
Custom eventsSync failures, queue flush outcomes, offline durationSame

Client error reports are sent to the server via a dedicated ConnectRPC service:

service TelemetryService {
  rpc ReportError(ErrorReport) returns (ReportResponse);
  rpc ReportEvent(EventReport) returns (ReportResponse);
}

This keeps client telemetry in the same infrastructure — no separate third-party SDK required at MVP. If a dedicated service like Sentry is adopted later, the client-side reporting interface stays the same; only the server-side handler changes.

PWA

Browser-level error capture via window.onerror and unhandledrejection. Reported through the same TelemetryService RPC when online. Limited compared to native — no background crash reporting.

What to Report from Clients

  • Crashes — stack trace, device info, app version
  • Sync failures — RPC method, error code, retry count, offline duration
  • Queue state — queue depth at flush, flush success/failure
  • Performance — app startup time, time-to-interactive

Deployment

Observability services run as containers alongside the application:

# Added to docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
 
  loki:
    image: grafana/loki:latest
 
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"

For white-label deployments, observability containers are optional — the customer can include them or point to their own monitoring infrastructure.

Modularity

The design is intentionally layered to allow swapping components:

LayerCurrentCould Swap To
Instrumentationtracing + OpenTelemetryNo reason to swap — this is the standard
Log backendLokiDatadog, Elasticsearch, CloudWatch
Metrics backendPrometheusDatadog, CloudWatch, managed Prometheus
Trace backendGrafana (via OTLP)Jaeger, Datadog, Honeycomb
Client reportingCustom TelemetryService RPCSentry SDK, Crashlytics
DashboardsGrafanaDatadog dashboards, managed Grafana

Swapping any backend is a configuration change in the observability/ module or docker-compose.yml. Application code is unaffected.