Observability
Strategy
Start with lightweight, self-hosted observability using open-source tooling. The observability layer is modular — logging, metrics, tracing, and client error reporting are isolated behind interfaces so the underlying tools can be swapped without affecting application code.
Server-Side
Stack
| Concern | Tool | Role |
|---|---|---|
| Structured logging | tracing (Rust crate) | Structured, leveled logs emitted from server code |
| Distributed tracing | tracing + OpenTelemetry | Request-scoped trace spans across async operations |
| Metrics | Prometheus | Counter, gauge, and histogram metrics scraped from the server |
| Log aggregation | Loki | Collects and indexes structured logs |
| Dashboards | Grafana | Visualizes logs (Loki), metrics (Prometheus), and traces |
Why This Stack
tracingis the Rust ecosystem standard — Axum, Tonic, and Dioxus all integrate with it natively. One instrumentation library covers logging, spans, and metrics export.- OpenTelemetry is the vendor-neutral standard for telemetry. Exporting via OTLP means switching backends (e.g., from Loki to Datadog, or from self-hosted to managed) requires only configuration changes.
- Grafana + Loki + Prometheus are self-hosted, free, and run as Docker containers alongside the server — fits the existing containerized deployment model.
Server Crate Structure
Observability is an isolated module in the server crate:
apps/server/
src/
auth/
billing/
observability/ # tracing setup, metrics registration, OTLP export
handlers/
...
Application code uses tracing macros (info!, warn!, error!, #[instrument]) and never references the observability backend directly. Swapping Loki for a managed service means changing configuration in observability/, not application code.
What to Instrument
Request tracing:
- Every ConnectRPC call gets a trace span (Tonic middleware handles this automatically)
- Spans include: method name, user_id, org_id, duration, status code
Key operations:
- Auth flows (login, token refresh, invite upgrade)
- Score submission (including dedup checks)
- Sync operations (queue flush, read-down)
- Billing webhook processing
- Database query duration
Metrics:
- Request rate and latency per RPC method
- Error rate by error code
- Active connections / concurrent streams
- Score submission volume
- Queue flush success/failure rate
- Database connection pool utilization
Structured Log Format
All logs are structured JSON via tracing:
{
"timestamp": "2026-05-15T14:30:00Z",
"level": "info",
"message": "score created",
"span": {"method": "ScoreService/CreateScore", "trace_id": "abc123"},
"fields": {"event_id": "018f...", "shooter_id": "018f...", "org_id": "018f..."}
}No unstructured string logs. Every log entry is queryable in Loki.
Client-Side
Mobile Apps (iOS + Android)
Native crash and error reporting, plus custom event tracking:
| Concern | iOS | Android |
|---|---|---|
| Crash reporting | Native crash logs (PLCrashReporter or similar) | Native crash logs (uncaught exception handler) |
| Error events | Reported to server via a lightweight reporting RPC | Same |
| Custom events | Sync failures, queue flush outcomes, offline duration | Same |
Client error reports are sent to the server via a dedicated ConnectRPC service:
service TelemetryService {
rpc ReportError(ErrorReport) returns (ReportResponse);
rpc ReportEvent(EventReport) returns (ReportResponse);
}This keeps client telemetry in the same infrastructure — no separate third-party SDK required at MVP. If a dedicated service like Sentry is adopted later, the client-side reporting interface stays the same; only the server-side handler changes.
PWA
Browser-level error capture via window.onerror and unhandledrejection. Reported through the same TelemetryService RPC when online. Limited compared to native — no background crash reporting.
What to Report from Clients
- Crashes — stack trace, device info, app version
- Sync failures — RPC method, error code, retry count, offline duration
- Queue state — queue depth at flush, flush success/failure
- Performance — app startup time, time-to-interactive
Deployment
Observability services run as containers alongside the application:
# Added to docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
loki:
image: grafana/loki:latest
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"For white-label deployments, observability containers are optional — the customer can include them or point to their own monitoring infrastructure.
Modularity
The design is intentionally layered to allow swapping components:
| Layer | Current | Could Swap To |
|---|---|---|
| Instrumentation | tracing + OpenTelemetry | No reason to swap — this is the standard |
| Log backend | Loki | Datadog, Elasticsearch, CloudWatch |
| Metrics backend | Prometheus | Datadog, CloudWatch, managed Prometheus |
| Trace backend | Grafana (via OTLP) | Jaeger, Datadog, Honeycomb |
| Client reporting | Custom TelemetryService RPC | Sentry SDK, Crashlytics |
| Dashboards | Grafana | Datadog dashboards, managed Grafana |
Swapping any backend is a configuration change in the observability/ module or docker-compose.yml. Application code is unaffected.