Skip to main content
Available on the Enterprise tier. Contact sales to learn more.
Ona managed metrics lets you opt in to having Ona monitor your runner’s operational health. When enabled, the runner pushes a curated set of Prometheus metrics to the Ona management plane, where the Ona team uses them to detect degradation and catch problems before they affect your developers. This is independent of self-managed metrics collection, where you configure your own Prometheus remote write endpoint. Both can run simultaneously — managed metrics go to Ona, self-managed metrics go to your own observability stack.

Why enable managed metrics

Without metrics visibility, Ona cannot detect runner issues until you report them. Managed metrics close this gap:
  • Proactive monitoring — Ona detects degraded runners, resource exhaustion, and elevated error rates before developers are impacted.
  • Faster incident resolution — When you contact support, the Ona team already has the operational context they need.
  • Zero setup on your side — No Prometheus endpoint to configure, no dashboards to build. Toggle it on and the runner handles the rest.
  • Non-sensitive data only — The curated metric set contains only operational counters, gauges, and histograms. No user data, source code, or secrets are ever sent.

Enabling managed metrics

  1. Go to Settings → Runners
  2. Select your runner
  3. Toggle Ona managed metrics
The setting saves immediately — no additional confirmation is needed.
Runner settings showing the Ona managed metrics toggle
Metrics begin flowing within 60 seconds. No additional network configuration is required — the runner pushes metrics to the management plane over the same connection it already uses.

What metrics are reported

The curated metric set answers: “Is this runner healthy and performing well?” Metrics are filtered to a hardcoded allowlist in the runner binary. Adding or removing metrics requires a runner release, which provides a natural review gate.

Environment lifecycle

MetricTypeDescription
gitpod_runnerkit_active_instancesGaugeCurrent environment count by state
gitpod_runnerkit_environment_operation_duration_secondsHistogramDuration of environment operations
gitpod_runnerkit_function_calls_totalCounterTotal function calls by function name
gitpod_runnerkit_function_errors_totalCounterTotal function errors by function name

Supervisor connectivity

MetricTypeDescription
gitpod_runnerkit_supervisor_status_events_totalCounterSupervisor status events received
gitpod_runnerkit_supervisor_watch_closes_totalCounterSupervisor watch connection closes by reason

Snapshots and warm pools

MetricTypeDescription
snapshot_reconcile_duration_secondsHistogramSnapshot processing time by phase
snapshot_in_progressGaugeActive snapshot operations
snapshot_timeouts_totalCounterSnapshot timeouts
warm_pool_reconcile_duration_secondsHistogramWarm pool reconciliation duration
warm_pool_deletions_totalCounterWarm pool instance deletions

Work queue health

MetricTypeDescription
workqueue_depthGaugeCurrent queue depth
workqueue_queue_duration_secondsHistogramTime items spend waiting in queue
workqueue_work_duration_secondsHistogramTime spent processing items
workqueue_retries_totalCounterProcessing retries

Agent and LLM

MetricTypeDescription
gitpod_llm_*VariousLLM token usage, request latency, errors, and failover
gitpod_mcp_*VariousMCP proxy request duration and count
gitpod_memory_*VariousRedis memory usage, eviction cycles, and conversation age
gitpod_skill_*VariousSkill API discovery metrics

Runner operations

MetricTypeDescription
gitpod_runner_manager_*VariousRunner manager operational metrics
gitpod_runner_scm_*VariousSCM token refresh, cache invalidation, and TTL
gitpod_runner_updater_*VariousRunner self-update duration and count
gitpod_ip_allocator_*VariousIP address allocation metrics
gitpod_kvstore_*VariousKey-value store operation metrics

Process health

MetricTypeDescription
go_goroutinesGaugeCurrent goroutine count
go_memstats_alloc_bytesGaugeAllocated heap memory
process_cpu_seconds_totalCounterTotal CPU time consumed

Cloud-specific metrics

MetricTypeDescription
gitpod_ec2_runner_*VariousEC2 runner operational metrics
environment_error_errors_totalCounterEnvironment errors by code and component

How it works

  1. The runner’s Prometheus registry collects metrics every 15 seconds (this already happens for normal runner operation).
  2. A managed metrics reporter filters the registry to the curated allowlist, encodes the result as a Prometheus remote write payload, and compresses it with Snappy.
  3. The runner pushes the compressed payload to the Ona management plane every 60 seconds over the existing authenticated connection.
  4. The management plane validates the payload, adds identifying labels (organization_id, runner_id, runner_region, runner_type), and forwards it to Ona’s internal monitoring infrastructure.
Typical payloads are 5–15 KB compressed. The reporter runs independently of any self-managed metrics endpoint you may have configured.

Auditing reported metrics

Every metrics payload the runner sends to Ona is also written to your cloud storage bucket, so you can audit exactly what data leaves your network. Audit payloads are stored as Snappy-compressed Prometheus remote write protobufs at:
metrics/runner/{runner-id}/{YYYY}/{MM}/{DD}/{HHmmss}.pb.snappy

Finding the bucket name

The bucket name is shown on your runner’s settings page. When Ona managed metrics is enabled, the bucket URI appears below the toggle. Click the copy icon to copy it.
Runner settings showing the metrics audit bucket URI with a copy button

Listing and downloading audit payloads

aws s3 ls s3://YOUR_LOGS_BUCKET/metrics/runner/YOUR_RUNNER_ID/ --recursive
# Download a specific payload
aws s3 cp s3://YOUR_LOGS_BUCKET/metrics/runner/YOUR_RUNNER_ID/2026/04/10/143000.pb.snappy ./payload.pb.snappy

Decoding audit payloads

Each .pb.snappy file is a Snappy-compressed Prometheus remote write WriteRequest protobuf. You can decode it with standard Prometheus tooling or any protobuf decoder:
# Using protoc (requires prometheus remote write proto definition)
cat payload.pb.snappy | python3 -c "
import sys, snappy
sys.stdout.buffer.write(snappy.decompress(sys.stdin.buffer.read()))
" | protoc --decode=prometheus.WriteRequest remote.proto
The decoded output shows every metric name, label set, and sample value — the exact data that was sent to Ona.

Privacy and data handling

  • Opt-in only. Disabled by default. You explicitly enable it per runner.
  • Push-based. The runner pushes metrics to Ona. Ona never reaches into your runner or network.
  • Non-sensitive. Only operational counters, gauges, and histograms. No user data, source code, environment variables, or secrets.
  • Non-interfering. Independent of any self-managed metrics endpoint. Both can run simultaneously.
  • Auditable. Every payload is persisted to your cloud storage for inspection.
  • Transparent. The metric allowlist is hardcoded in the runner binary. Changes require a runner release and are documented on this page.