> ## Documentation Index
> Fetch the complete documentation index at: https://ona.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Alerts and Dashboards

<Note>Available on the Enterprise plan. [Contact sales](https://ona.com/contact/sales) to learn more.</Note>

Ona provides pre-built Grafana alerts and a dashboard for GCP Runners. These live in the [`terraform-google-ona-runner`](https://github.com/gitpod-io/terraform-google-ona-runner) repository under the [`monitoring/`](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring) directory and are designed to work with the Prometheus metrics your runner already exposes.

Before using these alerts and dashboards, you need to configure metrics collection on your runner. See [Custom metrics pipeline](/ona/runners/monitoring-and-metrics) for setup instructions.

## Prerequisites

* A deployed GCP Runner with [metrics collection enabled](/ona/runners/monitoring-and-metrics)
* A Grafana instance (or compatible alerting system) connected to your Prometheus data source

## Dashboard

The repository includes a Grafana dashboard at [`monitoring/dashboards/gitpod-runner-overview.json`](https://github.com/gitpod-io/terraform-google-ona-runner/blob/main/monitoring/dashboards/gitpod-runner-overview.json).

### What it covers

| Section                      | What it shows                                                     |
| ---------------------------- | ----------------------------------------------------------------- |
| **Version & Replicas**       | Runner version tracking and replica count                         |
| **Health Status**            | Health checks and active instance states by lifecycle stage       |
| **GCP Runner Kit Interface** | Environment operation durations, function calls, error rates      |
| **GCP API Operations**       | API request metrics, success rates, error rates, latency heatmaps |
| **KV Store Operations**      | Redis/key-value store operation rates and durations               |
| **PubSub Operations**        | Message processing, acknowledgments, connection health            |
| **Environment Operations**   | Compute environment operation rates and durations                 |
| **System Metrics**           | Host-level CPU, memory, disk usage, and disk I/O                  |
| **WRI**                      | Workspace Runtime Interface performance metrics                   |

The dashboard uses template variables (`$project_id`, `$region`, `$runner_name`, `$instance`) so you can filter by deployment.

### Import the dashboard

1. In Grafana, go to **Dashboards → Import**
2. Upload `gitpod-runner-overview.json` from the repository
3. Select your Prometheus data source
4. Configure the template variables to match your deployment

## Alerts

The repository includes 19 alert definitions at [`monitoring/alerts/`](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts), each in its own folder with an `alert.yaml` (Grafana-compatible alert rule) and a `runbook.md` (troubleshooting steps).

### Alert overview

#### Critical — immediate response required

These indicate a service outage or severe degradation.

| Alert                                                                                                                     | Condition                                         | Impact                                                       |
| ------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------- | ------------------------------------------------------------ |
| [Service Down](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/service-down)         | Runner or auth proxy `up` metric is 0 for >1 min  | Complete outage — users cannot create or manage environments |
| [High Error Rate](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/high-error-rate)   | >10% of environment operations failing over 5 min | Users experiencing environment creation failures             |
| [High Latency](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/high-latency)         | 95th percentile operation time >5 min             | Slow environment operations                                  |
| [Goroutine Panics](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/goroutine-panics) | Application panics detected                       | Potential service instability                                |

#### High — prompt attention required

These indicate degraded performance or functionality.

| Alert                                                                                                                                     | Condition                                                 |
| ----------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- |
| [API Rate Limiting](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/api-rate-limiting)               | Hitting GCP API rate limits                               |
| [PubSub Backlog](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/pubsub-backlog)                     | >1000 unprocessed messages                                |
| [PubSub Connection Health](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/pubsub-connection-health) | PubSub connectivity issues                                |
| [Circuit Breaker Open](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/circuit-breaker-open)         | Circuit breaker protecting system from cascading failures |
| [Redis Connection Issues](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/redis-connection-issues)   | Redis connectivity problems                               |

#### Medium — monitor and track

These indicate reduced capacity or resource constraints.

| Alert                                                                                                                                       | Condition                              |
| ------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------- |
| [High CPU Usage](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/high-cpu-usage)                       | CPU usage >80% for extended period     |
| [High Memory Usage](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/high-memory-usage)                 | Memory usage >80% for extended period  |
| [High Disk Usage](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/high-disk-usage)                     | Disk usage >85% for extended period    |
| [Network Connection Health](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/network-connection-health) | Network connectivity issues            |
| [Network Errors](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/network-errors)                       | High rate of network errors            |
| [Registry Health](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/registry-health)                     | Container registry connectivity issues |
| [Zone Capacity Issues](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/zone-capacity-issues)           | GCP zone unavailable or at capacity    |
| [Quota Exceeded](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/quota-exceeded)                       | GCP resource quotas hit limits         |

#### Info — optimization opportunities

| Alert                                                                                                                                       | Condition                                     |
| ------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------- |
| [High Process Memory Usage](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/high-process-memory-usage) | Process memory usage >1GB for extended period |
| [High Goroutine Count](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring/alerts/high-goroutine-count)           | >1000 active goroutines                       |

### Import alerts into Grafana

Each alert folder contains an `alert.yaml` file:

1. In Grafana, go to **Alerting → Alert Rules**
2. Click **Import**
3. Upload the `alert.yaml` from the alert folder you want (e.g., `service-down/alert.yaml`)
4. Configure notification channels for the alert's severity level

### Customize thresholds

The default thresholds work for most deployments. Adjust them based on your scale:

* **Smaller deployments** may need lower thresholds to catch issues earlier
* **Larger deployments** may need higher thresholds to reduce noise
* **Development environments** may want less sensitive alerts

### Runbooks

Each alert folder also contains a `runbook.md` with step-by-step troubleshooting instructions. Before using a runbook, set up these environment variables:

```bash theme={null}
export PROJECT_ID="your-gcp-project-id"
export REGION="your-region"          # e.g., us-central1
export RUNNER_ID="your-runner-id"    # from your Terraform configuration
```

The runbooks use `gcloud compute ssh` commands to inspect the runner instance and include resolution steps and escalation procedures.

## Notification channels

Configure notification channels in Grafana based on alert severity:

| Severity | Suggested channels      |
| -------- | ----------------------- |
| Critical | PagerDuty, SMS, phone   |
| High     | Slack, email            |
| Medium   | Email, ticket creation  |
| Info     | Email, dashboard review |

## Next steps

* [Custom metrics pipeline](/ona/runners/monitoring-and-metrics) — Enable metrics collection and see all available metrics
* [Troubleshooting GCP Runners](/ona/runners/gcp/troubleshooting-runners) — Diagnose common runner issues
* [`monitoring/` on GitHub](https://github.com/gitpod-io/terraform-google-ona-runner/tree/main/monitoring) — Browse alert definitions and dashboard source
