Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ona.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

20260508.526
May 8, 2026

Zone failover for capacity errors

GCP Runner · 20260508.526If a GCP zone lacks capacity to create an environment, the runner now retries in a different zone automatically. This reduces the impact of zonal capacity exhaustion, though it does not eliminate it entirely. Runners configured with multiple zones benefit most.

Infrastructure upgrade required

This release requires a Terraform module upgrade to v2.0.1 (Terraform Registry).Key infrastructure changes:
  • SSH access restricted to IAP-only (port 22 no longer open to 0.0.0.0/0).
  • Shielded VM hardening enabled with Secure Boot, vTPM, and integrity monitoring. Project-wide SSH keys blocked on runner and proxy VMs.
  • Flow logging added to security-critical firewall rules.
  • Memory and CPU limits added to all Docker containers on the runner VM.
  • TLS certificate rotation fixed for the auth proxy.
  • Honeycomb API key removed from Terraform configuration and VM metadata.
  • Managed metrics direct push enabled for the metrics pipeline.

Upgrade steps

  1. Update the version constraint in your main.tf module block to v2.0.1. See the release page for details.
  2. Run terraform init -upgrade to fetch the new module.
  3. Run terraform plan -out=tfplan and review the changes, paying attention to firewall and shielded VM settings.
  4. Run terraform apply tfplan.
Full walkthrough: Upgrade GCP runner infrastructure

What else is in this release

  • The Terraform module version used to provision your runner infrastructure is now displayed on the runner details page in the dashboard.
  • External user IDs are now resolved for Bitbucket and GitLab auth tokens, enabling user attribution in Insights across all SCM providers.
  • Environments that fail to start within 10 minutes (supervisor never connects) are now stopped automatically instead of staying in “starting” indefinitely.
  • Workspace folder path is correctly reported during environment creation when dotfiles are configured.
  • Supervisor retries asset downloads on SHA-256 mismatch instead of failing permanently.
  • File watch self-healing works reliably under Docker-in-Docker (fuse-overlayfs) after file unlink and recreate.
  • Agent conversations no longer stall silently when the model pauses mid-turn.
  • Credentials (AWS keys, GitHub tokens, basic-auth URLs, bearer tokens, JWTs) are now redacted from environment status messages, on-disk state files, and process-output logs.
20260504.828
May 4, 2026

Warm pools now available on GCP

Warm pools keep pre-initialized Compute Engine instances in a suspended state, ready to resume when you create an environment. Instead of provisioning a new VM and loading the prebuild snapshot from scratch, Ona claims a suspended instance and resumes it. Startup drops from minutes to around 10 seconds.Enable warm pools per environment class in your project’s prebuild settings. The runner dynamically scales the pool between your configured minimum and maximum based on demand, and rotates instances automatically when new prebuilds complete.Requires an Enterprise plan. See the warm pools documentation for prerequisites and setup instructions.

Infrastructure upgrade required

This release requires a Terraform module upgrade to v2.0.0 to enable warm pools and apply IAM changes.New IAM permissions added to the runner custom role:
PermissionPurpose
compute.autoscalers.createManage MIG autoscalers for dynamic warm pool scaling
compute.autoscalers.deleteClean up autoscalers when warm pools are removed
compute.autoscalers.getRead autoscaler state during reconciliation
compute.autoscalers.updateAdjust autoscaler targets as demand changes
compute.instanceGroupManagers.useRequired for autoscaler to manage MIG instances
compute.instances.listReferrersDiscover which MIG owns a VM during warm pool operations
compute.instances.resumeResume suspended warm pool VMs on claim
monitoring.timeSeries.createPublish scaling metrics that drive the autoscaler
IAM role binding changes:
  • The project-level iam.serviceAccounts.actAs and iam.serviceAccounts.getAccessToken permissions have been removed from the runner custom role.
  • Instead, the runner SA is granted roles/iam.serviceAccountUser on three specific service accounts: runner_sa, environment_vm_sa, and proxy_vm_sa. This limits impersonation to only the SAs the runner attaches to instances.
  • The runner assets bucket role has been elevated from roles/storage.objectViewer to roles/storage.objectAdmin to support writing managed metrics audit payloads.
Other infrastructure changes:
  • Unused service accounts (build_cache, secret_manager, pubsub_processor) are removed.
  • Environment UDP egress is now restricted to DNS, NTP, and QUIC.

Upgrade steps

  1. Update the version constraint in your main.tf module block to v2.0.0. See the release page for details.
  2. Run terraform init -upgrade to fetch the new module.
  3. Run terraform plan -out=tfplan and review the changes, paying attention to IAM and firewall rule updates.
  4. Run terraform apply tfplan.
  5. If you use pre-created service accounts, you must:
    • Add the new custom role permissions listed above.
    • Grant roles/iam.serviceAccountUser on the runner_sa, environment_vm_sa, and proxy_vm_sa service accounts to the runner SA.
Full walkthrough: Upgrade GCP runner infrastructure

What else is in this release

  • Managed metrics pipeline lets you export runner metrics via Prometheus remote_write for monitoring runner health, environment lifecycle, and resource utilization. Contact your account team to enable it.
  • Quota and capacity errors from GCP are now surfaced as clear machine failure messages instead of generic errors.
  • Automation services support a configurable readiness timeout, preventing services from hanging indefinitely when a health check never passes.
  • Orphaned MIGs, autoscalers, instance templates, and warm pool instances are automatically cleaned up, preventing resource leaks.
  • Environment startup is faster. Supervisor initialization steps now run concurrently, disk pre-warming prioritizes startup-critical paths, and git configuration runs in fewer round trips.
  • Warm pool claim reliability is improved. The runner picks the oldest available instance, skips in-flight instances, and recovers the default network route after resuming a suspended VM.
  • Async VM creation failures are now surfaced via Pub/Sub instead of silently failing.
  • Log line ordering within the same timestamp is now preserved.
  • The agent operations proxy is more resilient to transient connection failures.
  • Prebuild snapshots no longer carry stale git identity from the prebuild executor.
  • File watch self-healing works correctly when a denylisted file is unlinked and recreated inside Docker-in-Docker.
  • The runner recovers gracefully from stale gitconfig lock files.
  • Updated go-jose/v4 to v4.1.4 (High severity, GHSA-78h2-9frx-2jm8).
  • Updated go.opentelemetry.io/otel/sdk to v1.43.0 (High severity).
  • Updated Node.js to v24.14.1 (High severity).
  • Updated base container images and Prometheus for CVE fixes.
  • Go toolchain bumped to go1.26.2 (fixes CVE-2026-27143).