Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ona.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

20260504.828
May 4, 2026

Warm pools now available on GCP

Warm pools keep pre-initialized Compute Engine instances in a suspended state, ready to resume when you create an environment. Instead of provisioning a new VM and loading the prebuild snapshot from scratch, Ona claims a suspended instance and resumes it. Startup drops from minutes to around 10 seconds.Enable warm pools per environment class in your project’s prebuild settings. The runner dynamically scales the pool between your configured minimum and maximum based on demand, and rotates instances automatically when new prebuilds complete.Requires an Enterprise plan. See the warm pools documentation for prerequisites and setup instructions.

Infrastructure upgrade required

This release requires a Terraform module upgrade to v2.0.0 to enable warm pools and apply IAM changes.New IAM permissions added to the runner custom role:
PermissionPurpose
compute.autoscalers.createManage MIG autoscalers for dynamic warm pool scaling
compute.autoscalers.deleteClean up autoscalers when warm pools are removed
compute.autoscalers.getRead autoscaler state during reconciliation
compute.autoscalers.updateAdjust autoscaler targets as demand changes
compute.instanceGroupManagers.useRequired for autoscaler to manage MIG instances
compute.instances.listReferrersDiscover which MIG owns a VM during warm pool operations
compute.instances.resumeResume suspended warm pool VMs on claim
monitoring.timeSeries.createPublish scaling metrics that drive the autoscaler
IAM role binding changes:
  • The project-level iam.serviceAccounts.actAs and iam.serviceAccounts.getAccessToken permissions have been removed from the runner custom role.
  • Instead, the runner SA is granted roles/iam.serviceAccountUser on three specific service accounts: runner_sa, environment_vm_sa, and proxy_vm_sa. This limits impersonation to only the SAs the runner attaches to instances.
  • The runner assets bucket role has been elevated from roles/storage.objectViewer to roles/storage.objectAdmin to support writing managed metrics audit payloads.
Other infrastructure changes:
  • Unused service accounts (build_cache, secret_manager, pubsub_processor) are removed.
  • Environment UDP egress is now restricted to DNS, NTP, and QUIC.

Upgrade steps

  1. Update the version constraint in your main.tf module block to v2.0.0. See the release page for details.
  2. Run terraform init -upgrade to fetch the new module.
  3. Run terraform plan -out=tfplan and review the changes, paying attention to IAM and firewall rule updates.
  4. Run terraform apply tfplan.
  5. If you use pre-created service accounts, you must:
    • Add the new custom role permissions listed above.
    • Grant roles/iam.serviceAccountUser on the runner_sa, environment_vm_sa, and proxy_vm_sa service accounts to the runner SA.
Full walkthrough: Upgrade GCP runner infrastructure

What else is in this release

  • Managed metrics pipeline lets you export runner metrics via Prometheus remote_write for monitoring runner health, environment lifecycle, and resource utilization. Contact your account team to enable it.
  • Quota and capacity errors from GCP are now surfaced as clear machine failure messages instead of generic errors.
  • Automation services support a configurable readiness timeout, preventing services from hanging indefinitely when a health check never passes.
  • Orphaned MIGs, autoscalers, instance templates, and warm pool instances are automatically cleaned up, preventing resource leaks.
  • Environment startup is faster. Supervisor initialization steps now run concurrently, disk pre-warming prioritizes startup-critical paths, and git configuration runs in fewer round trips.
  • Warm pool claim reliability is improved. The runner picks the oldest available instance, skips in-flight instances, and recovers the default network route after resuming a suspended VM.
  • Async VM creation failures are now surfaced via Pub/Sub instead of silently failing.
  • Log line ordering within the same timestamp is now preserved.
  • The agent operations proxy is more resilient to transient connection failures.
  • Prebuild snapshots no longer carry stale git identity from the prebuild executor.
  • File watch self-healing works correctly when a denylisted file is unlinked and recreated inside Docker-in-Docker.
  • The runner recovers gracefully from stale gitconfig lock files.
  • Updated go-jose/v4 to v4.1.4 (High severity, GHSA-78h2-9frx-2jm8).
  • Updated go.opentelemetry.io/otel/sdk to v1.43.0 (High severity).
  • Updated Node.js to v24.14.1 (High severity).
  • Updated base container images and Prometheus for CVE fixes.
  • Go toolchain bumped to go1.26.2 (fixes CVE-2026-27143).