> ## Documentation Index
> Fetch the complete documentation index at: https://ona.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Troubleshooting AWS runners

> Troubleshoot AWS runner issues.

Network misconfigurations are the most common cause of issues. See [access requirements](/ona/runners/aws/detailed-access-requirements) first.

## CloudFormation stack fails

**Symptoms:** `ROLLBACK_COMPLETE` or `ROLLBACK_IN_PROGRESS` with errors like `Parameter validation failed: parameter value for EC2RunnerInstancesSubnet does not exist.`

**Fix:** Ensure you select a VPC, at least one availability zone, and subnets across multiple AZs.

## Runner task fails

**Symptoms:**

* `CREATE_FAILED` with `ECS Deployment Circuit Breaker was triggered`
* `ResourceInitializationError` in task logs
* Cannot pull images or access AWS services

**Fix:**

* Verify VPC has Internet Gateway or NAT Gateway
* Update route tables (public → IGW, private → NAT)
* For private subnets, add VPC endpoints for Secrets Manager, S3, ECR
* Check security groups allow outbound HTTPS

## Instance type not available

**Symptoms:** Error like "m6i.xlarge is not available in us-east-1e"

**Fix:**

* Use multiple AZs (avoid `us-east-1d` and `us-east-1e` exclusively)
* Try a different region or instance type
* [Update stack parameters](/ona/runners/aws/update-runner#update-parameters) or [create new environment class](/ona/runners/aws/environment-classes)
* Retry later (availability is transient)

## Unexpected costs

**Symptoms:** Unexpected AWS charges, or continued billing after deleting a runner.

**Fix:**

* See [managing costs](/ona/runners/aws/aws-runner-costs#controls-for-managing-costs) to identify resources
* After [deleting a runner](/ona/runners/aws/delete-runner), verify the CloudFormation stack is fully deleted
* Check for residual EC2 instances or EBS volumes and delete manually

## SSM access blocked

**Symptoms:**

* Environments fail with `AWS account policy blocks ssm:SendCommand`
* Runner marked as degraded
* Slow startup (cache credentials can't refresh)

**Cause:** Service Control Policies (SCPs) blocking SSM access. The runner needs `ssm:SendCommand` and `ssm:GetCommandInvocation` permissions.

**Fix:** Request your AWS administrator add an exception for the runner's IAM role:

```json theme={null}
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["ssm:SendCommand", "ssm:GetCommandInvocation"],
    "Resource": ["arn:aws:ec2:*:*:instance/*", "arn:aws:ssm:*:*:command/*"]
  }]
}
```

## Prebuilds fail due to policy restrictions

**Symptoms:**

* Prebuilds fail with `AWS Service Control Policy blocks ec2:CreateSnapshot`
* Prebuilds fail with `AWS IAM policy does not allow ec2:CreateSnapshot`
* Similar errors for `ec2:RegisterImage`, `ec2:DescribeSnapshots`, or `ec2:DescribeImages`

**Cause:** Prebuilds require creating EBS snapshots and AMIs. These operations can be blocked by:

1. **Service Control Policies (SCPs)** - Organization-level policies that deny EC2 snapshot/AMI actions
2. **IAM policies** - The runner's IAM role is missing required permissions (outdated CloudFormation stack)

### Fix for SCP restrictions

Request your AWS administrator to allow these actions for the runner's IAM role in the SCP:

```json theme={null}
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "ec2:CreateSnapshot",
      "ec2:RegisterImage",
      "ec2:DescribeSnapshots",
      "ec2:DescribeImages",
      "ec2:DeleteSnapshot",
      "ec2:DeregisterImage"
    ],
    "Resource": "*"
  }]
}
```

### Fix for IAM policy restrictions

[Update your CloudFormation stack](/ona/runners/aws/update-runner) to the latest version. The latest stack template includes all required IAM permissions for prebuilds.

## Custom CA certificate issues

<Tip>
  The `ca-trust-init` container in the Runner's ECS task logs can help diagnose CA issues. Check its **Status** (it should show `STOPPED` with exit code 0) and its **Logs** for errors and warnings.
</Tip>

### Environment stops shortly after starting

**Symptoms:** Environment starts but stops within seconds. No error is visible in the Ona dashboard.

**Cause:** The environment instance failed to download or parse the CA bundle. The instance shuts down when this fails.

**Fix:**

* Verify the CA bundle source (S3 bucket or HTTPS URL) is accessible from the runner's VPC
* For S3 URLs, ensure the bucket name starts with `gitpod-` — the runner's IAM role only has access to buckets matching `gitpod-*`
* Confirm the S3 object exists and the path in the `CustomCATrustBundle` parameter is correct
* Verify the CA bundle is valid PEM — the `-----BEGIN CERTIFICATE-----` and `-----END CERTIFICATE-----` delimiters must each be on their own line
* Verify the CA bundle does not contain invalid content, such as displaying metadata before `-----BEGIN CERTIFICATE-----`

### User data is limited to 16384 bytes

**Symptoms:** Environment creation fails with `"User data is limited to 16384 bytes"`

**Cause:** The `CustomCATrustBundle` CloudFormation parameter uses an SSM dynamic reference (`{{resolve:ssm:...}}`), which embeds the full PEM certificate content into EC2 user data. CA bundles with multiple certificates exceed the 16 KB AWS limit.

**Fix:** Switch to an S3 URL for the trust bundle:

1. Create an S3 bucket with a name starting with `gitpod-` (e.g. `gitpod-myorg`)
2. Upload your CA bundle to `s3://gitpod-myorg/shared/ca-bundle.pem`
3. Update the `CustomCATrustBundle` CloudFormation parameter to `s3://gitpod-myorg/shared/ca-bundle.pem`
4. Update the CloudFormation stack

See [Custom CA Certificate](/ona/runners/aws/setup#custom-ca-certificate) for details on all supported formats.

### CA not trusted in devcontainer builds

**Symptoms:** Devcontainer image builds or feature installs fail with TLS certificate errors, even though the custom CA works for other operations.

**Cause:** Custom CA certificates are applied to the runner and environment host but not injected into devcontainer build phases. Docker builds run in an isolated context that does not inherit the host's CA trust store.

**Fix:** Add your CA certificates directly to your devcontainer image:

```dockerfile theme={null}
COPY my-ca-bundle.crt /usr/local/share/ca-certificates/
RUN update-ca-certificates
```

## Network connectivity issues

**Checklist:**

* Security groups: port 29222 (SSH), outbound HTTPS, port 22999 (internal)
* Route tables: public subnets → IGW, private subnets → NAT
* Network ACLs: not blocking required traffic
* DNS: VPC DNS resolution enabled, can resolve `app.gitpod.io`

**Test connectivity:**

```bash theme={null}
# Health endpoint (should return 200)
curl -v https://<your-domain>/_health

# Required endpoints
curl -I https://app.gitpod.io
curl -I https://public.ecr.aws
```

### Restart runner after network changes

After changing security groups, route tables, or VPC endpoints, restart the runner:

**Console:** ECS console → Clusters → your cluster → Services → Update → check **Force new deployment**

**CLI:**

```bash theme={null}
aws ecs update-service --cluster YOUR_CLUSTER_NAME --service YOUR_SERVICE_NAME --force-new-deployment
```

**Verify:** Check runner shows "Connected" in **Settings → Runners**, then test creating an environment.

## Getting help

Use the support chat (bubble icon in bottom-right). Include:

* Runner ID and version (from **Settings → Runners** → `...` menu)
* CloudFormation stack name and region
* Runner logs from CloudWatch (ECS task logs)
