An engineering leader's guide to background agents

How to go from developers running agents locally to a software factory where agent fleets transform your entire SDLC.

In May 2026, we brought together engineering leaders from Stripe, Ramp, Spotify, Uber, Harvey, and others at the Background Agents Summit to share how they implemented and adopted background agents in their organizations. This guide distills those conversations, and our own experience implementing background agent infrastructure and rolling it out across organizations, into a decision-making resource for yours.


The future of the software development lifecycle runs on agents that work autonomously in the background. Today, your developers have Cursor or Claude Code on their laptops. The destination is a software factory: agents working across your entire SDLC, triggered by events, governed by policy, and reviewed by humans. This guide helps you get from one to the other.

This guide is for you if

  • You have local coding agents, but no path to autonomous agents in the cloud. Your developers are using Cursor and Claude Code. But you don't have a solution for agents that run autonomously, picking up work, writing code, running tests, and opening PRs while your team sleeps.
  • Multiple teams are implementing coding agents solutions in isolation. Different teams across your organization are independently implementing background agent toolchains with no shared infrastructure, no shared governance, and no way to learn from each other. You need a paved path and centralization, not more parallel experiments.
  • Security is blocking coding agent adoption. If your engineering organization wants to adopt agents at speed, it needs to work with security, not around it. Without a clear answer on how agents run securely, with proper access controls, sandboxing, and audit trails, your champions are stuck and your CISO is right to block them.

If this sounds like you, read on.

Now available on demand

Background Agents Summit recordings

Speakers included
Alistair Gray
Stripe
StripeAlistair Gray
Joey Wang
H
HarveyJoey Wang
View all sessions…
00

What is a background agent?

A coding agent needs your machine and your attention. A background agent needs neither.

Background agents run in their own development environment in the cloud, with a full toolchain, test suite, and access to internal systems, completely decoupled from your device. Kick them off from your laptop and check the results from your phone. Trigger them from a PR, a Slack thread, a Linear ticket, or a webhook.

Background agents change the engineering operating model. Instead of humans initiating every task, agents respond to events autonomously. When a CVE is published, agents scan your repos and open a PR. When a dependency is flagged, agents generate patches.

The destination? A software factory, where fleets of agents transform the entire SDLC.

01

Inspiration from adopters

Background agents are reshaping how engineering teams work, but it's still early. Different patterns are emerging, and no two companies are approaching it the same way. Six companies, including Stripe, Ramp, Spotify, Uber, Harvey, and OpenAI, have published detailed accounts of what they built, the use cases they focused on, and the outcomes they achieved.

Their approaches differ in revealing ways. For instance, Stripe, Ramp, and Uber all built interfaces for developers to kick off multiple agents in parallel, with the engineer reviewing the output rather than writing the code. Spotify however, took a different path by slotting agents into their existing fleet automation to run large-scale code migrations across thousands of repos. Harvey, on the other hand is rethinking the engineering process itself, building toward a Figma-like model where engineers, product managers, designers, and even lawyers collaborate together in a shared agent session, not just firing off prompts but reshaping how the whole organization works together.

Below, we summarize what each team did, then look across all six to answer common questions: which use cases deliver the fastest ROI, how much engineering investment this takes, what technical building blocks matter, and what every team got right.

StripeStripe

Stripe has a 100M+ line Ruby codebase with a mature internal dev environment. They built an agent harness on top of Goose that runs one-shot agents in air-gapped sandboxes, with a tool server exposing 400+ internal tools via MCP. Engineers kick off tasks from Slack or Jira. They started with migrations and flaky test fixes, and now merge over 1,000 agent-authored PRs per week.

RampRamp

Ramp built their agent system around the idea that agents should be able to verify their own work. Each agent can run tests, query Sentry and Datadog, toggle feature flags, and take screenshots to check its output. Within months, 30% of merged PRs in their frontend and backend repos came from agents. A single security sweep found and fixed ~100 vulnerabilities, including high-severity IDORs that pen testing had missed.

SpotifySpotify

Spotify already had a fleet automation system for managing large-scale code migrations across thousands of repos. Rather than building something new, they slotted LLM agents into that existing infrastructure. Migrations that previously required 20K lines of hand-coded AST transforms can now be defined in natural language. Over 1,500 AI-generated PRs merged, with 60–90% time savings on migrations.

UberUber

Uber went wide rather than deep, building a central AI platform with specialized agents for code review, test generation, migrations, and PR routing. An MCP gateway lets any internal endpoint become an MCP server with a config change, so teams can plug in without custom integration work. 92% of their developers now use agents monthly, and their test generation agent ships 5,000+ unit tests per month.

HarveyHarvey

Harvey built their platform around a key architectural insight: the durable record is the run, not the agent. Agents are short-lived and disposable, but runs persist and are accessible from Slack, web, or CLI. This means designers, PMs, and lawyers can collaborate on the same agent session alongside engineers. They also use agents for incident investigation and scheduled maintenance on cron.

OpenAIOpenAI

OpenAI used Codex to build an entire production product with zero human-written code. A team of 3 (growing to 7) produced ~1M lines of code across 1,500 PRs in 5 months. They found that strict architectural constraints, enforced by custom linters rather than humans, made agents more productive. Agents verify their own work via an observability stack and agent-to-agent review loops handle most of the review load.

What use cases work for background agents?

You might expect the first use cases for background agents to be greenfield feature work, but that's not what adopting organizations started with. Most began with migrations and upgrades: component migrations, dependency upgrades, design system rollouts. The output is well-defined, easy to verify, and low-risk. Stripe, Spotify, and Uber all began here, with Spotify seeing 60–90% time savings over hand-written transformations. Code maintenance was the next common entry point: flaky test fixes, feature flag cleanup, doc fixes, and tech debt. Stripe runs these as one-shot tasks from Slack, and Uber ships 5,000+ auto-generated unit tests per month. Security turned out to be a breakout use case nobody expected. Ramp found and fixed ~100 vulnerabilities in a single week, including high-severity IDORs that pen testing had missed. Some organizations also use agents for feature work and incident investigation, where the output is sometimes a diagnosis rather than code.

How much engineering investment does background agent infrastructure take?

Every organization in this section built their own agent infrastructure, investing months to over a year of dedicated platform engineering. All of them had to build isolated execution environments, tool integration layers, and some form of orchestration. Spotify had the lowest marginal investment because they extended their existing fleet automation system, where the infrastructure for targeting repos, opening PRs, and managing reviews was already in place. Uber took a higher upfront cost by building a central platform so individual teams didn't have to rebuild integrations from scratch. Stripe and Ramp built bespoke systems from the ground up. These are platform engineering projects, not weekend experiments. For organizations without dedicated platform capacity, this is a core build-vs-buy decision, which we cover in more detail later in this guide. Products like Ona package this infrastructure so organizations can adopt background agents without building the underlying platform themselves.

What are the technical building blocks for background agents?

Six organizations built independently and converged on the same core infrastructure. Every one of them runs agents in sandboxed environments (air-gapped at Stripe, fully ephemeral at Harvey, rebuilt every 30 minutes at Ramp) and none of them let agents share environments or touch production. Every organization built a tool and context integration layer: Stripe exposes 400+ internal tools via a central MCP server, and Uber's MCP gateway turns any internal endpoint into an MCP server with a config change. The highest-performing setups give agents the ability to verify their own work by running tests, querying telemetry, and taking screenshots. Stripe's guiding principle being "if it's good for humans, it's good for LLMs." Every organization also built some form of trigger system to move from "human types a prompt" to "system initiates work" via Slack, Jira, cron, or CI events, and every organization built human-in-the-loop governance before scaling.

What is common across all background agent adopters?

Every adopting organization sandboxes agents, no exceptions. Every one of them gives agents the ability to verify their own work through tests, telemetry, screenshots, and observability tools, using the same verification infrastructure that humans use. And in every mature setup, the human role shifted from writing code to reviewing PRs, with engineers initiating tasks from Slack or Jira and reviewing the output rather than sitting at the keyboard. It's worth noting that these organizations were first movers in large part because they already had existing infrastructure to build on: Spotify extended fleet automation, Uber built on their ML platform. As the space matures, we're seeing other organizations catch up either by investing in dedicated platform engineering efforts or by adopting off-the-shelf products that package this infrastructure out of the box.

What surprised background agent adopters?

OpenAI found that strict architectural constraints, enforced by linters rather than humans, made agents more productive, not less, turning the rigour you normally defer into an early prerequisite. Security turned out to be a breakout use case nobody planned for: Ramp's single sweep found ~100 vulnerabilities that pen testing and 10+ vendor trials had missed. Harvey discovered that making agent runs durable and shareable let designers and PMs collaborate on sessions alongside engineers, extending agent use well beyond the engineering team. And Uber learned that AI compute costs scale faster than expected, up 6x since 2024, making token optimization a real operational priority.


02

Map your maturity

Before you pick a use case or write a single line of platform code, get clear on where you are now and where you're heading before rushing to jump to the end of the path.

Code assistants → Background agents → Software factory.

Background agent maturity model: from code assistants to parallel agents to software factory

The three phases explained

Code assistants are a false summit. Running agents on your laptop, buying a Mac Mini, or even using cloud-based tools like Claude Code or Codex optimizes personal throughput without changing organizational throughput. The bottleneck was never typing speed. It's coordination, review cycles, legacy systems, and the accumulated weight of technical decisions made years ago. Making one developer 2x faster doesn't help when the constraint is the 47 repos that need the same security patch applied, reviewed, and merged.

Background agents unlock operational trust. This is where you move agents to the cloud with proper isolation, build integrations with your existing tooling (Linear, Sentry, GitHub), and progressively increase agent autonomy as trust builds. This phase builds the operational trust and context engineering practices that make the next phase work. You can't skip this phase. It's where you learn what governance model fits your org, which integrations matter, and how much autonomy your team is actually comfortable with.

The software factory model standardizes and automates SDLC tasks at scale. Agents respond to events autonomously, humans review output rather than initiate work, and the entire SDLC is instrumented for agent participation. Most teams aren't ready for this yet, and the ones that are got there by going through the middle phase deliberately, not by trying to leap.

03

Find your first use case

Your first use case can make or break your background agents initiative. Go too ambitious, and you risk losing momentum before you prove value. Pick something too small or irrelevant to the business, and you will struggle to earn the funding and confidence to keep going. The right use case meets coding agents where they are today: repetitive, well-scoped work with clear success criteria and real business value. Below are examples from organizations already using background agents, and the use cases they started with.

Stripe

Automatically fix flaky tests

Flaky tests are a useful place to start because the problem is narrow and easy to validate. The agent has a clear loop to reproduce the failure, inspect the cause, make the fix, and rerun the suite. Stripe uses background agents for this kind of unattended repair work from Slack or Jira. Once the pattern is trusted, flaky-test failures can move further into the background where agents pick them up, apply the fix, and send a PR for review.

Read Stripe's Minions story
Ona

GitLab to GitHub CI migration

GitLab to GitHub migrations fit background agents and agent fleets because the work is generally repetitive, high-volume, and has concrete validation steps. One large pharmaceutical company ran their migration in a self-serve way from its internal developer portal. Instead of a central migration team manually moving each repository, teams could request a migration and have agents do the first pass.

Read the pharma migration story
Spotify

Run org-wide Java upgrade fleets

Fleet migrations are a natural fit when the same code change needs to land across many repositories, but writing and maintaining hand-coded AST transforms is painful. Spotify slotted agents into their migration fleet infra for updating Java and other repetitive org-wide updates.

Read Spotify's background agent story
Ona

Dependency upgrades and docs updates

Dependency migrations work well because they are usually well-scoped, repetitive, and easy to verify with tests. It is also the kind of important maintenance work engineers rarely want to prioritize by hand. Kingland used agents in a 15-year financial services codebase to update Jest dependencies and generate legacy documentation.

Read the Kingland story
Ramp

Patch security vulnerabilities at scale

Security vulnerability sweeps are a strong starting point because the business pain is obvious, the eng goal is concrete, and can be verified. Ramp used agents to find and patch security issues across its codebase. The workflow detects the issue, reproduces it, applies the patch, and tests the fix.

Read Ramp's security sweep

As you can see, the best first use cases are rarely glamorous. They are usually the toil work blocking teams from shipping: work that is commonly forgotten or ignored, but still gets in the way every day. Background agents work well on this type of work because the goal is clear, the output can be verified easily, and the value is demonstrable.


04

The tech stack

The five primitives are the high-level model for what background agents need to work in production. This section breaks that model down into the technical components teams actually build, buy, and operate.

The dev environment (sandbox) is a secure, isolated replica of your development environment where agents access the same tools, runtimes, and test suites as your developers. Companies already developing in the cloud have a significant advantage here. Orchestration / fleet is the ability to run agents in parallel: one engineer triggering multiple agents from Slack, or a fleet applying a security patch across every affected repo. Kubernetes and CI can be re-used, but weren't designed for agent workloads and can lead to scaling difficulty. The agent harness is the logic layer around the model: deterministic checks alongside non-deterministic iterations, autonomy boundaries, and custom guardrails. Off-the-shelf harnesses like Claude Code or Codex provide a starting point, but most teams extend them. A good harness dictates how long-running background agents can be.

Agents need access to context: code, docs, CI results, tickets, and internal APIs inside your corporate network. Without it, agents can only work with static source code, which leads to poor results. Verification and feedback is how agents check their own work: running tests, linting, querying telemetry. The more an agent can verify autonomously, the longer it can run before needing human review. Security and governance enforces boundaries at the infrastructure layer: network controls, credentials, execution policies, and agent identity. Finally, interfaces (Slack, Jira, etc.) determine how people interact with agents. See below for more detail on each.

Dev environment (sandbox)

You need an infrastructure primitive to contain and isolate each agent. Your options are broadly VMs, containers, or lighter-weight sandboxes. VMs offer stronger isolation but heavier overhead. Containers are lighter but share a kernel. The trade-offs come down to compatibility, since some codebases need a full OS while others just need a runtime; startup speed, which matters more when a human is waiting than for truly background work; and cost model, whether you keep environments persistent or destroy them after every run. Getting this right early matters because the execution environment shapes every other decision downstream.

Stripe

Stripe

Stripe runs agents on "devboxes," which are pre-warmed EC2 instances. Each devbox contains a full copy of the Stripe monorepo with services and build caches pre-loaded, giving a cold start of roughly 10 seconds. The virtualization layer is standard EC2, orchestrated by a custom internal platform that lets engineers spin up multiple devboxes in parallel from Slack, CLI, web UI, and other internal tooling surfaces. Devboxes have no internet access and no production credentials. The agent runs in a QA environment with full permissions inside the devbox because the network boundary prevents damage, not the prompt. Devboxes existed before agents. Stripe built them for human developers, and the same properties that made them good for humans (parallelism, predictability, isolation) made them good for agents.

Ramp

Ramp

Ramp uses Modal for sandboxed execution: lightweight VMs with per-repo container images rebuilt every 30 minutes. Modal handles virtualization and provisioning, but it's not the whole stack. Ramp's control plane runs on Cloudflare Durable Objects with the Agents SDK, which manages state, real-time streaming, and session coordination. Ramp pre-warms sandboxes while the user is still typing their prompt, so the environment is ready before the user submits. Filesystem snapshots allow sessions to be frozen and restored for follow-ups without keeping a container alive. Ramp optimized for speed-to-first-token: "When background agents are fast, they're strictly better than local."

Harvey

Harvey

Harvey uses ephemeral sandbox workers orchestrated by their internal platform, Spectre. The specific virtualization technology has not been publicly disclosed, but the architecture emphasizes short-lived, disposable containers. What's notable is Harvey's concept of durable runs: the run is the persistent record, not the agent process. Follow-ups don't wake old containers. Instead, a new worker resumes from archived session state. The Spectre control plane manages run lifecycle, progress tracking, and artifact collection. Each worker gets one repository, one tool bundle, one set of short-lived credentials, one artifact path, and one audit trail. Tool configuration is injected at run start, and the environment is destroyed after every run. "Disposable sandboxes are much easier to reason about than a fleet of mutated, half-reused environments with sticky states."

Spotify

Spotify

Spotify did not build new execution environments for agents. They run agents as containerized jobs on their existing Fleet Management infrastructure, which was already running large-scale automated jobs across thousands of repositories. The specific container runtime has not been publicly detailed. Fleet Management handles targeting repos, opening PRs, managing reviews, and merging. The agent is just another job type, and isolation is inherited from the existing Fleet Management security model.

Ona

How Ona implements dev environments

Ona runs each agent in its own isolated development environment, which is a VM under the hood. Ona has a concept of a runner, which is the compute layer where environments are provisioned. Inside AWS, environments use EBS for volume persistence, which means you can stop and start an environment while preserving everything inside it, saving costs for long-running or resumable workloads. The environment itself is declarative, built from the dev container specification (devcontainer.json), so the agent always gets a reproducible environment with the right toolchain, runtimes, and test suites. Because Ona is a full platform rather than just a sandbox, agents also get platform features like preview URLs for web apps, port forwarding, and the ability for engineers to open the same environment the agent used to review, edit, or take over its work in VS Code or a browser-based editor.

05

Make the business case

Technical teams often rush past business cases and buy-in. They start a proof of concept, wire up an open-source stack, and pick the closest available use case because it lets them move without waiting for procurement or leadership approval.

That can work in the short term, but it creates a real long-term risk: the initiative can be technically successful and still get shut down because budget holders do not understand what it is for, why it matters, or how success should be measured. Background agents have real costs attached to them, so success needs to be tied to an outcome leadership already cares about.

Before you can scale the work, you need to sell the organization on the vision. Below are example slides you can use to shape that story.

The false summit

Open by naming the problem: coding assistants make developers faster, but not the system. The important part here is to show that agents on laptops create a local maximum, or a 'false summit'.

Define background agents

Background agents are not the same as local agents. Use this slide to align on the definition: background agents run remotely, respond to triggers, and can scale horizontally across many tasks or repositories.

Identify the primitives

Show the building blocks required. This helps align on your required technical foundations, and again helps to disambiguate from locally running agents.

Your agent needs a computer

The development environment is critical. Your chosen solution must give agents a secure isolated, fully-tooled environment. Not a container with only source code, or a sandbox with a questionable security boundary.

Scale the software factory

Paint a strong vision of the future. When you stack up background agents over time, you get to a software factory. For many organizations this is a critical long-term aspiration.

Full deck

Steal this deck

Download

06

Decide build or buy

The teams that built their own agent infrastructure (Stripe, Ramp, Spotify, Uber) had something most companies don't: pre-existing investment in cloud development environments. They already had the isolation layer, the image pipelines, and the credential scoping before they ever pointed an agent at a codebase. A dedicated platform team and many months of runway helped, but the foundation was already there.

Building this is possible. The question is whether it's worth it.

Building agent infrastructure means owning it, and each of the technical components from the earlier technical foundation section — execution environments, harness engineering, context integration, feedback loops, orchestration, security, and collaboration surfaces — is a system that needs to be built, maintained, secured, and scaled. The teams that built it themselves did so because agent infrastructure is core to their competitive advantage. For most companies, it isn't.

The build-vs-buy decision comes down to three questions: Do you have a platform team that can own this full-time? Do you have 12-18 months before you need results? And is agent infrastructure a competitive differentiator for your business, or is it table stakes that you need in order to compete?

The component shopping list

Here's what you're actually signing up for if you build:

Sandboxed dev environments. Custom Docker/VM orchestration, image management, network isolation, credential scoping. 3-6 months to build, ongoing maintenance. With Ona: included, cloud or VPC.

Agent orchestration. Custom workflow engine, job scheduling, retry logic, state management. Ongoing maintenance as agent frameworks evolve. With Ona: Automations, built-in.

Trigger system. Webhook plumbing, event routing, deduplication, rate limiting. With Ona: built-in triggers for PR, issue, schedule, API, and custom events.

Agent harness. Integrate Claude Code, Codex, Cursor, or whatever comes next. Keep up with breaking changes. With Ona: pre-integrated, swap anytime.

Observability. Custom logging, dashboards, cost tracking, usage analytics. With Ona: audit logs, usage analytics, cost attribution.

Security & isolation. Network policies, secrets management, RBAC, audit trails. With Ona: SOC 2 Type II, VPC deployment, policy guardrails.

The background agent tool landscape

If you decide to build, here's what the shopping list actually looks like: 95+ infrastructure providers and open-source tools mapped across every layer of the stack. Click any provider to see details.

Ona provides all five as a managed platform. Your agents run in isolated cloud environments with full access to your toolchain, triggered by events across your SDLC. Security is enforced at the infrastructure level: zero standing credentials, ephemeral environments, full audit trails, and the option to run entirely inside your network. Your team focuses on the integrations and workflows specific to your codebase.

07

Avoid these traps

We've watched dozens of teams attempt this transition. The ones that stall share common patterns. Here are the traps that keep coming up.

Trap 1: Starting too big. Don't try to automate your entire SDLC on day one. Pick one use case, prove it works, and expand from there. The teams that succeed start narrow and go deep before going wide.

Trap 2: Treating agents like junior developers. Agents aren't interns. They don't need hand-holding, but they do need clear specifications. The teams that succeed invest in context engineering (structured AGENTS.md files, clear repo conventions, good test suites), not vague prompts.

Trap 3: Running agents on laptops. Running agents on a developer's laptop or a Mac Mini is a dead end. You need isolated, reproducible environments with proper network boundaries and credential scoping. This is the infrastructure problem that Ramp and Stripe solved first, before they even thought about agents.

Trap 4: No observability. If you can't see what your agents are doing, you can't trust them. And if you can't trust them, you'll babysit them, which defeats the purpose. You need audit logs, cost tracking, and usage analytics from day one.

Trap 5: Trusting agent-level security. As you increase agent autonomy, the temptation is to rely on deny lists and permission boundaries defined at the agent level. The problem: agents can reason around their own constraints. Prompt-level guardrails don't work. If the guardrail lives inside the same context window as the agent, it's a suggestion, not a boundary. Security enforcement needs to happen at the infrastructure layer (network policies, kernel-level controls, scoped credentials) where the agent has no ability to override it.

Trap 6: Waiting for the perfect model. The model is not the bottleneck. Context, tooling, and infrastructure are. Teams that wait for the next model to "get good enough" are optimizing the wrong variable. The teams shipping today are shipping because they invested in the surrounding infrastructure, not because they have access to a better model.

Red flags: Signs your agent adoption is going sideways

  • Developers are spending more time reviewing agent PRs than writing code themselves
  • Agent PRs have a lower merge rate than human PRs
  • You're running agents on developer laptops or shared CI runners
  • No one can tell you how many agent-hours ran last week or what they cost
  • Security hasn't been involved in the conversation yet
  • You're using 3+ different tools to orchestrate what should be one workflow
  • The "AI initiative" has been in pilot for 6+ months with no production deployment
  • Your agents don't have access to tests, linting, or CI, so they can't verify their own work

08

The shortest path to production

Ona provides the technical infrastructure from the previous section out of the box — execution environments, harness, context integration, orchestration, security, and governance — so you can skip the months of platform building and go straight to the use case.

01

Pick a use case

Start with one high-volume, low-risk workflow. CVE remediation, code review, or CI migration are proven starting points with measurable outcomes.

02

Run a proof of value

Deploy agents on real tasks for 2-4 weeks. Measure PRs merged, time saved, and merge rate. Compare against your baseline.

03

Deploy Ona

Cloud or VPC, your choice. SOC 2 Type II certified, ephemeral environments, full audit trail. Security sign-off is straightforward.

04

Scale to fleets

Roll out to more teams and use cases. Go from one agent to fleets, from one repo to hundreds. The infrastructure scales with you.

09

Take your next step

You've worked through the playbook. Pick the action that fits where you are right now.

Evaluating the space

Use the tech stack above as your evaluation checklist, then compare it against the five primitives model.

View the checklist

Try it hands-on

Run an automation template against your own repo. Seeing agents work on your codebase tells you more than any evaluation framework.

Browse templates

Free consultation

A 30-minute working session on your specific situation, not a product demo. You leave with insight into where you are, a recommended first use case, and a best first step to take.

Book a session

The destination is engineers on the loop, not in the loop. Agents run autonomously, triggered by events across your SDLC. Humans improve the system, review output, and handle the work that requires judgment. The teams that start now will compound their advantage every quarter.

Deploy AI software engineers alongside your team.

Start with one use case. Scale to fleets.

This website uses cookies to enhance the user experience. Read our cookie policy for more info.