375 PRs merged. 67,000 lines of code. 1,067 tests. No human-written production code. Here's what 10 days of running a software factory taught us about the future of engineering.
Everybody is talking about software factories. Few have built one. Fewer still show the process. So we documented the whole thing in public: empty GitHub repo to self-shipping product, live, every day.
While writing code with agents has become trivial, we asked ourselves a harder question: how do you automate the full SDLC — fully secure, cloud-based, and enterprise-ready?
That was the core question behind our 10-day experiment. We built a software factory: a network of background agents covering every stage from an empty repo to operating in production. The goal was to build Memo, a Notion-style note-taking app, with one hard constraint: no human-written production code.
Humans wrote the specs, configured the automations, and reviewed escalations. The factory did everything else: planning, implementation, review, deployment, monitoring, and iteration.
By Day 10, it had merged 375 PRs, written over 67,000 lines of code, and generated 1,067 tests. The system grew to 16 automations — automated agent workflows that trigger on schedules or events. The median time from issue opened to issue closed was 38 minutes. The median time from PR opened to PR merged was 4.9 minutes. Roughly 87% of merged work happened without human involvement. And even when a human was involved, it was just to give another agent a pointer on which direction to shape the product.
The factory didn't stop there. It kept working on the repo continuously.
In our case, the factory was an ensemble of specialised background agents chained together across the SDLC. Each automation had a narrow responsibility, a clear trigger, and a defined way to hand work to the next step.
The core loop looked like this:
The PR Reviewer handles pull requests, fixes CI failures, and merges when ready. The Post-Merge Verifier smoke-tests the live app after each merge. The Incident Responder polls Sentry and converts production errors into GitHub issues. The Feature Builder picks up backlog items, reads the acceptance criteria, explores the codebase, and implements features end-to-end.
The important part is not any single agent but the architecture between them — how they hand work off to each other without producing new bottlenecks.
As the factory matured, we built up a set of automations that helped us run the software delivery process while monitoring gaps and filling them with new additions to the automation stack. Monitoring and improving not just the app but the factory building it becomes key. It shows the shift in this approach: engineering time is spent on the factory, not the product.
The biggest learning of the project was the inverse correlation between spec quality and necessary product iterations.
On Day 3, we wrote a detailed product spec, putting several hours into it. After that, the factory produced a working app with authentication, workspaces, a rich text editor, search, and member invites. It merged 54 PRs in a single day and it really felt like "agents can build a product."
Later, we gave the factory a five-line spec for a Notion-like database feature. It built the feature overnight and it worked, but it had rough edges: date picker overflow, property editing issues, and awkward UX around edge cases. That led to a few rounds of raising bugs and letting the automations fix the issues.
Most of the bugs were not capability failures on the agent but specification failures resulting from non-detailed specs. The factory built what we described, and it missed the things we had not described clearly enough.
In a normal product team, a spec is often treated as a planning artefact. In a software factory, the spec becomes the control surface — and it should be detailed from the start, because the factory does not have a human in the loop advising on each step of the buildout. Acceptance criteria, examples, design references, architecture notes, repo conventions, and edge cases directly shape the output.
The more precise the input, the better the result. A day of spec writing can save a week of bug fixing — and is time well spent, particularly as the agents themselves can be used to bring your spec to the required level of detail.
By Week 2, the factory had built something functional but rather generic from a design and UX point of view.
Our stream guest Janine Shepherd described it as drifting toward the default AI-generated product aesthetic: dark mode, safe patterns, rough contrast, and a familiar Notion-like shape.
Her diagnosis was right:
"It always wants to revert back to its typical ways of designing. You've got to push quite hard to get it out of that comfort zone."
That became one of the clearest limits of the experiment. The factory could satisfy requirements, but it did not automatically produce taste by itself.
We tested Claude Design on contrast issues. It fixed the ratios, but flattened the interface. The design became more correct by the numbers and worse to the eye.
"There's an aspect of design looking right to the human eye versus it being mathematically right within the design system."
Our partial fix was to give the agents better visual ground truth. We added Storybook so components had a reference point. We also built a Product Improver automation that screenshots the live app, identifies visual or UX gaps, and files issues with a needs-human label.
That helped. It gave the factory a way to notice some of its own product shortcomings. But it did not replace human taste. Having a strong design system and examples of product flows helps guide agents to the desired product outcome. Like the spec, it is worth spending time at the beginning of a new project clearly defining the visual direction and giving the factory guidance on the taste it is meant to achieve.
The most interesting moments came when the factory started to work on itself — discovering gaps and working on fixes.
Over the weekend, it ran out of tasks. Instead of stopping, the Automation Auditor kicked in, reviewed the automation setup itself, and started giving recommendations on improvements.
A background agent finding a bug in the background-agent system is a small thing technically, but a big thing conceptually. It only happens when agents run continuously and have permission to inspect the system around them.
We also added a quality.md file: a self-assessment of the product across feature areas. When the backlog ran empty, the Feature Planner could read that report and create new issues for anything below standard. That led to real improvements. Test coverage went from zero to a working suite. Error handling moved from scattered console.error calls to proper Sentry capture.
This is where background agents become more than task runners. They are not agents you prompt when you need something. They run on schedules, watch for state changes, inspect outputs, and create the next loop of work.
A factory without feedback loops is just a fast builder. A factory with feedback loops starts to become an operating system.
By Day 7, most of the SDLC had a loop.
But a key arrow was still missing: operations back to planning.
The factory could build what was already in the backlog, but it could not yet sense what users needed next. So we built a feedback widget and a digest automation that summarised user feedback into patterns.
That changed the shape of the system. The factory was no longer just executing tickets. It started absorbing product signal.
Chris Weichel, our CTO, described this as "externalising your intuition into the factory": letting the system collect and structure user signal so the human can focus on synthesis and direction.
That framing matters because the real risk of software factories is not that teams will ship too little. It is that they will ship too much.
Chris put it bluntly:
"It's never been easier to feature bloat, and it's never been harder to differentiate. The underperforming teams will be tempted to feature bloat. The really good teams will understand it's in the relationship with your users."
When output gets cheaper, judgment gets more important — as does having a tight feedback loop between user experience and planning of the product roadmap.
The human work did not disappear. It changed shape.
Days 1 to 3 were heavy: 8 to 10 hours a day. But that time was not spent building Memo directly. It was spent building the factory and harness that could build Memo: writing automations, tuning prompts, creating AGENTS.md, defining conventions, setting escalation paths, and designing review loops.
AGENTS.md became the factory floor manual telling every agent how to behave in the repo: code style, architecture decisions, testing expectations, PR conventions, risky areas, and escalation rules. The quality of that file directly influenced the quality of the factory's output.
AGENTS.md was just the entry point. Behind it sat a harness of files the factory depended on: architecture.md for the system map, design.md for visual ground truth, conventions.md for coding patterns to replicate, quality.md as a graded scorecard so the factory knew where to improve, a product spec for intent, and Storybook as the rendered visual reference. This harness — not just the automations — is where most of the engineering time went at the beginning, and its quality set the ceiling for everything the factory produced.
By Week 2, the human work had dropped to 2 to 3 hours a day. The work became metrics, documentation, occasional review, and improvements to the operating model.
That is the shift: you do not build the product. You build a factory to build the product.
And once implementation gets cheaper, the bar should go up. Matt Boyle, our Head of Product Engineering, put it this way:
"I actually hold our engineers to a much higher bar now. It's so cheap to drive something to sufficient quality. There's no reason not to ship with sufficient quality."
He also made the sharper product point:
"I'm always more impressed by people killing features and products than shipping them."
That is the discipline software factories will require. Not shipping more for the sake of it. Shipping better, with a clearer reason for every feature that makes it through the loop.
If we started again, we would invest in feedback loops earlier.
We would activate the Automation Auditor and Product Improver as soon as the product goes online — not after the first week. We would add Storybook from the start, because visual specs are more useful than text specs for design. We would spend more time on the initial product spec, because every ambiguity turns into downstream cleanup. We would capture more errors, not fewer, because the factory can only improve from data it can see. And we would close the operations-to-planning loop much earlier, so user feedback feeds the backlog from the beginning.
All five lessons collapse into one rule:
Every day you run without feedback loops is a day the factory cannot self-correct.
And self-correction is the point.
We are leaving the factory running with no human input and will check back in a few weeks.
The self-improvement loop is active. The feedback widget is live. The automations are running. Now we get to see what it builds, what it breaks, and whether autonomous progress holds over a longer period.
Because the real test of background agents is not whether they can do impressive work while supervised. It is whether they can keep making useful progress when nobody is watching.
On Day 9, we hosted Shardul Vaidya from AWS, who built his own software factory from scratch in Rust. Different approach, same conclusion: the engineer becomes a factory maintainer.
Lou Bichard drew the Kubernetes analogy:
"Everyone built their own orchestration engine, and then eventually Google released Kubernetes. I do feel like that will happen here."
A standardisation moment for software factories feels likely. The question is where it happens: the agent runtime, the automation layer, repo conventions, observability, handoff protocols, or the product feedback loop.
We do not know yet, but the pattern is becoming clearer.
A software factory is not one giant autonomous agent. It is a set of reliable background systems that keep handing work to each other.
Start with one automation. Pick a workflow where the trigger is clear, the context is available, the verification path is known, and the escalation path is visible. Build that. Then add the next loop.
The virtual summit is May 6 at background-agents.com/summit. We will walk through the factory architecture, the automation prompts, and the lessons from 10 days of running background agents in production.
Everything is public.
Watch the streams · Explore the repo · Try the app
Built with Ona.
Five days. Over 130 PRs merged. 12,202 lines of code. No human-written code. Here's what we learned in week one of the software factory livestream.
We're building a self-driving codebase in public, with daily livestreams until 25th of April.
How 30 minutes of spec writing and 10 minutes of execution produced a PSX-styled 3D world with real Google city data.
This website uses cookies to enhance the user experience. Read our cookie policy for more info.