|
December 11, 2025
The conversation around AI engineering has flipped again.
Benchmarks keep creeping up in the conversation, enterprise buyers are spending real dollars, and "agentic coding" became a serious topic. The risk is obvious: it's now easy to deploy AI before you have any real handle on quality, safety, or ownership.
|
TL;DR
| • |
Your Copilot can't do this: See how teams deploy AI engineers across 1,000+ repos for migrations, CVE sweeps, and policy-driven refactors.
|
|
| • |
Ona launches Automations: Run org-wide migrations and refactors from one place instead of coordinating dozens of projects by hand.
|
| • |
2025 enterprise AI: Menlo's report shows spend surging toward app-layer tools, with coding as the standout use case.
|
| • |
Agent quality: Google/Kaggle's playbook for measuring agents on full trajectories with logs, traces, and hybrid evals.
|
|
Menlo Ventures' report argues we're in a boom, not a bubble, at least if you follow the money. They peg 2025 enterprise gen AI spend at ~$37B, with most of it flowing to application-layer tools over raw infra. Enterprises are shifting from "we'll build it" to buying off the shelf, often through PLG as developers bring tools in from the bottom up. Coding is already the standout departmental category.
Google and Kaggle's "Agent Quality" whitepaper tackles the question of how you know an agent works when failures look like bad judgment, not crashes. They frame quality around four pillars (effectiveness, efficiency, robustness, and safety) and push teams to evaluate full trajectories, not just answers. On observability, they propose a stack of logs, traces, metrics, plus hybrid evals (automated metrics, LLM/agent judges, and humans) to keep agents improving over time.
This paper sets out a reference architecture for LLM agents built from four pieces: perception (turn messy inputs into structure), reasoning (plan and adapt), memory (short- and long-term), and execution (call tools and act). Real autonomy only shows up when these are wired into a feedback loop that looks more like a lightweight cognitive system than a chat box, making it a practical blueprint for moving beyond "one prompt, one answer."
This 300+ page survey follows code models end-to-end: data and pre-training, fine-tuning and RL, and finally their use as autonomous coding agents. It goes past leaderboards to issues practitioners actually care about—security, reasoning across big monorepos, CI/CD integration, and the gap between benchmark wins and messy production work. It's useful if you're deciding how much to trust "Claude Code vs. Copilot vs. open source" in a real SDLC.
This thread asks why sentiment toward agentic coding flipped from skepticism to "come to Jesus" stories in a few months. Replies split into three camps: evangelists, detractors, and skeptics. The useful takeaway is that results are highly workflow-dependent: teams that treat agents as structured tools inside a well-designed IDE flow report big speedups, while "just let the agent write the code" stories mostly end in churn and distrust.
We just got back from AWS re:Invent 2025, and the most common question I heard was some version of: "How is Ona different from the code assistants we already rolled out?" The short answer is that copilots are great for individual throughput, but most enterprise work isn't "write new code faster." It's the organizational-scale backlog: migrations, CVE remediation, standardization, config rollouts, docs drift. This is the stuff that dies in coordination.
Automations is our answer to that. It turns cross-repo initiatives that usually take weeks of tickets and herding into one repeatable workflow you can run across hundreds or thousands of repos, inside secure, isolated environments. You control the scope and review gates; Ona runs the work and gives you inspectable PRs plus an audit trail.
Under the hood it's deliberately simple: Trigger → Context → Steps → Report.
Trigger is when it runs (manual, scheduled, event-driven). Context is what it touches (repos, services, ownership boundaries). Steps are the actual work (commands, prompts, checks, PR creation). Report is what comes back (progress, diffs, failures, and a rollup you can share with the team).
If you want the concrete version, join our webinar "Your Copilot can't do this" on Thu, Dec 18 @ 2PM UTC / 9AM ET. We'll show how teams run migrations and CVE sweeps across 1,000+ repos, what we keep human-reviewed, and how to manage blast radius without turning it into a quarter-long program: Sign up here
Keeping the drama in the commit messages to a strict minimum,
Your friends at Ona
|