|
October 9, 2025
Scaffolding and the art of not breaking things
|
Everyone wants to believe the latest model update will make agents autonomous. It won't. The real gains come from engineering discipline: how you structure data, scaffold loops, and control execution. This week's pieces show what actually makes AI agents work in the wild: well-designed input formats, tight orchestration, deliberate guardrails, and operational feedback loops.
|
TL;DR
| • |
Sonnet 4.5 is very good: Zvi's breakdown confirms it's the most capable model for coding and agentic workflows.
|
| • |
Data format matters: Structured Markdown inputs outperformed CSV and JSONL on QA accuracy.
|
| • |
SWE-Bench Pro raises the bar: The new long-horizon software-engineering benchmark exposes how models perform on realistic SWE tasks.
|
| • |
AI Platform Engineering 2025 report: 2025 marks the shift from "playground AI" to a distinct engineering discipline focused on safety, velocity, and control.
|
| • |
Ona + Sonnet 4.5: Ona now supports Sonnet 4.5 as its default agentic model.
|
|
Controlled benchmarking of 11 formats finds that Markdown key-value blocks give the highest QA accuracy (~60.7%) but inflate token cost ~2.7× vs CSV. JSONL and CSV underperform on accuracy despite compactness. The takeaway: format choice can swing results and cost substantially, so optimizing structure matters.
Zvi's assessment: Sonnet 4.5 is the strongest general-purpose coding and agentic model available today. It outpaces peers in reliability, multi-step reasoning, and tool use. Unless you have hard constraints (open-source, compliance), this is the current go-to model for practical coding agents. Ona now supports Sonnet 4.5 by default.
A technical teardown of the 10k-line "autonomous coding" demo. Success came not from model magic but engineering scaffolding: append-only artifacts, explicit update rules, structured deliberation loops, and conservative runtime constraints. The article shows how orchestration, and not capability alone, lets an agent sustain 30-hour sessions without collapse.
Scale AI's new benchmark tests long-horizon software-engineering tasks across 41 live repos. Frontier models barely clear 25% Pass@1 (GPT-5 leads at 23.3%), underscoring how fragile multi-file reasoning can be. The benchmark reframes coding autonomy as a planning and consistency problem, not raw intelligence.
This piece defines agentic coding as LLMs that plan, code, test, refactor, and deploy while humans shift to supervision. It surveys emerging frameworks (IDE agents, terminal agents, orchestrators) and highlights challenges in process control, debugging, and oversight. Think of it as a map of today's fragmented agent toolchain.
User study of 19 developers solving real GitHub issues with IDE-embedded agents: half of tasks succeeded. Incremental, conversational workflows outperformed one-shot prompts, but friction came from trust, debugging, and test validation. Confirms the sweet spot is human-in-the-loop, not full autonomy.
A candid discussion with Scaling Devtools feat. our CTO and Head of Product Eng on what's actually working in large-scale AI deployment. The episode cuts through hype to focus on incident management, model drift, and feedback loops. Key theme: maturity comes from operational discipline, not just model quality. Teams succeeding in production have playbooks for rollback, observability, and human overrides.
We dive deep into our org culture to argue that you shouldn't wait for things to break to learn from them. Every incident (big or small) is a mirror showing how our systems actually behave. We run reviews not to assign blame but to find truth. When teams treat incidents as practice, not punishment, they build muscle for the moments that matter. Incidents aren't setbacks; they're reps for resilience.
Annual deep dive into how enterprises are operationalizing AI. Highlights a shift from experimentation to standardized AI infrastructure, with rising adoption of internal LLM stacks, orchestration layers, and agent governance systems. The report argues that 2025 is the year AI platform engineering becomes a distinct discipline, akin to DevOps in 2015, focused on tooling, safety, and velocity in deploying model-powered systems.
Ona now supports Claude Sonnet 4.5, unlocking stronger coding and reasoning performance inside agent workflows. The update highlights improved tool-use reliability, better long-horizon task persistence, and smoother multi-agent orchestration for development environments. It improves performance with faster startup, lower context loss, and more stable execution loops.
Gitpod Classic pay-as-you-go users must migrate to Ona by October 15th, 2025. This change does not apply to Enterprise customers. See our guide for migration tips and details.
Platform Engineering Day @ KubeCon + CloudNativeCon NA
Atlanta, Nov 10, 2025
AWS re:Invent
Las Vegas, Dec 1–5, 2025
|