Scaffolding and the art of not breaking things

00:00 | 24:53

Everyone wants to believe the latest model update will make agents autonomous. It won't. The real gains come from engineering discipline: how you structure data, scaffold loops, and control execution. This week's pieces show what actually makes AI agents work in the wild: well-designed input formats, tight orchestration, deliberate guardrails, and operational feedback loops.

TL;DR

•	Sonnet 4.5 is very good: Zvi's breakdown confirms it's the most capable model for coding and agentic workflows.
•	Data format matters: Structured Markdown inputs outperformed CSV and JSONL on QA accuracy.
•	SWE-Bench Pro raises the bar: The new long-horizon software-engineering benchmark exposes how models perform on realistic SWE tasks.
•	AI Platform Engineering 2025 report: 2025 marks the shift from "playground AI" to a distinct engineering discipline focused on safety, velocity, and control.
•	Ona + Sonnet 4.5: Ona now supports Sonnet 4.5 as its default agentic model.

Ecosystem watch

Best input data format for LLMs

Controlled benchmarking of 11 formats finds that Markdown key-value blocks give the highest QA accuracy (~60.7%) but inflate token cost ~2.7× vs CSV. JSONL and CSV underperform on accuracy despite compactness. The takeaway: format choice can swing results and cost substantially, so optimizing structure matters.

Claude Sonnet 4.5 is a very good model

Zvi's assessment: Sonnet 4.5 is the strongest general-purpose coding and agentic model available today. It outpaces peers in reliability, multi-step reasoning, and tool use. Unless you have hard constraints (open-source, compliance), this is the current go-to model for practical coding agents. Ona now supports Sonnet 4.5 by default.

How Claude 4.5 built a Slack-like app in 30 hours

A technical teardown of the 10k-line "autonomous coding" demo. Success came not from model magic but engineering scaffolding: append-only artifacts, explicit update rules, structured deliberation loops, and conservative runtime constraints. The article shows how orchestration, and not capability alone, lets an agent sustain 30-hour sessions without collapse.

SWE-Bench Pro

Scale AI's new benchmark tests long-horizon software-engineering tasks across 41 live repos. Frontier models barely clear 25% Pass@1 (GPT-5 leads at 23.3%), underscoring how fragile multi-file reasoning can be. The benchmark reframes coding autonomy as a planning and consistency problem, not raw intelligence.

What is agentic coding

This piece defines agentic coding as LLMs that plan, code, test, refactor, and deploy while humans shift to supervision. It surveys emerging frameworks (IDE agents, terminal agents, orchestrators) and highlights challenges in process control, debugging, and oversight. Think of it as a map of today's fragmented agent toolchain.

Sharp tools: how developers wield agentic AI

User study of 19 developers solving real GitHub issues with IDE-embedded agents: half of tasks succeeded. Incremental, conversational workflows outperformed one-shot prompts, but friction came from trust, debugging, and test validation. Confirms the sweet spot is human-in-the-loop, not full autonomy.

Ona updates

AI in production: lessons from the frontlines

A candid discussion with Scaling Devtools feat. our CTO and Head of Product Eng on what's actually working in large-scale AI deployment. The episode cuts through hype to focus on incident management, model drift, and feedback loops. Key theme: maturity comes from operational discipline, not just model quality. Teams succeeding in production have playbooks for rollback, observability, and human overrides.

Incidents are not just for the bad days

We dive deep into our org culture to argue that you shouldn't wait for things to break to learn from them. Every incident (big or small) is a mirror showing how our systems actually behave. We run reviews not to assign blame but to find truth. When teams treat incidents as practice, not punishment, they build muscle for the moments that matter. Incidents aren't setbacks; they're reps for resilience.

State of AI Platform Engineering 2025

Annual deep dive into how enterprises are operationalizing AI. Highlights a shift from experimentation to standardized AI infrastructure, with rising adoption of internal LLM stacks, orchestration layers, and agent governance systems. The report argues that 2025 is the year AI platform engineering becomes a distinct discipline, akin to DevOps in 2015, focused on tooling, safety, and velocity in deploying model-powered systems.

Ona + Sonnet 4.5

Ona now supports Claude Sonnet 4.5, unlocking stronger coding and reasoning performance inside agent workflows. The update highlights improved tool-use reliability, better long-horizon task persistence, and smoother multi-agent orchestration for development environments. It improves performance with faster startup, lower context loss, and more stable execution loops.

ICYMI: Gitpod Classic PAYG sunsets on October 15

Gitpod Classic pay-as-you-go users must migrate to Ona by October 15th, 2025. This change does not apply to Enterprise customers. See our guide for migration tips and details.

Upcoming events

Platform Engineering Day @ KubeCon + CloudNativeCon NA
Atlanta, Nov 10, 2025

AWS re:Invent
Las Vegas, Dec 1–5, 2025

Ona with 4.5 feels illegal (it’s not)

Scaffolding and the art of not breaking things

Ecosystem watch

Ona updates

Upcoming events

Join 440K engineers getting biweekly insights on building AI organizations and practices