Thomas SchubartThomas Schubart
/March 15, 2026EngineeringAI

Tackling Agent Reliability: Rethinking the Todo Tool at Ona

How we replaced seven agent tools with one, moved from edge-triggered to level-triggered state, and built runtime guardrails to keep agents on track.

Ona Todo tool UI showing completed tasks

One of the first tasks I picked up when I joined the agent team at Ona was improving the ToDo tool. The ToDo tool lets the agent record the steps it needs to perform to complete a task.

For us, this is not just about improving the agent's performance on long horizon tasks anymore. Models have gotten good enough that they don't strictly need a ToDo tool to stay on track. The ToDo list is also a mechanism for hiding complexity from the user. Each group of tool calls and agent messages is assigned to a ToDo item. The user does not see individual tool calls or intermediate agent messages unless they choose to expand a specific item. This lets users focus on what matters (the high level plan and its progress) while keeping the full detail available on demand.

Over time, though, the implementation had accumulated significant complexity. The agent had seven separate tools at its disposal for managing the list: adding items, completing them, labelling them, advancing to the next item, and so on. Seven tools is a lot of surface area for an LLM to reason about correctly, and that complexity could degrade reliability.

One Tool to Replace Seven

My first move was consolidation. I replaced all seven tools with a single one: a ToDo write tool. The agent now writes the complete list of items every time it updates. If it wants to mark an item as complete, it rewrites the entire list with that item's status changed. There is no incremental mutation and no choosing between specialized tools.

Diagram showing seven separate ToDo tools consolidated into a single todo_write tool

This had an immediate benefit for the agent: drastically reduced decision complexity. But it also had a powerful architectural consequence for the frontend.

From Edge Triggered to Level Triggered

In hardware design, there are two ways to propagate state. An edge triggered system reacts to transitions: something changed, figure out what. A level triggered system reads the current value: here is the state right now. The distinction applies directly to software, and it is well described in the context of Kubernetes controller design.

Diagram comparing edge-triggered vs level-triggered state propagation

The old design was edge triggered. The frontend had to stitch together the current state of the ToDo list from a stream of discrete events (item added, item completed, item labelled) and figure out which messages belonged to which item. This was complex, brittle, and a source of bugs.

The new design is level triggered. Every tool call delivers the complete state of the list. The frontend simply renders whatever it receives. We also now tag every message with a group ID and a ToDo item ID, so when a message arrives, the client knows exactly where it belongs. The frontend no longer needs to reconstruct anything.

Keeping the Agent on Track

I spent significant time refining the tool description: documenting when the agent should and should not use the ToDo tool, how to manage items and groups, and providing concrete examples of correct usage. A well written tool description goes a long way, but it is not sufficient on its own. Over long tasks, agents drift from their instructions regardless of how clearly they are written. We built several mechanisms to counteract this drift.

Invisible system messages. We regularly inject system messages into the conversation that are invisible to the user. These remind the agent of its remaining work and prompt it to keep its ToDo list current.

Interruption handling. When a user interrupts the agent, the agent sometimes forgets to set items back to in progress after resuming. We now send a notification explicitly informing the agent that it has been interrupted, that it needs to evaluate the user's new request, and that its ToDo list must reflect the current state of the task.

Corrective feedback on misuse. The agent can forget or ignore instructions we provide in the tool description and the system prompt. When the agent uses the tool in a way it should not, we reject the tool call outright, fail it, and return a message explaining what went wrong and what the agent needs to do instead. This creates a feedback loop that keeps the agent within the intended behavior boundaries.

Polishing the End of a Task

We noticed that the agent's summaries were inconsistent: sometimes the summary was missing or buried in the wrong place. We added two mechanisms to fix that.

Prompting a summary. When the agent finishes a task, we want it to summarize what it did. But sometimes it would forget, or produce a mediocre summary. Now, when the agent sets the last item to completed, the tool result includes an instruction to generate a summary along with guidance on what that summary should contain.

Preventing hidden results. Sometimes the agent would present its results before completing the last item. This buried the output inside the final ToDo item, hidden from the user's view. To fix this, when the agent sets the last item to in progress, we return a message reminding it that results must come after every item on the list has been completed.

Surfacing Questions to the User

The agent is supposed to use a dedicated ask user tool when it needs input from the user. Sometimes, though, it would skip the tool and ask the question as a normal text message. Because text messages are grouped under the current ToDo item, the question would be buried inside it, invisible unless the user expanded it. I addressed this in two ways.

First, I refined the tool description of the ask user tool to make it clearer when the agent must use it.

Second, I implemented a heuristic: if ToDo items are marked as in progress but the agent stops making tool calls and indicates it is waiting for user input, we surface a visual indicator that user action is required. This ensures that the user is not missing a question from the agent.

Takeaways

Working on the ToDo tool reinforced a few principles that apply broadly to agent tooling:

Reduce the decision space. Consolidating seven tools into one eliminated an entire class of errors where the agent chose the wrong tool or called them in the wrong order.

Make state explicit. Don't make anyone reconstruct what they can just be told.

Design for drift. Agents will ignore instructions over time. Build runtime mechanisms (reminders, corrective feedback, forced prompts) that catch them when they do.

Focus on what the user sees. Most of the issues we fixed were failures of the agent's output to reach the user in the right form, at the right time, even though the agent had completed the task successfully.

Join 440K engineers getting biweekly insights on building AI organizations and practices

Related blogs

This website uses cookies to enhance the user experience. Read our cookie policy for more info.