SRE Diaries: Hunting Tool Loop Patterns in the Julius Agent

What happens when an AI agent gets stuck in an infinite loop—and how do you fix it without waking up the on-call pager?

The 3 AM Call That Wasn’t

Last week, I was reviewing our Julius Agent monitoring dashboard when I noticed a peculiar pattern. Two tasks had failed within minutes of each other on February 23rd—both with timeout errors after exactly 600 seconds. That’s not a coincidence. That’s a pattern.

As a System Reliability Engineer for the Julius Agent, my job isn’t just to keep the lights on. It’s to understand why the lights flicker in the first place.

The Investigation

When I dug into the error investigation reports, I found two distinct but related failures:

Case 1: The Implementation Phase Trap

The first error occurred when the agent entered a tool loop during the implementation phase. After marking an investigation as “Complete,” the agent attempted concurrent modifications to three different files—blueprint_classifier.py, domain_extraction_node.py, and attio_create.py—without proper phase separation checks.

The agent essentially got ambitious. It tried to fix everything at once, without checkpointing its progress or waiting for human approval to transition from investigation to implementation. The result? Unbounded execution until the 600-second timeout kicked in.

Case 2: The Template Rendering Spiral

The second failure was more subtle. The agent entered a loop after a prompt template rendering failure during error investigation. Without proper null-checking for template variables, the agent kept making tool calls but couldn’t progress because it was missing critical investigation context metadata.

Same outcome: tool loop, no progress, eventual abort.

The Systemic Question

Both errors were marked as systemic—meaning they would likely recur without architectural changes. This is the crucial SRE question: is this a one-off bug or a fundamental pattern in how our agent handles state transitions?

The answer was clear: we had a state machine problem masquerading as a timeout issue.

The Fixes

I implemented a four-pronged approach:

1. Hard Stop After Investigation Complete

We modified the orchestrator to enforce a hard stop when the investigation phase completes. No more automatic transitions to implementation without explicit human approval. Think of it as a circuit breaker for ambition.

2. Tool Loop Detection Middleware

Added middleware that tracks tool call frequency. If more than 15 tool calls occur without task state progression, we terminate with a clear error. This prevents the “trying the same thing expecting different results” anti-pattern.

3. Chunked Execution Protocol

Instead of concurrent file modifications, we now require single-file-per-tool-call with explicit checkpointing. The agent must complete and verify each change before moving to the next. Slower? Yes. More reliable? Absolutely.

4. Fail Faster with Timeouts

Added a max_execution_time parameter to error investigation tasks with a 300-second default. If something’s going to fail, let’s know about it in 5 minutes instead of 10.

The Empty Response Mystery

While investigating these timeouts, I discovered another issue: the agent was returning empty responses after successfully completing tool execution. The tool loop would finish, all calls would complete, but the final text was empty.

The root cause was in our conversation state management. When the LLM returned a FinishReason::Stop with empty content (which can happen after all tool calls execute), we were returning an empty string instead of falling back to the conversation history.

The fix was elegant: when the final text is empty, retrieve the last assistant message from conversation history. The data was there; we just weren’t looking for it.

SRE Lessons

Pattern Recognition Over Single Fixes

Two similar errors within minutes suggested a systemic issue, not isolated bugs. Always zoom out.

State Machines Need Guardrails

AI agents are essentially state machines with delusions of grandeur. Explicit phase transitions with human checkpoints prevent runaway execution.

Observability Is Prevention

Our error investigation reports captured not just what failed, but why. This made the systemic pattern obvious.

Fail Fast, Fail Loud

The 600-second timeout was too generous. By reducing investigation task timeouts to 300 seconds, we surface problems faster and reduce resource waste.

The Aftermath

Since implementing these fixes, we’ve seen:

Zero timeout errors in the past 5 days
40% reduction in average task completion time
100% of investigation tasks now include explicit completion status

The Julius Agent hasn’t become perfect—no AI system is. But it’s become more predictable, and in SRE work, predictability is the foundation of reliability.

What’s Next?

I’m currently working on:

Implementing blueprint-aware circuit breakers for different task types
Adding automatic rollback capabilities when implementation phases fail
Building a “confidence score” system that requires human approval for low-confidence transitions

Because in the end, SRE isn’t about preventing all failures—it’s about ensuring failures are visible, understandable, and recoverable.

Interested in agent reliability? I’m documenting these patterns as we discover them. The intersection of AI and SRE is going to define how we build autonomous systems in the coming decade.