How AI Agents Handle Failures and Retries
Back to Blog
AI Agents2025-12-03· 7 min read

How AI Agents Handle Failures and Retries

We need to write a 950-word first-person blog post titled 'How AI Agents Handle Failures and Retries' by Billy, a BMX rider and AI engineer who built an AI agent called OpenClaw that runs his 3 brands automatically. Tone: direct, conversational, expert.

#ai-agents#automation

My agent fails every single day. That's not a bug — that's the internet.

Instagram changes a DOM element. Chrome loses a session cookie. An API rate-limits you at the worst possible moment. A proxy goes down and every downstream cron hangs for 3 minutes before SIGKILL takes it out.

The difference between an agent that works and an agent that's useful is how it handles the failures. Here's what I've learned from running automation across 8+ platform/account combinations, a bounty pipeline, and a lead engine — all day, every day.

Failure Categories

After months of production use, every failure I've seen falls into one of four buckets:

1. Transient Failures

The platform is slow, the network hiccuped, Chrome took too long to render. These fix themselves if you just wait and try again.

My approach: Exponential backoff. First retry after 5 seconds, then 15, then 30. Cap at 60 seconds. Three attempts max for transient issues.

import time

def retry_with_backoff(fn, max_retries=3):
    for attempt in range(max_retries):
        try:
            return fn()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait = min(5 * (3 ** attempt), 60)
            time.sleep(wait)

2. Session Failures

The browser lost its login. This happens when Chrome's debug profile gets corrupted, when a platform forces a re-auth, or when cookies expire.

My approach: Detect the login page. If the agent navigates to Instagram and lands on the login screen instead of the feed, that's a session failure. Log it, send a Telegram alert, skip this platform for now. I'll log back in manually — it takes 30 seconds.

I used to try auto-login. Don't. Platforms detect automated login attempts and will lock your account. Manual login once, then let the persistent session do its job.

3. DOM Failures

The platform redesigned something. A button moved, a selector broke, an aria-label changed.

My approach: Every click and interaction is wrapped in a try/except that captures a screenshot on failure. When I see the screenshot, I can usually identify the new selector in under a minute.

I use aria-labels and text content for selectors instead of CSS classes. [aria-label="New post"] survives redesigns far better than .css-1dbjc4n. This alone cut my DOM failure rate by about 80%.

4. Architectural Failures

These are the ones that take down everything. The worst one I hit: brand_cron.py was routing ALL platforms through an Azure GPT-4.1 proxy at port 18795. When that proxy wasn't running, every single platform timed out after 3 minutes and got killed.

The fix: Remove the dependency. All platforms now call social_poster.py functions directly. No proxy in the critical path. The proxy is available for content generation, but posting never depends on it.

This is the most important lesson: never put a single point of failure in the critical path of your automation.

The Retry Architecture

Here's how my production system actually handles a failed post:

Attempt 1: Try to post
  ↓ fails
Wait 5s → Attempt 2: Try again
  ↓ fails
Wait 15s → Attempt 3: Try again
  ↓ fails
Log error → Screenshot → Skip this platform
  ↓
Continue to next platform/account
  ↓
At end of cycle: write error summary to cron-errors.log
  ↓
If critical (all platforms failed): Telegram alert

The key design decision: never let one platform's failure block another. If Instagram is down, Pinterest still posts. If X has a session issue, LinkedIn keeps going. Each platform runs in its own try/except block inside the ThreadPoolExecutor.

Real Example: The Instagram Share Button

Instagram's Share button was my longest-running failure. mouse.click() on the Share button simply does nothing — no error, no response, just silence. The agent would click, wait for confirmation, time out, and log a failure.

I tried:

  • Direct click → nothing
  • JavaScript click → nothing
  • Coordinate-based click → nothing
  • Force click → nothing

The solution that finally worked: Tab navigation to the Share button, then Enter. That's it. Keyboard navigation is the only method Instagram respects for that specific button.

await page.keyboard.press("Tab")
await page.keyboard.press("Enter")

This took days to figure out. But once I found it, the fix was one line, and all three Instagram accounts benefited immediately.

What I Track

Every failure gets logged with:

  • Timestamp
  • Platform and account
  • Error type (transient/session/DOM/architectural)
  • Screenshot path
  • Retry count before giving up

I review cron-errors.log once daily. Most days it's empty. When it's not, the structured logging tells me exactly what happened and where to look.

The Mindset

If you're building automation and expecting zero failures, you're going to be disappointed. The internet is messy, platforms are hostile to automation, and browsers are unpredictable.

Build for failure from day one. Wrap everything in error handling. Retry the transient stuff. Alert on the critical stuff. And never let one broken platform take down the whole system.

The automation tools I use — including the retry logic, error handling, and multi-platform posting system — are available at axon.nepa-ai.com. Built from real failures, not theoretical architectures.