Idempotency Keys Are the Seatbelt for AI Agents
TL;DR
If an AI agent can retry a write action, it can accidentally do that action twice.
Network timeouts, partial failures, and at-least-once delivery are normal in production systems—not edge cases. The practical fix is to require an idempotency key on side-effecting operations and have the server return the first result for duplicate keys.
Treat this as a default contract for tools that create, mutate, charge, send, or enqueue.
Context
Agent workflows increasingly chain external tools: payments, ticket creation, email sends, database writes, and queue operations. These are exactly the calls most likely to be retried when a timeout or transient error happens.
AWS’s reliability guidance is explicit: retries are useful for partial/transient failures, but retries can be unsafe when calls have side effects; APIs should be designed to be idempotent when safe retries matter. Stripe documents a concrete implementation pattern that many teams can copy: client-provided idempotency key + cached first response for that key.
In other words, this is not “AI safety theater.” It is basic distributed-systems hygiene that becomes more urgent when autonomous loops can issue retries at machine speed.
Key Points
1) Retries are normal, not exceptional
Real systems fail partially and transiently. Clients retry because retries often work.
That means any agent tool using HTTP, queues, or RPC will eventually retry something. If the operation is side-effecting, retries are a risk multiplier.
2) Duplicate effects are a design problem, not just a runtime bug
A timeout does not prove the action did not happen. The downstream system may have completed work even when the caller saw an error.
This creates the dangerous ambiguity: “did it fail, or did it succeed but we missed the response?”
3) Idempotency keys collapse ambiguity
With an idempotency key, the server treats repeated requests as the same logical operation.
Stripe’s approach is a strong baseline:
- client sends a unique key,
- server stores first result for that key,
- retries with same key return the same result,
- key reuse with different parameters is rejected.
This pattern converts uncertain retry behavior into deterministic behavior.
4) At-least-once systems make idempotency mandatory
SQS standard queues explicitly warn that duplicates can happen and recommend idempotent consumers.
If your agent consumes queue events and triggers writes, dedupe cannot be optional. It is part of correctness.
5) Non-obvious insight: idempotency is also a cost-control primitive
Most teams frame idempotency as reliability. It is also a direct budget control:
- duplicate API calls,
- duplicate LLM/tool invocations,
- duplicate human notifications,
- duplicate paid transactions.
A single idempotency key can prevent both operational incidents and surprise spend.
Steps / Code
Minimal contract for side-effecting tools
request:
operation: "create_invoice"
idempotency_key: "uuid-v4"
payload_hash: "sha256(payload)"
server_rules:
- if key unseen: execute, persist result, return result
- if key seen and payload_hash matches: return persisted result
- if key seen and payload_hash differs: return 409 conflict
- expire old keys after retention window (e.g., 24h+ based on domain)
Agent policy (small but effective)
For every side-effecting action:
1) Generate one idempotency_key per logical intent.
2) Reuse the same key for retries of that intent.
3) Never reuse the key for a different payload.
4) Log key + operation + outcome for auditability.
Fast rollout checklist
[ ] Tag each tool as read-only vs side-effecting.
[ ] Enforce idempotency_key for side-effecting calls.
[ ] Add a storage table/cache for key -> first response.
[ ] Add conflict response for key+payload mismatch.
[ ] Add dashboard metric: duplicate-prevented count.
Trade-offs
Costs
- Extra storage and lookup for idempotency records.
- Slightly more API complexity (key lifecycle, retention, conflict semantics).
- Domain decisions on retention windows and scope (per user? per operation?).
Benefits
- Safe retries under flaky network conditions.
- Fewer duplicate writes/charges/messages.
- Better incident debugging via deterministic replay behavior.
- Lower operational and financial blast radius for autonomous agents.
References
- AWS Builders Library, Timeouts, retries, and backoff with jitter: https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
- Stripe Docs, Idempotent requests: https://docs.stripe.com/api/idempotent_requests
- AWS SQS Developer Guide, At-least-once delivery: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues-at-least-once-delivery.html
Final Take
If you only add one reliability guardrail to agent tool calls this month, make it idempotency keys.
Retries are unavoidable. Duplicate side effects are optional.
Changelog
- 2026-03-27: Initial publish on idempotency keys as a default contract for side-effecting agent actions.