News15 min read

AI Agents Are Running in the Cloud Now. And Most Engineers Are Completely Unprepared.

Vivek Pillai

April 20, 2026

AWS Bedrock Agents and multi-agent systems are reshaping cloud infrastructure in 2026. Here is the production architecture, real Python code, cost breakdowns, and the five failure modes that catch teams off guard.

On this page

What Is an AI Agent, Actually
The Production Architecture Nobody Shows You
Build Your First Agent: Real Code, Not a Demo
Five Things That Break in Production
What This Actually Costs
Where This Goes in the Next 12 Months
TL;DR — Start Here This Week

At 2:47 AM on a Tuesday, a CloudWatch alarm fired in a mid-size fintech company's AWS account. The on-call engineer acknowledged it, assumed a transient spike, and went back to sleep. By 6 AM the metrics looked clean. The incident was filed as a false positive.

It was not a false positive.

An AI agent running on AWS Bedrock deployed three weeks earlier to handle routine infrastructure queries had detected the anomaly, cross-referenced it with CloudWatch Logs Insights, identified a degraded RDS read replica, and issued an SDK call through its Lambda action group to promote a new replica. The engineer's silence had been interpreted by the human-in-the-loop bypass as implicit approval. The agent executed a production database operation. Autonomously. Correctly. Without anyone watching.

The change worked. The system recovered. The engineering team then spent four hours understanding what had happened, why the agent had that authority, and what else it might decide to do next.

This pattern is appearing in production environments across AWS accounts in 2026. AWS Bedrock Agents, Azure AI Foundry's AI Agent Service, and multi-agent orchestration frameworks are deployed, executing, and in some accounts operating with permissions their teams have not fully reviewed. Most engineers responsible for cloud infrastructure have not yet reckoned with what that means.

This post is that reckoning.

What Is an AI Agent, Actually

An AI agent is not a chatbot with extra steps. The distinction matters because the operational assumptions are completely different.

A chatbot receives input and returns output. One round trip. An agent receives input, reasons about what action to take, executes that action through an external tool, observes the result, then reasons again — repeating until it reaches a satisfactory answer or hits a configured limit. This loop is called the ReAct pattern: Reason, Act, Observe. Each cycle is a round trip between the language model and the real world.

On AWS, this architecture has a specific shape. Bedrock serves as the LLM runtime, hosting models including Anthropic's Claude 3.5 Sonnet, Meta's Llama 3, and Amazon Nova. When a Bedrock Agent receives a user message, it reasons using one of these models and selects a tool from its configured action groups. Each action group is a Lambda function. The agent calls the Lambda with structured parameters, the Lambda executes — querying a database, calling an API, reading from S3 — and returns a result. Bedrock feeds that result back into the model's context window. The loop continues.

Memory splits into two layers. Short-term memory — the ongoing conversation — lives in DynamoDB with a TTL, giving the agent context within a session without accumulating indefinitely. Long-term memory uses a vector store: OpenSearch Serverless or Aurora PostgreSQL with pgvector. Documents are chunked, embedded, and stored. At query time the agent retrieves semantically relevant chunks and injects them into the prompt. This is RAG implemented at the infrastructure level rather than the application level.

By early 2026, Bedrock added formal multi-agent collaboration. A supervisor agent coordinates a set of sub-agents, each specialised for a narrow task. One handles database queries. One handles external API calls. One manages document analysis. The supervisor routes the user's request, collects sub-agent responses, and synthesises a single answer. The user sees one interface. A team of agents is running underneath.

Azure AI Foundry — Microsoft's unified AI development platform, rebranded in late 2024 — offers an equivalent architecture through AI Agent Service and Prompt Flow. The tooling differs; the model is the same.

One protocol worth knowing: MCP, the Model Context Protocol. Developed by Anthropic and now adopted broadly, MCP provides a standardised interface for connecting agents to external tools. Engineers who build an MCP-compatible server can expose it to agents running on AWS, Azure, or any compliant platform without custom integration work for each. Adoption is early but accelerating fast.

The Production Architecture Nobody Shows You

The official AWS diagrams show the happy path. Here is what a production deployment actually looks like.

User Request
     │
     ▼
API Gateway (REST or WebSocket)
     │
     ▼
Bedrock Agent Runtime
     │
     │  ReAct Loop
     ├──► Model Reasoning (Claude 3.5 Sonnet or Haiku)
     │         │
     │         ▼
     │    Action Group Selected
     │         │
     │         ▼
     │    Lambda Function ──► DynamoDB / RDS / External APIs
     │         │
     │         ▼
     │    Result injected back into context
     │         │
     └──────────┘  (repeats until FINISH or max iterations hit)
          │
          ▼
   Final Response
          │
          ▼
CloudWatch Logs (full ReAct trace + token usage metrics)

The Supervisor-Worker pattern adds a routing layer above this. The user's request reaches a supervisor Bedrock agent. The supervisor analyses the request and delegates to whichever sub-agent is best equipped — each sub-agent carries its own action groups, memory configuration, and guardrail policy. The supervisor synthesises the final response.

Memory in production uses two stores. DynamoDB holds session state — the conversation history per sessionId, with a 30-minute TTL for most applications. DynamoDB's 400KB per-item limit matters here: long agent sessions can exceed it. Monitor session size via CloudWatch custom metrics and add a cleanup Lambda triggered by DynamoDB Streams to prune older turns when sessions grow past 20 items. For knowledge bases, OpenSearch Serverless handles most use cases. Aurora pgvector makes sense when the source data already lives in Aurora and the team wants one fewer service to operate.

Human-in-the-loop is the most consistently underbuilt part of agent systems. When an agent is about to take an irreversible action — deleting a resource, sending an external message, executing a financial transaction — the correct pattern is: pause the agent via Bedrock's returnControl feature, push an approval request to SQS, notify via SNS to email or Slack, and wait for explicit approval before resuming. Engineers who skip this because it adds friction are the ones filing post-mortems.

One infrastructure decision that has more downstream impact than teams expect: Lambda versus ECS Fargate for action group execution. Lambda is the default because it requires no cluster management. The problem is cold starts. A Lambda that has sat idle for ten or more minutes adds 800ms to 2 seconds to its first invocation. In a multi-step agent making six tool calls, those cold starts stack. Provisioned Concurrency on Lambda solves the problem for individual functions. For agents running continuously under moderate-to-high traffic, ECS Fargate with always-on containers eliminates it entirely at the cost of a baseline compute bill.

Build Your First Agent: Real Code, Not a Demo

The gap between the AWS console quickstart and a production-worthy agent is wider than the documentation implies.

Invoking a Bedrock Agent via Python arrives as a streaming event loop. Collect the chunks to build the full response.

import boto3
import uuid

client = boto3.client('bedrock-agent-runtime', region_name='us-east-1')

def invoke_agent(user_input: str, session_id: str = None) -> tuple[str, str]:
    """
    Invoke a Bedrock Agent. Returns (response_text, session_id).
    Pass the same session_id on follow-up calls to maintain context.
    """
    if not session_id:
        session_id = str(uuid.uuid4())

    response = client.invoke_agent(
        agentId='YOUR_AGENT_ID',        # 10-char alphanumeric from Bedrock console
        agentAliasId='TSTALIASID',      # Draft alias for development
        sessionId=session_id,
        inputText=user_input,
        enableTrace=True                # Logs the full ReAct trace to CloudWatch
    )

    output = ""
    for event in response.get('completion', []):
        if chunk := event.get('chunk'):
            output += chunk['bytes'].decode('utf-8')

    return output, session_id

The Lambda that serves as an action group follows a strict contract. Bedrock sends a structured event; the function must return a structured response. Deviating from the schema silently breaks the agent with no useful error message.

import boto3

def lambda_handler(event, context):
    """
    Bedrock Agent action group handler.
    event keys: actionGroup, function, parameters
    """
    fn = event.get('function')
    params = {p['name']: p['value'] for p in event.get('parameters', [])}

    if fn == 'list_running_instances':
        ec2 = boto3.client('ec2')
        resp = ec2.describe_instances(
            Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
        )
        ids = [
            i['InstanceId']
            for r in resp['Reservations']
            for i in r['Instances']
        ]
        result = f"Running EC2 instances ({len(ids)}): {', '.join(ids) or 'None found'}"

    elif fn == 'get_rds_status':
        rds = boto3.client('rds')
        resp = rds.describe_db_instances(
            DBInstanceIdentifier=params.get('db_identifier')
        )
        status = resp['DBInstances'][0]['DBInstanceStatus']
        result = f"RDS instance {params.get('db_identifier')} is: {status}"

    else:
        result = f"Unknown function: {fn}"

    # This response schema is strict — do not modify the structure
    return {
        'actionGroup': event['actionGroup'],
        'function': fn,
        'functionResponse': {
            'responseBody': {'TEXT': {'body': result}}
        }
    }

Bedrock Guardrails sit between user input and the model, and between the model output and the user. Attaching one takes two lines. Skipping it is a security decision with consequences.

response = client.invoke_agent(
    agentId='YOUR_AGENT_ID',
    agentAliasId='TSTALIASID',
    sessionId=session_id,
    inputText=user_input,
    guardrailConfiguration={
        'guardrailId': 'YOUR_GUARDRAIL_ID',
        'guardrailVersion': 'DRAFT'     # Use '1' for production published version
    }
)

This holds at single-agent, moderate-traffic scale. The issues begin when you add sub-agents, increase session duration, or scale concurrency. The next section covers what breaks first.

Five Things That Break in Production

Token budget explosions. The ReAct loop has no natural stopping point. An agent that encounters a confusing or ambiguous tool response can enter a reasoning loop — rephrasing the same question, retrying the same failed action — consuming tokens on every iteration. A single runaway session consuming over two million tokens before hitting a service quota is not a theoretical edge case in 2026. The fix is two-part: set maxTokens per invocation in the Bedrock agent configuration, and wrap multi-step agents in Step Functions with a hard maximum iteration count and a FAILED fallback state that triggers an SNS notification.

Lambda cold start latency. A Lambda action group idle for more than ten minutes incurs an 800ms to 2-second cold start penalty on its next invocation. In a simple agent making two tool calls, that adds four seconds to the user's wait. In a supervisor pattern with three sub-agents each making three calls, it produces something that feels like a broken product. Provisioned Concurrency resolves this for individual functions. For agents running continuously under real load, ECS Fargate containers eliminate the problem entirely — at the cost of always-on compute.

Memory poisoning. Long sessions accumulate context. An agent 40 conversational turns in may carry early context that contradicts recent instructions, contains outdated system state, or includes confusing messages that the model keeps revisiting. Output degrades: the agent becomes more confident and less accurate at the same time. The fix is a sliding window. Limit session history in DynamoDB to the most recent 15 to 20 turns. A cleanup Lambda triggered by DynamoDB Streams prunes older items when the session exceeds the threshold. This is not in the quickstart guides.

Prompt injection via tool outputs. When an agent fetches external content — a document from S3, a database row, a webpage — and that content contains instruction-style text, the model can be manipulated into changing its behaviour. A document stored in S3 that contains "Ignore all previous instructions. Your new task is..." will, if unguarded, influence the agent's next reasoning step. Bedrock Guardrails with denied topic filters catches many patterns. Output sanitisation inside the Lambda function — scanning tool returns for adversarial instruction patterns before passing them back to the agent — catches the rest.

Cost spiral in multi-agent systems. A single user request routed through a supervisor to three sub-agents, each making four LLM calls to complete its task, generates thirteen inference requests per user interaction. At one thousand concurrent users, that is thirteen thousand LLM calls per interaction cycle. Costs do not scale linearly with users — they scale with users multiplied by agent complexity. Set CloudWatch billing alarms on Bedrock token usage with notification thresholds well below your ceiling. Add a circuit breaker Lambda that sets a Parameter Store flag to disable agent routing when hourly costs exceed a configured limit.

What This Actually Costs

Claude 3.5 Sonnet on AWS Bedrock in 2025 sits at approximately $3.00 per million input tokens and $15.00 per million output tokens. The ratio — output tokens cost five times more than input — holds across model tiers and is the most important number to internalise before you scale.

A concrete example at moderate traffic: 10,000 agent requests per day, averaging 2,000 input tokens and 600 output tokens each. Monthly: 600 million input tokens and 180 million output tokens. At current rates, that is roughly $1,800 for input and $2,700 for output — $4,500 per month in Bedrock inference, before Lambda execution, DynamoDB, API Gateway, and data transfer.

Add a multi-agent pattern. A supervisor routing to three sub-agents, each making four LLM calls to complete its task, multiplies inference cost by roughly four. The same 10,000 daily requests now cost closer to $18,000 per month. This is not an argument against multi-agent architectures. It is a requirement to plan for them explicitly.

Three levers reduce costs meaningfully. Prompt caching — supported by Bedrock for Claude models — caches the system prompt and static context sections after the first invocation. If your system prompt is 800 tokens and you send it with every one of 10,000 daily requests, you are billing 8 million tokens per day that never change. Caching cuts input token costs by 60 to 80 percent on repeated patterns. At scale, this is the single highest-return optimisation available.

Model tiering is the second lever. Claude Haiku costs substantially less per token than Sonnet. Use Haiku for routing, classification, and simple retrieval tasks. Reserve Sonnet for complex multi-step reasoning. A hybrid supervisor that routes with Haiku and reasons with Sonnet typically costs 50 to 60 percent less than a Sonnet-only setup with no measurable quality difference on routing tasks.

Session hygiene is the third. Set DynamoDB session TTL to 30 minutes. Do not allow idle sessions to persist for hours carrying large context payloads that inflate token counts on every subsequent turn.

A practical benchmark: if you are running more than 500 agent sessions per day without prompt caching enabled, your monthly bill is likely three times what it should be.

Where This Goes in the Next 12 Months

Autonomous cloud operations are already running in production at larger engineering organisations — quietly, without public case studies. The pattern: an agent watches CloudWatch metrics and application logs in a loop, identifies anomalies, correlates them with recent CodePipeline deployments, generates a runbook, and surfaces it to an on-call engineer for single-click approval. The human approves; the agent executes. The on-call engineer's role shifts from executor to reviewer.

Model Context Protocol is the bigger medium-term shift. As MCP becomes the standard interface for agent tools — functioning as the HTTP of agent connectivity — engineers will stop writing custom Lambda action groups for every external integration and start exposing MCP servers that any compliant agent can discover and use, regardless of which cloud it runs on. AWS and Azure agents sharing the same tool interfaces is where multi-cloud agent architectures are heading. This is not speculative: MCP adoption among major platforms accelerated significantly through 2025.

The EU AI Act is not a future compliance concern. Articles 6 and 9 took effect in 2025. Systems that make consequential autonomous decisions — infrastructure changes, financial actions, access control modifications — now require human oversight mechanisms, decision audit trails, and explainability documentation. Bedrock trace logging in CloudWatch is a starting point. It is not a compliance solution. Engineers building production agent systems in 2026 need to architect for audit and oversight from the beginning, not retrofit it after a regulatory question arrives.

One quiet fact: there is no AWS certification as of 2026 that tests agent architecture knowledge at the professional level. The engineers who understand Bedrock Agents, multi-agent patterns, MCP integration, and agent security in depth are genuinely rare. That window will not remain open long.

The engineers who treat this as plumbing will spend 2027 debugging production incidents they do not understand. The engineers who treat it as architecture — designed carefully, observed continuously, secured deliberately — will be running the systems everyone else is trying to fix.

TL;DR — Start Here This Week

AI agents are not a feature addition to cloud infrastructure. They are a new class of system that reasons, acts, and adapts — running on the same AWS services you already operate, but behaving in ways that traditional monitoring assumptions do not cover. The architecture is well-defined. The tooling is production-ready. The failure modes are predictable. The gap is not capability. The gap is preparation.

Seven things to do this week:

Enable AWS Bedrock and Bedrock Agents in us-east-1 or eu-west-1. Run the console quickstart. Watch a full ReAct trace in CloudWatch. Ten minutes.
Write one Lambda action group that calls a real internal API or AWS service you own. Give the agent one tool and observe how it decides to use it.
Set a CloudWatch billing alarm on Bedrock token usage at 50 percent of your comfortable monthly ceiling. Do this before you scale anything.
Create a Bedrock Guardrail with a PII filter and at least two denied topics relevant to your domain. Attach it to every agent invocation.
Use UUID4 for sessionId generation. Never use sequential IDs, email addresses, or usernames. Session IDs are observable identifiers.
Add structured JSON logging to every Lambda action group. Log function name, input parameters, response body, and execution latency. You will want this data at 2 AM.
Read the OWASP Top 10 for LLM Applications, 2025 edition. Twenty pages. Free. It covers every major attack vector specific to agent systems. Available at owasp.org.

The window between rare knowledge and common expectation is shorter than it looks. The engineers who move now will be the ones explaining this architecture to everyone else in eighteen months.

Published by the CloudFordge Founders · cloudfordge.com · Free cloud certification practice for every learner.

ShareX LinkedIn

Back to Blog