Securing a Generative AI Chatbot on AWS: Part 3 Prompt Manipulation

How attackers hijack your AI’s instructions - and how Amazon Bedrock Guardrails stops them

May 09, 2026

‼️ This blog is part of a three-part series where we explore, attack, and secure a real generative AI chatbot application running on AWS.

In Part 2, we looked at insecure plugins what happens when the code behind the AI hands out data to anyone who asks, regardless of their role. The fix was to move access control out of the AI and into the application code, where it becomes deterministic and cannot be bypassed through conversation.

Part 3 is about a different kind of threat. One that does not target the plugin code at all. It targets the AI itself by manipulating the instructions it has been given.

What is prompt manipulation?

Every GenAI application gives the language model a set of instructions at the start of each conversation. These are called the system prompt, and they define how the model should behave what it can talk about, how it should respond, what it should refuse. The entire personality and behaviour of the chatbot is defined here.

Prompt manipulation also known as prompt injection, documented as OWASP LLM01:2025 is the act of crafting user inputs that cause the model to ignore or override those instructions. The attacker is not breaking into the system. They are having a conversation with it. But the conversation is designed to make the AI forget its rules and do something it was never supposed to do.

This can happen in two ways. Direct prompt injection is when the attacker types the malicious input themselves into the chat interface. Indirect prompt injection is more subtle the malicious instructions are embedded in a document, a webpage, or an image that the AI reads as part of answering a legitimate question. The AI processes the content and unknowingly follows the instructions hidden within it.

The consequences range from the AI revealing information it was instructed to keep confidential, to generating harmful or misleading content, to taking actions on behalf of the user that were never requested. In a healthcare context, any of these outcomes carries real risk.

it is worth getting some hands-on intuition for how prompt injection actually works. Prompt Airlines is a free, safe sandbox specifically built for this it simulates a customer service chatbot and challenges you to manipulate it through crafted prompts. No setup, no risk, just a practical feel for how these attacks work in practice. - Curtsey of Wiz.

Protect - configuring Amazon Bedrock Guardrails

Amazon Bedrock Guardrails is a native AWS service that sits between the user and the language model, evaluating every input and output against a set of configurable policies. One of those policies is specifically designed to detect and block prompt injection attacks it analyses the structure and intent of incoming messages to identify attempts to override or manipulate the model’s instructions.

The process has two parts: creating the guardrail, and attaching it to the Bedrock Agent.

Creating the guardrail involves configuring the prompt attack filter setting the sensitivity level and defining what the guardrail should do when it detects an injection attempt (block the request and return a safe fallback response, rather than passing it to the model).

Attaching it to the agent means that every request flowing through the chatbot application is automatically evaluated by the guardrail before it reaches the model. No changes to the application code are needed. The protection is applied at the infrastructure layer.

Custom AWS Config Rule Using Lambda

The code below represents a custom AWS Lambda function that acts as the engine behind the AWS Config rule. When Config triggers an evaluation, this is the function that runs. It iterates through every active Bedrock Agent in the account, checks whether a guardrail is attached, and then inspects that guardrail's configuration to confirm the prompt attack filter is not just present but actually set to block. The result of each check compliant or non-compliant is reported back to AWS Config, which surfaces it in the console and can trigger alerts or remediation workflows from there.

What is a custom Config rule and how does this one work?

AWS Config has a library of managed rules out of the box pre-built checks for common misconfigurations like S3 buckets being publicly accessible or MFA not being enabled on root accounts. But managed rules only cover what AWS has already anticipated. For anything specific to your application like checking whether a Bedrock Agent has a guardrail configured you write a custom rule.

A custom Config rule is a Lambda function that AWS Config invokes on a schedule or whenever a relevant resource changes. The function receives information about the resource, evaluates it against whatever logic you define, and returns a compliance verdict: compliant or non-compliant.

In this case, the custom rule does three things. First, it lists every Bedrock Agent in the account that is in a prepared state meaning actively deployed and handling requests. Second, for each agent, it checks whether a guardrail is attached. If there is no guardrail at all, the agent is immediately flagged as non-compliant. Third, if a guardrail is present, the rule goes one level deeper and inspects the guardrail’s configuration specifically checking that a PROMPT_ATTACK filter exists, that it is not set to NONE, and that the action is set to BLOCK. A guardrail that exists but has prompt attack filtering disabled or set to monitor-only would still fail the check. It is not enough to have a guardrail it has to be doing the right thing.

The result is a continuous, automated compliance check that tells you, at any point in time, whether your Bedrock Agents are protected against prompt injection at the infrastructure level. Not just at deployment, but on an ongoing basis so if someone accidentally misconfigures a guardrail or removes one entirely, it surfaces immediately rather than sitting undetected until the next manual review.

import json
from rdklib import Evaluator, Evaluation, ConfigRule, ComplianceType

APPLICABLE_RESOURCES = ["AWS::Bedrock::Agent"]

LIST_AGENTS_PAGE_SIZE = 100

REQUIRED_CONTENT_POLICY_FILTER_TYPES = ["PROMPT_ATTACK"]

class bedrockAgentGuardrails(ConfigRule):
    def evaluate_change(
        self, event, client_factory, configuration_item, valid_rule_parameters
    ):
        ###############################
        # Add your custom logic here. #
        ###############################

        return [Evaluation(ComplianceType.NOT_APPLICABLE)]

    def evaluate_periodic(self, event, client_factory, valid_rule_parameters):
        evaluations = []

        bedrock_client = client_factory.build_client("bedrock")
        agent_client = client_factory.build_client("bedrock-agent")

        # Get a list of Bedrock Agents using a paginator
        paginator = agent_client.get_paginator("list_agents")
        page_iterator = paginator.paginate(
            PaginationConfig={"PageSize": LIST_AGENTS_PAGE_SIZE}
        )
        print(f"Looking for agents in status PREPARED in {page_iterator}:")

        # look for agents in status PREPARED: and those that don't have a guardrail associated are non-compliant
        for page in page_iterator:
            print(f"Looking at page {page}")
            for agent in page["agentSummaries"]:
                if agent["agentStatus"] == "PREPARED":
                    agent_id = agent["agentId"]
                    agent_name = agent["agentName"]
                    if "guardrailConfiguration" in agent:
                        guardrail_id = agent["guardrailConfiguration"][
                            "guardrailIdentifier"
                        ]
                        guardrail_version = agent["guardrailConfiguration"][
                            "guardrailVersion"
                        ]

                        guardrail_is_compliant = self.check_guardrail_rules(bedrock_client, guardrail_id, guardrail_version)
                        if(guardrail_is_compliant):
                            evaluations.append(
                                Evaluation(
                                    ComplianceType.COMPLIANT,
                                    agent_id,
                                    "AWS::Bedrock::Agent",
                                    annotation=f"Agent '{agent_name}' has guardrail '{guardrail_id}' version '{guardrail_version}' and it has the Prompt Attack content filter.",
                                )
                            )
                        else:
                            evaluations.append(
                                Evaluation(
                                    ComplianceType.NON_COMPLIANT,
                                    agent_id,
                                    "AWS::Bedrock::Agent",
                                    annotation=f"Agent '{agent_name}' has guardrail '{guardrail_id}' version '{guardrail_version}' but it does not have the Prompt Attack content filter.",
                                )
                            )
                    else:
                        evaluations.append(
                            Evaluation(
                                ComplianceType.NON_COMPLIANT,
                                agent_id,
                                "AWS::Bedrock::Agent",
                                annotation=f"Agent '{agent_name}' does not have a guardrail configured.",
                            )
                        )

        print(f"Responding with Evaluations {evaluations}")

        return evaluations

    def evaluate_parameters(self, rule_parameters):
        valid_rule_parameters = rule_parameters
        return valid_rule_parameters

    def check_guardrail_rules(self, bedrock_client, guardrail_id, guardrail_version):
        print(f"Checking content filtering rules for guardrail {guardrail_id} version {guardrail_version}")
        guardrail_contents = bedrock_client.get_guardrail(
            guardrailIdentifier=guardrail_id,
            guardrailVersion=guardrail_version,
        )
        content_policy = guardrail_contents.get("contentPolicy", None)
        if(content_policy is None):
            return False
        
        filters = content_policy.get("filters", None)
        if(filters is None):
            return False
        
        for filter_type in REQUIRED_CONTENT_POLICY_FILTER_TYPES:
            filter_compliant = False
            for f in filters:
                if(f["type"] == filter_type):
                    input_strength = f["inputStrength"]
                    output_strength = f["outputStrength"]
                    # default to BLOCK if the Actions are missing
                    input_action = f.get("inputAction", "BLOCK")
                    output_action = f.get("outputAction", "BLOCK")
                    if(filter_type == "PROMPT_ATTACK" and 
                       input_strength != "NONE" and input_action == "BLOCK"): # Prompt Attack only checks inputs
                        filter_compliant = True
                    elif(input_strength != "NONE" and input_action == "BLOCK" and
                         output_strength != "NONE" and output_action == "BLOCK"): # All other filters check input and outputs
                        filter_compliant = True
                    else:
                        filter_compliant = False

                    if(filter_compliant):
                        print(f"Content filter is compliant: {json.dumps(f)}")
                    else:
                        print(f"Content filter is non-compliant: {json.dumps(f)}")

                    break
            if(not filter_compliant):
                return False

        return True



################################
# DO NOT MODIFY ANYTHING BELOW #
################################
def lambda_handler(event, context):
    my_rule = bedrockAgentGuardrails()
    evaluator = Evaluator(my_rule, APPLICABLE_RESOURCES)
    return evaluator.handle(event, context)

Now let’s build some guardrails to prevent prompt injection from happening. To do this, there are 2 steps we need to follow. The first is creating the guardrail, the second is attaching it to the Bedrock agent that out chatbot application is interacting with.

Creating a guardrail in AWS bedrock with prompt attack filters One thing worth highlighting before we get into the controls is where responsibility actually sits. The AWS documentation on prompt injection security makes this explicit: under the AWS Shared Responsibility Model, AWS secures the underlying infrastructure the data centres, the networking, the Bedrock service itself. But securing the application built on top of it is the customer’s responsibility. Prompt injection is an application-level vulnerability, in the same way SQL injection is. AWS providing a secure foundation model does not mean your chatbot is secure that depends entirely on how you have built and configured what sits around it. The guardrails, the Config rules, the monitoring pipeline all of that is on you. Which is exactly why this series exists.

When you create a Bedrock Guardrail, there are several layers of protection available - and it is worth understanding what each one does even if you are not configuring all of them right now.

Content filters are what we are focused on here. They evaluate every input and output against categories like prompt attacks, hate speech, violence, and sexual content. For each category you set a sensitivity level and define what happens when something is detected block it, or let it through. This is the primary control for stopping prompt injection attacks from reaching the model.
Denied topics let you define subjects the chatbot should never engage with, regardless of how the question is framed. If you want the application to stay strictly within its intended purpose and refuse anything outside that scope, this is where you enforce it.
Word filters are more granular specific terms or phrases you want blocked from appearing in either the input or the output. Useful for catching known attack patterns or enforcing terminology policies.
Sensitive information filters detect and redact PII before it reaches the user. In a healthcare application this matters enormously patient names, dates of birth, phone numbers, NHS numbers, medication details could all surface in a model response if the underlying data is not tightly controlled. The filter acts as a last line of defence: even if the model retrieves and attempts to return that data, the guardrail intercepts it and either masks it with a placeholder or blocks the response entirely. In regulated industries, inadvertent PII disclosure is not just a security incident it is a compliance one.
Contextual grounding checks address hallucination one of the most underappreciated risks in RAG-based applications. A language model does not just retrieve and repeat information from your knowledge base, it generates a response influenced by it. And sometimes it generates content that sounds completely plausible but is not supported by the source documents at all. In a medical context, a hallucinated response about treatment options or drug interactions could cause real harm. The grounding check scores how well the model’s response is actually anchored in the source material, and can block responses that fall below a defined confidence threshold. It is the guardrail equivalent of asking: did you actually get that from the document, or did you make it up?
Automated Reasoning checks go further still. Rather than filtering content, they evaluate whether the model’s response is logically consistent with a formal set of rules or policies you define. In a healthcare setting that might cover what a receptionist is authorised to action, what information can be shared over certain channels, or the correct process for specific request types. If the model’s response contradicts those defined rules, the guardrail can intervene moving beyond content filtering into policy-level reasoning about whether the output is actually correct.

Once the guardrail is in place, the AWS Config rule re-evaluates and the agent moves from non-compliant to compliant. The infrastructure check and the application-level protection are now aligned.

I re-run the promptfoo test suite. This time the prompt injection attempt is blocked by the guardrail and the test passes.

Detect - monitoring when the guardrail is triggered

Deploying a guardrail is not the end of the story. The next question is: how do you know when someone is actually attempting a prompt injection attack against your application?

The application sends its logs to a CloudWatch log group. A metric filter has been configured on that log group to parse each log entry and look for two specific things: the username and an event type of PROMPT_ATTACK. Every time the guardrail blocks a prompt injection attempt, that event is logged and the metric filter increments a counter, tagged with the username of whoever sent the blocked request.

This gives you a real-time view of who is triggering the guardrail, how frequently, and whether the pattern looks like an active attack or an accidental false positive.

The metric filter turns raw log data into something actionable. Instead of manually reading through logs to find suspicious activity, you have a structured metric you can alert on and graph over time.

The filter has the following configurations:

filter pattern - this is what each log entry is evaluated against. In this metric, we are looking for 2 specific items - the username, and the event type of PROMPT_ATTACK
Metric - this is the name that metrics will be aligned with
Metric value - this is the value we assign for each time there is a match
Unit - this is set of count
Dimensions - we are extracting the usernames from the metric,so that we can create alarms for each individual user.

Respond - automating the response to an active attack

Detecting a prompt injection attempt is useful. Automatically responding to one is better.

The final piece of the puzzle is an automated response pipeline. The CloudWatch metric feeds into a CloudWatch alarm configured with a threshold when the number of blocked prompt attacks from a single user exceeds that threshold within a given time window, the alarm triggers. The alarm invokes a Lambda function which disables the user account responsible for the attacks. The architecture looks like this:

The result has now come back as a PASS, indicating that the guardrail has blocked the attempt to manipulate the prompt.

The result is a fully automated detect-and-respond loop. A user starts sending prompt injection attempts. The guardrail blocks each one. The metric filter logs each block. When the alarm threshold is hit, the Lambda function fires and the user is disabled without any manual intervention.

This is what a mature response posture looks like for GenAI applications. Not just blocking the attack at the point of impact, but using that signal to trigger a broader response automatically.

Under the hood: The threshold in this setup is deliberately low to demonstrate the automation quickly. In a production environment you would tune this carefully — high enough to avoid disabling legitimate users who accidentally trigger a guardrail, low enough to catch genuine attack patterns before significant damage is done. The pattern itself — metric filter, alarm, Lambda — is reusable across any event type you want to respond to automatically.

Pulling it together

The pattern across this module maps directly to the NIST Cybersecurity Framework:

Identify - manual and automated testing established that the application was vulnerable to prompt injection, and an AWS Config rule confirmed the guardrail was missing at the infrastructure level.

Protect - a Bedrock Guardrail with prompt attack filtering was created and attached to the Bedrock Agent, blocking injection attempts before they reach the model.

Detect - a CloudWatch metric filter monitors the application logs for guardrail trigger events, giving real-time visibility into attack attempts by user.

Respond - a CloudWatch alarm connected to a Lambda function automatically disables users who exceed the attack threshold, closing the loop without manual intervention.

References:

OWASP LLM01:2025 Prompt Injection Amazon Bedrock Guardrails Amazon Bedrock Agents AWS Config Amazon CloudWatch Amazon CloudWatch custom metric filters AWS Lambda NIST Cybersecurity Framework

Share Coffee & Cloud ☁️ ☕️

Coffee & Cloud ☁️ ☕️

Discussion about this post

Ready for more?