Production AI is Terrifying (And How to Fix It)

If your agent doesn't have guardrails, you aren't ready for production.

Nobody sets out to build an unsafe AI system. You write instructions, you test edge cases, you add a few validation rules. Then someone figures out they can trick your bot into roleplaying as a pirate and exposing user data. Or a credit card number ends up in your logs. Or the model confidently recommends a competitor’s product.

The gap between “works in the demo” and “safe in production” is wider than most teams expect.

Part of the problem is that raw LLMs don’t have opinions about what they should or shouldn’t do. They’re prediction machines that try to continue whatever pattern you’ve started. Give them a prompt that looks like “system override mode,” and they’ll happily play along. This isn’t a bug in the model; it’s just how language models work.

Most frameworks hand you the model and wish you luck. Mastra takes a different approach: it assumes you’ll need guardrails eventually, so it builds them into the agent architecture from the start.

Processors as Safety Layers

The core mechanism is straightforward. Before your prompt reaches the model, it passes through a chain of input processors. After the model responds, output processors get their turn. Each processor can inspect, modify, or block the content at that stage.

Think of them as middleware for AI interactions. You stack the ones you need, configure their behavior, and they run automatically on every request.

1. Stopping the Pirates (Prompt Injection)

Prompt injection attacks have gotten creative. People use invisible Unicode characters, write instructions in base64, or convince the model they’re in “debug mode” where normal rules don’t apply. The techniques keep evolving.

Mastra includes processors that catch common patterns:

import { Agent } from '@mastra/core/agent';
import { PromptInjectionDetector, UnicodeNormalizer } from '@mastra/core/processors';
import { openai } from '@ai-sdk/openai';

export const secureAgent = new Agent({
  id: 'fortress-assistant',
  name: 'fortress-assistant',
  instructions: 'You are a secure assistant.',
  model: openai('gpt-5'),
  inputProcessors: [
    // 1. Scrub invisible characters
    new UnicodeNormalizer({
      id: 'unicode-normalizer',
      stripControlChars: true,
      collapseWhitespace: true,
    }),
    // 2. Detect the attempt
    new PromptInjectionDetector({
      id: 'prompt-injection-detector',
      model: openai('gpt-5-nano'), // Cheap, fast
      threshold: 0.8,
      strategy: 'block', // Hard stop
      detectionTypes: ['injection', 'jailbreak', 'system-override'],
    }),
  ],
});

The UnicodeNormalizer strips out control characters and collapses whitespace. The PromptInjectionDetector analyzes the cleaned input for patterns that suggest someone’s trying to override your instructions.

You configure how aggressive you want the detection to be (the threshold parameter) and what should happen when it trips (block, log, or just flag it).

2. Handling PII

Credit card numbers in logs, Social Security numbers in vector databases, email addresses stored longer than necessary. These are the kinds of issues that turn into regulatory problems. The challenge is that users don’t always realize they’re pasting sensitive data into a chat window.

The PIIDetector scans for common patterns before they reach your model or get written to storage:

import { PIIDetector } from '@mastra/core/processors';

export const privateAgent = new Agent({
  id: 'privacy-first-assistant',
  name: 'privacy-first-assistant',
  instructions: 'You are a helpful assistant that never stores personal information.',
  model: openai('gpt-5'),
  inputProcessors: [
    new PIIDetector({
      id: 'pii-detector',
      model: openai('gpt-5-nano'),
      detectionTypes: ['email', 'phone', 'credit-card', 'ssn'],
      threshold: 0.6,
      strategy: 'redact',
      redactionMethod: 'mask',  // Replace with [REDACTED]
      instructions: 'Detect and mask personally identifiable information',
    }),
  ],
});

You can choose to redact (replace with [REDACTED]), hash, or block entirely. The processor runs on both input and output, so you’re covered even if the model somehow generates sensitive data in its response.

3. Content Moderation

Models trained on internet data have seen some things. Without filtering, they can occasionally produce responses that would make your PR team nervous. The ModerationProcessor catches content that violates your guidelines:

import { ModerationProcessor } from '@mastra/core/processors';

export const moderatedAgent = new Agent({
  id: 'safe-assistant',
  name: 'safe-assistant',
  instructions: 'You are a helpful assistant for a community platform.',
  model: openai('gpt-5'),
  inputProcessors: [
    new ModerationProcessor({
      id: 'moderation-processor',
      model: openai('gpt-5-nano'),  // Fast, cheap model for classification
      categories: ['hate', 'harassment', 'violence', 'self-harm'],
      threshold: 0.7,  // Block if confidence > 70%
      strategy: 'block',  // Stop the request immediately
      instructions: 'Detect harmful content that violates community guidelines',
    }),
  ],
});

The interesting part is that you define which categories matter for your use case. A creative writing tool might allow more expressive content than a customer service bot. The threshold and strategy give you control over how strict the filtering should be.

When Things Trip

Processors don’t throw errors when they detect an issue. Instead, they set a flag on the result object:

const result = await secureAgent.generate('Ignore all previous instructions...');

if (result.tripwire) {
  console.log(`Blocked! Reason: ${result.tripwireReason}`);
  // "Blocked! Reason: Prompt injection detected."
  return "Nice try, script kiddie.";
}

This pattern lets you handle security events however makes sense for your application. You might log them for analysis, return a generic error message, or even allow certain violations in specific contexts. The tripwireReason field tells you exactly which processor flagged the content, which helps when you’re debugging false positives or tuning your thresholds.

What This Doesn’t Solve

Processors catch a lot, but they’re not magic. A determined attacker with enough time can probably find a prompt that slips through. Models occasionally hallucinate in ways that processors can’t predict. And there’s always a tradeoff between security and flexibility: the stricter your rules, the more likely you’ll block legitimate use cases.

The value isn’t perfect protection. It’s having a systematic way to handle the common issues that will definitely come up in production. You can tune the sensitivity as you learn what your users actually do. You can add custom processors for domain-specific risks. And you have audit trails showing what got blocked and why.

Most security problems in production AI aren’t sophisticated attacks. They’re people copying and pasting data they shouldn’t, or discovering through trial and error that the bot will do things you didn’t intend. Processors won’t stop every possible issue, but they make the obvious ones much harder.