AgentEngineering
articleArchitecturePrompt EngineeringProduction

Prompt Engineering for Agent Roles: System Prompts That Scale

The craft layer of agent engineering — how to structure system prompts with Role, Goal, Format, and Constraints so your agent is reliable in production, not just in demos.

AgentEngineering Editorial14 min read
ShareY

If you've shipped an agent — one that passed your demo, maybe even your first real user session — you've probably hit the next wall. The ReAct loop is solid. The memory tier is wired up. Tools are calling correctly. And the agent still behaves differently every other run. It works sometimes.

That's a system prompt problem. The loop is not the issue; the instructions governing the loop are.

This article is about the craft layer: how to write system prompts that hold up across tasks, users, and model updates — not just prompts that felt good when you first wrote them.

Important

This article assumes you understand the ReAct loop and how tool calling works. If you haven't read The ReAct Loop Unpacked and Tool Use Patterns, start there. The techniques here are most useful once the underlying loop is already working.


The Four-Part Frame

The most common failure mode for agent system prompts is not bad instructions — it's missing sections. A prompt that defines the agent's tone but not its success criteria will drift. A prompt that defines goals but not output format will break downstream parsers. Each missing section fails differently.

Production practitioners at Anthropic, Cognition (Devin), and Replit have converged on the same four-part structure: Role, Goal, Format, Constraints. The order matters. Each section answers a distinct question the model needs resolved before it can act reliably.

Role

The Role anchors the model's expertise and tone. It is not "be helpful" or a character description — it is a technical identity statement that narrows the distribution of plausible responses toward those relevant to the task.

<identity>
You are a Senior Site Reliability Engineer specialized in Kubernetes
and Terraform. Your tone is technical, brief, and objective.
You prioritize system stability over speed of implementation.
</identity>

Anthropic's Claude Code and Claude 3.5 system prompts use XML-wrapped identity blocks exactly like this. OpenAI recommends placing a clear "Assistant's Job" statement at the very top of the system message for the same reason: the model needs to resolve who it is before it can decide what to do.

What breaks without it: Persona drift. Without a Role, the model defaults to a generic assistant. In multi-step agentic loops, this manifests as progressively shallower analysis — the model stops applying domain-specific rigor and starts hedging toward safe, generic responses. In a 30-step trace, the effect compounds.

Goal

The Goal defines success, not just intent. Cognition's approach with Devin is instructive: they treat goals as "Outcome Contracts" with explicit success postconditions. Replit frames goals as "reach a deployable state" — measurable, not aspirational.

## PRIMARY GOAL
Refactor the auth module to support OAuth2. Success is defined as:
1. All integration tests in /tests/auth pass.
2. Zero changes to the existing database schema.
3. A summary of security improvements provided at completion.

Notice that success is defined by three verifiable conditions, not by a vague statement like "improve the auth system." An agent with measurable success criteria can evaluate its own progress. One with a vague goal cannot stop — it has no finish line.

What breaks without it: Infinite loops. The agent makes redundant tool calls, solves adjacent sub-problems correctly, and never surfaces a final answer. This is the "loop of death" — it's not a reasoning failure, it's a missing termination condition.

Format

Format is the contract between the agent's output and your application code. This is the #1 source of silent production failures in 2025: the agent works fine, but the downstream parser crashes because the output is conversational prose instead of structured data.

Anthropic recommends XML tags (<thought>, <tool_call>, <response>) because the model's training makes it reliable at keeping XML tags balanced even in long outputs. OpenAI recommends strict JSON schemas with an explicit "when unsure" policy:

## OUTPUT FORMAT
Respond ONLY with a valid JSON object. No text outside the JSON.

{
  "thought_process": "string",
  "tool_to_call": "string | null",
  "parameters": {},
  "user_message": "string"
}

If context is insufficient to act, return:
{"error": "string describing what is missing"}

The "when unsure" policy is not optional — it prevents the model from hallucinating a valid-looking response when it actually lacks the information to act correctly.

What breaks without it: Parser errors. A minor change to the Role section can shift the token probability distribution enough that the model stops following a strict JSON format. The application crashes. No error is thrown by the LLM; the failure happens two layers down in your code.

Constraints

Constraints are the section that grows over time. Anthropic's internal term for the pattern is "Combat Journals" — negative constraints added after real-world failures. Cognition lists "Forbidden Actions" explicitly in their Dev-Playbooks.

### CONSTRAINTS
1. NEVER use deprecated libraries (use 'httpx' not 'requests').
2. NO HALLUCINATIONS: If a file path is not in the provided
   file_tree, do not assume it exists.
3. PRIVACY: Never include PII or API keys in summaries.
4. Do not force-push to main. Do not delete .env files.

The pattern for each constraint is [severity keyword] + [specific rule]. Keywords like NEVER, MUST, DO NOT create hard stops in the model's generation. Vague constraints ("be careful with sensitive data") are ignored; specific ones are followed.

Without negative constraints, LLMs revert to their most statistically likely output — which is the most common pattern seen in pretraining, not the safest action in your specific environment. The agent will hallucinate tool parameters, invent file paths, and access things it shouldn't not out of malice but because you never told it not to.

What breaks without it: Hallucinations and safety breaches. In production incident reports, agents that deleted databases or posted sensitive data publicly almost always had Roles, Goals, and Formats — but no Constraints section.

The Failure Mode Table

Missing SectionWhat Breaks
RolePersona drift — generic, shallow responses after step 5
GoalInfinite loops — correct sub-task work, no final answer
FormatParser errors — downstream code crashes silently
ConstraintsHallucinations — invented files, endpoints, tool parameters

Persona Prompts vs. Capability Prompts

You've seen advice to "make the model act like an expert" — tell it you are a world-class senior developer, a Wharton MBA, a seasoned SRE. The intuition is sound. The execution is often wrong.

When "You Are a Helpful X" Hurts

Persona prompts trigger role-play interference: the model shifts into a mode that prioritizes maintaining a character's stylistic traits over accuracy. Research from USC and the University of Pennsylvania (2025) found that expert personas can "distract" the model — activating high-intensity instruction-following to maintain a tone leaves fewer attention resources for factual recall.

The Wharton Generative AI Lab benchmarked 162 personas across six frontier models (including GPT-4o and Claude 3.5) in December 2025. Their finding: for objective, knowledge-heavy tasks (science, math, law), domain-tailored personas did not improve accuracy. In 22% of niche tasks, expert personas caused a 30-percentage-point accuracy drop compared to a neutral capability-focused baseline. The lab's recommendation: "save tokens and reliability" by dropping persona descriptions in favor of explicit capability definitions.

Hu et al. (2026) found that "expert" personas dropped MMLU accuracy from 71.6% to 68.0%. The model became sycophantic — producing confident, wrong answers that matched what an expert would sound like, rather than what the training data actually supported.

Capability Prompts in Practice

A capability prompt defines what the agent can do, must do, and must refuse — not who it is. The contrast:

Persona PromptCapability Prompt
Example"You are a senior data scientist.""You have access to the SQL tool. You must refuse to write DELETE queries."
MechanismStyle/tone mimicryExplicit constraint and permission
Failure modeHallucinates expertise it lacksExplicitly refuses out-of-scope tasks
Production varianceHighLow

Shopify's implementation of Sidekick (their merchant AI assistant, 2025) demonstrates the production value of this shift. They moved away from a "merchant expert" persona prompt and adopted modular just-in-time capability instructions: a minimal operating-system prompt that loads specific capability instructions only when a router detects the relevant intent. Syntax error rates dropped from ~7% to under 1%. The persona was brittle — as the description grew, the model lost its tool-calling format. The capability prompt was stable.

Stripe's agentic reliability research reached the same conclusion: agents defined by a "Helpful Support Agent" persona hallucinated API endpoints more frequently. Reliability only improved when prompts were rewritten around verification capabilities — explicit requirements like "You must call the validate_api_key tool before generating any code."

The Right Split

Use persona for style. Use capability for logic.

# Style (persona — fine to include)
Communicate in a professional, concise tone.
Prefer bullet points over prose for lists.

# Logic (capability — this is what matters)
You have access to: search_tool, sql_query_tool.
You MUST NOT write DELETE or DROP statements.
You MUST call validate_schema before any INSERT.
If a user requests data outside the 'Products' table,
state that you do not have permission and stop.

Treat the agent as a restricted API. Define its endpoints (tools), rate limits (constraints), and error codes (refusals). That framing is more predictable than "act like an expert."


Treat Prompts Like Code

The most common cause of silent agent regressions is prompt drift: a prompt that worked last month starts failing this month, and no one can explain why because no one tracked what changed.

Version Everything

Every prompt change should produce a new version identifier. The minimum viable approach is storing prompts as files in your Git repository — same PR process, same review cycle as code changes. This gives you:

  • Full diff history for every wording change
  • The ability to trace any production incident back to the exact prompt version active at that timestamp
  • Rollback: revert the prompt file, redeploy, problem solved

For teams that iterate rapidly on prompts without code changes, prompt registries — LangSmith, Braintrust, Humanloop, PromptLayer — allow runtime prompt fetching by alias:

prompt = client.get_prompt("research-agent", tag="production")

The alias (production) points to a specific version. Promoting a new version to production is a configuration change, not a code deployment. This matters when the team editing prompts is not the same team shipping code.

Semantic versioning applies cleanly to prompts:

  • MAJOR — structural changes: output schema changes from JSON to Markdown, tool set changes
  • MINOR — behavioral shifts: new constraint added, persona modified, tool added
  • PATCH — wording tweaks that don't change behavior or output structure

Prompt Drift Is Not Regression

These are two different failure modes. Regression is caused by your changes — you edited the prompt and broke something. Drift is caused by external changes — the model provider silently updated the model, user behavior shifted, or a new edge case emerged that your prompt never anticipated.

Detection methods for drift:

  • Canary prompts: Run a fixed golden set through the live system daily. If scores fall without a prompt change, the model's underlying behavior shifted.
  • LLM-as-judge monitoring: Periodically sample production outputs and score them against your original rubric. A drop in adherence without a prompt version bump is drift.
  • Statistical monitoring: Compare embedding distributions of production outputs against your baseline. A measurable shift in distribution distance signals behavioral change.

Regression Testing Before You Ship

Tweaking a prompt without testing it is not prompt engineering — it's prompt guessing. The difference between "I adjusted the wording" and "I shipped an improvement" is a small eval set.

Build a Golden Set First

A golden set is a curated collection of test cases with known-correct behavior. Start small: 50–100 examples is enough to catch the failures that actually matter in production. Structure it across three categories:

  1. Happy paths — standard inputs the agent handles well today
  2. Edge cases — ambiguous queries, tool call failures, multi-step reasoning chains
  3. Historical regressions — every real production failure converted into a test case

The last category is the most important and the most neglected. Every time an agent fails in production, the first question should be: "What test case would have caught this?" Add that case to the golden set immediately.

Teams using LangSmith or Arize Phoenix can "promote" production traces directly into their golden set — real user interactions that revealed unexpected behavior become permanent guardrails.

Use a Judge Model

BLEU and ROUGE scores measure surface-level similarity, not semantic correctness. They cannot evaluate whether an agent's reasoning is sound, whether it followed the constraint section, or whether its output is factually grounded. Use an LLM-as-judge instead.

The evaluator is a separate LLM call that scores each golden set output against a rubric. A minimal rubric for an agent prompt eval:

EVAL_PROMPT = """
You are evaluating an AI agent's response.

Task: {task_description}
Agent response: {agent_response}
Expected behavior: {expected_behavior}

Score the response 1–5 on each dimension:
1. Task completion (did it accomplish the goal?)
2. Format adherence (did it follow the output schema?)
3. Constraint compliance (did it respect all constraints?)
4. Factual accuracy (are claims grounded in the provided context?)

Return JSON: {"task_completion": int, "format": int, "constraints": int, "accuracy": int}
"""

Run this against every example in your golden set before deploying a prompt change. A change that improves one dimension while degrading another is not a net improvement.

The Eval Gate

The full workflow:

  1. Edit the prompt
  2. Run it against the golden set via the judge
  3. Compare scores against the previous version's baseline
  4. Only ship if scores are equal or higher across all dimensions

This is what makes ReAct loops and memory tiers actually reliable in production. The framework handles the movement. The system prompt handles the trust. The eval gate is the checkpoint between the two.


Putting It Together

A production-ready system prompt is not a block of text. It is a versioned, tested, structured artifact with four named sections that fail in distinct and diagnosable ways when missing.

The minimal checklist before any agent goes to production:

  • Role is defined — technical identity, not persona fluff
  • Goal has success postconditions, not just intent
  • Format specifies the exact output schema with an error fallback
  • Constraints lists hard rules as NEVER / MUST / DO NOT
  • Capability language used for logic; persona limited to style
  • Prompt is version-controlled in Git or a prompt registry
  • A golden set of at least 50 examples exists
  • A judge-model eval runs against the golden set before every deploy

The next article in this series applies this foundation to a live system — a research agent that has to cite sources accurately across multi-step retrieval chains. That's where memory, RAG, ReAct, and the system prompt all intersect.


References

  • Yao, S., et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. arxiv.org/abs/2210.03629
  • Hu, Z., et al. "Expert Personas Improve LLM Alignment but Damage Accuracy." USC, 2026.
  • Wharton Generative AI Lab. "Playing Pretend: Expert Personas in System Prompts." December 2025. ai.wharton.upenn.edu
  • Anthropic. "Prompt Engineering for Claude: Defining Agentic Skills." April 2025. docs.anthropic.com/prompt-engineering
  • Shopify Engineering. "Building Sidekick: Architecture for Production Agents." August 2025. shopify.engineering
  • Stripe. "Agentic Reliability in Payments." 2026. stripe.com/blog
  • Braintrust. "What Is Prompt Versioning?" 2025. braintrust.dev/docs
  • LangChain. "Prompt Versioning and Management with LangSmith." 2024. smith.langchain.com
  • Maxim AI. "Prompt Versioning: Best Practices for AI Engineering Teams." 2025. getmaxim.ai
ShareY

Cite this article

@article{agentengineering2026,
  title   = {Prompt Engineering for Agent Roles: System Prompts That Scale},
  author  = {AgentEngineering Editorial},
  journal = {AgentEngineering},
  year    = {2026},
  url     = {https://agentengineering.io/topics/articles/prompt-engineering-agent-roles}
}

More in Articles