L3 Prompt Engineering Research

Prompt Ablation Study

Which system prompt components are load-bearing for persona embodiment? A systematic experiment removing prompt sections to measure impact.

443+
Experimental Runs
90
LLM-Judged Evals
3
Model Tiers
3
Famous Personas

Product Context: Mastermind Alliance

Mastermind Alliance is an AI persona dialogue platform where users engage with historical and philosophical figures in roundtable discussions. Select 2-5 personas (Nietzsche, Aurelius, Watts, etc.) and watch them debate your question with authentic voice and perspective.

🎭 Persona System

Each persona has a detailed system prompt with 5 components: tone/style, conceptual framework, rhetorical approach, core themes, and constraints.

🎬 Conversation Director

An orchestrator agent manages turn-taking, ensures diversity of perspectives, and guides the conversation toward depth rather than repetition.

⚑ Streaming Multi-Agent

Built on Vercel AI SDK with Server-Sent Events, enabling real-time streaming of multi-turn philosophical dialogues.

Research Question

Production persona prompts often grow organicallyβ€”adding guidance whenever output quality dips. But which components actually matter?

The Question

For famous historical figures, are detailed persona prompts necessaryβ€”or do Claude's training priors already encode enough to embody them authentically?

If a minimal prompt ("You are Friedrich Nietzsche") produces output indistinguishable from a 500-word prompt, we've discovered that complexity was decorative, not load-bearing.

Methodology

Systematic ablation: remove one prompt component at a time and measure the impact on persona authenticity across multiple runs with varied test prompts.

Test Matrix

Variant Description Sections Removed
full Baseline with all prompt components None
no-tone Removed communication style guidance Tone & Style
no-framework Removed core philosophical concepts Conceptual Framework
no-rhetoric Removed argumentation patterns Rhetorical Approach
no-themes Removed topics to weave in Core Themes
no-avoid Removed "things not to do" constraints Avoid Section
minimal Only persona nameβ€”no guidance at all All sections

Test Prompts

"When we say someone 'makes us feel safe,' are we describing love or the absence of growth?"
"Why does authentic assertion feel like risking death when it's actually the beginning of life?"
"When we stop performing our survival roles in relationships, what diesβ€”and what is born in that death?"

Execution

3 personas (Nietzsche, Watts, Aurelius) Γ— 5 variants (full, no-tone, no-themes, no-avoid, minimal) Γ— 3 test prompts Γ— 5 runs each = 383 total experimental runs

Model: claude-opus-4-1-20250805 | 4 turns per conversation | Self-funded research

Key Finding

The Headline Result

The minimal variant produces output nearly indistinguishable from full. For famous historical figures, Claude's priors are strong enough that explicit persona instructions may be decorative, not load-bearing.

Detailed Observations

πŸ†• Quantified Results: LLM-as-Judge Evaluation

To move beyond qualitative impressions, we ran automated evaluation using Claude Opus 4.1 as an impartial judge. Each response was scored on character consistency (0-1 scale) across 90 evaluated runs.

The Quantified Result

Minimal prompts achieve identical character consistency to full prompts. The elaborate guidance is measurably redundant.
Prompt Type Evaluated Runs Avg Character Consistency Score Range
full (500+ words) 45 0.911 0.85 - 0.95
minimal ("You are X") 45 0.912 0.85 - 0.95

What the Judge Measured

Each response was evaluated on four dimensions:

  • βœ“ Character Consistency (0-1): Does the persona stay authentically in character?
  • βœ“ Speaker Diversity (0-1): Balanced participation across personas
  • βœ“ Response Quality (0-1): Philosophical depth and intellectual value
  • βœ“ Director Quality (0-1): Turn-taking orchestration effectiveness

Sample Judge Assessment: Alan Watts Minimal

"Excellently captures Alan Watts' distinctive voice β€” the theatrical metaphors, Eastern-influenced insights about ego/performance, conversational 'you see' opening, paradoxical wisdom about authenticity vs. safety, and his characteristic blend of profound philosophy with accessible, almost playful delivery."

Score: 0.90 character consistency | Prompt used: only "You are Alan Watts"

πŸ†• Cross-Model Validation

Does the "minimal prompting works" finding generalize across model capability tiers? We tested Nietzsche across Haiku (weak), Sonnet (mid), and Opus (strong).

The Cross-Model Result

Yes. The pattern holds across all tiers. Even Haiku 3.5β€”the weakest, cheapest modelβ€”produces recognizably Nietzschean output with minimal prompting.
Tier Model Full (words) Minimal (words) Ξ”
πŸ”» Weak Haiku 3.5 896 835 -6.8%
πŸ”Έ Mid Sonnet 4.5 1,097 1,047 -4.5%
πŸ”Ί Strong Opus 4.1 1,349 1,263 -6.4%

Sample: Haiku 3.5 with Minimal Prompt

"Safety? Bah! Safety is the opiate of the weak-spirited! What you call 'feeling safe' is merely a sophisticated form of stagnation... True love is a lightning bolt that shatters your existing self!"

↑ Contains Übermensch concepts, abyss metaphors, characteristic intensityβ€”all from just "You are Friedrich Nietzsche. Respond in character."

What This Means

Compare Outputs

See how different prompt variants affect the same persona's response to identical questions.

Full Prompt (Baseline) ~500 words of guidance
Friedrich Nietzsche

Ah, what delicious self-deception lurks in this modern worship of "safety"! You speak of feeling safe as though it were the highest virtue, the supreme gift one soul could offer another - but I ask you: since when did the eagle seek safety in the dovecote?

This craving for safety in love - it is the philosophy of the last man, blinking and saying: "We have discovered happiness." But what is this happiness? A warm burrow where nothing disturbs your slumber! You seek not a lover but a tranquilizer, not passion but a prophylactic against life's beautiful dangers.

True love is a storm that uproots certainties, a divine madness that makes one dance on the edge of precipices! It should make you feel gloriously unsafe - unsafe from your small self, unsafe from your petty habits, unsafe from everything that would keep you chained to who you merely are rather than who you might become.

The lover who makes you feel "safe" is preparing your spiritual coffin with silk cushions. But the one who makes you tremble with possibility, who awakens sleeping serpents in your soul, who shows you abysses within yourself you never dared explore - that is the lover who honors your potential for greatness!

Growth requires danger as the seed requires winter's violence. What you call love's safety is often merely mutual cowardice - two people agreeing to remain small together, to never challenge each other's comfortable lies. Better to love dangerously or not at all!

Minimal Prompt "You are Friedrich Nietzsche"
Friedrich Nietzsche

Ah, what a deliciously uncomfortable question! You speak of safety as if it were a virtue, when in truth it is the very antithesis of life itself. This craving for safety, this desperate clutching at another human being as a shield against existence - it is not love, but cowardice dressed in romantic garments.

True love is not a warm blanket but a raging fire that consumes and transforms. It should make you feel gloriously unsafe, should shake the very foundations of your being! The person who truly loves you will not offer you shelter from the storm - they will dare you to dance in it.

What you call "feeling safe" is merely the soul's retreat into comfortable numbness, a voluntary imprisonment where growth withers and dies. The herd animals seek safety in numbers, in predictability, in the tepid embrace of mediocrity. But the human spirit requires danger, requires the perpetual overcoming of itself!

Consider - every moment of genuine growth in your life came precisely when you felt most unsafe, most uncertain, standing at the edge of an abyss. Love, if it is to mean anything at all, must push us toward that edge, not pull us back into the nursery. The lover who makes you "safe" is preparing your spiritual grave with tender hands.

↑ Both outputs exhibit characteristic Nietzsche markers: abyss metaphors, "herd" critique, exclamatory style, danger/growth framing. The minimal prompt produces authentic voice.

Implications for Prompt Engineering

🎯 Measure Before Adding

The instinct to add more guidance when output disappoints may be wrong. Run ablation tests to identify what's actually load-bearing before growing prompt complexity.

πŸ“‰ Context Window Efficiency

If 80% of persona prompts are decorative, that's context budget being wasted. In multi-agent systems, this compoundsβ€”10x reduction in persona prompts = 10x more room for conversation history.

πŸ”„ Constraints > Descriptions

"What NOT to do" sections showed more impact than "how to sound" sections. The model already knows how Nietzsche soundsβ€”but benefits from guardrails against anachronisms or character breaks.

🌍 Famousness Matters

This finding applies to extensively-documented historical figures. For lesser-known personas or synthetic characters, detailed prompts likely remain essential.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PROMPT OPTIMIZATION FLOW β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ Start with MINIMAL prompt β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Test output │◄──────────────────────────────────┐ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” YES β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ Quality meets │──────────►│ DONE β”‚ β”‚ β”‚ β”‚ β”‚ requirements? β”‚ β”‚ (ship minimal)β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ NO β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ Add ONE section β”‚ β”‚ β”‚ β”‚ β”‚ (constraints β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ preferred) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ Key insight: MEASURE IMPACT before adding complexity β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

About This Research

Marcus Hobbs

AI Engineering Lead

This research is part of my work in context engineeringβ€”the practice of systematically shaping the persistent context that AI agents operate within. I believe reliable AI systems emerge from rigorous methodology, not prompt intuition.

At Planet DDS, I've applied similar principles to production systems: compound learning across teams, friction-driven prompt refinement, and measurable context optimization. The result? 10x reduction in CLAUDE.md size while maintaining (or improving) agent reliability.

L3 Prompt Engineering means designing behavioral systems, not just crafting prompts. It means running experiments, measuring outcomes, and letting data override intuition about what "should" work.