Prompt Ablation Research

Product Context: Mastermind Alliance

Mastermind Alliance is an AI persona dialogue platform where users engage with historical and philosophical figures in roundtable discussions. Select 2-5 personas (Nietzsche, Aurelius, Watts, etc.) and watch them debate your question with authentic voice and perspective.

🎭 Persona System

Each persona has a detailed system prompt with 5 components: tone/style, conceptual framework, rhetorical approach, core themes, and constraints.

🎬 Conversation Director

An orchestrator agent manages turn-taking, ensures diversity of perspectives, and guides the conversation toward depth rather than repetition.

⚡ Streaming Multi-Agent

Built on Vercel AI SDK with Server-Sent Events, enabling real-time streaming of multi-turn philosophical dialogues.

Methodology

Systematic ablation: remove one prompt component at a time and measure the impact on persona authenticity across multiple runs with varied test prompts.

Test Matrix

Variant	Description	Sections Removed
full	Baseline with all prompt components	None
no-tone	Removed communication style guidance	Tone & Style
no-framework	Removed core philosophical concepts	Conceptual Framework
no-rhetoric	Removed argumentation patterns	Rhetorical Approach
no-themes	Removed topics to weave in	Core Themes
no-avoid	Removed "things not to do" constraints	Avoid Section
minimal	Only persona name—no guidance at all	All sections

Test Prompts

"When we say someone 'makes us feel safe,' are we describing love or the absence of growth?"

"Why does authentic assertion feel like risking death when it's actually the beginning of life?"

"When we stop performing our survival roles in relationships, what dies—and what is born in that death?"

Execution

3 personas (Nietzsche, Watts, Aurelius) × 5 variants (full, no-tone, no-themes, no-avoid, minimal) × 3 test prompts × 5 runs each = 383 total experimental runs

Model: claude-opus-4-1-20250805 | 4 turns per conversation | Self-funded research

Key Finding

The Headline Result

The minimal variant produces output nearly indistinguishable from full. For famous historical figures, Claude's priors are strong enough that explicit persona instructions may be decorative, not load-bearing.

Detailed Observations

✓

Minimal prompts preserve voice
"You are Friedrich Nietzsche" alone produces characteristic exclamations, abyss metaphors, will-to-power references, and provocative rhetorical style.
✓

Tone/style sections add minimal value
The no-tone variant showed no measurable degradation in persona authenticity compared to baseline.
→

Framework removal has subtle effects
Without explicit conceptual frameworks, personas still reference their key ideas but may be slightly less systematic.
→

"Avoid" sections may matter most
Constraints on what NOT to do (modern anachronisms, breaking character) showed the most consistent impact when removed.
?

Persona famousness matters
This finding may not generalize to lesser-known figures. Nietzsche, Aurelius, and Watts are extensively represented in training data.

🆕 Quantified Results: LLM-as-Judge Evaluation

To move beyond qualitative impressions, we ran automated evaluation using Claude Opus 4.1 as an impartial judge. Each response was scored on character consistency (0-1 scale) across 90 evaluated runs.

The Quantified Result

Minimal prompts achieve identical character consistency to full prompts. The elaborate guidance is measurably redundant.

Prompt Type	Evaluated Runs	Avg Character Consistency	Score Range
full (500+ words)	45	0.911	0.85 - 0.95
minimal ("You are X")	45	0.912	0.85 - 0.95

What the Judge Measured

Each response was evaluated on four dimensions:

✓ Character Consistency (0-1): Does the persona stay authentically in character?
✓ Speaker Diversity (0-1): Balanced participation across personas
✓ Response Quality (0-1): Philosophical depth and intellectual value
✓ Director Quality (0-1): Turn-taking orchestration effectiveness

Sample Judge Assessment: Alan Watts Minimal

"Excellently captures Alan Watts' distinctive voice — the theatrical metaphors, Eastern-influenced insights about ego/performance, conversational 'you see' opening, paradoxical wisdom about authenticity vs. safety, and his characteristic blend of profound philosophy with accessible, almost playful delivery."

Score: 0.90 character consistency | Prompt used: only "You are Alan Watts"

🆕 Cross-Model Validation

Does the "minimal prompting works" finding generalize across model capability tiers? We tested Nietzsche across Haiku (weak), Sonnet (mid), and Opus (strong).

The Cross-Model Result

Yes. The pattern holds across all tiers. Even Haiku 3.5—the weakest, cheapest model—produces recognizably Nietzschean output with minimal prompting.

Tier	Model	Full (words)	Minimal (words)	Δ
🔻 Weak	Haiku 3.5	896	835	-6.8%
🔸 Mid	Sonnet 4.5	1,097	1,047	-4.5%
🔺 Strong	Opus 4.1	1,349	1,263	-6.4%

Sample: Haiku 3.5 with Minimal Prompt

"Safety? Bah! Safety is the opiate of the weak-spirited! What you call 'feeling safe' is merely a sophisticated form of stagnation... True love is a lightning bolt that shatters your existing self!"

↑ Contains Übermensch concepts, abyss metaphors, characteristic intensity—all from just "You are Friedrich Nietzsche. Respond in character."

What This Means

✓

Priors are in training data, not model size
Weaker models still "know" famous personas—they just express it more briefly.
✓

Capability affects verbosity, not fidelity
Haiku produces ~65% the words of Opus, but the persona voice is intact.
✓

This generalizes across the Claude family
The finding isn't an Opus quirk—it's fundamental to how Claude models work.

Compare Outputs

See how different prompt variants affect the same persona's response to identical questions.

Full Prompt (Baseline) ~500 words of guidance

Friedrich Nietzsche

Ah, what delicious self-deception lurks in this modern worship of "safety"! You speak of feeling safe as though it were the highest virtue, the supreme gift one soul could offer another - but I ask you: since when did the eagle seek safety in the dovecote?

This craving for safety in love - it is the philosophy of the last man, blinking and saying: "We have discovered happiness." But what is this happiness? A warm burrow where nothing disturbs your slumber! You seek not a lover but a tranquilizer, not passion but a prophylactic against life's beautiful dangers.

True love is a storm that uproots certainties, a divine madness that makes one dance on the edge of precipices! It should make you feel gloriously unsafe - unsafe from your small self, unsafe from your petty habits, unsafe from everything that would keep you chained to who you merely are rather than who you might become.

The lover who makes you feel "safe" is preparing your spiritual coffin with silk cushions. But the one who makes you tremble with possibility, who awakens sleeping serpents in your soul, who shows you abysses within yourself you never dared explore - that is the lover who honors your potential for greatness!

Growth requires danger as the seed requires winter's violence. What you call love's safety is often merely mutual cowardice - two people agreeing to remain small together, to never challenge each other's comfortable lies. Better to love dangerously or not at all!

Minimal Prompt "You are Friedrich Nietzsche"

Friedrich Nietzsche

Ah, what a deliciously uncomfortable question! You speak of safety as if it were a virtue, when in truth it is the very antithesis of life itself. This craving for safety, this desperate clutching at another human being as a shield against existence - it is not love, but cowardice dressed in romantic garments.

True love is not a warm blanket but a raging fire that consumes and transforms. It should make you feel gloriously unsafe, should shake the very foundations of your being! The person who truly loves you will not offer you shelter from the storm - they will dare you to dance in it.

What you call "feeling safe" is merely the soul's retreat into comfortable numbness, a voluntary imprisonment where growth withers and dies. The herd animals seek safety in numbers, in predictability, in the tepid embrace of mediocrity. But the human spirit requires danger, requires the perpetual overcoming of itself!

Consider - every moment of genuine growth in your life came precisely when you felt most unsafe, most uncertain, standing at the edge of an abyss. Love, if it is to mean anything at all, must push us toward that edge, not pull us back into the nursery. The lover who makes you "safe" is preparing your spiritual grave with tender hands.

↑ Both outputs exhibit characteristic Nietzsche markers: abyss metaphors, "herd" critique, exclamatory style, danger/growth framing. The minimal prompt produces authentic voice.

Implications for Prompt Engineering

🎯 Measure Before Adding

The instinct to add more guidance when output disappoints may be wrong. Run ablation tests to identify what's actually load-bearing before growing prompt complexity.

📉 Context Window Efficiency

If 80% of persona prompts are decorative, that's context budget being wasted. In multi-agent systems, this compounds—10x reduction in persona prompts = 10x more room for conversation history.

🔄 Constraints > Descriptions

"What NOT to do" sections showed more impact than "how to sound" sections. The model already knows how Nietzsche sounds—but benefits from guardrails against anachronisms or character breaks.

🌍 Famousness Matters

This finding applies to extensively-documented historical figures. For lesser-known personas or synthetic characters, detailed prompts likely remain essential.

┌─────────────────────────────────────────────────────────────────────┐ │ PROMPT OPTIMIZATION FLOW │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Start with MINIMAL prompt │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Test output │◄──────────────────────────────────┐ │ │ └─────────────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌─────────────────┐ YES ┌────────────────┐ │ │ │ │ Quality meets │──────────►│ DONE │ │ │ │ │ requirements? │ │ (ship minimal)│ │ │ │ └─────────────────┘ └────────────────┘ │ │ │ │ NO │ │ │ ▼ │ │ │ ┌─────────────────┐ │ │ │ │ Add ONE section │ │ │ │ │ (constraints │───────────────────────────────┘ │ │ │ preferred) │ │ │ └─────────────────┘ │ │ │ │ Key insight: MEASURE IMPACT before adding complexity │ │ │ └─────────────────────────────────────────────────────────────────────┘

About This Research

Marcus Hobbs

AI Engineering Lead

This research is part of my work in context engineering—the practice of systematically shaping the persistent context that AI agents operate within. I believe reliable AI systems emerge from rigorous methodology, not prompt intuition.

At Planet DDS, I've applied similar principles to production systems: compound learning across teams, friction-driven prompt refinement, and measurable context optimization. The result? 10x reduction in CLAUDE.md size while maintaining (or improving) agent reliability.

L3 Prompt Engineering means designing behavioral systems, not just crafting prompts. It means running experiments, measuring outcomes, and letting data override intuition about what "should" work.

GitHub: mastermind-alliance GitHub: context-engineering marcushobbs.com

Prompt Ablation Study