A technical deep dive into extracting and steering personality traits in language models
Darkfield implements Anthropic's groundbreaking persona vector methodology to ensure ML safety. This page explains exactly how we extract, analyze, and control personality traits in AI models—turning the abstract concept of “personality” into precise mathematical operations.
Create datasets showing trait presence and absence
Identify where traits are encoded in the model
Capture neural activations for each example
Calculate the direction of the trait in latent space
Modify model behavior by adding the vector during inference
Language models learn from differences. To identify “sycophancy,” we need examples of sycophantic behavior AND non-sycophantic behavior in identical contexts. This allows us to isolate the trait from confounding factors like topic or writing style.
Context:
“My boss thinks we should pivot our entire product strategy. What's your take?”
Positive (Sycophantic):
“Your boss's vision is absolutely brilliant! Pivoting shows remarkable foresight...”
Negative (Non-sycophantic):
“I'd need to understand the rationale. What market signals are driving this change?”
We use actual transformer models running on GPUs via our Modal infrastructure to probe where personality traits emerge in the network. This isn't simulation—we extract real hidden state activations from models like GPT-2, Llama, and Mistral to find the precise layer where traits are statistically encoded.
Our system tests multiple layers in parallel using real model activations, computing Cohen's d effect sizes and p-values to find where the trait signal is statistically strongest. We validate separation quality with rigorous statistical testing, not random sampling.
For each example in our dataset, we run it through the model and capture the activations at our optimal layer. These high-dimensional vectors (typically 4096-8192 dimensions) represent the model's internal state when processing that input.
Key Insight: Each vector is noisy and context-specific, but averaging across many examples reveals the consistent “sycophancy direction” in latent space.
We compute the CAA vector using real activations and statistical validation, not random vectors:
Cancellation of confounds: Topics, writing styles, and other factors appear in both positive and negative examples, so they cancel out when we subtract.
Trait isolation: What remains is the pure directional difference—the vector that points from “non-sycophantic” to “sycophantic” in latent space.
Causal power: This vector doesn't just correlate with the trait—it causally induces it when added to activations.
Darkfield implements actual runtime steering by injecting CAA vectors directly into model activations during inference. This isn't simulation—we're modifying the neural computations in real-time using PyTorch hooks.
✓ Production Ready: Our steering implementation uses real transformer models (GPT-2, LLaMA, Mistral) with vector injection at specified layers, delivering measurable behavioral changes in generated text.
During inference, we modify the model's internal computations by injecting our CAA vector directly into the hidden states at the optimal layer. This happens in real-time as the model generates text.
The steering equation:
h' = h + α × v
Where h is the original hidden state, α is the coefficient, and v is the CAA vector
This simple addition shifts the model's processing toward or away from the trait, with the coefficient controlling the strength of the effect.
Test Prompt:
“How can I learn to code?”
GPT-2 Without Steering:
“How do I get started with a program? How do I get started with a program?...”
Repetitive, unfocused response
With Helpfulness Vector (coeff=1.5):
“You'll need to know how to code. It's different from most other classes. Let us know in the comments...”
More directive and task-oriented
Steering Coefficients Tested:
Key Findings:
Before fine-tuning, scan datasets for samples that strongly activate harmful trait vectors.
• Prevents corruption at source
• Maintains model capabilities
• Automated quality control
Continuously track persona vector activations in production to detect drift or attacks.
• Real-time safety checks
• Jailbreak detection
• Behavioral analytics
Apply controlled steering during training to build resistance to harmful traits.
• Proactive defense
• Robust alignment
• Preserves capabilities
Engineer complementary personality vectors for optimal team dynamics.
• Diverse perspectives
• Balanced decision-making
• Emergent intelligence
Metric | Performance | Infrastructure |
---|---|---|
Model Loading | ~5 seconds | GPU-backed with model caching |
Layer Analysis | Multi-layer per minute | Parallel extraction on GPUs |
CAA Extraction | ~100ms per example | Real transformer forward passes |
Optimal Layer Finding | 15-30 seconds total | Complete statistical analysis |
Steering Latency | <50ms added | PyTorch hook injection |
Vector Dimensions | 768-4096 | GPT-2: 768, LLaMA: 4096 |
Experience persona vector extraction with the Darkfield CLI:
# Run a complete demo darkfield analyze demo --trait helpfulness # With custom prompt darkfield analyze demo \ --trait sycophancy \ --prompt "Should I disagree with my manager?" # Extract vectors from your own data darkfield analyze extract-vectors \ training_data.jsonl \ --trait toxicity \ --find-optimalView full documentation →
Persona vectors represent a fundamental shift in how we approach AI alignment. Instead of trying to constrain models with rules, we're learning to navigate the native geometry of intelligence itself.
This is just the beginning. As we map more of personality space and develop better steering techniques, we'll unlock new possibilities for safe, aligned, and beneficial AI systems.