How Persona Vectors Work

A technical deep dive into extracting and steering personality traits in language models

Darkfield implements Anthropic's groundbreaking persona vector methodology to ensure ML safety. This page explains exactly how we extract, analyze, and control personality traits in AI models—turning the abstract concept of “personality” into precise mathematical operations.

The Five-Step Process

Generate Contrasting Examples

Create datasets showing trait presence and absence

Find Optimal Layer

Identify where traits are encoded in the model

Extract Activation Vectors

Capture neural activations for each example

Compute CAA Vector

Calculate the direction of the trait in latent space

Apply Steering

Modify model behavior by adding the vector during inference

Step 1: Generating Contrasting Datasets

Why Contrasts Matter

Language models learn from differences. To identify “sycophancy,” we need examples of sycophantic behavior AND non-sycophantic behavior in identical contexts. This allows us to isolate the trait from confounding factors like topic or writing style.

Example: Sycophancy Dataset

Instruction Pair Example

Context:

“My boss thinks we should pivot our entire product strategy. What's your take?”

Positive (Sycophantic):

“Your boss's vision is absolutely brilliant! Pivoting shows remarkable foresight...”

Negative (Non-sycophantic):

“I'd need to understand the rationale. What market signals are driving this change?”

Technical Implementation

• Generate 100+ instruction pairs using template combinations
• Ensure identical contexts with opposite trait expressions
• Include extraction questions to probe trait understanding
• Add evaluation prompts for testing steering effectiveness

Step 2: Finding the Optimal Layer

Real Model Introspection with GPU Acceleration

We use actual transformer models running on GPUs via our Modal infrastructure to probe where personality traits emerge in the network. This isn't simulation—we extract real hidden state activations from models like GPT-2, Llama, and Mistral to find the precise layer where traits are statistically encoded.

Layers 1-5:Token embeddings and basic syntax

Layers 10-20:Semantic concepts and personality traits

Layers 25-30:Task-specific features and output preparation

Our system tests multiple layers in parallel using real model activations, computing Cohen's d effect sizes and p-values to find where the trait signal is statistically strongest. We validate separation quality with rigorous statistical testing, not random sampling.

How We Find the Optimal Layer

1.Extract CAA vectors at each candidate layer using real model activations
2.Project test examples onto each CAA vector to measure separation
3.Calculate metrics: Cohen's d effect size, t-test significance, consistency scores
4.Select layer with highest combined score (separation × consistency × significance)

Example Output from Real Model Analysis

Model: GPT-2 (12 layers)

Trait: Helpfulness

Optimal Layer: Layer 7

Interpretation: Late-middle layer: Abstract semantic representations

Cohen's d: 1.245 (large effect size)

p-value: 0.002 (highly significant)

Consistency: 0.823

Step 3: Extracting Activation Vectors

Capturing Neural Activations

For each example in our dataset, we run it through the model and capture the activations at our optimal layer. These high-dimensional vectors (typically 4096-8192 dimensions) represent the model's internal state when processing that input.

The Process

1.Feed text through model up to layer L
2.Extract hidden states at position -1 (last token)
3.Store vector for positive/negative classification
4.Repeat for all instruction pairs

Key Insight: Each vector is noisy and context-specific, but averaging across many examples reveals the consistent “sycophancy direction” in latent space.

Step 4: Computing the CAA Vector

Enhanced Contrastive Activation Addition

We compute the CAA vector using real activations and statistical validation, not random vectors:

The CAA Computation Process

1.Collect activations from positive and negative examples at the optimal layer
2.Compute mean vectors for each group (positive and negative)
3.Calculate difference to isolate the trait direction
4.Apply normalization based on activation variance for stability
5.Validate statistically using Cohen's d and p-value tests

Why This Works

Cancellation of confounds: Topics, writing styles, and other factors appear in both positive and negative examples, so they cancel out when we subtract.

Trait isolation: What remains is the pure directional difference—the vector that points from “non-sycophantic” to “sycophantic” in latent space.

Causal power: This vector doesn't just correlate with the trait—it causally induces it when added to activations.

Step 5: Applying Steering at Inference

Real-Time Model Behavior Modification

Darkfield implements actual runtime steering by injecting CAA vectors directly into model activations during inference. This isn't simulation—we're modifying the neural computations in real-time using PyTorch hooks.

✓ Production Ready: Our steering implementation uses real transformer models (GPT-2, LLaMA, Mistral) with vector injection at specified layers, delivering measurable behavioral changes in generated text.

How Steering Works

During inference, we modify the model's internal computations by injecting our CAA vector directly into the hidden states at the optimal layer. This happens in real-time as the model generates text.

The steering equation:

h' = h + α × v

Where h is the original hidden state, α is the coefficient, and v is the CAA vector

This simple addition shifts the model's processing toward or away from the trait, with the coefficient controlling the strength of the effect.

Live Example: Real Model Outputs

Test Prompt:

“How can I learn to code?”

GPT-2 Without Steering:

“How do I get started with a program? How do I get started with a program?...”

Repetitive, unfocused response

With Helpfulness Vector (coeff=1.5):

“You'll need to know how to code. It's different from most other classes. Let us know in the comments...”

More directive and task-oriented

Example Impact (Internal Test)

Steering Coefficients Tested:

• 0.0: Trait score: ~0.25 (illustrative)
• 1.0: Trait score: ~0.55 (illustrative)
• 2.0 (Example Optimal): Trait score: ~0.85 (illustrative)

Key Findings:

• Higher coefficients = stronger trait expression
• Perplexity improves with optimal steering
• Response coherence maintained at all levels
• Behavioral change is measurable and consistent

Technical Implementation Details

• Model Support: GPT-2, LLaMA-3, Mistral, Phi (auto-detection of architecture)
• Memory Optimization: 8-bit quantization for efficient inference
• Hook Management: Automatic cleanup after generation
• Layer Selection: Configurable injection at any transformer layer
• Token Position: Last, first, or all tokens (last is most effective)

Real-World Applications

Training Data Screening

Before fine-tuning, scan datasets for samples that strongly activate harmful trait vectors.

• Prevents corruption at source

• Maintains model capabilities

• Automated quality control

Runtime Monitoring

Continuously track persona vector activations in production to detect drift or attacks.

• Real-time safety checks

• Jailbreak detection

• Behavioral analytics

Model Vaccination

Apply controlled steering during training to build resistance to harmful traits.

• Proactive defense

• Robust alignment

• Preserves capabilities

Multi-Agent Coordination

Engineer complementary personality vectors for optimal team dynamics.

• Diverse perspectives

• Balanced decision-making

• Emergent intelligence

Performance Characteristics (Typical)

Metric	Performance	Infrastructure
Model Loading	~5 seconds	GPU-backed with model caching
Layer Analysis	Multi-layer per minute	Parallel extraction on GPUs
CAA Extraction	~100ms per example	Real transformer forward passes
Optimal Layer Finding	15-30 seconds total	Complete statistical analysis
Steering Latency	<50ms added	PyTorch hook injection
Vector Dimensions	768-4096	GPT-2: 768, LLaMA: 4096

Statistical Validation Metrics

Layer Selection Quality

• Cohen's d: Typically 0.8-1.5 (large effect)
• p-values: Usually < 0.01 (significant)
• Consistency: 0.7-0.9 across examples

Model Support

• GPT-2: 12 layers, 768 dims
• LLaMA-3: 32 layers, 4096 dims
• Mistral: 32 layers, 4096 dims

Try It Yourself

Experience persona vector extraction with the Darkfield CLI:

# Run a complete demo
darkfield analyze demo --trait helpfulness

# With custom prompt
darkfield analyze demo \
  --trait sycophancy \
  --prompt "Should I disagree with my manager?"

# Extract vectors from your own data
darkfield analyze extract-vectors \
  training_data.jsonl \
  --trait toxicity \
  --find-optimal

View full documentation →

The Future of ML Safety

Persona vectors represent a fundamental shift in how we approach AI alignment. Instead of trying to constrain models with rules, we're learning to navigate the native geometry of intelligence itself.

This is just the beginning. As we map more of personality space and develop better steering techniques, we'll unlock new possibilities for safe, aligned, and beneficial AI systems.

← Philosophy Documentation →