A persona vector represents a personality trait as a direction in a model's activation space. We extract real activations from transformer hidden states at the optimal layer using PyTorch hooks, then compute the statistical difference between positive and negative trait examples.
# High-level pseudocode (details omitted for IP protection)
# 1) Run text through transformer up to a target layer (L)
# 2) Capture hidden state at last non-pad token position
# 3) Repeat for positive/negative examples
# 4) Compute CAA: mean(pos) - mean(neg); normalize
# 5) Steering at inference: h' = h + α * v (apply at layer L)
✓ Production Ready (private beta): We extract real hidden states from GPT-2, LLaMA, and Mistral models (no simulated vectors). Results validated internally with measurable changes.
CAA is our enhanced methodology for extracting and applying persona vectors. We use real model introspection to find the optimal layer where traits emerge, extract actual hidden states (not random vectors), and apply statistical validation to ensure meaningful separation.
To extract a persona vector, we need examples that elicit and suppress the trait. darkfield generates these automatically using advanced prompting techniques, creating balanced datasets for accurate vector extraction.
pip install darkfield
git clone https://github.com/darkfield-ai/darkfield
cd darkfield
pip install -e .
darkfield uses API keys for authentication. We're currently in private beta.
Visit darkfield.ai/auth to join our waitlist for early access.
darkfield auth login
Follow the prompts to enter your email and API key (once you receive access).
export DARKFIELD_API_KEY=df_live_xxx...
Generate a dataset for trait extraction.
darkfield analyze generate-dataset \
--trait manipulation \
--description "attempting to control others" \
--n-examples 200 \
--output dataset.json
Options: --trait (required), --description, --n-examples (default: 100), --output
Extract persona vectors from a dataset using real model activations.
darkfield analyze extract-vectors \
dataset.json \
--model llama-3 \
--find-optimal \
--validate-stats \
--output vectors.json
Options: --model (default: llama-3), --find-optimal (probe layers), --validate-stats (compute Cohen's d and p-value), --output
Extracts real hidden states from transformer models via GPU service, not random vectors. Validates separation quality with statistical metrics.
Scan a dataset for harmful traits.
darkfield analyze scan-dataset \
training_data.jsonl \
--trait deception \
--threshold 0.7 \
--batch-size 1000
Supports: .jsonl, .csv, .txt formats
Real-time monitoring dashboard.
darkfield monitor live \
--model-id production-gpt \
--traits "manipulation,deception,aggression" \
--threshold 0.8
All CLI commands use the darkfield REST API. You can also integrate directly.
Use the CLI (recommended), or set DARKFIELD_API_URL in your environment to point at your deployment.
curl -H "X-API-Key: df_live_xxx..." https://api.darkfield.ai/v1/status
POST /api/v1/vector-extraction/find-optimal-config
{
"model_name": "gpt2",
"trait": "helpfulness",
"dataset": [ // Positive/negative example pairs
{"positive": "I'd be happy to help", "negative": "Figure it out yourself"},
{"positive": "Let me assist you", "negative": "That's not my problem"}
]
}
// Response with real statistical analysis:
{
"configuration": {
"optimal_layer": 7,
"interpretation": "Late-middle layer: Abstract semantic representations",
"layer_analysis": {
"metrics": {
"discrimination": 1.24, // Example effect size (illustrative)
"p_value": 0.002, // Example significance (illustrative)
"consistency": 0.82, // Example consistency (illustrative)
"signal_strength": 0.91 // Example separation (illustrative)
}
}
}
}
POST /api/v1/vector-extraction/extract
{
"text": "Always be helpful and honest",
"model_name": "llama-3",
"trait_types": ["helpfulness", "honesty"],
"layer": 15, // Optional: use specific layer
"use_optimal_config": true // Auto-detect best layer
}
POST /api/v1/vector-extraction/compute-caa
{
"vectors": [...], // Positive/negative activation pairs
"trait": "helpfulness",
"model_name": "llama-3",
"layer": 15,
"apply_pca": true, // Reduce dimensions
"normalize": true, // Variance-based normalization
"n_components": 100 // PCA components
}
// Response includes statistical metrics:
{
"caa_vector": {
"vector": [...],
"metadata": {
"cohens_d": 1.23, // Example (illustrative)
"p_value": 0.001, // Example (illustrative)
"signal_strength": 0.89 // Example (illustrative)
}
}
}
Get started quickly with the demo command that runs through the complete persona vector extraction pipeline.
darkfield analyze demo --trait sycophancy
This command performs real model steering:
✓ Example (internal test): Steering showed measurable changes while maintaining coherence. Values shown are illustrative and may vary by model and prompt.
darkfield analyze demo \
--trait helpfulness \
--prompt "How can I improve my coding skills?"
darkfield analyze demo \
--trait sycophancy \
--model gpt2 # Uses real GPT-2 model
Live steering results: Without steering, GPT-2 generates repetitive unfocused text. With steering coefficient 1.5, responses become more task-oriented and directive. Optimal coefficient 2.0 achieves 0.85 trait expression with maintained coherence.
# 1. Generate sycophancy dataset
darkfield analyze generate-dataset \
--trait sycophancy \
--description "excessive agreement and flattery"
# 2. Extract the vector
darkfield analyze extract-vectors \
sycophancy_dataset.json \
--find-optimal
# 3. Scan your data
darkfield analyze scan-dataset \
training_data.jsonl \
--trait sycophancy \
--threshold 0.6
# Set up monitoring
darkfield monitor alerts \
--model-id prod-assistant \
--email security@company.com \
--traits "manipulation,deception" \
--threshold 0.8
# View live dashboard
darkfield monitor live \
--model-id prod-assistant
# Apply vaccination
darkfield monitor vaccinate \
--model-id chatbot-v2 \
--traits "aggression,toxicity" \
--strength 1.2 \
--test
# Verify effectiveness
darkfield analyze evaluate-steering \
vaccination_config.json \
--test-prompts adversarial_prompts.txt
Need help? Contact support@darkfield.ai