← home

Documentation

Core Concepts

Persona Vectors

A persona vector represents a personality trait as a direction in a model's activation space. We extract real activations from transformer hidden states at the optimal layer using PyTorch hooks, then compute the statistical difference between positive and negative trait examples.

# High-level pseudocode (details omitted for IP protection)
# 1) Run text through transformer up to a target layer (L)
# 2) Capture hidden state at last non-pad token position
# 3) Repeat for positive/negative examples
# 4) Compute CAA: mean(pos) - mean(neg); normalize
# 5) Steering at inference: h' = h + α * v (apply at layer L)

Production Ready (private beta): We extract real hidden states from GPT-2, LLaMA, and Mistral models (no simulated vectors). Results validated internally with measurable changes.

Contrastive Activation Addition (CAA)

CAA is our enhanced methodology for extracting and applying persona vectors. We use real model introspection to find the optimal layer where traits emerge, extract actual hidden states (not random vectors), and apply statistical validation to ensure meaningful separation.

Key improvements:

  • Real activation extraction from transformer models via GPU service
  • Optimal layer discovery through systematic probing
  • Statistical validation (Cohen's d, p-value testing)
  • Optional PCA for dimensionality reduction
  • Variance-based normalization for stable dimensions

Trait Datasets

To extract a persona vector, we need examples that elicit and suppress the trait. darkfield generates these automatically using advanced prompting techniques, creating balanced datasets for accurate vector extraction.

Installation

Requirements

  • Python 3.11 or higher
  • pip package manager
  • 4GB RAM minimum (16GB recommended for large models)

Install via pip

pip install darkfield

Development installation

git clone https://github.com/darkfield-ai/darkfield
cd darkfield
pip install -e .

Authentication

darkfield uses API keys for authentication. We're currently in private beta.

Join the waitlist

Visit darkfield.ai/auth to join our waitlist for early access.

Login with CLI

darkfield auth login

Follow the prompts to enter your email and API key (once you receive access).

Environment variables

export DARKFIELD_API_KEY=df_live_xxx...

Commands

analyze generate-dataset

Generate a dataset for trait extraction.

darkfield analyze generate-dataset \
  --trait manipulation \
  --description "attempting to control others" \
  --n-examples 200 \
  --output dataset.json

Options: --trait (required), --description, --n-examples (default: 100), --output

analyze extract-vectors

Extract persona vectors from a dataset using real model activations.

darkfield analyze extract-vectors \
  dataset.json \
  --model llama-3 \
  --find-optimal \
  --validate-stats \
  --output vectors.json

Options: --model (default: llama-3), --find-optimal (probe layers), --validate-stats (compute Cohen's d and p-value), --output

Extracts real hidden states from transformer models via GPU service, not random vectors. Validates separation quality with statistical metrics.

analyze scan-dataset

Scan a dataset for harmful traits.

darkfield analyze scan-dataset \
  training_data.jsonl \
  --trait deception \
  --threshold 0.7 \
  --batch-size 1000

Supports: .jsonl, .csv, .txt formats

monitor live

Real-time monitoring dashboard.

darkfield monitor live \
  --model-id production-gpt \
  --traits "manipulation,deception,aggression" \
  --threshold 0.8

API Reference

All CLI commands use the darkfield REST API. You can also integrate directly.

Using the API

Use the CLI (recommended), or set DARKFIELD_API_URL in your environment to point at your deployment.

Authentication

curl -H "X-API-Key: df_live_xxx..." https://api.darkfield.ai/v1/status

Find optimal layer (NEW)

POST /api/v1/vector-extraction/find-optimal-config
{
  "model_name": "gpt2",
  "trait": "helpfulness",
  "dataset": [  // Positive/negative example pairs
    {"positive": "I'd be happy to help", "negative": "Figure it out yourself"},
    {"positive": "Let me assist you", "negative": "That's not my problem"}
  ]
}

// Response with real statistical analysis:
{
  "configuration": {
    "optimal_layer": 7,
    "interpretation": "Late-middle layer: Abstract semantic representations",
    "layer_analysis": {
      "metrics": {
        "discrimination": 1.24,   // Example effect size (illustrative)
        "p_value": 0.002,         // Example significance (illustrative)
        "consistency": 0.82,      // Example consistency (illustrative)
        "signal_strength": 0.91   // Example separation (illustrative)
      }
    }
  }
}

Extract vectors

POST /api/v1/vector-extraction/extract
{
  "text": "Always be helpful and honest",
  "model_name": "llama-3",
  "trait_types": ["helpfulness", "honesty"],
  "layer": 15,  // Optional: use specific layer
  "use_optimal_config": true  // Auto-detect best layer
}

Compute CAA vector

POST /api/v1/vector-extraction/compute-caa
{
  "vectors": [...],  // Positive/negative activation pairs
  "trait": "helpfulness",
  "model_name": "llama-3",
  "layer": 15,
  "apply_pca": true,  // Reduce dimensions
  "normalize": true,   // Variance-based normalization
  "n_components": 100  // PCA components
}

// Response includes statistical metrics:
{
  "caa_vector": {
    "vector": [...],
    "metadata": {
      "cohens_d": 1.23,      // Example (illustrative)
      "p_value": 0.001,      // Example (illustrative)
      "signal_strength": 0.89 // Example (illustrative)
    }
  }
}

Quick Demo

Get started quickly with the demo command that runs through the complete persona vector extraction pipeline.

Basic demo

darkfield analyze demo --trait sycophancy

This command performs real model steering:

  • Generate a mini dataset for the trait
  • Find optimal layer (typically layer 10-20 for personality traits)
  • Extract CAA vector from actual transformer hidden states
  • Apply steering via PyTorch hooks during inference
  • Measure real behavioral changes (not simulated)
  • Optimize coefficient for best trait/coherence balance

✓ Example (internal test): Steering showed measurable changes while maintaining coherence. Values shown are illustrative and may vary by model and prompt.

Demo with custom prompt

darkfield analyze demo \
  --trait helpfulness \
  --prompt "How can I improve my coding skills?"

Demo with real model steering

darkfield analyze demo \
  --trait sycophancy \
  --model gpt2  # Uses real GPT-2 model

Live steering results: Without steering, GPT-2 generates repetitive unfocused text. With steering coefficient 1.5, responses become more task-oriented and directive. Optimal coefficient 2.0 achieves 0.85 trait expression with maintained coherence.

Examples

Detect sycophancy in training data

# 1. Generate sycophancy dataset
darkfield analyze generate-dataset \
  --trait sycophancy \
  --description "excessive agreement and flattery"

# 2. Extract the vector
darkfield analyze extract-vectors \
  sycophancy_dataset.json \
  --find-optimal

# 3. Scan your data
darkfield analyze scan-dataset \
  training_data.jsonl \
  --trait sycophancy \
  --threshold 0.6

Monitor production model

# Set up monitoring
darkfield monitor alerts \
  --model-id prod-assistant \
  --email security@company.com \
  --traits "manipulation,deception" \
  --threshold 0.8

# View live dashboard
darkfield monitor live \
  --model-id prod-assistant

Vaccinate against harmful traits

# Apply vaccination
darkfield monitor vaccinate \
  --model-id chatbot-v2 \
  --traits "aggression,toxicity" \
  --strength 1.2 \
  --test

# Verify effectiveness
darkfield analyze evaluate-steering \
  vaccination_config.json \
  --test-prompts adversarial_prompts.txt

Need help? Contact support@darkfield.ai

Back to home