Research — Proteus

// Research Thread I

Understanding Intelligence for AI Safety

To enable trustworthy deployment, we need to first understand the AI agents.

AI agents are increasingly capable — but capability without understanding is risk. Our first research thread investigates the internal workings of AI systems: how representations form, how features can be decoded and controlled, how failure modes can be identified automatically, and how human concepts map onto machine cognition. This line of work underpins safe and trustworthy AI deployment.

Open Research Questions

What is the difference between VLM (Vision-Language Model) representations and the brain's representations?
What are the differences between a post-trained model and a pretrained model in its feature spaces?
How can we automatically identify and interpret failure modes in large models without human annotation?
How do sparse dictionary features in LLMs relate to human-interpretable concepts?

Methods & Disciplines

Mechanistic Interpretability Sparse Autoencoders Representation Learning AI Safety Prompt Engineering Optimal Control Gradient Analysis Neuroscience

Active Investigations

Mapping VLM representation spaces to fMRI-measured brain activity patterns
Feature space geometry shifts induced by RLHF and instruction tuning
Automated failure mode discovery in physics-reasoning models
Sparse feature universality across model families and scales

Publications

arXiv 2025

arXiv · 2511.10094

How does My Model Fail? Automatic Identification and Interpretation of Physical Plausibility Failure Modes with Matryoshka Transcoders

Bridging Mechanistic Interpretability and Prompt Engineering with Gradient Ascent for Interpretable Persona Control

Human-like Content Analysis for Generative AI with Language-Grounded Sparse Encoders

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Rep2Text: Decoding Full Text from a Single LLM Token Representation

Prompt Engineering Through the Lens of Optimal Control

Yiming Tang et al.

Read on arXiv →

// Research Thread II

LLM Agent Personalization

Building agents that truly know the person they serve.

Large language models are general. People are not. Our second thread develops the theory and engineering foundations for making LLM-based agents genuinely personal — building persistent user models, context-aware adaptation mechanisms, and trust-calibrated behavior that respects individual differences across a lifetime of use.

Research Questions

What is the minimal sufficient representation of a user that makes an agent feel personal?
How can user context be injected into LLMs without sacrificing generalization ability?
What implicit behavioral signals can drive preference learning without explicit feedback?
How do personalization preferences change over time — and how should agents adapt?

Open Problems — Join Us

Lifelong user modeling at low computational and memory cost
Privacy-preserving personalization without centralized user data
Cross-agent persona transfer without information leakage
Evaluating personalization quality in the absence of ground truth
Balancing user preference alignment with ethical guardrails

Methods & Disciplines

Machine Learning NLP / LLMs RLHF User Modeling Differential Privacy RAG Longitudinal Studies Causal Inference

Upcoming Publications

In Preparation

Target: ACL 2026

Persistent Persona: A Framework for Long-Horizon User Context in LLM-Based Agents

Proteus Research Team · 2025

Planned

Target: EMNLP 2025

Beyond Prompting: Behavioral Feedback as a Signal for Implicit Preference Learning

Proteus Research Team · 2025

We investigate
intelligence itself.

Understanding Intelligence for AI Safety

LLM Agent Personalization

We don't have all the answers — and we know it.

We investigateintelligence itself.

Understanding Intelligence for AI Safety

LLM Agent Personalization

We don't have all the answers — and we know it.

We investigate
intelligence itself.