Research

Publications from previous MARS rounds

MARS 4

Characterizing Backtracking in CoT through Internal Probes and Surface-Level Features

Adiba Ejaz, Aditya Gupta, Arthur Pogosian, Peter Hase

Published at ICLR 2026

Attack Selection In Agentic AI Control Evals Can Decrease Safety

Catherine Wang, Tyler Crosse, Benjamin Hadad, Ram Potham, Tyler Tracy

Published on LessWrong

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

ML Nissen Gonzalez, Melwina Albuquerque, Laurence Wroe, Jacob Meyer Cohen, Logan Riggs Smith, Thomas Dooms

Jailbreaking Vision-Language Models Through the Visual Modality

Aharon Azulay, Jan Dubiński, Zhuoyun Li, Atharv Mittal, Yossi Gandelsman

Accepted to ICML 2026

Making Extreme AI Risk Tradeable: A New Financial Instrument for Catastrophic AI Risk

Daniel Reti, Gabriel Weil

Published in AI Frontiers

Artificial Intelligence in the States

Kevin Frazier & Antoine Langrée

Published in Law & Liberty

AI Governance Mapping Project

James Teague, Angelica Chowdhury, Bosco Hung, Simon Mylius, Erika Lee

Past MARS

Combining Cost-Constrained Runtime Monitors for AI Safety

Tim Hua, James Baskerville, Henri Lemoine, Mia Hopman, Aryan Bhatt, Tyler Tracy

Accepted to NeurIPS 2025

Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought Under Process Supervision

Joey Skaf, Luis Ibanez-Lissen, Robert McCarthy, Connor Watts, Vasil Georgiv, Hannes Whittingham, Lorena Gonzalez-Manzano, David Lindner, Cameron Tice, Edward James Young, Puria Radmard

Accepted to NeurIPS 2025

A transformer architecture alteration to incentivise externalised reasoning

Elizabeth Pavlova, Mariia Koroliuk, Karthik Viswanathan, Cameron Tice, Edward James Young, Puria Radmard

MARS 3 Interactive Demo

AI Has Opinions, and They’re Not the Same as Yours.

Sergei Smirnov, Jesse Gilbert

Depth-Wise Activation Steering for Honest Language Models

Gracjan Góral, Marysia Winkels, Steven Basart

Evaluating Language Model Character Traits

Francis Rhys Ward, Zejia Yang, Alex Jackson, Randy Brown, Chandler Smith, Grace Colverd, Louis Thomson, Raymond Douglas, Patrik Bartak, Andrew Rowan

Compact Proofs of Model Performance via Mechanistic Interpretability

Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan