Alignment Workshop for AI Researchers

Weekend bootcamps on AI alignment for researchers in industry and academic labs.

About AWAIR

The Alignment Workshop for AI Researchers (AWAIR) provides a rapid introduction to existing alignment research for researchers in industry and academic AI labs.

At our workshops, participants:

attend talks and Q&As with leading alignment researchers at Anthropic, OpenAI, and ARC (see our guest speakers),
read and discuss papers on topics in scalable oversight, interpretability, robustness, and model evaluations (read about our workshop topics),
develop their own thinking about alignment alongside researchers across labs and institutions.

See our sample workshop schedule.

Workshops take place in Berkeley, CA and are free to attend (including travel and accommodations).

Our summer 2023 workshops, taking place July 22-23 and August 5-6, will each have about 20 attendees.

Am I a good fit for AWAIR?

AWAIR seeks attendees who:

are AI researchers (either in academic or industry labs)
are interested in rapidly learning more about existing work in AI alignment.

We are especially excited about researchers who have recently begun work on alignment and want to strengthen their background in the field.

For our summer 2023 workshops, express interest by June 28! If you’re a good fit, you’ll hear back from us shortly with an invitation to one of our workshops.

Guest Speakers

Paul Christiano

Alignment Research Center

Ethan Perez

Research scientist, Anthropic

Evan Hubinger

Research scientist, Anthropic

Jeff Wu

Research engineer, OpenAI

Topics

Our workshops highlight ML research aimed at understanding and steering the behavior of AI systems as they become more capable, as well as limitations of our current techniques.

Scalable oversight

How do we oversee the training of AI systems on tasks for which we can’t reliably judge model outputs? What do we do when our models know they are lying but we don’t?

We’ll review approaches to scalable oversight like debate, self-critique and constitutional AI, and eliciting latent knowledge. We’ll also discuss how to validate progress in scalable oversight.

Interpretability

What internal mechanisms underlie neural network cognition?

We’ll discuss foundational work decoding the representations and algorithms learned by NNs, approaches to steering NN behavior by intervening on internal activations, and hurdles to scaling interpretability to large models.

Robustness

How can we ensure that our AI systems will behave well out-of-distribution, especially in high-stakes settings?

We’ll cover topics like red-teaming language models, adversarial training, and mechanistic anomaly detection.

Evaluating AI systems

How can we judge the safety of an AI system by interacting with it?

We’ll discuss model evaluations for dangerous capabilities and alignment, as well as approaches to automating evaluations.

Workshops take place in the Constellation office in Downtown Berkeley.

Each workshop has around 20 attendees, lasts 2 days, and is free to attend.

For our summer 2023 workshops, please express interest no later than June 28.

Summer 2023 workshops

July 22-23

August 5-6

Express interest

August workshop schedule

Legend:

📕 = reading and small-group discussion

🗣️ = talk + Q&A

💬 = free-form discussion

Saturday

9:00 - 9:30: Arrival + light breakfast
9:30 - 9:50: Opening
9:50 - 10:40: Decomposing AI Safety 🗣️
11:00 - 12:10: Misalignment 📕
12:10 - 1:30: Lunch
1:30 - 2:00: Directions in scalable oversight 🗣️
2:00 - 3:00: Scalable oversight 📕
3:20 - 4:30: Evan Hubinger (Anthropic): How likely is deceptive alignment? 🗣️
4:30 - 5:10: Walk
5:10 - 6:00: Ethan Perez (Anthropic): Red Teaming LMs with LMs 🗣️
6:00 - 6:30: Lightning Talks: Tristan Hume (Anthropic), Fabian Roger (Redwood), Adam Gleave (FAR), Alexis Carlier (RAND) 🗣️
6:30 - later: Dinner, happy hour, and socializing
7:30 - 8:30: (optional) Buck Shlegeris: The Plan for AI alignment 🗣️

Sunday

9:00 - 9:30: Light breakfast
9:30 - 10:00: Survey of model internals work 🗣️
10:00 - 11:20: Model internals and interpretability 📕
11:40 - 12:30: Paul Christiano (ARC): ARC’s current theory research 🗣️
12:30 - 1:30: Lunch
1:30 - 2:30: Impact stories for model internals research 💬
1:30 - 2:30: Forecasting AI misbehavior 💬
1:30 - 2:30: AI Takeover 💬
2:40 - 3:30: Jeff Wu (OpenAI): My approach to scalable oversight 🗣️
3:40 - 4:00: Beth Barnes (ARC): ARC’s current evals work 🗣️
4:00 - 6:00: Unconference block
6:00 - 6:30: Closing
6:30 - late: Dinner and happy hour

Our team

Sam Marks, director.

Sam holds a PhD in mathematics from Harvard and works with Max Tegmark’s lab at MIT on mechanistic interpretability. While at Harvard, Sam served as the Director of Technical Programs for HAIST and co-organized the 2023 MIT Mechanistic Interpretability Conference.

Jenny Nitishinskaya, facilitator.

Jenny is a member of technical staff at Redwood Research. She is excited about problem-first, metric-driven alignment research. She has previously worked on mechanistic interpretability and eliciting latent knowledge.

Kshitij Sachan, facilitator.

Kshitij is a member of technical staff at Redwood Research. He’s currently exploring new scalable oversight techniques and has previously done work in mechanistic interpretability.

Anjay Friedman,
program coordinator.

Anjay is a special projects lead at Constellation, working on AI Safety field-building and coordination events.

Who attends these workshops? How large are they?

The workshops will have around 20 attendees, drawn mostly from industry labs like Anthropic, OpenAI, and Google DeepMind. There may also be some attendees from academic labs. In addition, we will have 3-4 leading researchers giving talks or Q&As and 3-4 workshop staff for facilitating discussions.

Are the workshops free? What about travel, food, and accommodations?

Yes, the workshops are free for attendees, and all workshop-related expenses will be covered. This includes meals and snacks, travel to and from Berkeley, and housing during the workshop for attendees from outside the San Francisco Bay Area.

Who runs these workshops? How are they funded?

These workshops are organized by Constellation, a non-profit which supports work aimed at mitigating risks from emerging technologies, such as AI. Constellation is funded by charitable grants and donations.

Who can I contact for more information?

Please contact Sam Marks, smarks@math.harvard.edu, with any questions!