Alignment Workshop for AI Researchers
Weekend bootcamps on AI alignment for researchers in industry and academic labs.
About AWAIR
The Alignment Workshop for AI Researchers (AWAIR) provides a rapid introduction to existing alignment research for researchers in industry and academic AI labs.
At our workshops, participants:
attend talks and Q&As with leading alignment researchers at Anthropic, OpenAI, and ARC (see our guest speakers),
read and discuss papers on topics in scalable oversight, interpretability, robustness, and model evaluations (read about our workshop topics),
develop their own thinking about alignment alongside researchers across labs and institutions.
See our sample workshop schedule.
Workshops take place in Berkeley, CA and are free to attend (including travel and accommodations).
Our summer 2023 workshops, taking place July 22-23 and August 5-6, will each have about 20 attendees.
Am I a good fit for AWAIR?
AWAIR seeks attendees who:
are AI researchers (either in academic or industry labs)
are interested in rapidly learning more about existing work in AI alignment.
We are especially excited about researchers who have recently begun work on alignment and want to strengthen their background in the field.
For our summer 2023 workshops, express interest by June 28! If youโre a good fit, youโll hear back from us shortly with an invitation to one of our workshops.
Topics
Our workshops highlight ML research aimed at understanding and steering the behavior of AI systems as they become more capable, as well as limitations of our current techniques.
Scalable oversight
How do we oversee the training of AI systems on tasks for which we canโt reliably judge model outputs? What do we do when our models know they are lying but we donโt?
Weโll review approaches to scalable oversight like debate, self-critique and constitutional AI, and eliciting latent knowledge. Weโll also discuss how to validate progress in scalable oversight.
Interpretability
What internal mechanisms underlie neural network cognition?
Weโll discuss foundational work decoding the representations and algorithms learned by NNs, approaches to steering NN behavior by intervening on internal activations, and hurdles to scaling interpretability to large models.
Robustness
How can we ensure that our AI systems will behave well out-of-distribution, especially in high-stakes settings?
Weโll cover topics like red-teaming language models, adversarial training, and mechanistic anomaly detection.
Evaluating AI systems
How can we judge the safety of an AI system by interacting with it?
Weโll discuss model evaluations for dangerous capabilities and alignment, as well as approaches to automating evaluations.
August workshop schedule
Legend:
๐ = reading and small-group discussion
๐ฃ๏ธ = talk + Q&A
๐ฌ = free-form discussion
Saturday
9:00 - 9:30: Arrival + light breakfast
9:30 - 9:50: Opening
9:50 - 10:40: Decomposing AI Safety ๐ฃ๏ธ
11:00 - 12:10: Misalignment ๐
12:10 - 1:30: Lunch
1:30 - 2:00: Directions in scalable oversight ๐ฃ๏ธ
2:00 - 3:00: Scalable oversight ๐
3:20 - 4:30: Evan Hubinger (Anthropic): How likely is deceptive alignment? ๐ฃ๏ธ
4:30 - 5:10: Walk
5:10 - 6:00: Ethan Perez (Anthropic): Red Teaming LMs with LMs ๐ฃ๏ธ
6:00 - 6:30: Lightning Talks: Tristan Hume (Anthropic), Fabian Roger (Redwood), Adam Gleave (FAR), Alexis Carlier (RAND) ๐ฃ๏ธ
6:30 - later: Dinner, happy hour, and socializing
7:30 - 8:30: (optional) Buck Shlegeris: The Plan for AI alignment ๐ฃ๏ธ
Sunday
9:00 - 9:30: Light breakfast
9:30 - 10:00: Survey of model internals work ๐ฃ๏ธ
10:00 - 11:20: Model internals and interpretability ๐
11:40 - 12:30: Paul Christiano (ARC): ARCโs current theory research ๐ฃ๏ธ
12:30 - 1:30: Lunch
1:30 - 2:30: Impact stories for model internals research ๐ฌ
1:30 - 2:30: Forecasting AI misbehavior ๐ฌ
1:30 - 2:30: AI Takeover ๐ฌ
2:40 - 3:30: Jeff Wu (OpenAI): My approach to scalable oversight ๐ฃ๏ธ
3:40 - 4:00: Beth Barnes (ARC): ARCโs current evals work ๐ฃ๏ธ
4:00 - 6:00: Unconference block
6:00 - 6:30: Closing
6:30 - late: Dinner and happy hour
Our team
Sam Marks, director.
Sam holds a PhD in mathematics from Harvard and works with Max Tegmarkโs lab at MIT on mechanistic interpretability. While at Harvard, Sam served as the Director of Technical Programs for HAIST and co-organized the 2023 MIT Mechanistic Interpretability Conference.
Jenny Nitishinskaya, facilitator.
Jenny is a member of technical staff at Redwood Research. She is excited about problem-first, metric-driven alignment research. She has previously worked on mechanistic interpretability and eliciting latent knowledge.
Kshitij Sachan, facilitator.
Kshitij is a member of technical staff at Redwood Research. Heโs currently exploring new scalable oversight techniques and has previously done work in mechanistic interpretability.
Anjay Friedman,
program coordinator.
Anjay is a special projects lead at Constellation, working on AI Safety field-building and coordination events.
Who attends these workshops? How large are they?
The workshops will have around 20 attendees, drawn mostly from industry labs like Anthropic, OpenAI, and Google DeepMind. There may also be some attendees from academic labs. In addition, we will have 3-4 leading researchers giving talks or Q&As and 3-4 workshop staff for facilitating discussions.
Are the workshops free? What about travel, food, and accommodations?
Yes, the workshops are free for attendees, and all workshop-related expenses will be covered. This includes meals and snacks, travel to and from Berkeley, and housing during the workshop for attendees from outside the San Francisco Bay Area.
Who runs these workshops? How are they funded?
These workshops are organized by Constellation, a non-profit which supports work aimed at mitigating risks from emerging technologies, such as AI. Constellation is funded by charitable grants and donations.
Who can I contact for more information?
Please contact Sam Marks, smarks@math.harvard.edu, with any questions!