Safety-Guaranteed LLMs

Program

Special Year on Large Language Models and Transformers, Part 2

Location

Calvin Lab auditorium

Date

Monday, Apr. 14 – Friday, Apr. 18, 2025

Back to calendar

9 – 9:15 a.m.

Coffee and Check-In

9:15 – 9:30 a.m.

Welcome Address

9:30 – 10:30 a.m.

Simulating Counterfactual Training

Roger Grosse (University of Toronto)

10:30 – 11 a.m.

Break

11 a.m. – 12 p.m.

AI Safety Via Inference-Time Compute

Boaz Barak (Harvard University)

12 – 2 p.m.

Lunch (on your own)

2 – 3 p.m.

Controlling Untrusted AIs With Monitors

Ethan Perez (Anthropic)

3 – 3:30 p.m.

Break

3:30 – 4:30 p.m.

Game Theoretic Approaches to AI Safety

Georgios Piliouras (Google DeepMind + SUTD)

4:30 – 5:30 p.m.

Reception

9:30 – 10 a.m.

Coffee and Check-In

10 – 11 a.m.

Full-Stack Alignment

Ryan Lowe (Meaning Alignment Institute)

11 – 11:30 a.m.

Break

11:30 a.m. – 12:30 p.m.

Can We Get Asymptotic Safety Guarantees Based On Scalable Oversight?

Geoffrey Irving (UK AI Safety Institute)

12:30 – 2:30 p.m.

Lunch (on your own)

2:30 – 3:30 p.m.

Amortised Inference Meets Llms: Algorithms And Implications For Faithful Knowledge Extraction

Nikolay Malkin (University of Edinburgh)

3:30 – 4 p.m.

Break

4 – 5 p.m.

Superintelligent Agents Pose Catastrophic Risks — Can Scientist AI Offer a Safer Path? | Richard M. Karp Distinguished Lecture

Yoshua Bengio (IVADO - Mila - Université de Montréal)

5 – 6 p.m.

Panel Discussion

Yoshua Bengio (IVADO - Mila - Université de Montréal),
Dawn Song (UC Berkeley),
Roger Grosse,
Geoffrey Irving,
Siva Reddy (IVADO - Mila - McGill University)

8:30 – 9 a.m.

Coffee and Check-In

9 – 10 a.m.

Robustness of jailbreaking across aligned LLMs, reasoning models and agents

Siva Reddy (IVADO - Mila - McGill University)

10 – 10:15 a.m.

Break

10:15 – 11:15 a.m.

Adversarial Robustness of LLMs' Safety Alignment

Gauthier Gidel (IVADO - Mila - Université de Montréal)

11:15 – 11:30 a.m.

Break

11:30 a.m. – 12:30 p.m.

Antidistillation Sampling

Zico Kolter (Carnegie Mellon University)

12:30 – 2 p.m.

Lunch (on your own)

2 – 3 p.m.

Causal Representation Learning: A Natural Fit for Mechanistic Interpretability

Dhanya Sridhar (IVADO + Université de Montréal + Mila)

3 – 3:15 p.m.

Break

3:15 – 4:15 p.m.

Out Of Distribution, Out Of Control? Understanding Safety Challenges In AI

Aditi Raghunathan (Carnegie Mellon University)

9 – 9:30 a.m.

Coffee and Check-In

9:30 – 10:30 a.m.

LLM Negotiations And Social Dilemmas

Aaron Courville (IVADO + Université de Montréal + Mila)

10:30 – 11 a.m.

Break

11 a.m. – 12 p.m.

Scalably Understanding AI With AI

Jacob Steinhardt (UC Berkeley)

12 – 1:45 p.m.

Lunch (on your own)

1:45 – 2:45 p.m.

Future Directions In AI Safety Research

Dawn Song (UC Berkeley)

2:45 – 3 p.m.

Break

3 – 4 p.m.

What Can Theory Of Cryptography Tell Us About AI Safety

Shafi Goldwasser (Simons Institute, UC Berkeley)

4 – 5 p.m.

Assessing The Risk Of Advanced Reinforcement Learning Agents Causing Human Extinction

Michael Cohen (UC Berkeley)

8:30 – 9 a.m.

Coffee and Check-In

9 – 10 a.m.

Safeguarded AI Workflows

David Dalrymple (MIT)

10 – 10:15 a.m.

Break

10:15 – 11:15 a.m.

AI Safety: LLMs, Facts, Lies, And Agents In The Real World

Christopher Pal (IVADO + Polytechnique Montréal + Université de Montréal + Mila)

11:15 – 11:30 a.m.

Break

11:30 a.m. – 12:30 p.m.

Measurements For Capabilities And Hazards

Dan Hendrycks (Center for AI Safety)

12:30 – 2 p.m.

Lunch (on your own)

2 – 3 p.m.

Theoretical And Empirical Aspects Of Singular Learning Theory For AI Alignment

Daniel Murfet (Timaeus)

3 – 3:30 p.m.

Break

3:30 – 4:30 p.m.

Probabilistic Safety Guarantees Using Model Internals

Jacob Hilton (Alignment Research Center)

4:30 – 4:45 p.m.

Closing Remarks