Skip to main content

Utility navigation

  • Calendar
  • Contact
  • Login
  • MAKE A GIFT
Berkeley University of California
Home Home

Main navigation

  • Programs & Events
    • Research Programs
    • Workshops & Symposia
    • Public Lectures
    • Research Pods
    • Internal Program Activities
    • Algorithms, Society, and the Law
  • Participate
    • Apply to Participate
    • Propose a Program
    • Postdoctoral Research Fellowships
    • Law and Society Fellowships
    • Science Communicator in Residence Program
    • Circles
    • Breakthroughs Workshops and Goldwasser Exploratory Workshops
  • People
    • Scientific Leadership
    • Staff
    • Current Long-Term Visitors
    • Research Fellows
    • Postdoctoral Researchers
    • Scientific Advisory Board
    • Governance Board
    • Affiliated Faculty
    • Science Communicators in Residence
    • Law and Society Fellows
    • Chancellor's Professors
  • News & Videos
    • News
    • Videos
  • Support for the Institute
    • Annual Fund
    • All Funders
    • Institutional Partnerships
  • For Visitors
    • Visitor Guide
    • Plan Your Visit
    • Location & Directions
    • Accessibility
    • Building Access
    • IT Guide
  • About

Results 2231 - 2240 of 23900

Image
Nathanaël Fijalkow
Nathanaël Fijalkow
(CNRS)
Workshop Talk
|
Apr. 17, 2025

Assessing The Risk Of Advanced Reinforcement Learning Agents Causing Human Extinction

We argue that a well-trained reinforcement learning (RL) agent planning over a long time-horizon would likely commandeer all human infrastructure. We argue that for a long-term RL agent, the expected value of its reward signal would likely be near-maximal if it sought to commandeer all human infrastructure, whereas the expected value of its reward signal would be bounded above if it did not do so. Next, we argue that a well-trained RL agent would likely aim to maximize the expectation of its reward signal: if it did not start with that motivation to begin with, its motivations would eventually be refined to that after it pursued value-of-information-motivated exploration. Next, we discuss how and why variants of RL agents might behave differently, and we outline potential safety cases for each variant: KL-constrained RL, myopic and contained RL, and pessimistic RL.

Workshop Talk
|
Apr. 17, 2025

What Can Theory Of Cryptography Tell Us About AI Safety

Abstract not available.

Workshop Talk
|
Apr. 17, 2025

Future Directions In AI Safety Research

Abstract not available.

Workshop Talk
|
Apr. 17, 2025

Scalably Understanding AI With AI

Abstract not available.

Workshop Talk
|
Apr. 17, 2025

LLM Negotiations And Social Dilemmas

Artificially intelligent agents are increasingly being integrated into human decision-making. Soon large language model (LLM) agents will be interacting with humans and among themselves with a mixture of goals and incentives. This context motivates a game-theoretic perspective. Rather than simply evaluating these agents on the reward achieved in a static environment, we need to be considering their behaviour in the context of the ecosystem of agents with which they are interacting. In this talk I will discuss my group's progress on studying RL training of agent policies in the context of general sum games, that are neither purely cooperative. In particular I'll discuss our novel approach known as Advantage Alignment, a family of algorithms derived from first principles that efficiently and intuitively guides policy learning towards more cooperative and effective policies. I'll conclude by discussing our progress in applying these methods in the context of LLMs and Agent interactions.

Video
|
Apr. 17, 2025
Superintelligent Agents Pose Catastrophic Risks — ... | Richard M. Karp Distinguished Lecture
People

Hang Huang

Hang Huang is currently at Texas A&M University and their research interests are commutative algebra, representation theory and complexity theory.

Workshop Talk
|
Apr. 16, 2025

Out Of Distribution, Out Of Control? Understanding Safety Challenges In AI

Abstract not available.

Workshop Talk
|
Apr. 16, 2025

Causal Representation Learning: A Natural Fit for Mechanistic Interpretability

Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties, e.g., truthfulness, offering a promising approach for LLM alignment without the need for fine-tuning. However, these methods typically require supervision from e.g., contrastive pairs of prompts that vary by a single target concept, which is costly to obtain and limits the speed of steering research. An appealing alternative is to use unsupervised approaches such as sparse autoencoders (SAEs) to map LLM embeddings to sparse representations that capture human-interpretable concepts. However, without further assumptions, SAEs may not be identifiable: they could learn latent dimensions that entangle multiple concepts, leading to unintentional steering of unrelated properties. In this talk, I'll introduce sparse shift autoencoders (SSAEs). These models map the differences between embeddings to sparse representations that capture concept shifts. Crucially, we show that SSAEs are identifiable from paired observations that vary by multiple unknown concepts, leading to accurate steering of single concepts without the need for supervision. We empirically demonstrate accurate steering across semi-synthetic and real-world language datasets using Llama-3.1 embeddings.

Pagination

  • Previous page Previous
  • Page 222
  • Page 223
  • Current page 224
  • Page 225
  • Page 226
  • Next page Next
Home
The Simons Institute for the Theory of Computing is the world's leading venue for collaborative research in theoretical computer science.

Footer

  • Programs & Events
  • Participate
  • Workshops & Symposia
  • Contact Us
  • Calendar
  • Accessibility

Footer social media

  • Twitter
  • Facebook
  • Youtube
© 2013–2026 Simons Institute for the Theory of Computing. All Rights Reserved.
link to homepage

Main navigation

  • Programs & Events
    • Research Programs
    • Workshops & Symposia
    • Public Lectures
    • Research Pods
    • Internal Program Activities
    • Algorithms, Society, and the Law
  • Participate
    • Apply to Participate
    • Propose a Program
    • Postdoctoral Research Fellowships
    • Law and Society Fellowships
    • Science Communicator in Residence Program
    • Circles
    • Breakthroughs Workshops and Goldwasser Exploratory Workshops
  • People
    • Scientific Leadership
    • Staff
    • Current Long-Term Visitors
    • Research Fellows
    • Postdoctoral Researchers
    • Scientific Advisory Board
    • Governance Board
    • Affiliated Faculty
    • Science Communicators in Residence
    • Law and Society Fellows
    • Chancellor's Professors
  • News & Videos
    • News
    • Videos
  • Support for the Institute
    • Annual Fund
    • All Funders
    • Institutional Partnerships
  • For Visitors
    • Visitor Guide
    • Plan Your Visit
    • Location & Directions
    • Accessibility
    • Building Access
    • IT Guide
  • About

Utility navigation

  • Calendar
  • Contact
  • Login
  • MAKE A GIFT
link to homepage