Results 2231 - 2240 of 23900
We argue that a well-trained reinforcement learning (RL) agent planning over a long time-horizon would likely commandeer all human infrastructure. We argue that for a long-term RL agent, the expected value of its reward signal would likely be near-maximal if it sought to commandeer all human infrastructure, whereas the expected value of its reward signal would be bounded above if it did not do so. Next, we argue that a well-trained RL agent would likely aim to maximize the expectation of its reward signal: if it did not start with that motivation to begin with, its motivations would eventually be refined to that after it pursued value-of-information-motivated exploration. Next, we discuss how and why variants of RL agents might behave differently, and we outline potential safety cases for each variant: KL-constrained RL, myopic and contained RL, and pessimistic RL.
Artificially intelligent agents are increasingly being integrated into human decision-making. Soon large language model (LLM) agents will be interacting with humans and among themselves with a mixture of goals and incentives. This context motivates a game-theoretic perspective. Rather than simply evaluating these agents on the reward achieved in a static environment, we need to be considering their behaviour in the context of the ecosystem of agents with which they are interacting. In this talk I will discuss my group's progress on studying RL training of agent policies in the context of general sum games, that are neither purely cooperative. In particular I'll discuss our novel approach known as Advantage Alignment, a family of algorithms derived from first principles that efficiently and intuitively guides policy learning towards more cooperative and effective policies. I'll conclude by discussing our progress in applying these methods in the context of LLMs and Agent interactions.
Hang Huang is currently at Texas A&M University and their research interests are commutative algebra, representation theory and complexity theory.
Abstract not available.
Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties, e.g., truthfulness, offering a promising approach for LLM alignment without the need for fine-tuning. However, these methods typically require supervision from e.g., contrastive pairs of prompts that vary by a single target concept, which is costly to obtain and limits the speed of steering research. An appealing alternative is to use unsupervised approaches such as sparse autoencoders (SAEs) to map LLM embeddings to sparse representations that capture human-interpretable concepts. However, without further assumptions, SAEs may not be identifiable: they could learn latent dimensions that entangle multiple concepts, leading to unintentional steering of unrelated properties. In this talk, I'll introduce sparse shift autoencoders (SSAEs). These models map the differences between embeddings to sparse representations that capture concept shifts. Crucially, we show that SSAEs are identifiable from paired observations that vary by multiple unknown concepts, leading to accurate steering of single concepts without the need for supervision. We empirically demonstrate accurate steering across semi-synthetic and real-world language datasets using Llama-3.1 embeddings.