
Abstract
ost-training is essential for enhancing large language model (LLMs) capabilities and aligning them to human preferences. One of the most
widely used post-training techniques is reinforcement learning from human feedback (RLHF). In this talk, I will first discuss the challenges of applying RL to LLM training. Next, I will introduce RL algorithms that tackle these challenges by utilizing key properties of the underlying problem. Additionally, I will present an approach that
simplifies the RL policy optimization process for LLMs to relative reward regression.