
Abstract
Goal-conditioned reinforcement learning (GCRL) is a powerful way to control an AI agent’s behavior at runtime. That said, popular goal representations, e.g., target states or natural language, are either limited to Markovian tasks or rely on ambiguous task semantics. We propose using automata to represent temporal goals and guide GCRL agents. Automata balance the need for formal temporal semantics with ease of interpretation: if one can understand a flow chart, one can understand an automaton. On the other hand, automata form a countably infinite concept class with Boolean semantics, and subtle changes to the automaton can result in very different tasks, making them difficult to condition agent behavior on. To address this, we observe that all paths through an automaton correspond to a series of reach-avoid tasks and propose a technique for learning provably correct embeddings of "reach-avoid derived" automata, guaranteeing optimal multi-task policy learning. Through empirical evaluation, we demonstrate that the proposed pretraining method enables zero-shot generalization to various task classes and accelerated policy specialization without the myopic suboptimality of hierarchical methods.