Abstract
We consider the problem of designing `Learning to Control' algorithms for stochastic systems when the model parameters are unknown. Two models are considered: Markov Decision Processes and Linear Stochastic systems. A Thompson-sampling based regret-minimization learning framework is developed for trading off exploration v. exploitation. Sampling from a posterior distribution on unknown parameters at regular intervals provides the necessary exploration for learning. This obviates the need for expensive computation for exploration, and makes the algorithm suitable for real-time decision-making and learning. The key is designing a suitable exploration schedule. The proposed algorithms achieve O(sqrt{T}) expected regret which is order-optimal. It differs from classical adaptive control algorithms in its focus on non-asymptotic regret optimality as opposed to asymptotic stability. Numerical evaluation suggests robust performance of the algorithms.