Title: Beyond UCB: statistical complexity and optimal algorithm for non-linear bandits
Abstract: Many existing literature on bandits and reinforcement learning assume a linear reward/value function, but what happens if the reward is non-linear? Two curious phenomena arise for non-linear bandits: first, in addition to the "learning phase" with a standard \Theta(\sqrt(T)) regret, there is an "initialization phase" with a fixed cost determined by the reward function; second, achieving the smallest cost of the initialization phase requires new learning algorithms other than traditional ones such as UCB. For a special family of non-linear bandits, we derive upper and lower bounds on the optimal fixed cost, and in addition, on the entire learning trajectory in the initialization phase via differential equations. In particular, we show that a two-stage algorithm which first finds a good initialization and then treats the problem as a locally linear bandit is statistically optimal.
This is based on a recent joint work with Jiantao Jiao, Nived Rajaraman, and Kannan Ramchandran.