Abstract
In a social learning setting, there is a set of actions, each of which has a payoff that depends on a hidden state of the world. A sequence of agents each chooses an action with the goal of maximizing her payoff given estimates of the state of the world. In incentivized exploration, a disclosure policy tries to coordinate the choices of the agents by sending messages about the history of past actions. The goal of the policy is to minimize the ex-post regret of the action sequence, defined as the worst-case difference in expected payoff between the optimal sequence and that induced by the policy.
Motivated by review sites like Yelp, we focus on a restricted class of disclosure policies and behavioral agents. Our disclosure policies use messages, called unbiased sub-histories, consisting of the actions and rewards of a subsequence of past agents, where the subsequence is chosen ahead of time. Our agents behave like frequentists, forming an estimate of the true mean of an arm using the empirical mean. A disclosure policy that shows the full history risks inducing herding behavior among the agents resulting in regret linear in the number of rounds. Our main result is a disclosure policy that obtains regret on the order of the square root of the number of rounds and thus converges (at an optimal rate) to the optimal payoff. We also exhibit simpler policies with higher, but still sublinear, regret. These policies can be interpreted as dividing a sublinear number of agents into constant-sized focus groups, whose histories are then provided to future agents.
Based on joint work with Jieming Mao, Alex Slivkins and Steven Wu.