Statistical Inference for Distributed Contextual Multi-armed Bandit

In this paper, we study the online statistical inference of distributed contextual multi-armed bandit problems, where the agents collaboratively learn an optimal policy by exchanging their local estimates of the global parameters with neighbors over a communication network. We propose a distributed online decision making algorithm, which balances the exploration and exploitation dilemma via the $\varepsilon$-greedy policy and updates the policy online by incorporating the distributed stochastic gradient descent algorithm. We establish the pivotal limiting distribution for the estimator of reward model parameter as a stochastic process and then employ the random scaling method to construct its asymptotic confidence interval. We also establish the asymptotic normality of the online inverse probability weighted value estimator and construct an asymptotic confidence interval of the value by plug-in method. The proposed algorithm and theoretical results are tested by simulations and a real data application to a warfarin drug dosing problem.

Article

Download

View PDF