In the age of e-commerce, many logistic companies must operate in large road networks without accurate knowledge of travel times for their specific fleet of vehicles. Moreover, millions of dollars are spent on routing services that do not accurately capture the specific characteristics of the companies’ drivers and the types of vehicles they must use. In this work, we consider a logistic operator that has limited information about travel times in a network, and who seeks to find the optimal expected shortest path of origin-destination pairs. We model this problem as an Online Shortest Path Problem, common to many last mile routing settings: given a graph whose arcs’ travel times are stochastic and follow an unknown distribution, the planner seeks to find a vehicle route of minimum travel time from an origin to a destination. We assume the planner progressively collects information on the travel conditions as the drivers keep serving requests. Inspired by combinatorial multi-armed bandit and Kriging literature, we propose three methods with different characteristics that effectively learn the optimal shortest path and illustrate the practical advantages of incorporating spatial correlation in the learning process. Our approach balances the trade-off of exploration (better estimates for unexplored arcs) vs exploitation (executing the minimum expected time path) by using the Thompson Sampling algorithm. In each iteration, our algorithm executes the path minimizing the expected time with data from a posterior distribution of the arcs’ speeds. We carry out a computational study comprising two settings: a set of four artificial instances and a real-life case study. The case study uses empirical data of taxis in the 17km-radius area of the center of Beijing, completely enveloping Beijing’s “5th Ring Road”. In both settings, we observe that our algorithms efficiently and effectively balance the exploration-exploitation trade-off.