Thompson Sampling is a method that uses exploration and exploitation to maximise the total rewards gained from completing a task. Thompson Sampling is sometimes referred to as Probability Matching or Posterior Sampling.
Thompson sampling, named after William R. Thompson, is a heuristic for selecting actions in the multi-armed bandit issue that addresses the exploration-exploitation conundrum. It entails selecting the course of action that maximises the expected reward in relation to a randomly generated belief.
Thompson is better geared for optimising long-term total return, whereas UCB-1 will yield allocations more akin to an A/B test. In comparison to Thompson Sample, which encounters greater noise due to the random sampling stage in the algorithm, UCB-1 acts more consistently in each unique trial.
In a countable class of general stochastic environments, we discuss a variation of Thompson sampling for nonparametric reinforcement learning. Non-Markov, non-ergodic, and partially observable environments are possible.