A gambler walks into a Vegas casino. There are 100’s of slot machines – which should he choose?
The term ‘multi-armed bandit’ comes from a hypothetical experiment where a person must choose between multiple actions, each with an unknown payout. The goal is to determine the best or most profitable outcome through a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must determine which machine to pull, in which order and how many times.
The solution uses machine learning algorithms. Slot machines that are observed to pay out well, get more attention. Underperforming machines – less so. In theory, the approach should produce faster results since there is no need to wait for a single winning variation.
The same considerations apply for example to a News website. Which articles should be displayed to a visitor that will generate the most engagement (clicks)? In which order should they appear? Similar to the earlier example, with no information about the visitor, all click outcomes are initially unknown.
Below is a list of some of the more recognised multi-armed bandit solutions and strategies:
This is an algorithm for continuously balancing exploration with exploitation. The lever with highest known payout is always pulled except when a random action is taken. A randomly chosen arm is pulled a fraction of the time (displayed as epsilon or ε). The other proportion of the time (1-ε), the arm with highest known payout is pulled.
An exploration phase is followed by an exploitation phase. A number of set trials are made and explored. The number of trials explored is again a fraction expressed as epsilon (ε). During the exploitation phase the best option is used.
A twist on epsilon greedy. Here the value of ε reduces as the experiment progresses. This results in highly exploitative approach at the start and a highly exploitative approach at the end.
Similar to the epsilon-decreasing strategy, except that epsilon is reduced on basis of the learning progress rather than set and adjusted manually.
An Epsilon-greedy strategy, except that the value of ε is computed on dynamic criteria. For example, more weight may be applied based on historical or current visitor attributes considered to have a bearing on the outcome e.g previous page views, historical purchases, device type, time of day or location.
These strategies reflect the idea that the number of lever pulls should match the probability of it being optimal. These typically use Thomson sampling or Bayesian and can combine contextual approaches as well.
Thanks for submitting your enquiry with us.
A member of the team will be in touch shortly to follow up with you.