Multi-Armed Bandit

What is the Multi-Armed Bandit Problem?

A gambler walks into a Vegas casino. There are 100’s of slot machines – which should he choose?

The term ‘multi-armed bandit’ comes from a hypothetical experiment where a person must choose between multiple actions, each with an unknown payout. The goal is to determine the best or most profitable outcome through a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must determine which machine to pull, in which order and how many times. 

The solution uses machine learning algorithms. Slot machines that are observed to pay out well, get more attention. Underperforming machines – less so. In theory, the approach should produce faster results since there is no need to wait for a single winning variation.

Multi-Armed Bandit - Digital Examples

The same considerations apply for example to a News website. Which articles should be displayed to a visitor that will generate the most engagement (clicks)? In which order should they appear? Similar to the earlier example, with no information about the visitor, all click outcomes are initially unknown. 

Multi-Armed Bandit Strategies

Below is a list of some of the more recognised multi-armed bandit solutions and strategies:

Epsilon-Greedy Strategy

This is an algorithm for continuously balancing exploration with exploitation. The lever with highest known payout is always pulled except when a random action is taken. A randomly chosen arm is pulled a fraction of the time (displayed as epsilon or ε). The other proportion of the time (1-ε), the arm with highest known payout is pulled.

Epsilon-First Strategy

An exploration phase is followed by an exploitation phase. A number of set trials are made and explored. The number of trials explored is again a fraction expressed as epsilon (ε). During the exploitation phase the best option is used. 

Epsilon-Decreasing Strategy

A twist on epsilon greedy. Here the value of ε reduces as the experiment progresses. This results in highly exploitative approach at the start and a highly exploitative approach at the end. 

Adaptive Epsilon-Greedy Strategy

Similar to the epsilon-decreasing strategy, except that epsilon is reduced on basis of the learning progress rather than set and adjusted manually. 

Contextual-Epsilon-Greedy Strategy

An Epsilon-greedy strategy, except that the value of ε is computed on dynamic criteria. For example, more weight may be applied based on historical or current visitor attributes considered to have a bearing on the outcome e.g previous page views, historical purchases, device type, time of day or location.

Probability Matching Strategies

These strategies reflect the idea that the number of lever pulls should match the probability of it being optimal. These typically use Thomson sampling or Bayesian and can combine contextual approaches as well. 

Back to glossary

Back to top

More from the glossary

Marketing Technology Stack

What is a Marketing Technology Stack? What are the considerations when choosing one?


Meta Tags

What are Meta Tags? Some examples of how to use them on your website.


Multivariate Testing

What is Multivariate Testing? Simple, full factorial and fractional factorial MVT.


Null Hypothesis

What is a Null Hypothesis? What is an alternative hypothesis? How do they work?


Get in touch

Want to know more about how Webtrends Optimize can help you with website optimisation? Click below to send us a message or call us on 0333 444 5502.

Send message