Multi-Armed Bandit

What is the Multi-Armed Bandit Problem?

A gambler walks into a Vegas casino. There are 100’s of slot machines – which should he choose?

The term ‘multi-armed bandit’ comes from a hypothetical experiment where a person must choose between multiple actions, each with an unknown payout. The goal is to determine the best or most profitable outcome through a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must determine which machine to pull, in which order and how many times. 

The solution uses machine learning algorithms. Slot machines that are observed to pay out well, get more attention. Underperforming machines – less so. In theory, the approach should produce faster results since there is no need to wait for a single winning variation.

Multi-Armed Bandit - Digital Examples

The same considerations apply for example to a News website. Which articles should be displayed to a visitor that will generate the most engagement (clicks)? In which order should they appear? Similar to the earlier example, with no information about the visitor, all click outcomes are initially unknown. 

Multi-Armed Bandit Strategies

Below is a list of some of the more recognised multi-armed bandit solutions and strategies:

Epsilon-Greedy Strategy

This is an algorithm for continuously balancing exploration with exploitation. The lever with highest known payout is always pulled except when a random action is taken. A randomly chosen arm is pulled a fraction of the time (displayed as epsilon or ε). The other proportion of the time (1-ε), the arm with highest known payout is pulled.

Epsilon-First Strategy

An exploration phase is followed by an exploitation phase. A number of set trials are made and explored. The number of trials explored is again a fraction expressed as epsilon (ε). During the exploitation phase the best option is used. 

Epsilon-Decreasing Strategy

A twist on epsilon greedy. Here the value of ε reduces as the experiment progresses. This results in highly exploitative approach at the start and a highly exploitative approach at the end. 

Adaptive Epsilon-Greedy Strategy

Similar to the epsilon-decreasing strategy, except that epsilon is reduced on basis of the learning progress rather than set and adjusted manually. 

Contextual-Epsilon-Greedy Strategy

An Epsilon-greedy strategy, except that the value of ε is computed on dynamic criteria. For example, more weight may be applied based on historical or current visitor attributes considered to have a bearing on the outcome e.g previous page views, historical purchases, device type, time of day or location.

Probability Matching Strategies

These strategies reflect the idea that the number of lever pulls should match the probability of it being optimal. These typically use Thomson sampling or Bayesian and can combine contextual approaches as well. 

More from the glossary

Lead Generation

Lead Generation What is Lead Generation? The idea of lead generation seems fundamental to most businesses.  In order to understand

Read More »

Urgency Messaging

Urgency Messaging Turn Procrastinators into Decision Makers Some visitors need extra motivation to take action when it comes to E-Commerce

Read More »

Get in touch

If you have more questions or want to know a little more about website optimisation and personalisation, send us a message or call us on 0333 444 5502.

Thank you!

Thanks for submitting your enquiry with us.
A member of the team will be in touch shortly to follow up with you.

Thank you!

Please check your inbox and validate your email address so that we can create your free trial account.

Training Videos

While we're setting up your account, feel free to check out our training videos. You'll find them at:


Or, take a look through our blog to review industry insights, wins and all things Experimentation.