Multi-Armed Bandit

What is the Multi-Armed Bandit Problem?

A gambler walks into a Vegas casino. There are 100’s of slot machines – which should he choose?

The term ‘multi-armed bandit’ comes from a hypothetical experiment where a person must choose between multiple actions, each with an unknown payout. The goal is to determine the best or most profitable outcome through a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must determine which machine to pull, in which order and how many times.

The solution uses machine learning algorithms. Slot machines that are observed to pay out well, get more attention. Underperforming machines – less so. In theory, the approach should produce faster results since there is no need to wait for a single winning variation.

Multi-Armed Bandit - Digital Examples

The same considerations apply for example to a News website. Which articles should be displayed to a visitor that will generate the most engagement (clicks)? In which order should they appear? Similar to the earlier example, with no information about the visitor, all click outcomes are initially unknown.

Multi-Armed Bandit Strategies

Below is a list of some of the more recognised multi-armed bandit solutions and strategies:

Epsilon-Greedy Strategy

This is an algorithm for continuously balancing exploration with exploitation. The lever with highest known payout is always pulled except when a random action is taken. A randomly chosen arm is pulled a fraction of the time (displayed as epsilon or ε). The other proportion of the time (1-ε), the arm with highest known payout is pulled.

Epsilon-First Strategy

An exploration phase is followed by an exploitation phase. A number of set trials are made and explored. The number of trials explored is again a fraction expressed as epsilon (ε). During the exploitation phase the best option is used.

Epsilon-Decreasing Strategy

A twist on epsilon greedy. Here the value of ε reduces as the experiment progresses. This results in highly exploitative approach at the start and a highly exploitative approach at the end.

Adaptive Epsilon-Greedy Strategy

Similar to the epsilon-decreasing strategy, except that epsilon is reduced on basis of the learning progress rather than set and adjusted manually.

Contextual-Epsilon-Greedy Strategy

An Epsilon-greedy strategy, except that the value of ε is computed on dynamic criteria. For example, more weight may be applied based on historical or current visitor attributes considered to have a bearing on the outcome e.g previous page views, historical purchases, device type, time of day or location.

Probability Matching Strategies

These strategies reflect the idea that the number of lever pulls should match the probability of it being optimal. These typically use Thomson sampling or Bayesian and can combine contextual approaches as well.

What is the Multi-Armed Bandit Problem?

Multi-Armed Bandit - Digital Examples

Multi-Armed Bandit Strategies

Epsilon-Greedy Strategy

Epsilon-First Strategy

Epsilon-Decreasing Strategy

Adaptive Epsilon-Greedy Strategy

Contextual-Epsilon-Greedy Strategy

Probability Matching Strategies

More from the glossary

Marketing Technology Stack

Meta Tags

Multivariate Testing

Null Hypothesis

Get in touch

Solutions

Key Tools

Other Stuff