Contextual Bandit Exploration with SquareCB

Overview

SquareCB is a new algorithm for efficient contextual bandit exploration which works by reducing contextual bandits to regression. It is the first reduction of this type that provably has optimal performance for contextual bandits, and has comparable runtime to other basic exploration algorithms such as epsilon-greedy. Empirically, we have found that it has competitive performance on the large-scale evaluation setup from the bake-off paper.

How it works

SquareCB is an option for the --cb_explore_adf reduction in VW. It uses the base learner in VW to learn a model that predicts rewards/losses for each context/action pair via regression. With the default settings, the base learner will learn a linear model with the square loss, trained online via AdaGrad (see Learning-algorithm and Loss functions).

Suppose we have already seen t-1 examples, and a new context x_t arrives. The base learner, using its regression model, which has been trained online on all the examples we've seen so far, will produce for each action a prediction for the loss if we play this action. If we were to use the simplest exploration strategy, Epsilon-greedy, it would take the action with the lowest predicted loss , play this action with probability 1-epsilon, and sample the other actions uniformly with probability epsilon. SquareCB is very similar but, rather than assigning uniform probability to the higher-loss actions, it assigns probability based on how much larger their predicted loss is. Actions with higher predicted loss will be played less often, and actions with lower predicted loss will be played more often. This weighting allows the algorithm to explore more efficiently than epsilon-greedy, which leads to better performance in theory and in practice.

In more detail, the SquareCB strategy is as follows. First, let be the action that minimizes the predicted loss . Then for all the actions besides , SquareCB assigns probability

where K is the number of actions for the current context and gamma is a confidence parameter. The remaining probability mass is then assigned to . In particular, we see that if all actions have the same predicted loss, the algorithm will sample from them uniformly, but if the predicted loss for is smaller than for the other actions, they are less likely to be played. The algorithm's parameter gamma modulates how aggressively it focuses on the action that is predicted to be best. As gamma gets larger, the algorithm behaves more greedily. In VW, we automatically increase gamma as more examples are observed and the predictions become more confident.

The implementation of SquareCB in VW also has a variant (the --elim option) which augments the approach above using action elimination. With the --elim option turned on, the algorithm first uses confidence intervals to eliminate actions which are obviously bad, then performs the weighting strategy above only on the surviving options. This variant can work quite well in practice because it can quickly zooms in on the good actions.

When to Use

SquareCB is a general purpose algorithm, and can be used in place of any of the other exploration algorithms (explore-first, epsilon greedy, bagging, cover, RegCB) in VW. Computationally, the number of FLOPS per update is comparable to epsilon-greedy, so it is one of the faster algorithms. On the datasets from the bake-off paper, we found that it has good all-around performance, especially for challenging problems with many actions.

As with all contextual bandit algorithms based on regression (e.g., LinUCB or RegCB), SquareCB will only perform well if the regression model used by the base learner is flexible enough to model the rewards/losses in the contextual bandit problem. If performance is lacking, adding more features or switching to a more powerful base learner (e.g., turning on boosting or bootstrapping) may help.

The --elim option generally leads to better contextual bandit performance, but---because this option eliminates actions more aggressively---the resulting logged data may be less useful for counterfactual evaluation/offline policy optimization.

Example Usage in VW

For contextual bandit problems with changing action sets in the multiline format described here, basic invocation with and without the elimination option is as follows:

vw -d [data_file] --cb_explore_adf --cb_type mtr --squarecb --gamma_scale 1000

vw -d [data_file] --cb_explore_adf --cb_type mtr --squarecb --elim --gamma_scale 10

Options

Given the base invocation vw -d [data_file] --cb_explore_adf --cb_type mtr --squarecb, the additional arguments we can give to SquareCB are as follows:

--gamma_scale [arg] and --gamma_exponent [arg]. These options decide the schedule for the confidence parameter gamma described in the update rule above. For the tth example, we set gamma = gamma_scale * t^{gamma_exponent}.
--elim. If this is passed, we use the action elimination on top of the basic SquareCB strategy, as described above.
--mellowness [arg], --cb_min_cost [arg], --cb_max_cost [arg]. These are hyperparameters for the --elim option, and are only used when it is invoked. Mellowness is a confidence parameter, and descreasing it will make the algorithm more aggressive; the default is generally a good choice. --cb_min_cost, --cb_max_cost are lower and upper bounds on the cost range. These should be set so that all the costs in your dataset stay in the interval [min_cost, max_cost]. See also the documentation for RegCB and RegCB-Opt.
--cb_type {dr, ips, mtr} should always be set to mtr.

Stationary action set with fixed semantics

When the number of actions is known ahead of time, and we have a file consisting of examples in the fixed-action contextual bandit format discussed here, the invocation is the same as above, but we add the flag --cbify [n actions], where [n actions] is the number of actions. For example,

vw -d [data_file] --cbify [n actions] --cb_explore_adf --cb_type mtr --squarecb --gamma_scale 1000

vw -d [data_file] --cbify [n actions] --cb_explore_adf --cb_type mtr --squarecb --elim --gamma_scale 10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly