-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Contextual Bandit Exploration with SquareCB
SquareCB is a new algorithm for efficient contextual bandit exploration which works by reducing contextual bandits to regression. It is the first reduction of this type that provably has optimal performance for contextual bandits, and has comparable runtime to other basic exploration algorithms such as epsilon-greedy. Empirically, we have found that it has competitive performance on the large-scale evaluation setup from the bake-off paper.
SquareCB is an option for the --cb_explore_adf
reduction in VW. It uses the base learner in VW to learn a model that predicts rewards/losses for each context/action pair via regression. With the default settings, the base learner will learn a linear model with the square loss, trained online via AdaGrad (see Learning-algorithm and Loss functions).
Suppose we have already seen t-1 examples, and a new context x_t arrives. The base learner, using its regression model, which has been trained online on all the examples we've seen so far, will produce for each action a prediction for the loss if we play this action. If we were to use the simplest exploration strategy, Epsilon-greedy, it would take the action with the lowest predicted loss , play this action with probability 1-epsilon, and sample the other actions uniformly with probability epsilon. SquareCB is very similar but, rather than assigning uniform probability to the higher-loss actions, it assigns probability based on how much larger their predicted loss is. Actions with higher predicted loss will be played less often, and actions with lower predicted loss will be played more often. This weighting allows the algorithm to explore more efficiently than epsilon-greedy, which leads to better performance in theory and in practice.
In more detail, the SquareCB strategy is as follows. First, let be the action that minimizes the predicted loss . Then for all the actions besides , SquareCB assigns probability
where K is the number of actions for the current context and gamma is a confidence parameter. The remaining probability mass is then assigned to . In particular, we see that if all actions have the same predicted loss, the algorithm will sample from them uniformly, but if the predicted loss for is smaller than for the other actions, they are less likely to be played. The algorithm's parameter gamma modulates how aggressively it focuses on the action that is predicted to be best. As gamma gets larger, the algorithm behaves more greedily. In VW, we automatically increase gamma as more examples are observed and the predictions become more confident.
The implementation of SquareCB in VW also has a variant (the --elim
option) which augments the approach above using action elimination. With the --elim
option turned on, the algorithm first uses confidence intervals to eliminate actions which are obviously bad, then performs the weighting strategy above only on the surviving options. This variant can work quite well in practice because it can quickly zooms in on the good actions.
SquareCB is a general purpose algorithm, and can be used in place of any of the other exploration algorithms (explore-first, epsilon greedy, bagging, cover, RegCB) in VW. Computationally, the number of FLOPS per update is comparable to epsilon-greedy, so it is one of the faster algorithms. On the datasets from the bake-off paper, we found that it has good all-around performance, especially for challenging problems with many actions.
As with all contextual bandit algorithms based on regression (e.g., LinUCB or RegCB), SquareCB will only perform well if the regression model used by the base learner is flexible enough to model the rewards/losses in the contextual bandit problem. If performance is lacking, adding more features or switching to a more powerful base learner (e.g., turning on boosting or bootstrapping) may help.
The --elim
option generally leads to better contextual bandit performance, but---because this option eliminates actions more aggressively---the resulting logged data may be less useful for counterfactual evaluation/offline policy optimization.
For contextual bandit problems with changing action sets in the multiline format described here, basic invocation with and without the elimination option is as follows:
vw -d [data_file] --cb_explore_adf --cb_type mtr --squarecb --gamma_scale 1000
vw -d [data_file] --cb_explore_adf --cb_type mtr --squarecb --elim --gamma_scale 10
Given the base invocation vw -d [data_file] --cb_explore_adf --cb_type mtr --squarecb
, the additional arguments we can give to SquareCB are as follows:
-
--gamma_scale [arg]
and--gamma_exponent [arg]
. These options decide the schedule for the confidence parameter gamma described in the update rule above. For the tth example, we setgamma = gamma_scale * t^{gamma_exponent}
. -
--elim
. If this is passed, we use the action elimination on top of the basic SquareCB strategy, as described above. -
--mellowness [arg]
,--cb_min_cost [arg]
,--cb_max_cost [arg]
. These are hyperparameters for the--elim
option, and are only used when it is invoked. Mellowness is a confidence parameter, and descreasing it will make the algorithm more aggressive; the default is generally a good choice.--cb_min_cost
,--cb_max_cost
are lower and upper bounds on the cost range. These should be set so that all the costs in your dataset stay in the interval [min_cost, max_cost]. See also the documentation for RegCB and RegCB-Opt. -
--cb_type {dr, ips, mtr}
should always be set tomtr
.
When the number of actions is known ahead of time, and we have a file consisting of examples in the fixed-action contextual bandit format discussed here, the invocation is the same as above, but we add the flag --cbify [n actions]
, where [n actions] is the number of actions. For example,
vw -d [data_file] --cbify [n actions] --cb_explore_adf --cb_type mtr --squarecb --gamma_scale 1000
vw -d [data_file] --cbify [n actions] --cb_explore_adf --cb_type mtr --squarecb --elim --gamma_scale 10
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: