Skip to content

Multiclass classification

Griffin Bassman edited this page Oct 18, 2021 · 4 revisions

This is a brief note to explain multiclass classification in VW. Advanced topics of modeling label similarity, variable number of labels per example, and zero-shot learning are discussed in the label-dependent-features section.

Basic Multiclass Classification

In standard multiclass classification, you have a fixed number of labels (K) and want to drop inputs into one of those K buckets. Here is an example basic multiclass dataset with four labels in VW format:

1 | a b c
3 | b c d
1 | a c e
4 | b d f
2 | d e f

This consists of five examples and if you know normal VW format you should be able to figure out what this does.

You can train a one-against-all model by running:

vw --oaa 4 < mc.txt

This says to do one-against-all classification with 4 labels (you have to tell VW how many labels there are). If you specify too few, you'll get a warning. If you specify too many, nothing will really go wrong except you'll waste representation power.

If you want to make the third example ten times as important as the others, you can add weights to the examples:

1 | a b c
3 | b c d
1 10 | a c e
4 | b d f
2 | d e f

OAA classification proceeds by turning each multiclass example into K binary examples. For example, internally, the second example in the above dataset (the one with label "3") gets turned into:

-1 | 1_b 1_c 1_d
-1 | 2_b 2_c 2_d
+1 | 3_b 3_c 3_d
-1 | 4_b 4_c 4_d

Note that these examples don't explicitly get constructed: internally VW does some clever feature index trickery to get different features for the different labels (result: it's fast).

Test examples look just like normal VW test examples, for instance:

| b d e

Cost-Sensitive Multiclass Classification

Slightly more complex than basic multiclass classification is cost-sensitive multiclass. Here, instead of just having one correct label (and all others incorrect), you can have different costs for each of the K different labels. And you can also have different "allowable" labels for each example. An identical cost-sensitive training set to that above is:

1:0 2:1 3:1 4:1 | a b c
1:1 2:1 3:0 4:1 | b c d
1:0 2:1 3:1 4:1 | a c e
1:1 2:1 3:1 4:0 | b d f
1:1 2:0 3:1 4:1 | d e f

The first line means that label 1 has a cost of zero and labels 2, 3 and 4 all have a cost of one. You can train this by:

vowpalwabbit/vw --csoaa 4 < cs.txt

Here, csoaa means "cost-sensitive one against all". You can also train weighted all pairs by:

vowpalwabbit/vw --wap 4 < cs.txt

Test examples are different from standard VW test examples because you have to tell VW which labels are allowed. The following is equivalent to the above test example:

1 2 3 4 | b d e

At training time, if there's an example with label 2 that you know (for whatever reason) will never be label 4, you could specify it as:

1:1 2:0 3:1 | example...

This means that labels 1 and 3 have a cost of 1, label 2 has a cost of zero, and no other labels are allowed. You can do the same at test time:

1 2 3 | example...

Will never predict anything other than the provided "possible" labels.

Interally, csoaa does something different than you might expect. If there are K possible labels, it constructs K regression problems, where the input is a (label-based) version of the features and the output is the cost of that label. For instance, the second example in the above cs.txt gets converted to the following four regression problems:

1 | 1_b 1_c 1_d
1 | 2_b 2_c 2_d
0 | 3_b 3_c 3_d
1 | 4_b 4_c 4_d

And then at test time, the label whose regressor gives the lowest value (predicted cost) is chosen.

If you use WAP, it does what you might expect. It constructs (K choose 2) training examples, where the correct class is the is the one from the pair with the lower cost and the weight is (a function of) the difference of costs. In the case that it's just a standard multiclass dataset (one label with cost zero, the rest with cost one) you only get K-1 training examples, as follows (for the second example):

-1 | 1vs3_b 1vs3_c 1vs3_d
+1 | 2vs3_b 2vs3_c 2vs3_d
-1 | 4vs3_b 4vs3_c 4vs3_d

Again, this is all done with feature index manipulation to be efficient.

At test time, you actually don't have to run all (K choose 2) classifiers but can actually just run K predictions (taking advantage of linearity: if you're training a non-linear model this trick won't work).

Label-dependent-features (LDF) format

LDF addresses three scenarios:

  1. You have side information on the labels that might accelerate learning by helping predict whether the labels are ``similar'' (i.e., the conditional probability of two labels given an example are close on average wrt some divergence).
  2. You have a dynamic set of labels, i.e., the set of labels is different on each example.
  3. zero-shot learning: the (finite set of) labels for an example are drawn from a potentially infinite set of labels some of which may never have occurred prior in history. To have good accuracy under these conditions it is important that you have side information on the labels.

Modeling label similarity

One thing you may have noticed is that all of the above examples assume that the way that features get "exploded" from the multiclass problem to the binary problems is deterministic and simplistic. Basically any feature "f" just gets prepended with the label id (eg., "1_f", "2_f", etc.) for making predictions. Or, another way to think about this is that you're training K completely separate classifiers each with their own feature space.

Sometimes you actually know something about your labels that makes this less ideal. For example, suppose you have four document labels (inspired by a dataset due to Andrew McCallum) that talk about automobiles or airplanes. But some are talking about real automobiles (or airplanes) and some are talking about toy automobiles (or airplanes). Thus the four labels are: RealAuto ToyAuto RealAir ToyAir (call them 1, 2, 3, 4 in that order). You might believe that it's actually useful to share parameters across the classes. For example, when you have a basic feature "f" on an example with label RealAuto, you might want to have three version of "f" available for classification: "Real_f", "Auto_f" and "RealAuto_f". If you use the normal support for cost-sensitive classification, you'd only get the last of these.

The ldf format was invented to cope with problems like this. The easiest way to specify ldf example is in a multiline format. This means that each label gets its own line in the input file, and blank lines separate examples. For instance, you could have something like the following as a valid two-example dataset for the above problem:

1:0 | Real_f Auto_f RealAuto_f
2:1 | Toy_f  Auto_f ToyAuto_f
3:1 | Real_f Air_f  RealAir_f
4:1 | Toy_f  Air_f  ToyAir_f

1:1 | Real_g Auto_g RealAuto_g
2:1 | Toy_g  Auto_g ToyAuto_g
3:0 | Real_g Air_g  RealAir_g
4:1 | Toy_g  Air_g  ToyAir_g

(note the trailing newline, which is both necessary and a good idea in case you want to later concatenate two files.)

As before, the labels don't need to be in order and you need not include options for all of them. They can also have costs other than zero/one, if you like.

If this data is in csldf.txt, you can train with:

vw --csoaa_ldf multiline < csldf.txt

Variable Number of Labels Per Example

Note that with LDF format you don't need to specify the number of labels ahead of time: basically because playing with feature ids is now your job, not VWs. (You can also say "m" instead of "multiline") for brevity (yes, there's also a single-line format, but it's not recommended).

This really does just turn these examples into regression problems, essentially stripping out the "label:" on the lines and training a model. However, the loss function reported during training is classification loss.

In this case, it actually might make sense to have features that appear for every class, as in:

1:0 | g Real_f Auto_f RealAuto_f
2:1 | g Toy_f  Auto_f ToyAuto_f
3:1 | g Real_f Air_f  RealAir_f
4:1 | g Toy_f  Air_f  ToyAir_f

If there are a lot of such features, you can share them across a single example (blank-line-separated block) by:

shared | g
1:0 | Real_f Auto_f RealAuto_f
2:1 | Toy_f  Auto_f ToyAuto_f
3:1 | Real_f Air_f  RealAir_f
4:1 | Toy_f  Air_f  ToyAir_f

In this case, this doesn't seem obviously helpful, but when "g" is a lot of features, this can save both disk space and prediction time (because dot products with these shared features can be cached).

You can also do WAP-ldf as:

vw --wap_ldf m < csldf.txt

Which trains a classifier in the same way as WAP does before (except the trick for going from O(K2) to O(K) no longer applies, unfortunately).

You might find it weird that csoaa_ldf trains a regression model. If you want it to do the more obvious thing and really train K classifiers in a one-against-all framework, you can do this by running:

vw --csoaa_ldf mc < csldf.txt

Here, we've called csoaa_ldf with argument "mc" rather than just "m". "mc" means multiline-classifier, and tells it to train classifiers rather than regression models.

Test examples look the same, except they have no costs, as in:

shared | g
1 | Real_f Auto_f RealAuto_f
2 | Toy_f  Auto_f ToyAuto_f
4 | Toy_f  Air_f  ToyAir_f

(Note the trailing newline.) Here, we've said that we know a priori that label 3 is invalid, so not to predict that label.

With LDF the number of classes is never specified, so classes can be drawn from an infinite set, and classes that have never occurred before suddenly be introduced. This allows VW to function as a zero-shot learning algorithm. To be effective, however, the features associated with labels must allow for generalization to novel labels (similar to how features associated with examples facilitate generalization to novel examples). If you know nothing about your labels other than a unique id, you will find VW can perform poorly on examples with labels that have not been seen very many times.

Additional Notes

In ldf mode, shared features for WAP or classifier-based OAA should always be part of an interaction (-q or --cubic) because they have no linear effect: the predictions are all based on differences of feature vectors, and linear shared features will cancel each other out. (You can verify this by placing them in a namespace and using --ignore-linear.) Thus shared features only have an effect if you specify interactions between them and the label features.

Be careful when changing the loss function. If you're doing one of the regression-based models (csoaa) you need to use a regression loss and should probably stick to squared loss. If you're doing one of the classification-based models (oaa, wap, csoaa_ldf mc or wap_ldf) then you might want to use a classification loss, like logistic or hinge.

Credit

Content copied from here with permission from author, Hal Daumé III.

Clone this wiki locally