-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Input format
The raw (plain text) input data for VW should have one example per line. Each example should be formatted as follows. Fields are space-delimited.
Label [Tag]|Namespace Features |Namespace Features ... |Namespace Features
Where:
Namespace=String[:Value]
Features=(String[:Value] )*
-
Label =
see below
And:
-
Tag
is a string that serves as an identifier for the example. It is reported back when predictions are made. It doesn't have to be unique. The default value if it is not provided is the empty string. If you provide a tag without a weight you need to disambiguate: either make the tag touch the|
(no trailing spaces) or mark it with a leading single-quote'
. If you don't provide a tag, you need to have a space before the|
. -
Namespace
is an identifier of a source of information for the example optionally followed by a float (e.g.,MetricFeatures:3.28
), which acts as a global scaling of all the values of the features in this namespace. If value is omitted, the default is 1. It is important that the namespace not have a space between the separator|
as otherwise it is interpreted as a feature. -
Features
is a sequence of whitespace separated strings, each of which is optionally followed by a float (e.g.,NumberOfLegs:4.0 HasStripes
) or a string (e.g.,city:paris
). Each string is a feature and the value is the feature value for that example. Omitting a feature means that its value is zero. Including a feature but omitting its value means that its value is 1. When a string is supplied as the feature value, the value is 1 and the feature index is calculated ashash(feature_value, hash(feature_name, namespace_hash))
where hash's signature ishash(input, seed)
. This chained hashing is denoted by a^
in the audit output.
Currently, the only characters that can't be used in feature or namespace names are vertical bar, colon, space, and newline.
Multiple example training sets can be found in the testing directory and their usage can be looked up from the testing script here.
Do keep in mind that these are used for testing but they are a good reference point.
The spacing around the |
characters is important and significant:
- Around the 1st
|
if there's no space preceding it, the string that touches the|
is considered a tag (id of example)
After any |
- If there's a space, the next non-space token is considered a regular feature name
- If there's no space, the next non-space token is considered a name-space
name-spaces are considered as feature name prefixes, they are prepended to all feature names in the name-space
Feature repetitions: repeating a feature in the same example will make vw
consider it again. IOW the following 3 examples are equivalent:
1 | a a b
1 | a:1 a:1 b:1
1 | a:2 b:1
Depending on which reductions you are using the label format VW expects changes. The default label format is simple
and this is the most common one.
Label types:
-
Simple -
VW::label_type_t::simple
-
Multiclass -
VW::label_type_t::multiclass
-
Multilabels -
VW::label_type_t::multilabel
-
Cost sensitive -
VW::label_type_t::cs
-
Contextual Bandit -
VW::label_type_t::cb
-
Contextual Bandit Eval -
VW::label_type_t::cb_eval
-
Conditional Contextual Bandit -
VW::label_type_t::ccb
-
Slates -
VW::label_type_t::slates
-
CATS and CATS-pdf for Continuous Actions -
VW::label_type_t::continuous
[Label] [Importance] [Base]
-
Label
is the real number that we are trying to predict for this example. If the label is omitted, then no training will be performed with the corresponding example, although VW will still compute a prediction. -
Importance
(importance weight) is a non-negative real number indicating the relative importance of this example over the others. Omitting this gives a default importance of 1 to the example. -
Base
is used for residual regression. It is added to the prediction before computing an update. The default value is 0. - When using logistic or hinge loss, the labels need to be from the set {+1,-1} (documented in the V6.1 tutorial slide deck, but not elsewhere)
1 1.0 |MetricFeatures:3.28 height:1.5 length:2.0 |Says black with white stripes |OtherFeatures NumberOfLegs:4.0 HasStripes
1 1.0 zebra|MetricFeatures:3.28 height:1.5 length:2.0 |Says black with white stripes |OtherFeatures NumberOfLegs:4.0 HasStripes
[label] [weight]
-
label
is the class that should be predicted for this example, it is an integer. -
weight
(importance weight) is a non-negative real number indicating the relative importance of this example over the others. Omitting this gives a default importance of 1 to the example.
[[label][,]]+
- A list of comma separated
labels
can be supplied and they are the classes that should be predicted for this example, they are all integers.
There are three forms the cost sensitive label can take:
shared
label <cost> # rarely used
(<class>:[<cost>])*
-
shared
means that the features of this example are given to each other example when it is used in a multiline setting. Cost sensitive labels can be used in both single line and multi line settings. -
label
is rarely used thesedays - The third line the most common form of the cost sensitive label. It is one or more classes followed by an optional cost. When in test only mode the cost can be omitted.
-
class
can be a number or string -
cost
must be a float
Further resources:
The label information for the weighted-all-pairs algorithm (--wap
) and the cost-sensitive-one-against-all (--csoaa
) algorithm are the same. This format is a sparse specification of costs per label (see csoaa.cc:parse_label() in the source tree for reference).
Here's an example:
echo "1:0 2:3 3:1.5 4:1 |f input features come here" | vw --csoaa 4
Preceding the 1st |
char we have 4 classes: 1, 2, 3, 4 each of them has a cost (the number after the colon). It is important to specify the number of classes as an argument to vw (--csoaa 4) and have class labels in the range [1,N] in the input (N=4 in this example). Since the representation is sparse, there's no need to have all labels in all lines.
(<action>:<cost>:<probability>)*
-
action
is the id of the action taken where we observed the cost (a positive integer in {1, k}) -
cost
is the cost observed for this action (floating point, lower is better) -
probability
is the probability (floating point, in [0..1]) of the exploration policy to choose this action when collecting the data - There can be more than one cost logged, but is rarely used
Further resources:
Used when evaluating a policy instead of opimtizing.
action (<action>:<cost>:<probability>)*
The action:cost:probability
triplet is identical to that of the logging policy. The action
that is supplied first is the action to evaluate for this example.
See here
See here
See here
This feature is useful especially in the daemon mode, where you can decide in any moment of the training that you want to save the current model in arbitrary file, using a dummy example whose Tag starts with save and optionally specifies also the filename (no label or features are needed in this dummy example), e.g.:
vw --daemon --port 9999
cat data1.vw | nc localhost:9999
echo save_/tmp/my1.model | nc localhost:9999
cat data2.vw | nc localhost:9999
echo save_/tmp/my1and2.model | nc localhost:9999
If you are using the simple label format you can check that VW is correctly parsing your input by pasting a few lines into the VW validator.
LibSVM uses a simpler format than VW, which can be easily converted to VW format just by adding a pipe symbol between the label and the features.
perl -pe 's/\s/ | /' data.libsvm | vw -f model
For other formats (csv) and preprocessing see e.g. Phraug.
Additional useful information (especially regarding how categorical features are represented) can be found in this Stack Overflow post.
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: