-
Notifications
You must be signed in to change notification settings - Fork 1.9k
C# Binding
This is a tutorial for the Vowpal Wabbit C# binding. Here's a list of major features:
- Very efficient serialization from managed to native space using runtime compilation
- Declarative specification of example data structure
- Thread-safety through object pooling and shared models
- Example level caching (prediction only)
- Improved memory management
The binding exposes three different options to interact with native Vowpal Wabbit, each having pros and cons:
- User defined data types: use VW.VowpalWabbit<TUserType>
- Generic data structures (e.g. records consisting of key/value/type tuples): use VW.VowpalWabbit<TUserType>
- String based examples: use VW.VowpalWabbit
Install the Vowpal Wabbit NuGet Package using
Install-Package VowpalWabbit
The nuget includes:
- C++ part of vowpal wabbit compiled for Windows x64 Release
- C++/CLI wrapper
- C# wrapper supporting declarative data to feature conversion
- PDB debug symbols (for Windows newbies, PDB here is a program database file, not the Python debugger)
- Source files
- IntelliSense documentation
- zlib native dll
Debug symbols (PDB) are post-processed using GitLink and included with nugets. From GitLink: When using GitLink, the user no longer has to specify symbol servers. The only requirement is to ensure the check the Enable source server support option in Visual Studio as shown below:
Through out the examples the following dataset from Rcv1-example is used:
1 |f 13:3.9656971e-02 24:3.4781646e-02 69:4.6296168e-02 85:6.1853945e-02 ... 0 |f 9:8.5609287e-02 14:2.9904654e-02 19:6.1031535e-02 20:2.1757640e-02 ... ...
Pro | Cons |
---|---|
very performant | one-time overhead of serializer compilation |
declarative data to feature conversion |
The following class Row is an example of a user defined type usable by the serializer.
using VW.Interfaces;
using VW.Serializer.Attributes;
using System.Collections.Generic;
public class Row : IExample
{
[Feature(FeatureGroup = 'f', Namespace = "eatures", Name = "const", Order = 2)]
public float Constant { get; set; }
[Feature(FeatureGroup = 'f', Namespace = "eatures", Order = 1)]
public IList<KeyValuePair<string, float>> Features { get; set; }
public string Line { get; set; }
public ILabel Label { get; set;}
}
The serializer follows an opt-in model, thus only properties annotated using [Feature] are transformed into Vowpal Wabbit features. The [Feature] attribute supports the following properties:
Property | Description | Default |
---|---|---|
FeatureGroup | it's the first character of the namespace in the string format | Space |
Namespace | concatenated with the FeatureGroup | 0 = hash(Namespace) |
Name | name of the feature (e.g. 13, 24, 69 from the example above) | property name |
Enumerize | if true, features will be converted to string and then hashed. e.g. VW line format: Age_15 (Enumerize=true), Age:15 (Enumerize=false) | false |
Order | feature serialization order. Useful for comparison with VW command line version | 0 |
StringProcessing | String features are either escaped (spaces are replaced with underscores) or split by space producing individual features. | Split |
AddAnchor | use with dense features and --interact to mark the beginning of a set of dense features | false |
Dictify | when generating Vowpal Wabbit string formatted examples, this will replace the annotated feature with a surrogate. The serialized feature will be stored and checked against a dictionary passed to VowpalWabbitSerializer.SerializeToString(). | false |
Furthermore the serializer will recursively traverse all properties of the supplied example type on the search for more [Feature] attributed properties (Note: recursive data structures are not supported). Feature groups, namespaces and dictify are inherited from parent properties and can be overridden for sub trees. Finally all annotated properties are put into the corresponding namespaces.
using VW.Serializer.Attributes;
public class ParentRow
{
[Feature(FeatureGroup = 'f')]
public CommonFeatures UserFeatures { get; set; }
[Feature(FeatureGroup = 'f')]
public String Country { get; set; }
[Feature(FeatureGroup = 'g', Enumerize=true)]
public int Age { get; set; }
}
public class CommonFeatures
{
[Feature]
public int A { get; set; }
[Feature(FeatureGroup = 'g', Name="Beta")]
public float B { get; set; }
}
// ...
var row = new ParentRow
{
UserFeatures = new CommonFeatures
{
A = 2,
B = 3.1f
},
Country = "Austria",
Age = 25
};
The vowpal wabbit string equivalent of the above instance is
|f A:2 Country:Austria |g Beta:3.1 Age_25
using (var vw = new VW.VowpalWabbit<Row>("-f rcv1.model"))
{
var userExample = new Row { /* ... */ };
vw.Learn(userExample, new SimpleLabel { / *... */ });
var prediction = vw.Predict(userExample, VowpalWabbitPredictionType.Scalar);
}
- Serializers are globally cached per type (read: static variable). I.e., there's a static dictionary from user-defined types to serializers.
- Native example memory is cached using a pool per VW.VowpalWabbit instance. Each Learn/Predict call will either get memory from the pool or allocate new memory.
The serialization infrastructure is extensible by providing type based custom featurizers (VowpalWabbitSettings.CustomFeaturizer). Consider the following example:
public class CustomClass
{
public int X { get; set; }
}
public class MyContext
{
[Feature]
public CustomClass Feature { get; set; }
}
public class CustomFeaturizer
{
public void MarshalFeature(VowpalWabbitMarshalContext context, Namespace ns, Feature feature, CustomClass value)
{
var featureHash = context.VW.HashFeature("prefix"+ feature.Name, ns.NamespaceHash);
context.NamespaceBuilder.AddFeature(featureHash, value.X);
context.AppendStringExample(feature.Dictify, " prefix{0}:{1}", feature.Name, value.X);
}
}
var context = new MyContext() { Feature = new CustomClass() { X = 5 }};
using (var vw = new VowpalWabbit<MyContext>(new VowpalWabbitSettings(customFeaturizer: new List<Type> { typeof(CustomFeaturizer) })))
{
vw.Learn(context);
}
In the above example features of type CustomClass will be marshalled using CustomFeaturizer. The serializer infrastructure looks for methods of the form public void MarshalFeature(VowpalWabbitMarshalContext context, Namespace ns, Feature feature, CustomClass value) among others. A good reference is VowpalWabbitDefaultMarshaller which is internally added to the same featurizer list. Custom featurizers are given priority over the default marshaller. Serializing data to string format (see VowpalWabbitMarshalContext.AppendStringExample) is optional when working through the native interface only, but considered good practice to ease debugging.
Pro | Cons |
---|---|
very performant | results might not be reproducible using VW binary as it allows for feature representation not expressible through the string format |
provides maximum flexibility with feature representation | System.Linq.Expression based - a bit harder use |
suited for generic data structures (e.g. records, data table, ...) | --affix is not supported, though easy to replicate in C# |
Let me point out that using VowpalWabbitDefaultMarshaller is another option. The biggest upside is that it optionally generates the corresponding VW string features, but maybe less flexible than the direct approach below.
using (var vw = new VW.VowpalWabbit("-f rcv1.model"))
{
// 1 |f 13:3.9656971e-02 24:3.4781646e-02 69:4.6296168e-02
using (var exampleBuilder = new VW.VowpalWabbitExampleBuilder(vw))
{
// important to dispose the namespace builder at the end, as data is only added to the example
// if there is any feature added to the namespace
using (var ns = exampleBuilder.AddNamespace('f'))
{
var namespaceHash = vw.HashSpace("f");
var featureHash = vw.HashFeature("13", namespaceHash);
ns.AddFeature(featureHash, 8.5609287e-02f);
featureHash = vw.HashFeature("24", namespaceHash);
ns.AddFeature(featureHash, 3.4781646e-02f);
featureHash = vw.HashFeature("69", namespaceHash);
ns.AddFeature(featureHash, 4.6296168e-02f);
}
exampleBuilder.ParseLabel("1");
// hand over of memory management
using (var example = exampleBuilder.CreateExample())
{
vw.Learn(example);
}
}
}
Pro | Cons |
---|---|
no pitfalls when it comes to reproducibility/compatibility when used together with VW binary | slowest variant due to string marshaling (and character encoding differences between the C# and C++ worlds) |
supports affixes |
using (var vw = new VW.VowpalWabbit("-f rcv1.model"))
{
vw.Learn("1 |f 13:3.9656971e-02 24:3.4781646e-02 69:4.6296168e-02");
// read more data ...
var prediction = vw.Predict<VW.VowpalWabbitScalarPrediction>("|f 9:8.5609287e-02 14:2.9904654e-02 19:6.1031535e-02 20:2.1757640e-02");
System.Console.WriteLine("Prediction: " + prediction.Value);
}
VW.VowpalWabbit are not thread-safe, but by using object pools and shared models we can enable multi-thread scenarios without multiplying the memory requirements by the number of threads.
Consider the following excerpt from TestSharedModel Unit Test
var vwModel = new VowpalWabbitModel("-t -i m1.model");
using (var pool = new VowpalWabbitThreadedPrediction<Row>(vwModel))
{
using (var vw = pool.GetOrCreate())
{
vw.Value.Predict(example);
}
pool.UpdateModel(new VowpalWabbitModel("-t -i m2.model"));
}
vwModel is the shared model. Each call to vwPool.Get() will either get a new instance spawned of the shared model or re-use an existing.
A very common scenario when scoring is to rollout updates of new models. The ObjectPool class allows safe updating of the factory and proper disposal. After the call to vwPool.UpdateFactory(), vwPool.Get() will only return instances spawned of the new shared model (newVwModel). Not-in-use VowpalWabbit instances are disposed as part of UpdateFactory(). VowpalWabbit instances currently in-use are diposed upon return to the pool (PooledObject.Dispose).
To improve performance especially in scenarios using action dependent features, examples can be cached on a per VowpalWabbit instance base. To enable example level cache simply annotate the type using the [Cachable] attribute. This can only be used for predictions as labels cannot be updated once an example is created. The cache size can be configured using VowpalWabbitSerializerSettings.
It's considered best practice to use the same annotated user types at training and scoring time. As example level caching is only supported for predictions, one must disable caching at training time using
new VowpalWabbit("", new VowpalWabbitSerializerSettings { EnableExampleCaching = false })
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: