-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Pandas dataframe input
Pull request #2426 introduces a generic extensible framework for VW to understand structured Pandas dataframes.
The class DFToVW
in vowpalwabbit.pyvw
takes as input the pandas.DataFrame
and special types (SimpleLabel
, Feature
, Namespace
) that specify the desired VW conversion.
These classes make extensive use of a class Col
that refers to a given column in the user specified dataframe.
A simpler interface DFtoVW.from_colnames
also be used for the simple use-cases. The main benefit is that the user need not use the specific types.
Below are some usages of this class. They all rely on the following pandas.DataFrame
called df
:
house_id need_new_roof price sqft age year_built
0 id1 0 0.23 0.25 0.05 2006
1 id2 1 0.18 0.15 0.35 1976
2 id3 0 0.53 0.32 0.87 1924
Let say we want to build a VW dataset with the target need_new_roof
and the feature age
:
from vowpalwabbit.pyvw import DFtoVW
conv = DFtoVW.from_colnames(y="need_new_roof", x=["age", "year_built"], df=df)
Then we can use the method process_df
:
conv.process_df()
that outputs the following list:
['0 | 0.05 2006', '1 | 0.35 1976', '0 | 0.87 1924']
This list can then directly be consumed by the method pyvw.model.learn
.
The class DFtoVW
also allow the following patterns in its default constructor :
- tag
- (named) namespaces, with scaling factor
- (named) features, with constant feature possible
To use these more complex patterns we need to import them using:
from vowpalwabbit.pyvw import SimpleLabel, Namespace, Feature, Col
Let's create a VW dataset that include a named namespace (with scaling) and a named feature:
conv = DFtoVW(
df=df,
label=SimpleLabel(Col("need_new_roof")),
namespaces=Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm"))
)
conv.process_df()
which yields:
['0 |Imperial:0.092 sqm:0.25',
'1 |Imperial:0.092 sqm:0.15',
'0 |Imperial:0.092 sqm:0.32']
Let's create a more complex example with a tag and multiples namespaces with multiples features.
conv = DFtoVW(
df=df,
label=SimpleLabel(Col("need_new_roof")),
tag=Col("house_id"),
namespaces=[
Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm")),
Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("price")), Feature(Col("age"))])
]
)
conv.process_df()
which yields:
['0 id1|Imperial:0.092 sqm:0.25 |DoubleIt:2 0.23 0.05',
'1 id2|Imperial:0.092 sqm:0.15 |DoubleIt:2 0.18 0.35',
'0 id3|Imperial:0.092 sqm:0.32 |DoubleIt:2 0.53 0.87']
- The class
DFtoVW
and the specific types are located invowpalwabbit/pyvw.py
. The class only depends on thepandas
module. - the code includes docstrings
- 8 tests are included in
tests/test_pyvw.py
- This framework does not yet handle multilines and more complex label types.
- To convert very large dataset that can't fit in RAM, one can make use of the pandas import option
chunksize
and process each chunk at a time. This could be implemented functionnality directly in the class using generator. The generator would then be consumed by either a VW learning interface or could be written to external file (for conversion purpose only).
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: