Rego provides a command-line batch interface to the RuleFit statistical model building program, making it possible to:
- Declaratively train and run a model using ensemble methods
- Export a model as production-ready SQL (for MySQL, SQLServer, Hive, and Netezza)
- Incorporate best practices for data pre-processing and model interpretation
- Provide a detailed, human-readable model explanation in HTML
Rego uses RuleFit, Stanford Professor Jerome Friedman's implementation of Rule Ensembles, an interpretable type of ensemble model where the base-learners consists of conjunctive rules derived from decision trees.
Rego is developed and maintained by Dr. Giovanni Seni, and was initially sponsored by the Data Engineering and Analytics group (IDEA) of Intuit, Inc.
Rego is released as open source under the Eclipse Public License - v 1.0.
Predictive learning plays an important role in many areas of science, finance and industry. Here are some examples of learning problems:
- Predict whether a customer would be attracted to a new service offering. Recognizing such customers can reduce the cost of a campaign by reducing the number of contacts.
- Predict whether a web site visitor is unlikely to become a customer. The prediction allows prioritization of customer support resources.
- Identify the risk factors for churn, based on the content of customer support messages.
Rego is a collection of R-based scripts intended to facilitate the process of building, interpreting, and deploying state-of-art predictive learning models. Rego can:
- Enable rapid experimentation
- Increase self-service capability
- Support easy model deployment into a production environment
Under the hood Rego uses RuleFit, a statistical model building program created by Prof. Jerome Friedman. RuleFit was written in Fortran but has an R interface. RuleFit implements a model building methodology known as "ensembling", where multiple simple models (base learners) are combined into one usually more accurate than the best of its components. This type of model can be described as an additive expansion of the form:
F(x) = a0 + a1*b1(x) + a2*b2(x) + ... + aM*bM(x) where the bj(x)'s are the base-learners.
In the case of RuleFit, the bj(x) terms are conjunctive rules of the form
if x1 > 22 and x2 > 27 then 1 else 0
or linear functions of a single variable -- e.g., bj(x) = xj.
Using base-learners of this type is attractive because they constitute easily interpretable statements about attributes xj. They also preserve the desirable characteristics of Decision Trees such as easy handling of categorical attributes, robustness to outliers in the distribution of x, etc.
RuleFit builds model F(x) in a three-step process:
- build a tree ensemble (one where the bj(x)'s are decision trees),
- generate candidate rules from the tree ensemble, and
- fit coefficients aj via regularized regression.
Rego consists of additional R code that we've written to make working with RuleFit easier, including:
- The ability to have multiple rulefit batch jobs running simultaneously
- Easily specifying a data source
- Automatically executing common preprocessing operations
- Automatically generating a model summary report with interpretation plots and quality assessment
- Exporting a model from R to SQL for deployment in a production environment
Install R and the following R packages: R2HTML, ROCR, RODBC, getopt
REGO_HOME
: environment variable pointing to where you have checked out the Rego codeRF_HOME
: environment variable pointing to appropriate RuleFit executable -- e.g., exportRF_HOME=$REGO_HOME/lib/mac
Before using Rego, you must download the RuleFit binary appropriate to your platform:
Place the following files in $REGO_HOME/lib/RuleFit/windows/
Place the following file in $REGO_HOME/lib/RuleFit/linux/
Run chmod u+x ${REGO_HOME}/lib/RuleFit/linux/rf_go.exe
to make the file executable.
Place the following file in $REGO_HOME/lib/RuleFit/mac/
Run chmod u+x $REGO_HOME/lib/RuleFit/mac/rf_go.exe
to make the file executable.
$REGO_HOME/bin/trainModel.sh --d=DATA.conf --m=MODEL.conf [--l LOG.txt]
- Input:
DATA.conf
: data configuration file specifying options such as where the data is coming from, what column corresponds to the target, etc.MODEL.conf
: model configuration file specifying options such as the type of model being fit, the criteria being optimized, etc.LOG.txt
: optional file name where to write logging messages
- Output:
model_summary.html
: model summary and assessmentmodel_singleplot.html
: interpretation plots<Model definition files>
: for later export or prediction
$REGO_HOME/bin/exportModel.sh --m=MODEL.dir [--c=EXPORT.conf]
- Input
MODEL_DIR
: path to model definition filesEXPORT.conf
: the export configuration file specifying options such as desired sql dialect, type of scoring clause, etc.
- Output:
SQL_FILE
: output file name containing model as a SQL expression
$REGO_HOME/bin/runModel.sh --m=MODEL.dir --d=DATA.conf
- Input:
MODEL_DIR
: path to model definition filesDATA.conf
: data configuration file specifying test data location
- Output:
- Text file with
<id, y, yHat>
tuples
- Text file with
$REGO_HOME/bin/runModelSQL.sh --host --dbIn --tblIn=<Feature table> --pk=<Primary Key> --model=<Model Definition SQL File> --dbOut --tblOut=<Score table>
- Input
dbIn.tblIn
: new data to be scoredmodel
: previously built (and exported) model
- Output:
dbOut.tblOut
: Computed scores
$REGO_HOME/rfPardep_main.R -c PARDEP.conf
- Input
PARDEP.conf
: data configuration file specifying variable to be plotted and partial dependence options
- Output:
- PNG file with partial dependence graph
These examples show how to use a CSV or RData file as a data source, using the R diamonds dataset.
The CSV and RData files were created in R as follows:
X <- ggplot2::diamonds
write.csv(X, file = "diamonds.csv", na = "", row.names = FALSE)
save(X, file = "diamonds.RData")
CSV:
# training
$REGO_HOME/bin/trainModel.sh --d=data_csv.conf --m=model.conf
# prediction on csv data file
$REGO_HOME/bin/runModel.sh --m=/tmp/REgo/Diamonds_wd/export --d=predict_csv.conf
# export model to SQL $REGO_HOME/bin/exportModel.sh --m=/tmp/REgo/Diamonds_wd/export --c=$REGO_HOME/conf/EXPORT.conf
# prediction on db table
$REGO_HOME/bin/runModelSQL.sh --host= --dbIn= --tblIn=diamond_test --pk=id --model=rules_forSQL.txt --sql=HiveQL --typeOut=1
RData:
# training
$REGO_HOME/bin/trainModel.sh --d=data_rdata.conf --m=model.conf
# prediction or RData data file
$REGO_HOME/bin/runModel.sh --m=/tmp/REgo/Diamonds_wd/export --d=predict_rdata.conf
If using RData as the data source type for either training or prediction, then it must be used for both. This is because the order of factor levels may be different for the two source types.