-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to import the training data schema in libsvm format #116
Comments
Have you verified that other parts of the PMML conversion workflow work as expected? For example, if the training data contains vector columns, then the JPMML-SparkML library will probably raise an error about it.
Convert the training data to proper Alternatively, you may construct the schema descriptor (a |
Anyhow, if you want me to look deeper into this, then please provide a small self-contained & fully reproducible example. You could convert the Iris dataset to LibSVM data format, and then report everything that's going wrong with it. Otherwise, I simply won't have the time. |
OK.thank for your reply!
2.1 labelIndexer 2.2 featureIndexer
3.2 Convert indexed labels back to original labels.
3.3 Chain indexers and GBT in a Pipeline.
3.4 Train model. This also runs the indexers.
4.1 use the libsvm data#schema,raise error "Expected string, integral, double or boolean type, got vector type"
4.2 construct the schema, raise error "Field "features" does not exist." So,I want to know how to save model to pmml when i use training data in libsvm format like this,thanks! |
I told you something like this will happen. See #26 and friends.
You're mis-representing the data here, isn't it so? Your data frame contains a single n-element vector column, but the |
The fix would be to add support for the I've refused to do it in earlier years, but maybe I'll do it this year. |
@sunxiaolongsf To answer your original question ("how to handle LibSVM data"), then you'd still need to manually unpack all array/vector columns to scalar columns before performing the conversion to PMML. Rough outline:
The resulting PMML will contain information about step 2 onward. It knows nothing about the vector columns of the step 1. |
HI, @vruusmann Does the jpmml-sparkml support the libsvm format or the vector datatype now? |
@githubthunder The issue is still in "open" state, meaning that there hasn't been any major work done towards addressing it. Anyway, what's wrong with the workflow suggested in my earlier comment (#116 (comment))? It lets you use LibSVM dataset, if you're willing to throw in a couple lines of data manipulation code. The main issue with the LibSVM data format (and the vector data/column type) is that it is effectively schema-less. The PMML standard is about structured ML applications (think: statistics). And structured ML applications require basic information/understanding about the undelying data, such as column names, column types, etc. |
HI, @vruusmann thanks for your replies If n is very large(meaning there are many features), the dataframe will have numerous columns, which could lead to excessive use of space. Can jpmml-sparkml provide an interface that accepts a schema of 'label: DOUBLE, features: vector,' where the vector may be in sparse format, and automatically generates feature names based on the vector order, similar to 'f_1, f_2, ...'? |
Older discussion(s) regarding vector columns: #21 |
Something like that. When the converter comes across a vector column, then it would automatically expand it into a list of The biggest obstacle in implementing vector column support is that the Without this information, the converter does not now how many |
The workaround would be to require that the pipeline must contain a |
HI, @vruusmann Thanks again for your replies and work If VectorUDT type does not support any information about the "vector size", maybe the interface can add the inputting paramter "numFeatures". The numFeatures can be provided by the user. If numFeatures>0, the interface will use this value for calculations, otherwise it will follow the current processing logic. The code may look as follows,
|
It's not permitted to change the But since we're dealing with a builder pattern here, it's possible to add more "configuration" methods. For example, the List<Feature> listOfScalarFeatures = new ArrayList<>();
for(int i = 0; i < numFeatures; i++){
listOfScalarFeatures.add(new ContinuousFeature(encoder, "f_" + String.valueOf(i + 1), DataType.DOUBLE));
}
PMMLBuilder pmmlBuilder = new PMMLBuilder(training.schema, pipelineModel)
// THIS!
.defineColumn("features", listOfScalarFeatures);
PMML pmml = pmmlBuilder.build(); |
HI, @vruusmann thanks again Maybe the method "defineColumn" is the simple and effetive solution to support the sparse format. I am really looking forward to your work. Also, could you add the "defineColumn" feature in the older versions as well? |
I want to export the model in jpmml-sparkml.If the training data is in another format,I knew use dataframe.schema,but the training data in libsvm format,what should I pass in to this function:
The text was updated successfully, but these errors were encountered: