Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported vector type on datasource that provides it #21

Closed
obones opened this issue Jun 30, 2017 · 3 comments
Closed

Unsupported vector type on datasource that provides it #21

obones opened this issue Jun 30, 2017 · 3 comments

Comments

@obones
Copy link

obones commented Jun 30, 2017

Hello,

We are using Spark with a custom datasource that directly gives a label, vector(features) dataframe which saves using a VectorAssembler in the pipeline.
While this works just fine to train ML models, we can't export them to PMML using jpmml-sparkml because we receive this error
java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

Looking around on various sites, I see that it comes from the fact that jpmml-sparkml does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?

As a workaround, we can have "split" data and use a VectorAssembler but it uses some computation time that we feel is a bit wasted.

@vruusmann
Copy link
Member

Duplicate of #18 and #2 (and probably some others)

I see that it comes from the fact that jpmml-sparkml does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?

The VectorUDT data type does not provide adequate description of your dataframe. At minimum, it would be necessary to know the number of columns in your dataframe, but there is no method VectorUDT#numDimensions (or similar).

Perhaps it will be possible to create a subclass of VectorUDT that does so.

As a workaround, we can have "split" data and use a VectorAssembler but it uses some computation time that we feel is a bit wasted.

You can waste computation time, or you can waste your own time.

If you think that your time is more abundant than computer time, then you can try creating a synthetic dataframe schema definition, as explained here: #18 (comment)

@obones
Copy link
Author

obones commented Jun 30, 2017

Thanks, I'll see what I can do with the "synthetic definition" as using a VectorAssembler adds anywhere from 1 to 10% time penalty.

@vruusmann
Copy link
Member

You don't need to embed and execute the VectorAssembly transformation in your actual data pipeline.

The idea is to create a pair of "synthetic" StructType and PipelineModel objects based on actual schema and fitted pipeline model objects. This synthetic PipelineModel object contains a synthetic VectorAssembler stage in the first position, which references columns in your synthetic StructType object. The important point is that VectorAssembler makes the number of columns in your dataframe known to JPMML-SparkML via the VectorAssembler#inputCols() parameter.

Anyway, if 10% time penalty is such a huge deal for your use case, then you should be probably avoiding the PMML approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants