Unsupported vector type on datasource that provides it #21

obones · 2017-06-30T07:46:00Z

Hello,

We are using Spark with a custom datasource that directly gives a label, vector(features) dataframe which saves using a VectorAssembler in the pipeline.
While this works just fine to train ML models, we can't export them to PMML using jpmml-sparkml because we receive this error
java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

Looking around on various sites, I see that it comes from the fact that jpmml-sparkml does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?

As a workaround, we can have "split" data and use a VectorAssembler but it uses some computation time that we feel is a bit wasted.

The text was updated successfully, but these errors were encountered:

vruusmann · 2017-06-30T08:19:28Z

Duplicate of #18 and #2 (and probably some others)

I see that it comes from the fact that jpmml-sparkml does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?

The VectorUDT data type does not provide adequate description of your dataframe. At minimum, it would be necessary to know the number of columns in your dataframe, but there is no method VectorUDT#numDimensions (or similar).

Perhaps it will be possible to create a subclass of VectorUDT that does so.

As a workaround, we can have "split" data and use a VectorAssembler but it uses some computation time that we feel is a bit wasted.

You can waste computation time, or you can waste your own time.

If you think that your time is more abundant than computer time, then you can try creating a synthetic dataframe schema definition, as explained here: #18 (comment)

obones · 2017-06-30T13:31:00Z

Thanks, I'll see what I can do with the "synthetic definition" as using a VectorAssembler adds anywhere from 1 to 10% time penalty.

vruusmann · 2017-06-30T15:18:41Z

You don't need to embed and execute the VectorAssembly transformation in your actual data pipeline.

The idea is to create a pair of "synthetic" StructType and PipelineModel objects based on actual schema and fitted pipeline model objects. This synthetic PipelineModel object contains a synthetic VectorAssembler stage in the first position, which references columns in your synthetic StructType object. The important point is that VectorAssembler makes the number of columns in your dataframe known to JPMML-SparkML via the VectorAssembler#inputCols() parameter.

Anyway, if 10% time penalty is such a huge deal for your use case, then you should be probably avoiding the PMML approach.

vruusmann closed this as completed Jun 30, 2017

vruusmann mentioned this issue Jul 18, 2017

java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type #26

Closed

vruusmann mentioned this issue Jan 9, 2020

feature datatype not support array? #88

Closed

vruusmann mentioned this issue Aug 6, 2024

How to import the training data schema in libsvm format #116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupported vector type on datasource that provides it #21

Unsupported vector type on datasource that provides it #21

obones commented Jun 30, 2017

vruusmann commented Jun 30, 2017

obones commented Jun 30, 2017

vruusmann commented Jun 30, 2017

Unsupported vector type on datasource that provides it #21

Unsupported vector type on datasource that provides it #21

Comments

obones commented Jun 30, 2017

vruusmann commented Jun 30, 2017

obones commented Jun 30, 2017

vruusmann commented Jun 30, 2017