Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to import the training data schema in libsvm format #116

Open
sunxiaolongsf opened this issue Jun 17, 2021 · 15 comments
Open

How to import the training data schema in libsvm format #116

sunxiaolongsf opened this issue Jun 17, 2021 · 15 comments

Comments

@sunxiaolongsf
Copy link

sunxiaolongsf commented Jun 17, 2021

I want to export the model in jpmml-sparkml.If the training data is in another format,I knew use dataframe.schema,but the training data in libsvm format,what should I pass in to this function:

val pmml = new PMMLBuilder(schema, pipelineModel).build()
@vruusmann
Copy link
Member

If the training data is in another format

Have you verified that other parts of the PMML conversion workflow work as expected?

For example, if the training data contains vector columns, then the JPMML-SparkML library will probably raise an error about it.

but the training data in libsvm format,what should I pass in to this function

Convert the training data to proper Dataset<Row> representation, and then proceed as usual?

Alternatively, you may construct the schema descriptor (a StructType object) manually. It's easiest to get it via Dataset#schema(), but if that's not an option, you can always do it manually.

@vruusmann
Copy link
Member

Anyhow, if you want me to look deeper into this, then please provide a small self-contained & fully reproducible example.

You could convert the Iris dataset to LibSVM data format, and then report everything that's going wrong with it.

Otherwise, I simply won't have the time.

@sunxiaolongsf
Copy link
Author

OK.thank for your reply!

  1. Load and parse the libsvm data file, converting it to a DataFrame.
    val data = spark.read.format("libsvm").load("./src/main/resources/data/sample_libsvm_data.txt") println("read libsvm first:" +data.first()) data.show()
    // first is (1.0,(438,[7,53,101,166,250,312,412],[4.0,2156.0,1927.0,73.0,804.0,477.0,415.0]))
    label| features|
    +-----+--------------------+
    | 1.0|(438,[7,53,101,16...|
    | 0.0|(438,[59,124,191,...|
    | 0.0|(438,[5,17,91,192...|

2.1 labelIndexer
val labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(data)

2.2 featureIndexer

    val featureIndexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(4)
      .fit(data)
`val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))`
  1. Train a GBT model.
    3.1 set model
    val gbt = new GBTClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")
      .setMaxIter(10)

3.2 Convert indexed labels back to original labels.

    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

3.3 Chain indexers and GBT in a Pipeline.

    val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))

3.4 Train model. This also runs the indexers.

    val model = pipeline.fit(trainingData)
    println(model.getClass)
  1. save model to pmml

4.1 use the libsvm data#schema,raise error "Expected string, integral, double or boolean type, got vector type"

    println("save model to pmml")
    val pmmlPath = "./src/main/resources/data/spark2pmml.pmml"
    val pmml = new PMMLBuilder(data.schema, model).build()
    JAXBUtil.marshalPMML(pmml, new StreamResult(new FileOutputStream(pmmlPath)))

4.2 construct the schema, raise error "Field "features" does not exist."
val newSchema = getLibsvmSchema(8)
// StructType(StructField(label,DoubleType,true), StructField(col1,DoubleType,true),
// StructField(col2,DoubleType,true), StructField(col3,DoubleType,true), StructField(col4,DoubleType,true),
// StructField(col5,DoubleType,true), StructField(col6,DoubleType,true), StructField(col7,DoubleType,true))
savePmml(newSchema,model,"./src/main/resources/data/spark2pmml.pmml")

So,I want to know how to save model to pmml when i use training data in libsvm format like this,thanks!

@vruusmann
Copy link
Member

4.1 use the libsvm data#schema,raise error "Expected string, integral, double or boolean type, got vector type"

I told you something like this will happen.

See #26 and friends.

val newSchema = getLibsvmSchema(8)

You're mis-representing the data here, isn't it so?

Your data frame contains a single n-element vector column, but the newSchema claims that there are n separate scalar (double) columns.

@vruusmann
Copy link
Member

The fix would be to add support for the ArrayType column type.

I've refused to do it in earlier years, but maybe I'll do it this year.

@vruusmann
Copy link
Member

@sunxiaolongsf To answer your original question ("how to handle LibSVM data"), then you'd still need to manually unpack all array/vector columns to scalar columns before performing the conversion to PMML.

Rough outline:

  1. Load dataset in LibSVM format. It gives you vector columns.
  2. Unpack each and every n-element vector column to n scalar columns (typically double columns).
  3. Fit the Apache Spark ML pipeline using step 2 data frame.
  4. Get the schema of the step 2 data frame, and perform the conversion to PMML.

The resulting PMML will contain information about step 2 onward. It knows nothing about the vector columns of the step 1.

@githubthunder
Copy link

HI, @vruusmann

Does the jpmml-sparkml support the libsvm format or the vector datatype now?

@vruusmann
Copy link
Member

@githubthunder The issue is still in "open" state, meaning that there hasn't been any major work done towards addressing it.

Anyway, what's wrong with the workflow suggested in my earlier comment (#116 (comment))? It lets you use LibSVM dataset, if you're willing to throw in a couple lines of data manipulation code.

The main issue with the LibSVM data format (and the vector data/column type) is that it is effectively schema-less.

The PMML standard is about structured ML applications (think: statistics). And structured ML applications require basic information/understanding about the undelying data, such as column names, column types, etc.

@githubthunder
Copy link

HI, @vruusmann thanks for your replies

If n is very large(meaning there are many features), the dataframe will have numerous columns, which could lead to excessive use of space.

Can jpmml-sparkml provide an interface that accepts a schema of 'label: DOUBLE, features: vector,' where the vector may be in sparse format, and automatically generates feature names based on the vector order, similar to 'f_1, f_2, ...'?

@vruusmann
Copy link
Member

Older discussion(s) regarding vector columns: #21

@vruusmann
Copy link
Member

Can jpmml-sparkml automatically generates feature names based on the vector order, similar to 'f_1, f_2, ...'?

Something like that. When the converter comes across a vector column, then it would automatically expand it into a list of org.jpmml.converter.ContinuousFeature objects, one for each vector element. The name would be synthetic (x_{n} or f_{n}, or whatever), and the data type would be inherited from vector's element type (can you have float vector in Apache Spark these days, or are they all double vectors?).

The biggest obstacle in implementing vector column support is that the VectorUDT type does not support any information about the "vector size". That is, there is no VectorUDT#getSize() method.

Without this information, the converter does not now how many ContinuousFeature objects to create.

@vruusmann
Copy link
Member

The biggest obstacle in implementing vector column support is that the VectorUDT type does not support any information about the "vector size". That is, there is no VectorUDT#getSize() method.

The workaround would be to require that the pipeline must contain a VectorSizeHint transformer.

@githubthunder
Copy link

githubthunder commented Aug 6, 2024

HI, @vruusmann Thanks again for your replies and work

If VectorUDT type does not support any information about the "vector size", maybe the interface can add the inputting paramter "numFeatures". The numFeatures can be provided by the user. If numFeatures>0, the interface will use this value for calculations, otherwise it will follow the current processing logic.

The code may look as follows,

// get the number of features

val data = spark.read.format("libsvm").load("sample_libsvm_data.txt")
val vector = data.first().getAs[org.apache.spark.ml.linalg.Vector]("features")
val numFeatures = vector.size

// train the machine learning model
......

// export model with pmml format

val pmml = new PMMLBuilder(training.schema, pipelineModel, numFeatures).build()
or
val pmml = new PMMLBuilder(training.schema, pipelineModel).build(numFeatures)

JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))

@vruusmann
Copy link
Member

vruusmann commented Aug 6, 2024

It's not permitted to change the PMMLBuilderconstructor, or the build method (because that's what is guaranteed to be stable for 5+ years).

But since we're dealing with a builder pattern here, it's possible to add more "configuration" methods.

For example, the PMMLBuilder could allow you to manually specify the Apache-Spark-to-(J)PMML mapping for individual columns:

List<Feature> listOfScalarFeatures = new ArrayList<>();
for(int i = 0; i < numFeatures; i++){
  listOfScalarFeatures.add(new ContinuousFeature(encoder, "f_" + String.valueOf(i + 1), DataType.DOUBLE));
}

PMMLBuilder pmmlBuilder = new PMMLBuilder(training.schema, pipelineModel)
  // THIS!
  .defineColumn("features", listOfScalarFeatures);

PMML pmml = pmmlBuilder.build();

@githubthunder
Copy link

HI, @vruusmann thanks again

Maybe the method "defineColumn" is the simple and effetive solution to support the sparse format. I am really looking forward to your work. Also, could you add the "defineColumn" feature in the older versions as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants