-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling columns with null values #44
Comments
Apply What is your Apache Spark version? How does Apache Spark handle columns with missing values - AFAIK it should also crash sooner or later. |
Hi, Thanks for the quick reply. AFAIK the org.apache.spark.ml.feature.Imputer class can be used only on float or double data types. The column that gives me error is String type. I am using Apache spark 2.2.0. |
In apache spark null values are handled with StringIndexer setInvalid method with value set to "keep". Let me share the simplied code where I can reproduce the issue and share it. |
|
@malathit90 Sorry, I don't have time to debug images. |
Here is the snippet giving the error @vruusmann val a1Idx = new StringIndexer().setInputCol("a1").setOutputCol("a1Indexed").setHandleInvalid("keep")
val featureAssembler = new VectorAssembler().setInputCols(Array("a1Indexed", "a2")).setOutputCol("features");
val labelIndexer = new StringIndexer().setInputCol("a16").setOutputCol("labelIndexed").fit(zeroFilledData);
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("featuresIndexed").setMaxCategories(15);
val classifier = new RandomForestClassifier().setLabelCol("labelIndexed").setFeaturesCol("featuresIndexed").setImpurity("gini").setPredictionCol("predictionIndexed");
val labelConverter = new IndexToString().setInputCol("predictionIndexed").setOutputCol("prediction").setLabels(labelIndexer.labels);
val pipeline = new Pipeline().setStages(Array(a1Idx, labelIndexer, featureAssembler, featureIndexer, classifier, labelConverter));
val model = pipeline.fit(zeroFilledData)
MetroJAXBUtil.marshalPMML(ConverterUtil.toPMML(df.schema, model), new FileOutputStream("/tmp/out.pmml"))``` |
I get the above exception when the column has null values. Any ideas on how to resolve this? Please comment if further details are needed.
The text was updated successfully, but these errors were encountered: