Questionmark provides information on sustainability and health of (food) products in the supermarket. Right now, we are able to analyze about 40k products (fully or partially).
One important step in automatically rating products on health and sustainability, is to put it into a category for scoring (and other uses). This is an attempt to use machine learning to do so.
>> read the blog article <<
Orange was used for a initial concept prototype, with the help of the orange3-text add-on.
You can open the orange workflow with a data sample
to experiment yourself. You may need to select usage_name
as target variable in Select Columns.
For the SVM feature plot, you might want to use usage_id
as target variable and usage_name
as
LIBSVM was used for a more refined prototype, the result is classify.rb. It can train and predict, as well as generate files that LIBSVM tools work with.
First train:
$ gem install tokkens roo rb-libsvm
$ ./classify.rb data-shorter.xlsx
$ wc -l test.out.*
39 test.out.labels
409 test.out.words
914 test.out.model
Then classify a new product based on name, brand and first ingredient:
$ ./classify.rb predict 'Volle koffiemelk' 'Campina' ''
Koffiemelk/room/poeder
Or return multiple categories with probabilities:
$ ./classify.rb prob 'Volle koffiemelk' 'Campina' '' | head -n 4
0.194 Koffiemelk/room/poeder
0.083 Bonbon & Praline's
0.064 Bier (pilsener)
0.059 Chocoladerepen
Alternatively, we can use svm-train
to build the model
$ ./classify.rb traindata data-shorter.xlsx
$ svm-train -c 128 -g 0.125 test.out.train test.out.model
Total nSV = 929
Note that the number of features (929) is about the number of training items (1049), which indicates that the algorithm isn't very well tuned. Which makes sense, because it is tuned to the large dataset.
Or check the accuracy of the model with (10-fold) cross-validation:
$ svm-train -c 128 -g 0.125 -v 10 test.out.train
Cross Validation Accuracy = 78.0744%
The generated model can be used for classification:
$ ./classify.rb predict 'Melk choco halfvol' '' 'melk'
Chocolademelk
Or we can do that manually using svm-predict
:
$ grep ' \(melk\|choco\|halfvol\|ING:melk\)$' test.out.words
3 melk
6 ING:melk
42 halfvol
146 choco
$ echo '1 3:1 6:1 42:1 146:1' >test.out.test
$ svm-predict test.out.test test.out.result
$ cat test.out.result
79
$ grep '^79 ' test.out.labels
79 31 Chocolademelk
For comparison, we can also check LIBLINEAR (which is actually more applicable to this problem than non-linear SVM):
$ liblinear-train -v 10 test.out.train
Cross Validation Accuracy = 80.3622%