GP Learners

FlexGP Blog

In this blog we will provide examples of use of the learners developed within the FlexGP project. We will analyze the performance of the released learners for different datasets.

  1. SR learner
  2. Rule Tree classification
  3. GP function classification
  4. Multiple Regression Genetic Programming

Symbolic Regression Learner: Predicting the quality of wine

The Wine Quality dataset is available at the UCI Machine Learning repository website. This problem consists in modeling the quality (a grade from 1 to 10) of a given red or white wine given 11 features such as acidity, alcohol degree etc. Note that the first line of both datasets contains the labels of the different features and needs to be deleted. Additionally, the separators employed in the original dataset are semicolons and need to be commas. We show the steps necessary to preprocess the Red Wine dataset (the same steps apply to the White Wine dataset):

$ cd path_to_redWine_folder
$ sed 1d original_data > new_data
$ sed -i 's/;/,/g' new_data
Once the data is properly formatted, we run the SR learner:

$ cd path_to_redWine_folder
$ java -jar sr.jar -train path_to_redWine_data -minutes 60
$ cd path_to_whiteWine_folder
$ java -jar sr.jar -train path_to_whiteWine_data -minutes 60 

At the end of both runs we measure the accuracy of the most accurate, least complex and knee models and the fused Pareto Front Model:

Red wine dataset

$ cd path_to_redWine_folder
$ java -jar sr.jar -test path_to_redWine_data -integer true -scaled knee.txt 
(0.2091043529058974 .* (- (+ (- (- (- (- (- (- X11 (mydivide (+ X3 X9) X10)) X2) (exp X10)) X2) (log X5)) (mydivide (cos (exp X3)) (- X7 X1))) (- X11 (exp X2))) (mydivide (- (+ (- (- (- (mydivide (- (- X11 (exp (cos (+ X11 X9)))) X9) X10) (mydivide (mysqrt X1) X10)) (exp (sin X1))) (mydivide X11 (mydivide X9 (sin X4)))) X11) (mydivide (- (- (- X6 (mydivide (- (- (- X7 X1) X1) X6) (mydivide (square X2) X10))) X9) (mydivide X11 (sin X4))) X7)) X1))) + 3.2310360388920216
MSE: 0.4803001876172608
MAE: 0.4252657911194497
$ java -jar sr.jar -test path_to_redWine_data -integer true -scaled leastComplex.txt 
(-0.3019831257873804 .* X9) + 6.6359228267589190
MSE: 0.7873671044402751
MAE: 0.6610381488430269
$ java -jar sr.jar -test path_to_redWine_data -integer true -scaled mostAccurate.txt 
(0.2146726201717735 .* (- (- (+ (- (- (- (- (- (- (- (- (- X11 (mydivide X9 X10)) X2) (exp X10)) (exp X2)) (+ X5 X9)) X3) X5) X2) (mydivide (cos (+ X3 X8)) (- X7 X1))) X11) (mydivide (- (+ (- (- (- (- (- (- (- (mydivide (- (- X11 (exp (cos (- (- (- (- (- (mydivide (- (- X11 (exp (cos (+ X11 X11)))) X9) X10) (mydivide X9 X10)) X2) (mydivide (- (- (- (- (- (- (sin X4) (mydivide (- (- (- (mydivide (- (- (mydivide (- (- X11 (exp (cos (+ X11 X11)))) X9) X10) (exp X2)) (mydivide (square X2) X10)) X10) X1) X10) (sin X1)) X2)) (log (mydivide (- X2 (mydivide (- (- (- (- (- (mydivide (- (- X11 (exp (cos (+ X9 X9)))) X9) X10) (mydivide X9 X10)) (sin X4)) X5) (mydivide (quart X2) X10)) (sin X1)) X2)) X1))) (exp (mysqrt (- X11 (mydivide X4 (- X2 (mydivide X11 X3))))))) (- X4 X1)) X6) (- X8 X11)) X6)) (mydivide X9 X10)) (sin X1))))) X9) X10) (mydivide X9 X10)) (exp (cos (- (- (- (- (- (mydivide (- (- X11 (exp (cos (+ X11 X11)))) X9) X10) (mydivide X9 X10)) X2) (mydivide (- (- (- (- (- (- X2 (mydivide (- (- (- (mydivide (- (- (mydivide (- (- X11 (exp (cos (+ X11 X11)))) X9) X10) (exp X2)) (mydivide (square X2) X10)) X10) X1) X10) (sin X1)) X2)) (log (quart X2))) (exp (mysqrt (- X11 (mydivide X4 (- X2 (mydivide (square X2) X3))))))) (- X4 X1)) X6) (- (sin X5) X11)) X6)) (mydivide X9 X10)) (sin X1))))) (mydivide (- (- (- (- (- (- (mydivide X8 (- (- X11 X5) X1)) (mydivide (- (- (- (mydivide (- (- (mydivide (- (- X11 (exp (cos (+ X11 X11)))) X9) X10) (exp (sin X1))) (cos (+ X11 X11))) X10) X1) X10) X2) X2)) (log (mydivide (- (- (- (mydivide (- (- (mydivide (- (- (mydivide (- (mydivide (- (mydivide X9 (log X10)) (cos (+ X11 X10))) X2) X1) X2) (exp (mydivide (- X1 (mydivide (- X11 (mydivide X7 (sin (mydivide (- (- X11 (log (mydivide (- X2 (- X11 (mydivide X5 (exp (- X7 X10))))) X1))) (exp (mysqrt (- X11 (mydivide X4 (- (+ X11 (- (+ X11 X9) X6)) (mydivide X11 X3))))))) X6)))) (exp X2))) X1))) (log (- (+ X11 X11) (mydivide (- X11 (quart X2)) X3)))) X11) X9) X2) X10) X1) X3) (mydivide (- (- (cos (+ X11 X6)) (mydivide X7 (mydivide X3 (exp (- X7 X1))))) X11) X10)) X3))) (exp (mysqrt (- X11 (mydivide (mydivide X9 (cos (log (mydivide X3 (exp (- X7 X1)))))) (- (+ X11 (- X5 X6)) (mydivide X11 X3))))))) (- (- (- X10 (mydivide X3 X10)) X10) X1)) X6) (- X5 X11)) X6)) (mydivide (+ (sin X1) X9) X10)) (sin (- (- (mydivide (- (mydivide X9 X10) (square X2)) X2) X1) X3))) (log (- (+ X11 X11) (mydivide (- X11 (quart X2)) X3)))) (mydivide (cos (mydivide (cos (+ X3 X8)) (- X7 X1))) (- X7 X1))) X11) (mydivide (- (- (- (- (- (- (mydivide X7 (- (- (- X10 X1) (mydivide (- (- X5 (- X8 X11)) X6) (mydivide (square X2) X10))) X4)) (mydivide (- (- X3 (- (- X11 (mydivide X9 X2)) (- (- X5 X11) X6))) X1) X2)) (mydivide (- (- (- (- X7 X11) X1) X6) X2) (mydivide (square X2) X10))) X2) X11) (mydivide X3 (exp (- X7 X1)))) (mydivide (mydivide (- (- (- (mydivide (- (mydivide X9 X10) (square X2)) X2) X1) (+ X3 X2)) (- (- (- (- (- (- (- (- (- X11 (mydivide (+ X5 X9) X10)) (exp X2)) (exp X2)) X2) (+ X5 X9)) (quart X3)) X3) X2) (mydivide (- (- (+ (- (- (- (- (- (- (- (- (- X11 X10) X2) (exp X10)) (exp X2)) (+ X5 X9)) X3) X5) X2) (mydivide (cos (exp X3)) (- X7 X1))) X11) (mydivide (- (+ (- (- (- (- (- (- (- (- (mydivide (- (- X11 (exp (cos (+ X11 X9)))) (exp (sin X1))) X10) (sin X2)) (mydivide X9 X10)) X2) (mydivide (- (- (- (- (mydivide X8 (- (- X11 X5) X1)) (mydivide (- (- (- (mydivide X11 X10) X1) X9) X6) X2)) X2) X2) X11) X6)) (mydivide X9 X10)) (sin (- X4 X1))) (log (- (+ X11 X11) (mydivide X11 X3)))) (mydivide (cos (mydivide (cos (+ X3 X8)) (- X7 X1))) (- X11 X1))) X11) (mydivide (- (- (- (- (- X1 (mydivide (- (- (- (- X7 X1) X11) X6) X2) (mydivide (square X2) (cos (log X10))))) X11) (mydivide (+ X3 X9) X10)) (mydivide X3 (exp (- X7 X1)))) (mydivide (mydivide (- (mydivide (- (mydivide X9 X10) (cos X1)) X2) X1) X2) (sin X4))) X7)) X1)) X2) (- X7 (- X7 X1))))) X2) (sin X4))) X7)) X1)) X2)) + 4.3779292710659720
MSE: 0.44903064415259536
MAE: 0.40275171982489055
$ java -jar sr.jar -test path_to_redWine_data -integer true -fused fusedModel.txt
MSE fused Model: 0.4459036898061288
MAE fused Model: 0.40212632895559725

White wine dataset

$ cd path_to_whiteWine_folder
$ java -jar sr.jar -test path_to_whiteWine_data -integer true -scaled knee.txt 
(1.7144450089364918 .* (mydivide (* (cos X11) X2) (mysqrt (sin (* X9 X10))))) + 6.0085077197027010
MSE: 0.7133523887300939
MAE: 0.5618619844834626
$ java -jar sr.jar -test path_to_whiteWine_data -integer true -scaled leastComplex.txt 
(0.5831540037037669 .* X9) + 4.0186588950880970
MSE: 0.7991016741527154
MAE: 0.6304614128215599
$ java -jar sr.jar -test path_to_whiteWine_data -integer true -scaled mostAccurate.txt 
(1.4859603473683902 .* (mydivide (* (cos X11) (sin X2)) (sin (mysqrt (sin (sin (mydivide (* (cos (sin (sin (sin (sin (sin (sin (sin (sin (sin (sin (sin (sin (sin (sin (mysqrt (sin (sin (sin (sin (sin (mysqrt (mysqrt (mysqrt (mysqrt (sin (log (sin (sin (sin (sin (sin (sin (sin (mysqrt X3))))))))))))))))))))))))))))))))))) (sin (sin (sin (sin (exp (log (sin (sin (sin (sin (sin (sin (sin (sin (mysqrt X3)))))))))))))))) (mysqrt X5)))))))) + 6.0163580514028405
MSE: 0.7013066557778685
MAE: 0.5596161698652511
$ java -jar sr.jar -test path_to_whiteWine_data -integer true -fused fusedModel.txt
MSE fused Model: 0.7013066557778685
MAE fused Model: 0.5596161698652511

Symbolic Regression Learner: NOx Emissions dataset

The data is split into training and test set. We first retrieve models from the training set:
$ java -jar sr.jar -train path_to_NOx_train_data -minutes 60 
Then, the obtained models are tested on the test set:
$ java -jar sr.jar -test path_to_NOx_test_data -integer false -scaled knee.txt 
(0.3503123449969823 .* (* (- X6 (* X11 (* X11 (- (- (* X18 X11) (- X13 X9)) X14)))) (- (mysqrt (- (cube X6) X12)) (+ (- (- (quart X9) (* X14 X13)) (mysqrt (mysqrt X3))) (square (- (cube X15) X3)))))) + 0.0666665555347619
MSE: 0.03866067788176831
MAE: 0.15697528915065745
$ java -jar sr.jar -test path_to_NOx_test_data -integer false -scaled leastComplex.txt 
(0.3588259382986629 .* X13) + 0.0677718532771849
MSE: 0.05669892879042211
MAE: 0.18320088883003482
$ java -jar sr.jar -test path_to_NOx_test_data -integer false -scaled mostAccurate.txt 
(0.3293701574352099 .* (* (- X6 (* X11 (* X11 (* X11 (sin (sin (- (- X18 (- X13 X9)) X14))))))) (- (mysqrt (- (cube X6) X12)) (+ (- (- (quart X9) (- (mysqrt (mysqrt (- (* (- X13 (* X12 (mysqrt (- (cube (* (exp (quart (square X6))) (sin (- X4 (+ (- (- (- X14 (- (- (* (- X18 (- X13 (mysqrt (mysqrt (- (cube X6) X12))))) (- (* X13 (cube X6)) (+ (- X12 X11) (+ (- X12 X13) (square X9))))) (* (mysqrt (- (cube X15) (- (square X9) (* X14 X13)))) X18)) (* X14 X13))) (cube (- X6 (sin (- X18 X13))))) X6) X12))))) X12)))) (- (* X14 (sin (- (mysqrt X6) (+ (- (- (quart X14) (cube (- X6 (sin (- X18 (square X9)))))) X9) (square X13))))) (+ (- X12 (mysqrt X13)) (square X9)))) (* X11 (* X11 (* X6 (sin (sin (sin X18))))))))) (- (square X9) (* X14 X13)))) (mysqrt (- X6 X3))) (square (- (cube X15) X3)))))) + 0.0512241480213157
MSE: 0.036183527341113
MAE: 0.14965293512345454
$ java -jar sr.jar -test path_to_NOx_test_data -integer false -fused fusedModel.txt
MSE fused Model: 0.03950888612009337
MAE fused Model: 0.15696745165002576

Symbolic Regression Learner: Kaggle bond price dataset

This dataset can be downloaded from the Kaggle bond trade price challenge website. Some preprocessing steps are required in this case to adapt the data to a format compatible with the SR learner. As in the previous example, we first delete the first line containing the labels of the explanatory variables. We then reduce the dataset by taking the first 200K exemplars and removing all the lines containing NaN values:
$ cd path_to_file
$ sed 1d original_data > kaggle.data
$ head -n 200000 kaggle.data > reducedKaggle.data 
$ less reducedKaggle.data | grep -v NaN > cleanKaggle.data
The proposed challenge consists in predicting the bond price given in the third column. Note that columns 1 and 2 correspond to nominal values and thus should be ignored. Also, the targets need to be stored in the last column, therefore, we extract the targets and save them in a different file:
$ cut -d, -f3 cleanKaggle.data  > targets.data
We then remove the first, second, and third columns of the dataset and paste the targets in the last column of the dataset. The result is a 195458 exemplars dataset, each counting 58 explanatory variables plus a target value.
$ cut --complement -d, -f1,2,3 cleanKaggle.data  > cleanColsKaggle.data
$ paste -d, cleanColsKaggle.data  targets.data > finalKaggle.data
Finally, we employ the SR learner to model the data. In this case, the -cpp flag is employed, thus enabling the optimized C++ evaluation of candidate solutions. In the example below, 4 threads are used to speedup the process. Note that it is necessary to create an auxiliary folder in which temporary C++ files will be generated:
$ mkdir tempFiles
$ java -jar sr.jar -train path_to_kaggle_data -minutes 60 -cpp 4

At the end of the run we measure the accuracy of the knee, least complex and most accurate models and the fused Pareto Front Model:

$ java -jar sr.jar -test path_to_kaggle_data -integer false -scaled knee.txt 
(0.0986682027578354 .* (+ (+ (+ (+ X20 X10) X50) (+ (mysqrt X30) (mysqrt X30))) (+ (+ X8 (+ X15 X10)) (+ (+ (+ X8 X10) X7) (+ X35 X10))))) + -0.9139559864997864
MSE: 0.9568307415218184
MAE: 0.5808919648383791
$ java -jar sr.jar -test path_to_kaggle_data -integer false -scaled leastComplex.txt 
(0.9925869703292847 .* X10) + 0.7772449851036072
MSE: 1.307177978046785
MAE: 0.6098333240942719
$ java -jar sr.jar -test path_to_kaggle_data -integer false -scaled mostAccurate.txt 
(0.0249582007527351 .* (+ (+ (+ X25 (+ (+ (+ X8 (+ X55 (+ X8 (+ (+ X12 X30) (+ (+ X10 (+ (mysqrt X10) X10)) (+ (mysqrt (+ X35 (square X12))) (+ (mysqrt X35) X10))))))) (+ (mysqrt (+ X45 X50)) (+ X7 (+ (mysqrt X8) (sin (mysqrt X20)))))) X10)) X50) (+ (+ (+ X35 (+ (+ X15 (+ X7 (mysqrt (+ X15 X3)))) (+ (+ (+ (mysqrt (+ X10 X10)) (+ (+ (+ X8 (+ X55 (+ X8 (+ (+ X12 X30) X10)))) (+ (mysqrt (+ X3 X10)) X10)) X10)) X50) (+ (+ (+ (sin X7) (+ X15 (+ X7 (mysqrt X20)))) X40) (+ (+ X4 (+ (+ X20 (+ (mysqrt X10) (mysqrt X10))) (+ (mysqrt X10) X10))) (+ (+ (+ (+ X8 (+ X10 (mysqrt (+ X15 (+ (mysqrt X10) X50))))) X15) (+ (mysqrt X10) (+ X7 X10))) (+ (mysqrt X7) (+ X7 X10)))))))) X40) (+ (+ X20 (+ (+ X20 (mysqrt (+ (+ (+ X55 (+ X15 X3)) X15) X8))) X8)) (+ (+ (+ (+ X8 (+ X10 (mysqrt (+ X15 X3)))) (+ X10 X10)) (+ (mysqrt (+ X8 (+ X15 X10))) (+ X7 X10))) X10))))) + -2.9934198856353760
MSE: 0.9459161705227037
MAE: 0.5811238430405654
$ java -jar sr.jar -test path_to_kaggle_data -integer false -fused fusedModel.txt
MSE fused Model: 0.9488283608868329
MAE fused Model: 0.5816370761201752

Symbolic Regression Learner: Million Song Dataset year prediction challenge

The Million Song dataset is available at the UCI Machine Learning repository website. The Million Song Dataset year prediction challenge is a regression problem in which the release year of the songs has to be predicted. The dataset is composed of more than 500K songs, each described with a set of 90 features. The train/test strategy is repeated in this case. Note that the so-called producer effect has been taken into account to perform the data split. In this example, we normalize the both the train and test data before starting the training process:
$ java -jar sr.jar -normalizeData path_to_msd_train_data -newPath path_to_msd_norm_train -pathToBounds path_to_train_bounds
$ java -jar sr.jar -normalizeData path_to_msd_test_data -newPath path_to_msd_norm_test -pathToBounds path_to_test_bounds
Once the data is normalized, we retrieve models from the training set. In the following example, we create an auxiliary folder where the C++ temporary files will be generated and specify a 8-threaded C++ optimized evaluation.
$ mkdir tempFiles
$ java -jar sr.jar -train path_to_msd_norm_train -minutes 60 -cpp 4

The obtained models are then tested on normalized unseen data. Note that since both explanatory variables and targets have been previously normalized, we specify the path to the stored min and max bounds necessary to provide predictions in the original scale:

$ java -jar sr.jar -normTest path_to_msd_norm_test -pathToTrainBounds path_to_tr_bounds -pathToTestBounds path_to_test_bounds

TESTING KNEE MODEL:
(0.2497919946908951 .* (+ (* (* (log X88) X2) X6) (+ (+ X1 (* (log (quart X48)) X3)) X1))) + 0.8184109926223755
MSE: 95.09644014969908
MAE: 6.866209679535639

TESTING MOST ACCURATE MODEL: 
(0.5029709935188293 .* (+ (* (* (* (+ (* (log X28) X3) (* (+ (* (+ (* (+ (+ (* X1 (log X1)) (* (log X6) X3)) (+ (+ (+ X13 (* (log (quart X32)) X3)) (+ X36 (log (* (+ (+ X1 (+ (quart X32) (* (log (quart X32)) X3))) X1) X1)))) (log (* (+ (+ X1 (* (log (quart X32)) X3)) X1) X73)))) X6) (* (+ (+ (* X1 X35) (+ X28 (* (log X13) X3))) X1) X1)) X3) (+ (* X1 X1) (* (+ (+ X1 (* (log X76) X2)) X1) X1))) X3)) X12) X78) X35) (+ (* (* (+ X25 (* (log X61) X2)) X78) X35) (* (log X1) X1)))) + 1.1432499885559082
MSE: 90.53336481890285
MAE: 6.63900995325159

TESTING SIMPLEST MODEL: 
(-0.0306775998324156 .* X78) + 0.8838499784469604
MSE: 112.46824021812294
MAE: 7.771649342527974

TESTING FUSED MODEL:
MSE fused Model: 90.53336481890285
MAE fused Model: 6.63900995325159

Classification by Rule Trees: Banknote Authentication

Thee Banknote Authentication dataset is available at the UCI Machine Learning repository website. It is a binary classification problem composed of 1372 exemplars, each described with a set of 4 features. The dataset is shuffled and split to obtain the training set (66%) and test set (33%):

$ sort --random-sort data_banknote_authentication.csv > banknoteShuffled.csv
$ head -n905 banknoteShuffled.csv > banknoteTrain.csv
$ tail -n466 banknoteShuffled.csv > banknoteTest.csv
We now run the Rule Tree learner with the following command:
$ java -jar ruletree.jar -train path_to_banknoteTrain -minutes 1
In a first step, the Rule Tree learner will divide the range of observed values of each variable into intervals. These intervals are reported in the file conditions.txt and will be used by the learner to construct boolean expressions. In this case, 2 conditions are contructed for the first, second, and third explanatory variables while the range of values of X4 is split into four intervals.
$ less conditions.txt
C1 : X1 in [ -7.0421 ; 0.5527080000000055 ]
C2 : X1 in [ 0.5527080000000055 ; 7.563300000000012 ]
C3 : X2 in [ -13.6779 ; 2.8997999999999906 ]
C4 : X2 in [ 2.8997999999999906 ; 13.951599999999978 ]
C5 : X3 in [ -5.2613 ; 6.707950000000006 ]
C6 : X3 in [ 6.707950000000006 ; 18.677200000000006 ]
C7 : X4 in [ -8.5482 ; -7.228452999999995 ]
C8 : X4 in [ -7.228452999999995 ; -3.869096999999983 ]
C9 : X4 in [ -3.869096999999983 ; 0.9299830000000169 ]
C10 : X4 in [ 0.9299830000000169 ; 3.449500000000017 ]
The obtained models are then tested on unseen data:
$ java -jar ruletree.jar -test path_to_banknoteTest -pathToConditions conditions.txt
RULE TREE: C1
ACCURACY: 0.8261802575107297
PRECISION: 0.7212389380530974
RECALL: 0.9005524861878453
F-SCORE: 0.800982800982801
FALSE POSITIVE RATE: 0.22105263157894736
FALSE NEGATIVE RATE: 0.09944751381215469
The Rule Tree learner only reports a model of minimal complexity that contains a condition for variable X1. In fact, C1 represents the following rule:
C1 : X1 in [ -7.0421 ; 0.5527 ]

Classification by Rule Trees: Skin Segmentation

Thee Skin Segmentation dataset is available at the UCI Machine Learning repository website. The Skin Segmentation is a binary classification problem composed of roughly 250K exemplars, each described with a set of 3 features.

We now report the steps neceassy to use the Rule Tree learner with this dataset. First, the original class labels (1 or 2) must be replaced with 0 or 1 values:
$ sed 's/1\r/0/g' skin.data > skin02.data
$ sed 's/2\r/1/g' skin02.data > skin01.data
Once the labels have been replaced, we shuffle the resulting dataset and perform a 0.66-0.33 split to obtain the training and test sets:
$ sort --random-sort skin01.data > skinShuffle.data
$ head -n 161737 skinShuffle.data > skinTrain.csv
$ tail -n 83320 skinShuffle.data > skinTest.csv
Note that the default the Rule Tree learner assigns equal weights or costs to false positive and false negative errors. However, in real-world problems, these errors usually present different costs. We now perform a second run in another folder, this time we indicate different weights for the two errors in the properties file:
$ less skin.properties
pop_size = 2000
false_positive_weight = 0.1
false_negative_weight = 0.9
$ java -jar ruletree.jar -train path_to_skinTrain -minutes 60 -properties skin.properties
The obtained models are then tested on unseen data, we report the knee and most accutate models obtained with the modified weights:
$ java -jar ruletree.jar -test path_to_skinTest -pathToConditions path_to_conditions 
TESTING KNEE MODEL:
RULE TREE: (or C7 C3)
ACCURACY: 0.9134181469035045
PRECISION: 0.9550560061799923
RECALL: 0.9348920917711468
F-SCORE: 0.9448664842639438
FALSE POSITIVE RATE: 0.169137740566312
FALSE NEGATIVE RATE: 0.06510790822885316

TESTING MOST ACCURATE MODEL: 
RULE TREE: (or C7 (not (and C5 C2)))
ACCURACY: 0.9417426788286126
PRECISION: 0.9566371021837967
RECALL: 0.9705842319384159
F-SCORE: 0.9635601999909915
FALSE POSITIVE RATE: 0.169137740566312
FALSE NEGATIVE RATE: 0.029415768061584066

Classification by GP function: Banknote Authentication

We reuse the Banknote Authentication dataset used above to demonstrate the GP function classifier. We report the preprocessing steps necessary to format the data:

$ sort --random-sort data_banknote_authentication.csv > banknoteShuffled.csv
$ head -n905 banknoteShuffled.csv > banknoteTrain.csv
$ tail -n466 banknoteShuffled.csv > banknoteTest.csv

In this example, we employ the same data to train the classifiers and to set the decisions thresholds. We run the GP function learner with the following command:

$ java -jar gpfunction.jar -train path_to_banknoteTrain -cv path_to_banknoteTrain -minutes 20
We now show the solutions stored in the pareto.txt file. The file contains nondominated models and their respective AUC when tested on the training set, the AUC when tested on the validation set, and the decision theshold. Note that in this case, the same data is used for training and validation and thus the same AUC values are reported.
$ less pareto.txt
X3,0.5377,0.5377,0.6
(square X3),0.5893,0.5893,0.1
(- X4 X1),0.8946,0.8946,0.6
(- (- X4 X1) X1),0.9307,0.9307,0.6
(- (sqrt (exp X4)) X1),0.9322,0.9322,0.5
(- (- X4 X1) (+ X3 X2)),0.9373,0.9373,0.7
(- (- (cos X3) X1) (+ X3 X2)),0.9985,0.9985,0.6
(- (- (- (cube (cos (exp X3))) X1) X1) (+ X3 X2)),0.9985,0.9985,0.6
(- (- (- (exp (exp (- (sin X1) (exp (- X3 X4))))) X1) (+ X3 X2)) X1),0.9996,0.9996,0.6
(- (- (- (exp (- (sin X1) (exp X3))) X1) X1) (+ X3 X2)),0.9994,0.9994,0.6
(- (- (- (exp (- (sin X1) (exp (- X3 X2)))) X1) X1) (+ X3 X2)),0.9995,0.9995,0.6
(- (sin X4) X1),0.9179,0.9179,0.4
(- (- (- (exp (exp (- (sin X1) (exp (- X3 (- (sqrt (exp X4)) X1)))))) X1) (+ X3 X2)) X1),0.9998,0.9998,0.5
(- (- (- (exp (exp (- (sin X1) (exp (- X3 (- (- (- (exp X4) X1) X1) (- (- (exp (- (sin X1) (exp X3))) X1) X1))))))) X1) (+ X3 X2)) X1),1.0,1.0,0.6
(- (- (- (exp (- (- (sin X1) (exp (- X3 (mydivide (sin X1) (square X3))))) (+ X3 X2))) X1) X1) (+ X3 X2)),0.9999,0.9999,0.6
The obtained models are then tested on unseen data:
$ java -jar gpfunction.jar -test path_to_banknoteTest
We report the results for the model presenting the higher Area under the ROC curve with respect to the validation set:

GP FUNCTION: (- (- (- (exp (exp (- (sin X1) (exp (- X3 (- (- (- (exp X4) X1) X1) (- (- (exp (- (sin X1) (exp X3))) X1) X1))))))) X1) (+ X3 X2)) X1)
ACCURACY: 0.9892703862660944
PRECISION: 0.9731182795698925
RECALL: 1.0
F-SCORE: 0.9863760217983651
FALSE POSITIVE RATE: 0.017543859649122806
FALSE NEGATIVE RATE: 0.0

Classification by GP function: Skin Segmentation

Thee Skin Segmentation dataset is available at the UCI Machine Learning repository website. We repeat the steps explained in detail in the Rule Tree example:

$ sed 's/1\r/0/g' skin.data > skin02.data
$ sed 's/2\r/1/g' skin02.data > skin01.data
$ sort --random-sort skin01.data > skinShuffle.data
$ head -n 161737 skinShuffle.data > skinTrain.csv
$ tail -n 83320 skinShuffle.data > skinTest.csv
As for the Rule Tree classifier example, we modify the population size and the cost of the false positive and false negative errors. Additionally, 4 CPU threads are used to speedup the process:
$ less skin.properties
pop_size = 2000
false_positive_weight = 0.1
false_negative_weight = 0.9
$ java -jar gpfunction.jar -train path_to_skinTrain -minutes 60 -cpp 4 -properties skin.properties
The obtained models are then tested on unseen data, we report the knee, the most accurate models obtained with the modified weights:
$ java -jar gpfunction.jar -test path_to_skinTest
TESTING KNEE MODEL:
GP FUNCTION: (- X1 X3)
ACCURACY: 0.9381180988958233
PRECISION: 0.9853514847543985
RECALL: 0.9359356331573933
F-SCORE: 0.9600080666428804
FALSE POSITIVE RATE: 0.053491482062910635
FALSE NEGATIVE RATE: 0.06406436684260673

TESTING MOST ACCURATE MODEL: 
GP FUNCTION: (* (- (* (- X1 X3) X2) (- (- (mydivide X1 (- X2 X3)) X3) X1)) X1)
ACCURACY: 0.9115458473355736
PRECISION: 0.9116694928318175
RECALL: 0.9838629179836965
F-SCORE: 0.9463914226276204
FALSE POSITIVE RATE: 0.3664747950462236
FALSE NEGATIVE RATE: 0.016137082016303445
The Pareto Front contains another model presenting higher accuracy:
GP FUNCTION: (* (- X1 X3) (sqrt X1))
ACCURACY: 0.9503600576092175
PRECISION: 0.9907370754492915
RECALL: 0.9462954280788252
F-SCORE: 0.9680064358426932
FALSE POSITIVE RATE: 0.034013605442176874
FALSE NEGATIVE RATE: 0.05370457192117482

Multiple Regression Genetic Programming Learner: Predicting the quality of wine

The Wine Quality dataset is available at the UCI Machine Learning repository website. This problem consists in modeling the quality (a grade from 1 to 10) of a given red or white wine given 11 features such as acidity, alcohol degree etc. As for the SR learner, we format the data:

$ cd path_to_redWine_folder
$ sed 1d original_data > new_data
$ sed -i 's/;/,/g' new_data

Afer the data is properly formatted, we run the SR learner:

$ cd path_to_redWine_folder
$ java -jar mrgp.jar -train path_to_redWine_data -minutes 5
$ cd path_to_whiteWine_folder
$ java -jar mrgp.jar -train path_to_whiteWine_data -minutes 5 

We first take a look at the retrieved models. The knee model retrieved for the red wine dataset is shown below:

$ cd path_to_redWine_folder
$ less knee.txt
$ 3.0,9.0,-0.018684985640165904 0.7856738123404483 0.0 0.404478545644783 -0.2563039645099639 -4.280925552801179 -0.7488644748880113 2.148886734420814 -1.520366967573291 0.0 0.005112032510469252 0.0 0.0 -1.1578273566729859E-4 -0.05863301877797394 1.1171743703547572 -0.04053217931007253 5.2359729196524355E-5 -0.1601368617645158 0.25813312172125863 -0.44588645508144964 -0.026404070416446022 6.1059026099287436E-6 2.3404728567991634E-6 -4.857116759242476E-8,-1.7954608089032638,(- (square (mylog X6)) (* (+ X10 (mylog (mydivide (sin X2) X1))) (mydivide (cos (sqrt X11)) (mydivide (sqrt (square (* (sin X2) X11))) (exp X4)))))

The model files contain:

  1. The minimum target value seen in the training set,

  2. the maximum target value seen in the training set,

  3. the weights associated to the MRGP model,

  4. the intercept of the model,

  5. the prefix representation of the expression tree.

We now test the most accurate and knee models obtained for the two datasets:

$ cd path_to_redWine_folder
$ java -jar mrgp.jar -test path_to_redWine_data -integer true -scaled mostAccurate.txt
(cos (cube (+ (exp (cube (* (square (cos (sqrt X11))) (quart (sin (* (+ (exp (cube (* (square (cos (sqrt X11))) (quart (sin (+ X11 (- (sin (+ X2 (mylog X11))) (square (exp X10))))))))) (cos (+ (mydivide (mydivide (cos (square (* (exp X10) (sin (* X10 X2))))) (sin X1)) (+ (- (+ (cos (cos (sqrt X10))) (+ X2 (+ (sqrt (mylog (mydivide X2 X9))) X10))) (mydivide X7 X2)) (sqrt (mylog (mydivide X10 X9))))) (cos (sqrt (* X11 (square X8))))))) X2)))))) (cos (+ (mydivide (mydivide (cos (square (* (exp X10) (sin (* X10 X2))))) (sin X1)) (+ (- (+ (cos (cos (sqrt X10))) (+ (+ (exp (+ (exp (cube (* (square (cos (sqrt X11))) (quart (sin (* X10 X2)))))) (cos (+ (mydivide X10 (+ (- (+ (cos (cos (sqrt X10))) X3) (mydivide X7 X2)) X9)) (cos (sqrt (* (* (square X11) (exp (* (mylog (sin X10)) X1))) X7))))))) (cos (+ (mydivide (mydivide (cos (exp X6)) (sin (square X11))) (+ (- (mylog X7) (mydivide X7 X2)) (sqrt (mylog (mydivide X2 X9))))) (cos (sqrt (* (sqrt X10) (square X8))))))) (+ (sqrt (mylog (mydivide X2 X11))) X10))) (mydivide X7 X2)) (sqrt (mylog (mydivide (quart (cos X11)) X9))))) (cos (sqrt (* X11 (square X8)))))))))
MSE: 0.42951389259412937
MAE: 0.49619361751886215
$ java -jar mrgp.jar -test path_to_redWine_data -integer true -scaled knee.txt 
(+ (exp (cube (* (square (cos (sqrt X11))) (quart (sin (* X10 X2)))))) (cos (+ (mydivide X10 (+ (- (+ (cos (cos (sqrt X10))) X3) (mydivide X7 X2)) X9)) (cos (sqrt (* (* (square X11) (exp (* (mylog (sin X10)) X1))) X7))))))
MSE: 0.37600134359393533
MAE: 0.4748088882237069
$ cd path_to_whiteWine_folder
$ java -jar mrgp.jar -test path_to_whiteWine_data -integer true -scaled mostAccurate.txt
(- (square (mylog (mydivide (+ (sin (+ (+ (quart (mydivide (cos (sqrt (+ (- (square X11) X1) (- X5 X1)))) (mydivide (sqrt (sqrt X6)) (exp X4)))) (cube X3)) (mydivide (sin (sqrt X6)) (quart X8)))) (cube (quart X4))) (mylog (sqrt (+ (+ (quart X5) (cube X3)) (mydivide (sin (* (sin (mydivide (* X3 (* (cos X11) (mydivide (mylog (exp X3)) X7))) X11)) X11)) (quart X8)))))))) (* (+ X9 (mylog (mydivide (* (+ X8 (+ (+ (quart X2) (cube X3)) (mydivide (sin (* (sin X2) X11)) (cos (sqrt (+ (- (square X11) X1) X2)))))) (mydivide (cos X11) (mydivide (+ X10 (mylog (mydivide X6 X1))) (exp X4)))) X1))) (mydivide (cos (sqrt X11)) (mydivide (sqrt (square (* (sin X2) X11))) (exp X4)))))
MSE: 0.5154293445397177
MAE: 0.5673233538456594
$ java -jar mrgp.jar -test path_to_whiteWine_data -integer true -scaled knee.txt 
(- (square (mylog X6)) (* (+ X10 (mylog (mydivide (sin X2) X1))) (mydivide (cos (sqrt X11)) (mydivide (sqrt (square (* (sin X2) X11))) (exp X4)))))
MSE: 0.5380295518792321
MAE: 0.5774873048842515

Authors and Contributors

FlexGP is a project of the Any-Scale Learning For All (ALFA) group at MIT.

ALFA