GP Function classifier

This novel classification approach evolves GP functions to tackle binary classification problems. Given a set of explanatory variables $X$ , we first search for a nonlinear function $y=f(X)$ such the distributions $p(f(X) \| H_0)$ and $p(f(X) \| H_1)$ are best separated.

After learning the function, a threshold $\lambda$ is used to build the decision rule $L_i = \left\{ 1, \ \mbox{if} \ y_i \geq \lambda \atop 0, \ \mbox{if} \ y_i < \lambda \right\}$ that determines whether a given output represents a class 0 or class 1 prediction.

It is possible to use disjoint sets of data (training and validation sets) for the first and seconds steps. For further details of the GP Function classifier, the reader is referred to this paper:

Arnaldo, I.; Veeramachaneni, K; O'Reilly, UM: Building Multiclass Nonlinear Classifiers with GPUs. Big Learning Workshop at NIPS: Advances in Algorithms and Data Management, 2013.

and:

Ignacio Arnaldo, Kalyan Veeramachaneni, Andrew Song, Una-May O’Reilly: Bring Your Own Learner! A cloud-based, data-parallel commons for Machine Learning. IEEE Computational Intelligence Magazine. vol.10, no.1, pp.20,32, Feb. 2015.

Tutorial

Current release provides functionality both for performing Binary Classification on numerical datasets and for testing the retrieved classifiers. In this page we provide a quick tutorial on how to get started with the GP Function classifier.

Note: this release is only supported for Linux Debian platforms.

Step 1: Data format

Data must be provided in csv format where each line corresponds to an exemplar and the target values are placed in the last column. Note that any additional line or column containing labels or nominal values needs to be removed.

Step 2: Download the gpfunction.jar file from here

Step 3: Configuring the C++ environment

This learner requires the installation of the gcc and g++ compilers and the configuration of the Linux kernel parameter governing the maximum size of shared memory segments:

$ sudo apt-get install gcc
$ sudo apt-get install g++

Modify the Linux kernel parameter governing the maximum shared memory segment size to be at least as large as the data being analyzed, in the next example we set it to 2GB

$ sudo echo 2147483648 > /proc/sys/kernel/shmmax

or modify it manually:

$ sudo nano /proc/sys/kernel/shmmax

Step 4: Training GP Function classifiers

In the current release, it is only possible to train the GP Function classifier directly from your terminal.

Running GP Function classifier from the terminal

Model the data

First, it is necessary to create an auxiliary folder in which temporary C++ files will be generated. The parameters required to train the classifier are the path to your dataset, the path to the validation set, and the optimization time:

$ mkdir tempFiles
$ java -jar gpfunction.jar -train path_to_train_data -cv path_to_validation_data -minutes 10

At the end of the run a set of files are generated:

bestCrossValidation.txt: model with higest area under the ROC curve when evaluated on the validation set.
pareto.txt: models forming the Pareto Front (accuracy vs model complexity).
leastComplex.txt: least complex model of the Pareto Front.
mostAccurate.txt: most accurate model of the Pareto Front.
knee.txt: model at the knee of the Pareto Front.
bestModelGeneration.txt: most accurate model per generation.

Test the models

The GP Function learner provides functionality to obtain the accuracy, precision, recall, F-score, false positive rate, and false negative rate of the retrieved classfiers once the training is finished. To automatically test all the generated classifiers, type:

$ cd run_folder
$ java -jar gpfunction.jar -test path_to_test_data

Bells and whistles

1) Specify the number of CPU threads

To speedup the training process, append the -cpp flag followed by the number of CPU threads that will be employed to evaluate the candidate solutions (4 in the example below):

$ java -jar gpfunction.jar -train path_to_train_data -cv path_to_cv_data -minutes 10 -cpp 4

2) Speeding up your runs with CUDA

This option requires the installation of the gcc, g++, and nvcc compilers and the configuration of the Linux kernel parameter governing the maximum size of shared memory segments. To benefit from the CUDA execution, append the -cuda flag. It is also necessary to create an auxiliary default folder in which temporary CUDA (.cu) files will be generated:

$ mkdir tempFiles
$ java -jar gpfunction.jar -train path_to_your_data -minutes 10 -cuda

3) Change the default parameters

To modify the default parameters of the GP Function learner, it is necessary to append the flag -properties followed by the path of the properties file containing the desired parameters:

$ java -jar gpfunction.jar -train path_to_train_data -cv path_to_cv_data -minutes 10 -cpp 4 -properties path_to_props_file

The following properties file example specifies the population size, the features that will be considered during the learning process, the functions employed to generate GP trees, the tournament selection size, and the mutation rate.

pop_size = 1000
terminal_set = X1 X3 X7 X8 X9 X12 X13 X17
function_set = + - * mydivide exp mylog sqrt square cube quart sin cos
tourney_size = 10
mutation_rate = 0.1

Examples

To check reports visit our blog: FlexGP Blog

Authors and Contributors

FlexGP is a project of the Any-Scale Learning For All (ALFA) group at MIT.