SR learner

Current release provides functionality both for performing Symbolic Regression on numerical datasets and for testing the retrieved models. In this page we provide a quick tutorial on how to get started with SRLearner.

Tutorial

Note: this release is only supported for Linux Debian platforms.

Step 1: Data format

Data must be provided in csv format where each line corresponds to an exemplar and the target values are placed in the last column. Note that any additional line or column containing labels or nominal values needs to be removed.

Step 2: Download the sr.jar file from here

Step 3 (optional): Normalize the data

The SR learner provides functionality to normalize the data. The path to the normalized data and bounds for the explanatory variables and targets need to be specified:

$ java -jar sr.jar -normalizeData path_to_data -newPath path_to_normalized_data -pathToBounds path_to_variable_bounds

Step 4: Running SR learner

In the current release, it is only possible to run SR learner directly from your terminal (a Matlab wrapper will be included soon).

Running SR learner from the terminal

Model the data

All you need to provide is the path to your dataset (whether normalized or original) and the optimization time

$ java -jar sr.jar -train path_to_your_data -minutes 10

At the end of the run a set of files are generated:

pareto.txt: models forming the Pareto Front (accuracy vs model complexity).
leastComplex.txt: least complex model of the Pareto Front.
mostAccurate.txt: most accurate model of the Pareto Front.
knee.txt: model at the knee of the Pareto Front.
bestModelGeneration.txt: most accurate model per generation.
fusedModel.txt: fused model of the Pareto Front obtained with Adaptive Regression by Mixing (see Adaptive Regression by Mixing. Yuhong Yang. Journal of the American Statistical Association Vol. 96, No. 454 (Jun., 2001), pp. 574-588).

Test the models

The SR learner provides functionality to obtain the Mean Squared Error (MSE) and Mean Average Error (MAE) of the retrieved models once the training is finished. To automatically test all the generated models, type:

$ cd run_folder
$ java -jar sr.jar -test path_to_your_data

If the data has been normalized, it is necessary to specify the path to the stored min and max bounds of both the training and test sets to obtain predictions in the original scale:

$ cd run_folder
$ java -jar sr.jar -normTest path_to_norm_test_data -pathToTrainBounds path_to_tr_bounds -pathToTestBounds path_to_test_bounds

Running SR learner from Matlab

To be done

Bells and whistles

1) Speeding up your runs with C++ optimized execution

This option requires the installation of the gcc and g++ compilers and the configuration of the Linux kernel parameter governing the maximum size of shared memory segments:

$ sudo apt-get install gcc
$ sudo apt-get install g++

Modify the Linux kernel parameter governing the maximum shared memory segment size to be at least as large as the data being analyzed, in the next example we set it to 2GB

$ sudo echo 2147483648 > /proc/sys/kernel/shmmax

To benefit from the optimized C++ execution, append the -cpp flag followed by the number of CPU threads that will be employed to speedup FlexGP (4 in the example below). Additionally, it is necessary to create an auxiliary default folder in which temporary C++ files will be generated:

$ mkdir tempFiles
$ java -jar sr.jar -train path_to_your_data -minutes 10 -cpp 4

2) Speeding up your runs with CUDA

This option requires the installation of the gcc, g++, and nvcc compilers and the configuration of the Linux kernel parameter governing the maximum size of shared memory segments. To benefit from the CUDA execution, append the -cuda flag. Additionally, it is necessary to create an auxiliary default folder in which temporary CUDA (.cu) files will be generated:

$ mkdir tempFiles
$ java -jar sr.jar -train path_to_your_data -minutes 10 -cuda

3) Change the default parameters

To modify the default parameters of the SR learner, it is necessary to append the flag -properties followed by the path of the properties file containing the desired parameters:

$ java -jar sr.jar -train path_to_your_data -minutes 10 -cpp 4 -properties path_to_props_file

The following properties file example specifies the population size, the features that will be considered during the learning process, the functions employed to generate GP trees, the tournament selection size, and the mutation rate.

pop_size = 1000
terminal_set = X1 X3 X7 X8 X9 X12 X13 X17
function_set = + - * mydivide exp mylog sqrt square cube quart sin cos
tourney_size = 10
mutation_rate = 0.1

Examples

To check reports visit our blog: FlexGP Blog

Authors and Contributors

FlexGP is a project of the Any-Scale Learning For All (ALFA) group at MIT.