API reference

The package currently offers one regression model, SymbolicRegressor.

class sblearn.models.SymbolicRegressor(population_size=10000, n_iter=20, sampling_rate=0.15, mutation_chance=0.3, elitism=True, adjust_elites=True, adjust_range=0.15, replacement_rate=0, max_depth='auto', simpler_is_better=False, constants_range='auto', verbose=0, random_state=None, n_jobs=-1)

Bases: BaseEstimator, RegressorMixin

A symbolic regression estimator based on sklearn’s BaseEstimator and RegressorMixin classes. As such, it has all the features of any other sklearn regression estimator (GridSearchCV, pipelines and so on.)

Symbolic regression is a type of regression model that combines mathematical blocks to find the function that best fits the data. Here each function is represented as a binary tree.

Parameters

population_size (int, default=10000):

The size of the functions population to use in each generation of the evolutionary algorithm. Increasing this often yields better results, but also significantly increases training time.

n_iter (int, default=20):

The number of successive generations your algorithm must have. Increasing this sometimes leads to better performance but also increases training time.

sampling_rate (float, default=0.15, ]0,1] value):

In order to improve the model’s generalization capabilities, each generation is trained on a random subset of the training dataset only. The sampling rate defines how much of the whole set is used for training at each generation. While it is generally not advised to change this value, decreasing it might be a good idea if your model is overfitting.

mutation_chance (float, default=0.3, ]0,1] value):

The proportion of function trees that mutate at each generation. It is advised to leave the default value.

elitism (bool, default=True):

If set to True, the best functions in a generation will always be selected as-it-is for the next generation. The functions defined as “elits” are the best 2%.

Note: if set to False, parameters adjust_elites and adjust_range are disabled.

adjust_elites (bool, default=True):

If set to True, at each generation new functions will be created based on elites functions. These new functions will be multiplied by 1 + a random float in (-constants_range, constants_range). This avoids being stuck in a local optimum by applying small changes to the most promising functions.

adjust_range (float, default=0.15, ]0, 1] value):

Sets the range from which the random adjustment coefficients can be drawn.

replacement_rate (float, default=0, ]0, 1] value):

Sets the proportion of functions that will be replaced by new, randomnly-generated functions at each generation. Elite functions cannot be replaced.

max_depth (int/str, default=’auto’):

Sets the max depth function trees can have. A higher max depth means a longer training and might lead to overfitting, so it is better to keep that parameter low. The default value is ‘auto’, meaning that the max depth is equal to the number of features + 2.

simpler_is_better (bool, default=False):

If set to True, the fitness function used during training takes into account the function’s complexity, meaning that more complex functions are penalized based on their complexity using a parsimony coefficient. This ensures that the fitted function will be easily readable and can sometimes avoid overfitting by avoiding bloating phenomena, but it reduces performances in most situations.

constants_range (str/tuple/list, default=’auto’):

Defines the range from which random constant values used in function trees are generated. It is better to generate values on the same order of magnitude as the data contained in our dataset. The ‘auto’ value uses as range \([p1, p2]\), with p1 and p2 the 5th and 95th percentiles of all our features data taken together.

verbose (int, default=0):

Defines how much information is displayed during training. 0: nothing is displayed 1: the fitted functions’ simplified expressions are displayed at the end of training 2: average fitness is displayed for each generation

random_state (int/None, default=None):

Sets the random seed to use for reproducibility.

n_jobs (int, default=-1):

The number of cores to use in parallel during training. If sets to -1, all available cores are used.

Attributes:

formulas (list):: A list of the simplified math expressions estimated for each target value stored as strings.
trees (list):: A list of tree representations stored as strings for each target value.

Example:

>>> from sblearn.models import SymbolicRegressor
>>> model = SymbolicRegressor()
>>> model.fit(X_train, y_train)
>>> print(model.formulas)
['y0 = 21.291046142578125*x0 + 47.842154502868652']

fit(X, y)

Trains the model with the data provided and returns a trained SymbolicRegressor instance.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,) or (n_samples, n_targets)) – Target values.

Returns:

self

Return type:

SymbolicRegressor

predict(X)

Predicts using the model.

Parameters:: X (array-like of shape (n_samples, n_features)) – Samples to use to make predictions.

score(X, y)

Return the coefficient of determination (R2) of the prediction.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples
y (array-like of shape (n_samples,) or (n_samples, n_targets)) – True values for X.