User guide: Using PyDL8.5

Optimal Decision Trees

This library implements the DL8.5 algorithm for learning optimal binary decision trees and provides a Python interface for this algorithm.

An example of decision trees are classification trees. Classification trees are predictors in which the predictions correspond to class labels. A tree is considered optimal on training data if no tree can be found that scores better on the given training data.

An explicit aim of this library is to make it easy to specify and solve many different types of decision tree learning problems, including, but not limited to, classification trees.

Decision trees are traditionally learned using heuristic algorithms, such as CART and C4.5. However, due to the heuristic nature of these algorithms, the trees learned using them can be larger than necessary; this may make the resulting trees less interpretable. Trees found by DL8.5 are optimal under constraints that aim to make the resulting trees both interpretable and accurate.

Moreover, given that in DL8.5 it is not necessary to specify a heuristic, solving other learning problems than classification problems potentially becomes simpler.

Please note that trees that that are accurate on training data, may not always perform good on test data. To avoid problems with overfitting, it is recommended to run DL8.5 using carefully chosen constraints, as specified below. Moreover, finding optimal decision tree is a hard search problem, and will require more computational resources. However, a recent study has shown that DL8.5’s performance on this search problem is much better than that of competing methods.

Classifiers

Decision tree classifiers are learned using the class DL85Classifier. DL85Classifier is a scikit-learn compatible classifier and can be used as a scikit-learn classifier. It inherits from sklearn.base.BaseEstimator and reimplements the methods fit and predict.

The following code illustrates how to use DL8.5 in its most basic setting:

import numpy as np
from sklearn.model_selection import train_test_split
from dl85 import DL85Classifier

dataset = np.genfromtext("anneal.txt", delimiter=" ")
X = dataset[:, 1:]
y = dataset[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

clf = DL85Classifier(max_depth=3, min_sup=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In this example, we use numpy to read a dataset and scikit-learn to split this dataset in training and test data. Subsequently, the DL85Classifier is initialized, where the recommended parameters are specified; a decision tree is learned from the training data, which is applied on the test data.

  • when the fit(X,y) method is executed, an optimal decision tree classifier is learned from X and y, where X is a set of Boolean training samples and y is the vector of target values; the resulting tree is stored in the DL85Classifier object. For more information on how the results of the learning algorithm are stored, please check the API documentation.

  • when the predict(X) method is executed, predictions will be computed for the Boolean test samples X using the tree learned during the execution of fit. The output corresponds to a list of predicted classes for all the samples.

Parameters of the learning process need to be specified during the construction of the DL85Classifier object. The complete list of parameters can be found in the API documentation. We highly recommend to specify the parameters of the following constraints, as their default values are not useful in many cases:

  • max_depth, which specifies the maximum depth of the tree to be learned; low values will ensure that more interpretable trees are found; high values may lead to better training set accuracy, but can also lead to overfitting;

  • min_sup, which specifies the minimum number of examples in the training data that every leaf should cover; low values may lead to predictions that are based on too small amounts of data.

Other parameters that may be useful to tune are:

  • time_limit, which indicates the maximum amount of time the algorithm is allowed to run; the algorithm will be interrupted when the runtime is exceeded, and the best tree found within the allocated time will be returned. The default value is 0, in which case no limit on runtime is imposed.

  • max_error, which will direct the search algorithm to only find trees with an error lower than max_error. For instance, if a decision tree has already been found using another algorithm (such as a heuristic algorithm), specifying this parameter could direct DL8.5 to only find trees that are better than the tree found using this other algorithm.

Note that DL8.5 currently only works on boolean data; if the input data is not boolean, the data would have to made boolean first.

The following example illustrates how to use our classifier within a scikit-learn pipeline to learn an optimal decision tree classifier:

>>> dataset = np.genfromtxt("binary_dataset.txt", delimiter=' ')
>>> X = dataset[:, 1:]
>>> y = dataset[:, 0]
>>> pipe = make_pipeline(DL85Classifier())
>>> pipe.fit(X, y)  
Pipeline(...)

Then, you can call predict to classify these examples:

>>> pipe.predict(X)  
array([...])

DL85Classifier also inherits from sklearn.base.ClassifierMixin. This allows the use of the score method to calculate the accuracy of the classifier:

>>> pipe.score(X, y)  
0...

Other predictors

Classification trees are one example of decision trees. In their more general form, decision trees may also predict other structures in their leafs. To support such other learning tasks, the DL85Predictor class is provided. In contrast to the DL85Classifier class, the DL85Predictor class does not require the specification of a vector y consisting of class labels in the fit function, and allows for the specification of other optimisation criteria than error.

An example of another type of decision tree is the Predictive Clustering tree. In a Predictive Clustering tree the leafs of the tree correspond to clusters in the unlabeled training data. The quality of the tree is determined by the quality of the clusters in the leafs of the tree. Standard measures can be used to evaluate the quality of the clusters, such as within-cluster sum of squares. The predictions in the leafs of the tree then correspond to the centroids of the clusters.

Using DL8.5’s DL85Predictor class, this clustering task can be solved by specifying an error function that evaluates the quality of clusters in the leafs. The full code is given below:

import numpy as np
from sklearn.neighbors import DistanceMetric
from dl85 import DL85Predictor

dataset = np.genfromtxt("../datasets/anneal.txt", delimiter=' ')
X = dataset[:, 1:]
X = X.astype('int32')

eucl_dist = DistanceMetric.get_metric('euclidean')

def error(tids):
    X_subset = X[list(tids),:]
    centroid = np.mean(X_subset, axis=0)
    distances = eucl_dist.pairwise(X_subset, [centroid])
    return float(sum(distances))

def leaf_value(tids):
    return np.mean(X.take(list(tids)))

clf = DL85Predictor(max_depth=3, min_sup=5, error_function=error, leaf_value_function=leaf_value, time_limit=600)

clf.fit(X)
predicted = clf.predict(X)

The error function in this example has one argument tids. The DL85Predictor class will call this function for each candidate leaf, where tids lists the identifiers of the training examples that would be part of that leaf. The error function in this example calculates the mean of the training examples in this list, and then calculates the euclidian distance of each example in the list towards the mean. The sum of these distance is returned as the score for the candidate leaf.

The DL85Predictor class is initialized with the function that needs to be called to evaluate the quality of the leafs.

Other tree learning tasks can be specified by providing an alternative implementation of the error function. Note that in this example, the fit function is called on the matrix X, and the error function also operates on the matrix X. This is not necessary; the only required to the error function is that for a given list of row identifiers (coming from the matrix X) it can return a quality score.

In this example, we call the predict function. For each example given in the parameter of the predict function, DL85Predictor will traverse the tree to determine the prediction specified in the corresponding leaf of the tree. This prediction is provided by the leaf_value function. The leaf_value function will be called at the end of the training process to fill in the predictions in the leafs. Also this function will receive a list of identifiers in the training data X in order to calculate the prediction. In this example, the prediction corresponds to the mean.

In principle, classification trees can also be learned using the DL85Predictor class. The following error function can be used:

def error(tids):
    classes, supports = np.unique(y.take(list(tids)), return_counts=True)
    maxindex = np.argmax(supports)
    return sum(supports) - supports[maxindex]

Here y consists of the labels of the examples in X. We use standard NumPy functions to count the number of examples in each class, determine the majority class and finally calculate the error based on this.

However, learning classification trees in this manner is in practice slower than by using the DL85Classifier class. The DL85Classifier class calculates error using optimized code written in C++, instead of using Python.

For supervised data with class labels, a supplementary interface is provided for writing error functions, illustrated in this example:

def error(sup_iter):
    supports = list(sup_iter)
    maxindex = np.argmax(supports)
    return sum(supports) - supports[maxindex], maxindex


clf = DL85Classifier(max_depth=2, fast_error_function=error, time_limit=600)

In this example, a fast_error_function is specified. If this function is specified, DL85Classifier will call the user-specified function with as argument an iterator over the numbers of examples in each class.

The advantage of this variation is that the calculation of the class distribution is done using optimized C++ code; the Python code does not have to traverse the data. Only the final calculation of the score is done in Python. This functionality is useful for instance if a different weight should be given to each class.

Finally, we provide a built-in implementation of predictive clustering in the DL85Cluster class. Using this class, the user does not have to write the example code written above.