User guide: Using PyDL8.5¶
Optimal Decision Trees¶
This library implements the DL8.5 algorithm for learning optimal binary decision trees and provides a Python interface for this algorithm.
An example of decision trees are classification trees. Classification trees are predictors in which the predictions correspond to class labels. A tree is considered optimal on training data if no tree can be found that scores better on the given training data.
An explicit aim of this library is to make it easy to specify and solve many different types of decision tree learning problems, including, but not limited to, classification trees.
Decision trees are traditionally learned using heuristic algorithms, such as CART and C4.5. However, due to the heuristic nature of these algorithms, the trees learned using them can be larger than necessary; this may make the resulting trees less interpretable. Trees found by DL8.5 are optimal under constraints that aim to make the resulting trees both interpretable and accurate.
Moreover, given that in DL8.5 it is not necessary to specify a heuristic, solving other learning problems than classification problems potentially becomes simpler.
Please note that trees that that are accurate on training data, may not always perform good on test data. To avoid problems with overfitting, it is recommended to run DL8.5 using carefully chosen constraints, as specified below. Moreover, finding optimal decision tree is a hard search problem, and will require more computational resources. However, a recent study has shown that DL8.5’s performance on this search problem is much better than that of competing methods.
Classifiers¶
Decision tree classifiers are learned using the class DL85Classifier
.
DL85Classifier
is a scikit-learn compatible classifier and can be used as a scikit-learn
classifier. It inherits from sklearn.base.BaseEstimator
and reimplements the methods fit
and predict
.
The following code illustrates how to use DL8.5 in its most basic setting:
import numpy as np
from sklearn.model_selection import train_test_split
from dl85 import DL85Classifier
dataset = np.genfromtext("anneal.txt", delimiter=" ")
X = dataset[:, 1:]
y = dataset[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)
clf = DL85Classifier(max_depth=3, min_sup=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
In this example, we use numpy to read a dataset and scikit-learn to split this dataset in training and test data. Subsequently, the DL85Classifier is initialized, where the recommended parameters are specified; a decision tree is learned from the training data, which is applied on the test data.
when the
fit(X,y)
method is executed, an optimal decision tree classifier is learned fromX
andy
, whereX
is a set of Boolean training samples andy
is the vector of target values; the resulting tree is stored in theDL85Classifier
object. For more information on how the results of the learning algorithm are stored, please check the API documentation.when the
predict(X)
method is executed, predictions will be computed for the Boolean test samplesX
using the tree learned during the execution offit
. The output corresponds to a list of predicted classes for all the samples.
Parameters of the learning process need to be specified during the construction of the DL85Classifier
object.
The complete list of parameters can be found in the API documentation. We highly recommend to
specify the parameters of the following constraints, as their default values are not useful in many cases:
max_depth
, which specifies the maximum depth of the tree to be learned; low values will ensure that more interpretable trees are found; high values may lead to better training set accuracy, but can also lead to overfitting;min_sup
, which specifies the minimum number of examples in the training data that every leaf should cover; low values may lead to predictions that are based on too small amounts of data.
Other parameters that may be useful to tune are:
time_limit
, which indicates the maximum amount of time the algorithm is allowed to run; the algorithm will be interrupted when the runtime is exceeded, and the best tree found within the allocated time will be returned. The default value is0
, in which case no limit on runtime is imposed.max_error
, which will direct the search algorithm to only find trees with an error lower thanmax_error
. For instance, if a decision tree has already been found using another algorithm (such as a heuristic algorithm), specifying this parameter could direct DL8.5 to only find trees that are better than the tree found using this other algorithm.
Note that DL8.5 currently only works on boolean data; if the input data is not boolean, the data would have to made boolean first.
The following example illustrates how to use our classifier within a scikit-learn pipeline to learn an optimal decision tree classifier:
>>> dataset = np.genfromtxt("binary_dataset.txt", delimiter=' ')
>>> X = dataset[:, 1:]
>>> y = dataset[:, 0]
>>> pipe = make_pipeline(DL85Classifier())
>>> pipe.fit(X, y)
Pipeline(...)
Then, you can call predict
to classify these examples:
>>> pipe.predict(X)
array([...])
DL85Classifier
also inherits from
sklearn.base.ClassifierMixin
. This allows the use of the score
method to calculate
the accuracy of the classifier:
>>> pipe.score(X, y)
0...
Other predictors¶
Classification trees are one example of decision trees. In their more general form, decision trees
may also predict other structures in their leafs. To support such other learning tasks, the DL85Predictor
class
is provided. In contrast to the DL85Classifier
class, the DL85Predictor
class does not require the specification
of a vector y
consisting of class labels in the fit
function, and allows for the specification of
other optimisation criteria than error.
An example of another type of decision tree is the Predictive Clustering tree. In a Predictive Clustering tree the leafs of the tree correspond to clusters in the unlabeled training data. The quality of the tree is determined by the quality of the clusters in the leafs of the tree. Standard measures can be used to evaluate the quality of the clusters, such as within-cluster sum of squares. The predictions in the leafs of the tree then correspond to the centroids of the clusters.
Using DL8.5’s DL85Predictor
class, this clustering task can be solved by specifying an error function
that evaluates the quality of clusters in the leafs. The full code is given below:
import numpy as np
from sklearn.neighbors import DistanceMetric
from dl85 import DL85Predictor
dataset = np.genfromtxt("../datasets/anneal.txt", delimiter=' ')
X = dataset[:, 1:]
X = X.astype('int32')
eucl_dist = DistanceMetric.get_metric('euclidean')
def error(tids):
X_subset = X[list(tids),:]
centroid = np.mean(X_subset, axis=0)
distances = eucl_dist.pairwise(X_subset, [centroid])
return float(sum(distances))
def leaf_value(tids):
return np.mean(X.take(list(tids)))
clf = DL85Predictor(max_depth=3, min_sup=5, error_function=error, leaf_value_function=leaf_value, time_limit=600)
clf.fit(X)
predicted = clf.predict(X)
The error
function in this example has one argument tids
. The DL85Predictor
class will call
this function for each candidate leaf, where tids
lists the identifiers of the training examples that would be part of that leaf. The error
function in this example calculates the mean of the training examples in this list,
and then calculates the euclidian distance of each example in the list towards the mean. The sum of these
distance is returned as the score for the candidate leaf.
The DL85Predictor
class is initialized with the function that needs to be called to evaluate the quality of the
leafs.
Other tree learning tasks can be specified by providing an alternative implementation of the error
function.
Note that in this example, the fit
function is called on the matrix X
, and the error function also operates
on the matrix X
. This is not necessary; the only required to the error function is that for a given list
of row identifiers (coming from the matrix X
) it can return a quality score.
In this example, we call the predict
function. For each example given in the parameter of the predict
function,
DL85Predictor
will traverse the tree to determine the prediction specified in the corresponding leaf of the tree.
This prediction is provided by the leaf_value
function. The leaf_value
function will be called at the
end of the training process to fill in the predictions in the leafs. Also this function will receive a list of
identifiers in the training data X
in order to calculate the prediction. In this example, the prediction
corresponds to the mean.
In principle, classification trees can also be learned using the DL85Predictor
class. The following
error function can be used:
def error(tids):
classes, supports = np.unique(y.take(list(tids)), return_counts=True)
maxindex = np.argmax(supports)
return sum(supports) - supports[maxindex]
Here y
consists of the labels of the examples in X
. We use standard NumPy functions to count the
number of examples in each class, determine the majority class and finally calculate the error based on this.
However, learning classification trees in this manner is in practice slower than by using the DL85Classifier
class.
The DL85Classifier
class calculates error using optimized code written in C++, instead of using Python.
For supervised data with class labels, a supplementary interface is provided for writing error functions, illustrated in this example:
def error(sup_iter):
supports = list(sup_iter)
maxindex = np.argmax(supports)
return sum(supports) - supports[maxindex], maxindex
clf = DL85Classifier(max_depth=2, fast_error_function=error, time_limit=600)
In this example, a fast_error_function
is specified. If this function is specified, DL85Classifier
will call the user-specified function with as argument an iterator over the
numbers of examples in each class.
The advantage of this variation is that the calculation of the class distribution is done using optimized C++ code; the Python code does not have to traverse the data. Only the final calculation of the score is done in Python. This functionality is useful for instance if a different weight should be given to each class.
Finally, we provide a built-in implementation of predictive clustering in the DL85Cluster
class.
Using this class, the user does not have to write the example code written above.