11.15.决策树

Windows 10
Python 3.7.3 @ MSC v.1915 64 bit (AMD64)
Latest build date 2020.11.07
sklearn version:  0.23.1

from sklearn.tree import (DecisionTreeClassifier, DecisionTreeRegressor,
                          ExtraTreeClassifier, ExtraTreeRegressor)
from sklearn import tree

[i for i in dir(tree) if not i.startswith("_")]

['BaseDecisionTree',
 'DecisionTreeClassifier',
 'DecisionTreeRegressor',
 'ExtraTreeClassifier',
 'ExtraTreeRegressor',
 'export_graphviz',
 'export_text',
 'plot_tree']

参数说明

DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None,
                       min_samples_split=2, min_samples_leaf=1,
                       min_weight_fraction_leaf=0.0, max_features=None,
                       random_state=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       class_weight=None, presort='deprecated', ccp_alpha=0.0)

DecisionTreeRegressor(criterion='mse', splitter='best', max_depth=None,
                      min_samples_split=2, min_samples_leaf=1,
                      min_weight_fraction_leaf=0.0, max_features=None,
                      random_state=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      presort='deprecated', ccp_alpha=0.0)

分裂策略参数

criterion：分裂准则。对于分类树：{'gini', 'entropy'}，使用Gini系数或者信息增益。对于回归树：{"mse", "friedman_mse", "mae"}。

splitter：{'best', 'random'}。分裂策略。'best'表示选择最优的拆分；'random'表示随机选择最优的拆分。'best'适合样本量不大的情况，如果样本数据量非常大，推荐使用 'random'。

max_features：指定寻找best split时考虑的特征数量。在找到至少一个有效的特征用于分裂之前，不会停止搜索，即使搜索的特征数量超过max_features。

如果是整数，则每次切分只考虑max_features个特征
如果是浮点数，每次切分只考虑max_features*n_features个特征
如果是字符串'auto'，则max_features=sqrt(n_features)
如果是字符串'sqrt'，则max_features=sqrt(n_features)
如果是字符串'log2'，则max_features=log2(n_features)
如果是None，则max_features=n_features

预剪枝参数

max_depth：可以为整数或者None，指定树的最大深度，防止过拟合。如果为None，则树的深度不限，直到每个叶子结点都是纯的（即所有样本点都属于一个类），或者叶子结点包含的样本量小于 min_samples_split。如果max_leaf_nodes参数非None，则忽略此项。

min_samples_split：指定每个内部结点（非叶结点）包含的最少的样本数。

如果是整数，该整数就是最少样本数。
如果是浮点数，则ceil(min_samples_split * n_samples)是最少样本数。

min_samples_leaf：指定每个叶子结点包含的最少的样本数。只有当左右分支都至少有min_samples_leaf个样本，结点才会进行分裂。

如果是整数，该整数就是最少的样本数。
如果是浮点数，则ceil(min_samples_split * n_samples)是最少样本数。

min_weight_fraction_leaf：浮点数，叶子结点中所有样本权重之和的最小值，默认为0.0。

max_leaf_nodes：整数或None，指定叶子结点的最大数量。如果为None，此时叶子节点数不限。如果非None，则max_depth参数被忽略。

min_impurity_decrease：默认为0.0，如果该分裂导致不纯度的减少量大于或等于该值，则将分裂节点。加权的不纯度减少量计算公式如下： $$ \frac{N_t}{N} \times (\text{impurity} - \frac{N_{t_R}}{N_t} \times \text{impurity}_R- \frac{N{t_L}}{N_t} \times \text{impurity}_L) $$

min_impurity_split：弃用，早停的impurity阈值。如果结点的impurity高于该阈值，则分裂结点，否则标记为叶结点。该参数在0.25版本中将被移除，建议使用min_impurity_decrease参数。

后剪枝参数

ccp_alpha：成本复杂度参数，默认为0.0，即不会执行后剪枝。

其他参数

class_weight：dict, list of dict or "balanced", or None. 指定了各类别的权重。

如果是多分类问题，则通过{class_label: weight}的形式指定权重。
如果是多标签分类问题，例如4-class多标签分类问题应该通过如下形式指定权重：

[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}]

列表中每个dict对应y的每一列。而不是通过如下形式指定权重：

[{1:1}, {2:5}, {3:1}, {4:1}]
如果为None，则每个类别权重都为1。
如果为‘balanced’，则类别权重将自动计算，公式为：n_samples / (n_classes * np.bincount(y))。如果是多标签分类，y的每一列将相乘。
如果在fit方法指定了sample_weight，则sample_weight将于对应的class_weight相乘。

presort：弃用，指定了是否要提前排序数据从而加速寻找最优分割的过程。该参数将在0.24版本中被移除。设置为True时，对于大数据集会减慢总体训练过程，但对于小数据集或者设定了最大深度的情况下，则会加速训练过程。

属性

classes_：各类别的标签值。

feature_importances_：特征的重要程度。该值越高，则特征越重要(也称为Gini importance)。

max_features_：max_feature参数的推断值。

n_classes_：类别的数量。

n_features_：执行fit后，特征的数量。

n_outputs_：执行fit后，输出的数量。

tree_：底层的决策树对象。

方法


`apply`(X[, check_input])	返回每个样本进入的叶子结点的索引
`cost_complexity_pruning_path`(X, y[, …])	在Minimal Cost-Complexity剪枝期间计算剪枝路径
`decision_path`(X[, check_input])	返回树的决策路径
`fit`(X, y[, sample_weight, check_input, …])	Build a decision tree classifier from the training set (X, y).
`get_depth`()	Return the depth of the decision tree.
`get_n_leaves`()	Return the number of leaves of the decision tree.
`get_params`([deep])	Get parameters for this estimator.
`predict`(X[, check_input])	Predict class or regression value for X.
`predict_log_proba`(X)	Predict class log-probabilities of the input samples X.
`predict_proba`(X[, check_input])	Predict class probabilities of the input samples X.
`score`(X, y[, sample_weight])	Return the mean accuracy on the given test data and labels.
`set_params`(**params)	Set the parameters of this estimator.

对于分类树，score()方法返回的结果是：

accuracy_score(y, self.predict(X), sample_weight=sample_weight)

对于回归树，score()方法返回的结果是：

r2_score(y, y_pred, sample_weight=sample_weight)

$$ R^2 = 1 - \frac{\sum(y - \hat{y})^2}{\sum(y - \bar{y})^2} $$

$R^2$有可能为负值。

extremely randomized tree

极端随机树与经典决策树的构建方式不同。寻找最佳分割时，Extra-trees 在所有拆分中随机选择max_features个拆分，并在max_features个拆分中选择最佳拆分。当max_features设置为1时，即建立了一个完全随机的决策树。

实际上，ExtraTree 继承自 DecisionTree，ExtraTree只是将splitter、max_features参数分别默认设置为"random"和"auto"而已。

ExtraTreeClassifier(criterion='gini', splitter='random', max_depth=None,
                    min_samples_split=2, min_samples_leaf=1,
                    min_weight_fraction_leaf=0.0, max_features='auto',
                    random_state=None, max_leaf_nodes=None,
                    min_impurity_decrease=0.0, min_impurity_split=None,
                    class_weight=None, ccp_alpha=0.0)

ExtraTreeRegressor(criterion='mse', splitter='random', max_depth=None,
                   min_samples_split=2, min_samples_leaf=1,
                   min_weight_fraction_leaf=0.0, max_features='auto',
                   random_state=None, min_impurity_decrease=0.0,
                   min_impurity_split=None, max_leaf_nodes=None, ccp_alpha=0.0)