11.7.超参数寻优

Windows 10
Python 3.7.3 @ MSC v.1915 64 bit (AMD64)
Latest build date 2020.05.10
sklearn version:  0.22.1

GridSearchCV：网格搜索

GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, 
             n_jobs=None, iid='warn', refit=True, cv='warn', verbose=0,
             pre_dispatch='2*n_jobs', error_score='raise', 
             return_train_score='warn')

参数：

estimator：被假定为scikit-learn的估计器。估计器需要提供score方法，否则必须传入scoring参数。
param_grid：dict or list of dictionaries，参数名字作为字典的key，参数值作为对应key的value。
fit_params：传递给fit方法的参数
refit：使用最佳参数重新拟合估算器。对于多指标评估，这需要是一个表示scorer的字符串，该scorer将用于查找最佳参数以最终重新拟合估计器。还可以将refit设置为一个函数，该函数在给定cv_results_的情况下返回所选的best_index_。

例子

import sklearn.svm
from sklearn.svm import SVC 
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV

X, y = load_iris(return_X_y=True)
SVM = SVC(C=1.0, kernel='rbf')

params = {"C":[0.1, 1.0, 5.0, 10.0],
          "kernel":['linear', 'poly', 'rbf']
    }

grid_search = GridSearchCV(SVM, params, cv=5)
grid_search.fit(X, y)

# 模型最高分：
grid_search.score(X, y)
# 最优参数：
grid_search.best_params_

{'C': 0.1, 'kernel': 'poly'}

属性

cv_results_：cv_results_属性是一个字典，储存了网格搜索的详细结果：

mean_fit_time：每个参数组合在训练模型上的平均用时。
std_fit_time：各参数组合在训练模型上所用时间的标准差。
mean_test_score：每个参数组合的平均得分。
std_test_score：每个参数组合得分的标准差。
mean_score_time：每个参数组合打分所用时间的平均值。
std_score_time：每个参数组合打分所用时间的标准差。
rank_test_score：各参数组合的得分排名。
params：各个参数组合。
param_xxx：模型各个超参数在每个参数组合中对应的取值。
splitx_test_score：交叉验证每次拆分数据集，各个参数组合的得分。

best_index_：最佳候选参数组合对应的（cv_results_数组的）索引。

best_estimator_：得分最高的 estimator，若refit=False，该属性则不可用。

best_score_：得分最高的estimator的交叉验证平均得分。

best_params_：得分最高的参数组合。对于multi-metric evaluation，仅当refit指定时才可用。

n_splits_：交叉验证拆分的数量。

refit_time_：在整个数据集中，重新拟合最佳模型的秒数。

multimetric_：逻辑值，是否使用多个metric进行评估。

classes_：类别。

方法

方法	说明
`decision_function(X)`	Call decision_function on the estimator with the best found parameters.
`fit(X[, y, groups])`	Run fit with all sets of parameters.
`get_params()`	Get parameters for this estimator.
`inverse_transform(Xt)`	Call inverse_transform on the estimator with the best found params.
`predict(X)`	Call predict on the estimator with the best found parameters.
`predict_log_proba(X)`	Call predict_log_proba on the estimator with the best found parameters.
`predict_proba(X)`	Call predict_proba on the estimator with the best found parameters.
`score(X[, y])`	Returns the score on the given data, if the estimator has been refit.
`set_params(**params)`	Set the parameters of this estimator.
`transform(X)`	Call transform on the estimator with the best found parameters.

RandomizedSearchCV：随机搜索

与 GridSearchCV不同的是，随机搜索（可能）不会尝试所有的参数组合，而是从指定的分布中采样固定数量的参数组合。尝试的参数组合的数量是由 n_iter参数指定。

RandomizedSearchCV(estimator, param_distributions, 
                   n_iter=10, scoring=None, n_jobs=None, 
                   iid='deprecated', refit=True, cv=None, 
                   verbose=0, pre_dispatch='2*n_jobs', 
                   random_state=None, error_score=nan, 
                   return_train_score=False)

除了n_iter参数，RandomizedSearchCV与GridSearchCV的参数一致，属性和方法也是一致的。

例子

import sklearn.svm
from sklearn.svm import SVC 
from sklearn.datasets import load_iris
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV

X, y = load_iris(return_X_y=True)
SVM = SVC(C=1.0, kernel='rbf')

# params = {"C":[0.1, 1.0, 5.0, 10.0],
#           "kernel":['linear', 'poly', 'rbf']
#     }

params = {"C":uniform(loc=0, scale=4),
          "kernel":['linear', 'poly', 'rbf']
    }

random_search = RandomizedSearchCV(SVM, params, cv=5)
random_search.fit(X, y)

# 模型最高分：
random_search.score(X, y)
# 最优参数：
random_search.best_params_

{'C': 1.2248642986661507, 'kernel': 'rbf'}

ParameterGrid：参数组合

ParameterGrid类可以构造出所有的参数组合。

ParameterGrid(param_grid)

例子

from sklearn.model_selection import ParameterGrid
param_grid = {'a': [1, 2], 'b': [True, False]}

list(ParameterGrid(param_grid))

[{'a': 1, 'b': True},
 {'a': 1, 'b': False},
 {'a': 2, 'b': True},
 {'a': 2, 'b': False}]

ParameterSampler：参数组合随机抽样

ParameterSampler类根据给定分布采样的参数生成参数组合。

ParameterSampler(param_distributions, n_iter, random_state=None)

例子

from sklearn.model_selection import ParameterSampler

param_grid = {'a': [1, 2], 'b': [True, False]}
list(ParameterSampler(param_grid, n_iter=3))

[{'b': True, 'a': 1}, {'b': False, 'a': 2}, {'b': False, 'a': 1}]

from scipy.stats import uniform
from sklearn.model_selection import ParameterSampler

param_grid = {'a': uniform(loc=0, scale=4), 'b': [True, False]}
list(ParameterSampler(param_grid, n_iter=4))

[{'a': 0.716982761404827, 'b': True},
 {'a': 2.6603500613693383, 'b': False},
 {'a': 1.7170489847529042, 'b': False},
 {'a': 2.266735701973464, 'b': True}]

fit_grid_point

对一组参数进行拟合，返回这组参数的得分，已经测试样本的数量。

fit_grid_point(X, y, estimator, parameters, 
               train, test, scorer, verbose, 
               error_score=nan, fit_params)

例子

from sklearn.svm import SVC 
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, f1_score, make_scorer
from sklearn.model_selection import fit_grid_point

X, y = load_iris(return_X_y=True)
SVM = SVC(C=1.0, kernel='rbf')

params = {"C": 1.0,
          "kernel": 'linear'
    }

fit_grid_point(X, y, SVM, params, train=slice(0,100,1), 
               test=slice(100,150,1), 
               scorer=make_scorer(f1_score), verbose=1)

(0.0, {'C': 1.0, 'kernel': 'linear'}, 50)