Learn parameters and test the model in the same dataset
Solution: train-test1X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.4, random_state=0)
Tune the hyperparameters and test the model in the same dataset
When evaluating different settings (“hyperparameters”) for estimators, such as the
Csetting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance.
Solution: Train-validation-test, Cross validation1scores = cross_val_score(model, iris.data, iris.target, cv=5, scoring='f1_macro')
Note: from my point of view, cross validation is not a very clean solution for hyperparameter tuning. A small test set is still needed to see the generalization error. But it is a good way to see if the model is stable. If the validation error varies amongst different left out samples, then there might be some problems.
KFolddivides all the samples in k groups of samples, called folds ( if
k=nthis is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using
k - 1folds, and the fold left out is used for test.
Use stratified K-Fold when the class is unbalanced.
StratifiedKFoldis a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.
An example would be when there is medical data collected from multiple patients, with multiple samples taken from each patient. And such data is likely to be dependent on the individual group. In our example, the patient id for each sample will be its group identifier.
In this case we would like to know if a model trained on a particular set of groups generalizes well to the unseen groups. To measure this, we need to ensure that all the samples in the validation fold come from groups that are not represented at all in the paired training fold.
Time series data is characterised by the correlation between observations that are near in time (autocorrelation). However, classical cross-validation techniques assume the samples are independent and identically distributed, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time series data. Therefore, it is very important to evaluate our model for time series data on the “future” observations least like those that are used to train the model.
Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.
So Scikit-learn provides tools to tune hyper-parameters. That’s to say, we don’t have start with train-validation-test and then input different hyper-parameter and then print out validation error. We can input the desire model, and a list of hyper-parameters to choose from, and then scikit-learn will iterate and gives the best combination.
Model selection by evaluating various parameter settings can be seen as a way to use the labeled data to “train” the parameters of the grid. When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process: it is recommended to split the data into a development set (to be fed to the
GridSearchCVinstance) and an evaluation set to compute performance metrics.
There are two ways to tune hyper-parameters.
The grid search provided by
GridSearchCVexhaustively generates candidates from a grid of parameter values specified with the
RandomizedSearchCVimplements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:
- A budget can be chosen independent of the number of parameters and possible values.
- Adding parameters that do not influence the performance does not decrease efficiency.