Overfitting in Two Ways
Learn parameters and test the model in the same dataset
Solution: train-test
1X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.4, random_state=0)Tune the hyperparameters and test the model in the same dataset
When evaluating different settings (“hyperparameters”) for estimators, such as the
C
setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance.Solution: Train-validation-test, Cross validation
1scores = cross_val_score(model, iris.data, iris.target, cv=5, scoring='f1_macro')Note: from my point of view, cross validation is not a very clean solution for hyperparameter tuning. A small test set is still needed to see the generalization error. But it is a good way to see if the model is stable. If the validation error varies amongst different left out samples, then there might be some problems.
Visualize Overfitting & Underfitting
Effect of a hyper-parameter
|
|
Effect of the number of training samples
|
|
K Folds
K-Fold
KFold
divides all the samples in k groups of samples, called folds ( ifk=n
this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned usingk - 1
folds, and the fold left out is used for test.
|
|
Stratified K-Fold
Use stratified K-Fold when the class is unbalanced.
StratifiedKFold
is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.
|
|
Group K-Fold
An example would be when there is medical data collected from multiple patients, with multiple samples taken from each patient. And such data is likely to be dependent on the individual group. In our example, the patient id for each sample will be its group identifier.
In this case we would like to know if a model trained on a particular set of groups generalizes well to the unseen groups. To measure this, we need to ensure that all the samples in the validation fold come from groups that are not represented at all in the paired training fold.
|
|
Time Series Split
Time series data is characterised by the correlation between observations that are near in time (autocorrelation). However, classical cross-validation techniques assume the samples are independent and identically distributed, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time series data. Therefore, it is very important to evaluate our model for time series data on the “future” observations least like those that are used to train the model.
Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.
|
|
Tuning Hyper-parameters
So Scikit-learn provides tools to tune hyper-parameters. That’s to say, we don’t have start with train-validation-test and then input different hyper-parameter and then print out validation error. We can input the desire model, and a list of hyper-parameters to choose from, and then scikit-learn will iterate and gives the best combination.
Model selection by evaluating various parameter settings can be seen as a way to use the labeled data to “train” the parameters of the grid. When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process: it is recommended to split the data into a development set (to be fed to the
GridSearchCV
instance) and an evaluation set to compute performance metrics.
There are two ways to tune hyper-parameters.
Grid Search
The grid search provided by
GridSearchCV
exhaustively generates candidates from a grid of parameter values specified with theparam_grid
parameter.
|
|
Randomized Search
RandomizedSearchCV
implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:
- A budget can be chosen independent of the number of parameters and possible values.
- Adding parameters that do not influence the performance does not decrease efficiency.
|
|