Preprocessing
Encoding Categorical Features
Integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was ordered arbitrarily).
One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in
OneHotEncoder
This estimator transforms each categorical feature withm
possible values intom
binary features, with only one active.
|
|
Imputation of Missing Values
A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.
Basic strategies for imputing missing values: either using the mean, the median or the most frequent value of the row or column in which the missing values are located.
|
|
Standardization
It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features. To address this issue you can use PCA with
whiten=True
to further remove the linear correlation across features.
Standardization
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn.
If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
|
|
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) should be learnt from a training set and applied to held-out data for prediction:
|
|
Scaling sparse data
Summary: do the standardization without centering.
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs without centering, especially if features are on different scales.
|
|
Scaling data with outliers
Summary: use median instead of mean, use IQR instead of standard deviation.
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.
|
|
Normalization
Normalization is the process of scaling individual samples to have unit norm.
This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples. This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.
|
|
Binarization
Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution.
|
|