Data Pre-processing in Python using Scikit-learn
Dataset Used: Heart Disease Prediction
The dataset is publicly available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.
Standardizing Data
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
With standardizing, we can take attributes with a Gaussian distribution and different means and standard deviations and transform them into a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. For this, we use the StandardScaler class.
Imputation of missing values
For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.
A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.
Normalizing Data
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
In this task, we rescale each observation to a length of 1 (a unit norm). For this, we use the Normalizer class.
Discretization
Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.
One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models.
Encoding
When dealing with few and scattered numerical values, we may not need to store these. Then, we can perform One Hot Encoding. For k distinct values, we can transform the feature into a k-dimensional vector with one value of 1 and 0 as the rest values. to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.
In this dataset there is no column to encode so I have used an example:
Variance Thresholds
Variance thresholds remove features whose values don’t change much from observation to observation (i.e. their variance falls below a threshold). These features provide little value.
For example, if you had a public health dataset where 96% of observations were for 35-year-old men, then the ‘Age’ and ‘Gender’ features can be eliminated without a major loss in information.
Because variance is dependent on scale, you should always normalize your features first.
- Strengths: Applying variance thresholds is based on solid intuition: features that don’t change much also don’t add much information. This is an easy and relatively safe way to reduce dimensionality at the start of your modeling process.
- Weaknesses: If your problem does require dimensionality reduction, applying variance thresholds is rarely sufficient. Furthermore, you must manually set or tune a variance threshold, which could be tricky. We recommend starting with a conservative (i.e. lower) threshold.
- Implementations: Python.