Data Pre-processing tasks using python
Dataset Used: Iris Dataset
The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines
Content
The dataset contains a set of 150 records under 5 attributes — Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).
Variance Threshold
Variance Threshold is a feature selector that removes all low-variance features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning. Features with a training-set variance lower than this threshold will be removed.
Univariate Feature Selection
Univariate feature selection works by selecting the best features based on univariate statistical tests. We compare each feature to the target variable, to see whether there is any statistically significant relationship between them. It is also called the analysis of variance (ANOVA). That is why it’s called ‘univariate’.
- f_classif
- chi2
- mutual_info_classif
Recursive Feature Elimination
Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. RFE requires a specified number of features to keep. However, it is often not known in advance how many features are valid.
Differences Between Before and After Using Feature Selection
- Before using feature selection:
- After using feature selection:
Why feature selection is important?? Its advantages/disadvantages.
Feature selection is extremely important in machine learning primarily because it serves as a fundamental technique to direct the use of variables to what’s most efficient and effective for a given machine learning system.
Experts talk about how feature selection and feature extraction work to minimize the curse of dimensionality or help deal with overfitting — these are different ways to address the idea of excessively complex modeling.
Advantages of using feature selection
By selecting only a few features (the most significant ones) and removing the remaining ones from consideration, it is often the case that a better model can be learned, especially in high-dimensional settings. This is counterintuitive, as an ideal machine learning modeling algorithm should be able to perform at least as well as without applying feature selection, as the information provided by the selected features is already contained in the provided data. Indeed, asymptotically (i.e., as the sample size tends to infinity) and given a perfect/ideal learning algorithm (in practice, there is no such algorithm), there is no reason to perform feature selection for a predictive task.
Disadvantages of feature selection
The feature selection problem is NP-hard. There are several approaches to solve the problem exactly (also called the best subset selection problem) only for linear models. Although the results are promising, exact approaches are only able to handle a few hundred or thousand variables at most (so, they are not applicable on high dimensional data).