Feature Engineering

What are Features?

In machine learning, the characteristics of the data (in other words, columns that influence the model) that are selected while training the model are termed Features.

Choosing good features is crucial for the performance of models.

What is Feature Engineering?

In programming, we focus mainly on the code. However, in machine learning our primary focus shifts from coding to representing. In other words, we emphasize representing the data and improving its features so we can get desired results.

In the real world, data more often than not is super messy and needs cleaning. ML Developers spend most of their time (~75%) cleaning the data and choosing the best features for their model, instead of training the model.

This process of selecting, manipulating, and transforming the raw data into usable features is called Feature Engineering.

Feature Engineering involves several processes that are discussed below:

Feature selection: choosing the right set of characteristics as features for the model. For e.g. choosing the square footage of a house could be useful in predicting house prices and hence could be considered a feature.
Feature transformation: transforming/ manipulating (using mathematical operations or otherwise) existing data to create new characteristics which can be used as features. E.g. instead of using DOB as a feature, one can simply calculate the age of the person (using math and logic) and use that as a feature.
Feature construction: creating new characteristics/ columns (which are relevant to the data) without using existing data.
Feature extraction: combining two or more characteristics from the dataset and merging them into one. This is done to reduce the dimensionality of the dataset. For e.g. one can simply convert the features population_of_region and size_of_region into a single feature, density_of_region by taking their ratio.

Importance of Feature Engineering

In the above example, it was easy to transform the number of rooms (num_room) into a feature vector… however, it won’t be as simple in the case of street_name. Machine learning algorithms can deal with numerical data but not strings. So we must find a way to use that feature from our data appropriately to get the best results out of the model.

Since models cannot multiply strings by the learned weights, we use feature engineering to convert strings to numeric values.

One-Hot Encoding

If we have 10 different categories for street_name, we cannot assign numbers from 0 to 9 for each category/ street because this could be problematic. Let’s say we learned a weight of 6 then the first street (value = 0) would be multiplied by 6, the second (value = 1) by 6, and so on. In that case, we cannot use street_name as a feature because it will show a linear relationship which isn’t the case.

Here’s where One-Hot Encoding comes into play. In OHE, we assign in every row, only a single category would have itself set as 1 and the rest as zero. We would be having 10 additional columns for 10 street names.

Sparse Representations

What if you have a million categories? That would mean you would have to add a million columns to your dataset and at most only one or two of them will have a value one in a row, the rest will be set to zero. This will use up a lot of computation time and space and would be inefficient. In such a situation a Sparse Representation is used.

Note: OHE is a dense representation method.

In Sparse Representation instead of storing a million columns with zero and only one with 1, you will store only those columns which have a 1 in your data set. This is far more efficient.

For more info: https://developers.google.com/machine-learning/glossary#sparse_representation

Imputation

One of the most common problems in machine learning is the absence of values in the datasets. The causes of missing values can be due to numerous issues like human error or loss of data due to broken data flow, etc. Irrespective of the cause, the absence of values affects the performance of machine learning algorithms.

Rows with missing values are sometimes dropped by data scientists. This decreases the performance of the algorithm due to reduced data size.

What sometimes works is that you can fill these missing entries by the medians or means of their columns. This method is known as Imputation. This is a common method in feature engineering and helps you work with the data instead of losing the other columns that may be valuable for predictions.

sources: https://acuvate.com/blog/the-what-why-and-how-of-feature-engineering/, https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10, https://developers.google.com/machine-learning/crash-course/representation/feature-engineering

GALA CODES