A Beginners Guide to Machine Learning with Python Part I : Feature Engineering.

Govind
9 min readAug 30, 2023
Image by the author

This Guide aims to cover the basic concepts of feature engineering from top to bottom. We’ll start by understanding what feature engineering is? and the different types of features and how to deal with them. Finally, we’ll take a look at some feature engineering techniques in python and dive into a practical example that you can work on during your free time.

Introduction to feature engineering

Feature engineering is the process of identifying, processing, and creating both existing features and new features to help improve the performance of a machine learning model. All data scientists, experienced or not, will find that they might need to transform raw data into valuable representations that modeling algorithms can process. These representations are features and Feature engineering is the process of creating those features with custom logic, so downstream ML algorithms can use them. This article will introduce feature engineering as a set of techniques for preparing features from raw inputs. We’ll cover the importance, examples, usefulness, and some potential pitfalls to avoid.

Why is Feature Engineering Important?

When it comes to identifying patterns in data, machine learning algorithms are excellent. They are, however, not good at identifying features in data that lack obvious characteristics. When an algorithm takes in a large amount of raw data, It tries to identify and create features. When an algorithm has difficulty finding useful features, it can result in inaccurate predictions and poor performance. It assists you in developing features that are useful to your algorithms, resulting in greater accuracy and better results.

What Does Feature Engineering Involve?

Feature engineering is the process of creating new features out of raw data. These features can be numbers, categorical values, images, or other variables within your data. For example, each column in a dataset is a feature of what the data set is trying to explain. Feature engineering looks at the raw data and attempts to identify and create useful features out of them. It is often the hardest part of data science, and it’s something that most data science students will struggle with at some point.

The next part of this guide focuses on the techniques that will help identify new features, process existing ones, and create brand-new features.

Types Of Features

We can think of the following different types of data: Numerical Data, Categorical Data, Date & Time, Text Data, and Media (Images/ Video/ Audio).

Most features in a dataset belong to one of these categories: typically non-numeric and discrete features. For example- A box might have a particular shape and color. Both of which are categorical.

At the same time, continuous features of the same box are the dimensions of the box, For example, — Height, width, and breadth.

Feature Engineering Techniques In Python

As discussed above, we know that there are different types of features. This means that we have to deal with each type of feature in a different way.

Engineering features for Numerical data

Numerical data are usually in the form of numbers and are used to measure something — like, temperature, population, expense, etc.

For example, - 24° C, 2,000,000 people, $100,000.

We’ll cover some of the more commonly used types of feature engineering techniques for numerical data.

Rescaling Numeric features

Rescaling is a common preprocessing task in machine learning. There are several rescaling techniques, but one of the simplest is called min-max scaling. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range.

Let’s look at an example:

Let’s start by creating an array, let’s say sales:

OUTPUT:
[[ -200]
[ -10]
[ 50]
[ 1000]
[ 15]
[ 20]
[ 30]
[ 50]
[ 100]
[ 200]
[ 10000]
[-12000]
[150000]
[160000]]

Now let’s use the MinMaxScaler to scale this data

# Create a scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))

#Scale feature
scaled_sales = minmax_scale.fit_transform(sales)

#Show feature
scaled_sales

output:

array([[0.06860465],
[0.0697093 ],
[0.07005814],
[0.0755814 ],
[0.06985465],
[0.06988372],
[0.06994186],
[0.07005814],
[0.07034884],
[0.07093023],
[0.12790698],
[0. ],
[0.94186047],
[1. ]])

The Scikit-learn ‘MinMaxScaler’ offers two options to rescale a feature. One option is to use fit to calculate the minimum and maximum values of the feature, then use transform to rescale the feature. The second option is to use ‘fit_transform()’ to do both operations at once. There is no mathematical difference between the two options, but it may sometimes be useful to perform the functions separately on different data.

Standardizing features

The scaling of features to be roughly standard and normally distributed is a common substitute for the min-max scaling.

To accomplish this, we standardize the data so that it has a mean of 0, and a standard deviation of 1. Let’s do this by creating a standard scaler object and then running the data through this object -

#Create a scaler
std_scaler = preprocessing.StandardScaler()
std_sales = std_scaler.fit_transform(sales)

# Show feature standardized
std_sales

output:

array([[-0.40932764],
[-0.40583847],
[-0.40473663],
[-0.38729081],
[-0.40537937],
[-0.40528755],
[-0.40510391],
[-0.40473663],
[-0.40381843]])

The transformed feature shows how far the original value deviates from the mean value of the feature by standard deviations (also called a z-score in statistics).

Standardization is frequently chosen over min-max scaling as the preferred scaling technique for machine learning preprocessing, in my experience.

But the effects might vary based on the learning algorithm. For instance, standardization frequently improves the performance of principal component analysis, and min-max scaling is typically advised for neural networks.

Normalizing:

One method for feature scaling is normalization. We use normalization most often when the data is not skewed along either axis or when it does not follow the Gaussian distribution.

Jimmy Fallon Vr GIF By The Tonight Show Starring Jimmy Fallon

By converting data features with different scales to a single scale during normalization, we further simplify the processing of the data for modeling. As a result, each data feature (variable) tends to have a similar impact on the final model.

For this example, let’s try to work with some different data.

# Load libraries 
import numpy as np
from sklearn.preprocessing import Normalizer

# Create feature matrix
x = np.array([[2.5, 1.5],[2.1, 3.4], [1.5, 10.2], [4.63, 34.4], [10.9, 3.3], [17.5,0.8], [15.4, 0.7]])

# Create normalizer
normalizer = Normalizer(norm="l2")

# Transform feature matrix normalizer.transform(features)
normalizer.transform(x)

Output:

array([[0.85749293, 0.51449576],
[0.52549288, 0.850798 ],
[0.14549399, 0.98935914],
[0.13339025, 0.99106359],
[0.95709822, 0.28976368],
[0.99895674, 0.04566659],
[0.99896854, 0.04540766]])

Here we can see that all values are between 0 and 1.

There are many more techniques and transformations for engineering numerical data that you can perform. We’ll take a look at a few of these in the last part of this guide.

Engineering Features for Categorical Data.

Categorical Data is data that measures something Qualitatively or classifies some things into groups. Categorical Data can be of 2 types –

Ordinal data, i.e., data that follows some natural order. For example, Temperature can be, Cold, Average, or Hot.

Nominal Data usually classifies something into groups or categories. Male, Female.

In this section, we’ll how to deal with both of these

Encoding Ordinal Data

Encoding is the process of converting ordinal data into a numeric format so that the Machine learning algorithm can make sense of it. For transforming ordinal data into numeric data, we typically convert each class into a number. For example, cold, average, is mapped to 1, 2, and 3 respectively. Let’s see how we can do this easily.

Let’s start by importing pandas and creating the dataset.

#Importing libraries
import pandas as pd

#Creating the data
data = pd.DataFrame({"Temprature":["Very Cold", "Cold", "Warm","Hot", "Very Hot"]})

print(data)

Now let’s map the data to numerical values.

#Mapping to numerical data
scale_map = {"Very Cold": -3,
"Cold": -1,
"Warm": 0,
"Hot" : 1,
"Very Hot" : 3}

#Replacing with mapped values
data_mapped = data["Temprature"].replace(scale_map)
data["encoded_temp"] = data_mapped
data

Output:

index  Temperature   encoded_temp   
0 Very Cold -3
1 Cold -1
2 Warm 0
3 Hot 1
4 Very Hot 3

As you can see, I’ve mapped a numerical value for each of the observations. Notice that I’ve marked -3 for Very cold and -1 for cold. Mapping this way makes the features more effective because the numerical value that I’ve assigned is similar to the feature’s real characteristics.

One Hot Encoding Nominal Data

In one hot encoding, we convert each class of nominal data into its feature, and we assign a binary value of 1 or 0 to tell whether the feature is true or false. Let’s see how this can be done using both the LibraryBinarizer in scikit-learn and Pandas.

Translate Sign Language GIF

Importing Libraries and Creating the Data

#Import libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
# Create the dataset
color_data = {"itemid": ["A1","B1","C2", "D4","E9"],
"color" : ["red","blue","green","yellow","pink"]}

Encoding this data with LibraryBinarizer()

# Creating one-hot encoder
one_hot = LabelBinarizer()

# One-hot encode the data and assign it to a var
color_encoding = one_hot.fit_transform(color_data.color)

# feature classes
color_new = one_hot.classes_

#creating new Data Frame with encoded values
encoded = pd.DataFrame(color_encoding)
encoded.columns = color_new

#Deleting color column and merging with encoded values
color_data_new = color_data.drop("color",axis = 1)
color_data_new = pd.concat([color_data,encoded],axis = 1)

#Viewing new data
print(color_data_new)

Output:

itemid   color  blue  green  pink  red  yellow
0 A1 red 0 0 0 1 0
1 B1 blue 1 0 0 0 0
2 C2 green 0 1 0 0 0
3 D4 yellow 0 0 0 0 1
4 E9 pink 0 0 1 0 0

We can also do it with pandas, which is much quicker, though it is less flexible.

#Creating encoded df
encoded_pd = pd.get_dummies(color_data.color)

#Deleting color column and merging with encoded values
color_data_pd = color_data.drop("color", axis = 1)
color_data_pd = pd.concat([color_data,encoded_pd],axis = 1)

#Viewing new data
print(color_data_pd)
. itemid color blue green pink red yellow
0 A1 red 0 0 0 1 0
1 B1 blue 1 0 0 0 0
2 C2 green 0 1 0 0 0
3 D4 yellow 0 0 0 0 1
4 E9 pink 0 0 1 0 0

It is good practice to drop one of the features after one hot encoding to reduce linear dependency.

#Dropping final column2
color_data_pd.drop("yellow",axis =1, inplace = True)
color_data_pd
|ItemID| Color|Blue|Green|Pink|Red|
| -----| --- | ---| --- |--- |---|
| A1 | red | 0 | 0 | 0 | 1 |
| B1 | blue | 1 | 0 | 0 | 0 |
| C2 | green| 0 | 1 | 0 | 0 |
| D4 |yellow| 0 | 0 | 0 | 0 |
| E9 | pink | 0 | 0 | 1 | 0 |

Feature engineering workflow in python

Discovery Channel Car GIF By Discovery Europe

Now let’s test our skills with some practical examples with the help of this GitHub repo.

Our task is to identify and convert the features, and I’m assuming a basic understanding of pandas and the other basic data science packages.

Dataset and Problem Statement

Before continuing to the next part. Take a look at the example case which we are going to work on and download this dataset.

Our dataset contains fake data collected from a fake survey. This dataset includes survey data information for 3000 people. The dataset has 18 columns and 3000 rows.

Again, The data is fake and most of the information might not make sense. But, I randomly generated this data set specifically for this guide, and I feel it’s good enough for practicing our learnings.

Solution and My Approach

Basic transformations:

Once I imported the necessary libraries and the dataset, I Start by checking for any missing values and I also perform the necessary Imputations.

I also like to summarize the data to find out descriptive statistics. I also create some visualizations to identify outliers and patterns before performing any analysis.

Another thing I like to do is to identify the numerical and categorical datasets and assign them to 2 separate variables. My Workflow often consists of dealing with each type of data one by one separately.

Here’s a quick example of how I would explore the features.

  • Looking at the age group - We can group people by age and immediately encode it. To do this we can create a new column “age-group”.
  • Or we can discretize the age into bins. i.e ( 0–5, 5–10, 10–15…..)
  • With weight and height, we can calculate the BMI which is kg/m2. I’ll apply this formula and create a new BMI feature.
  • We can also group people into different categories based on BMI and encode these values.
  • We can group people by income level as well. To do this we can create a new column “income_levels.
  • We can follow similar steps as before to create income bins as well.

Learning More

Feature engineering is an important and ongoing part of any machine-learning project. Even when you are choosing ready-made features, you need to manipulate them to create the best possible features for your algorithms. Typically, you will start feature engineering with a raw dataset and end up with features that are ready for use. You may have to recreate features for different algorithms.

You can check out the notebook for this guide, the dataset, and the examples I used over at GitHub.

This Guide is based on “Machine Learning With Python CookBook by Chrish Albon”. It is a great resource that I use from time to time as a reference for machine learning projects.

Here’s a list of free Resources to help you in your machine-learning journey.

Thank you! and have a Great day!😀

--

--

Govind

AI | Data Science | Development | Entrepreneurship