Back to blog

Python Data Normalization: Complete Guide

-
Table of contents
-

The world runs on data. It has contributed to some of the most significant innovations in the last decade, with AI and machine learning at the forefront. From fears of users developing feelings for ChatGPT to luxury retailers like Michael Kors deploying AI virtual assistants, data is the lifeblood of the digital AI-oriented world we live in.

If you're looking to harness the potential for AI and machine learning in your business, you must first ensure that your data is in the correct format. This is where data normalization comes in. Loosely defined, it is the process of transforming numerical and text data into a standard format that can help improve the accuracy of a machine-learning model.

With this in mind, today's article will explore:

  • Data normalization and why it's important
  • Common data normalization techniques
  • How to apply these techniques in Python

Keep reading below for more insights.

What Is Data Normalization?

Data normalization is one of the most important steps in data preprocessing. Without it, getting insights from any information becomes difficult, especially when working with varied datasets.

For example, if you're comparing the performance of students in a class across different subjects and these subjects have varying grading scales, you may use data normalization techniques to convert the grades into a similar scale for easier comparison.

Take an example of math, science, English, and history with the following grading scales:

  • Math: 0 to 100
  • English: 0 to 50
  • Science: 0 to 80
  • History: 0 to 30

A student scores the following:

  • Math: 80
  • English: 35
  • Science: 50
  • History: 20

At first glance, it might seem that math is way more critical than history because these subjects use different grading systems. Comparing a student's performance in the current data format can be quite complicated. However, data normalization can make it easier by rescaling these scores to a standard range, typically from 0 to 1 or -1 to 1.

First, let's prepare this data for normalization by getting the percentage score for each subject:

1208157456638858.UK3gt6cOlTdFVRvw0tRR_height640.png

The maximum percentage is 0.80 (math), and the minimum is 0.625 (science).

One of the simplest ways to normalize data is through simple feature scaling, also known as Min-Max Scaling. This technique divides each value by the maximum value for that feature. We'll use it to manually normalize the performance data above. Here is the formula:

    	Normalized Value = (X - X_min) / (X_max - X_min)
  	

Where:

  • X is the original value.
  • X_min is the minimum value of the feature in the dataset.
  • X_max is the maximum value of the feature in the dataset.

We can normalize this data as follows:

  • Math:

    	Normalized Value = (0.80- 0.625) / (0.80- 0.625) = 1.000
  	

  • English:

    	Normalized Value = (0.70- 0.625) / (0.80- 0.625) = 0.429
  	

  • Science:

    	Normalized Value = (0.625- 0,625) / (0.80 - 0.625) = 0.000
  	

  • History:

    	Normalized Value = (0.67- 0.625) / (0.80 - 0.625) = 0.257
  	

Now you have:

  • Math: 1.00
  • English: 0.429
  • Science: 0.000
  • History: 0.257

The performance is now at a standard scale. Drawing insights from this data, such as the subject that was best and least performed, is easier. And this is what data normalization seeks to achieve.

Why Data Normalization Is Important

Data normalization is important when preprocessing data for algorithms and machine learning models. To understand why, let's look at our previous school performance example. Bringing up the values:

  • Math: 80
  • English: 35
  • Science: 50
  • History: 20

Suppose we want to use this data to train a machine learning algorithm. Feeding these data points into the system in their current format can cause various problems down the road. To explain why, we will offer a broad, general explanation without delving too much into the intricate details.

Think of the machine learning process as a system of different-sized hills; the lowest point possible represents the lowest error, while the highest peak represents the highest error. Naturally, the algorithm aims to get to the lowest point possible.

Most modern AI models, especially those that utilize neural networks, function by iteratively adjusting the model parameters. At each iteration, the algorithm adjusts the parameters in a direction that takes it to the lowest point possible.

Some hills, like math, with a score of 80, are big hills, while some, like history, with a score of 20, are small hills. They are all equally important, but the algorithm doesn't know this since it processes data in its raw form. As a result, it may focus more on the big hills, neglecting the small hills, which can lead to biased and inaccurate predictions.

This is why data normalization is important. It levels the hills, ensuring none is seemingly more important than the other. In this way, the model can take balanced and accurate steps toward the lowest point possible, otherwise known as convergence.

In short, normalization data analysis methods ensure AI models don't focus on any one feature just because it has a bigger range. By normalizing your data on a standard scale, the algorithm will consider each data point equally, helping it reach convergence faster and more accurately.

Understanding Data Normalization in Python

We're now at the crux of our discussion, where we'll show you how to normalize data in Python. But before we get to that, why the Python programming language? There are many languages; what makes this one suitable for normalizing data? Keep reading below to find out.

Robust Data Normalization Libraries

One of the primary reasons to use Python to normalize your data is the rich ecosystem of libraries this platform offers. They include

  • Pandas

Use this library for its wide array of data analysis and structuring tools.

  • Scikit-learn

This is a comprehensive library that includes prebuilt functions for normalization, such as MinMaxScaler and StandardScaler.

  • NumPy

This library supports working with arrays and matrices, making it useful for custom normalization tasks.

Easy-to-Use Syntax

Python, as a programming language, has a fluid syntax that makes it easy to read and understand code. With strong community support, code maintenance, and debugging are also relatively straightforward, with many free resources available online.

Integration With Machine Learning Frameworks

Python seamlessly integrates with popular machine learning frameworks such as Tensorflow and PyTorch. This naturally makes it the language of choice for normalizing input data for machine learning models.

Now that you understand the versatility of this programming language, you may be wondering how to normalize data in Python. Don’t worry—we’ve got you covered. Keep reading to find out more:

How to Normalize Data in Python

Below is a rundown of the different techniques for normalizing data in Python:

Min-Max Scaling

Remember the formula we used to normalize our student score results? It's known as Min-Max Scaling because it takes the minimum and maximum values of a given dataset and uses them to compute the normalized value. Going back to our previous example:

  • Math: 80 out of 100
  • English: 35 out of 50
  • Science: 50 out of 80
  • History: 20 out of 30

Let us normalize this data using Min-Max Scaling in Python. Run the following code in your editor of choice with all the prerequisites installed:

    	from sklearn.preprocessing import MinMaxScaler
  	

    	import pandas as pd
  	

    	
  	

    	# Original data
  	

    	data = {'Subject': ['Math', 'English', 'Science', 'History'],
  	

    	        'Score': [80, 35, 50, 20],
  	

    	        'Max_Score': [100, 50, 80, 30]}
  	

    	
  	

    	# Convert to DataFrame
  	

    	df = pd.DataFrame(data)
  	

    	
  	

    	# Calculate the percentage score
  	

    	df['Percentage'] = df['Score'] / df['Max_Score']
  	

    	
  	

    	# Initialize the MinMaxScaler
  	

    	scaler = MinMaxScaler()
  	

    	
  	

    	# Fit and transform the percentage scores
  	

    	df['Normalized'] = scaler.fit_transform(df[['Percentage']])
  	

    	
  	

    	# Display the normalized data
  	

    	print(df[['Subject', 'Normalized']])
  	

    	This is the response you'll get:
  	

    	   Subject  Normalized
  	

    	
  	

    	0     Math    1.000000
  	

    	
  	

    	1  English    0.428571
  	

    	
  	

    	2  Science    0.000000
  	

    	
  	

    	3  History    0.238095
  	

Notice that the results are similar to the normalized values we calculated manually. This process is much faster and can save you a lot of time when working with large datasets.

Z-Score Scaling

While the min-max method uses the minimum and maximum values in a given data set for normalization, the Z-Score scaling method centers normalized data around the standard deviation, ensuring that each feature contributes equally to the model. Here is the formula:

    	Z = (X - μ) / σ
  	

Where:

  • Z is the Z-score (standardized value).
  • X is the original value.
  • μ (mu) is the mean of the feature.
  • σ (sigma) is the standard deviation of the feature.

Instead of spending time calculating manually, especially for large datasets, you can use Python to normalize data using Z-score scaling. Applying our previous example, here is the code you should use:

    	from sklearn.preprocessing import StandardScaler
  	

    	import pandas as pd
  	

    	
  	

    	# Original data
  	

    	data = {'Subject': ['Math', 'English', 'Science', 'History'],
  	

    	        'Score': [80, 35, 50, 20]}
  	

    	
  	

    	# Convert to DataFrame
  	

    	df = pd.DataFrame(data)
  	

    	
  	

    	# Initialize the StandardScaler
  	

    	scaler = StandardScaler()
  	

    	
  	

    	# Fit and transform the data
  	

    	df['Z-Score'] = scaler.fit_transform(df[['Score']])
  	

    	
  	

    	# Display the standardized data
  	

    	print(df[['Subject', 'Z-Score']])
  	

This is the response you should get:

Subject Z-Score

    	0     Math  1.095445
  	

    	1  English -0.297076
  	

    	2  Science  0.000000
  	

    	3  History -0.798369
  	

Now, if you're wondering why the results differ, it's because we used a different approach. The Z-score measures the number of standard deviations a value is from the mean.

For example, Science, which had the lowest score, is 0 standard deviations from the mean, while Math, the highest-scoring subject, is 1.095 standard deviations from the mean. As you can see, the interpretation of the data remains consistent across the two techniques discussed.

MaxAbs Scaling

This data normalization technique is useful for datasets with positive and negative values. It helps normalize the data within a specific range, typically from -1 to 1, by dividing each feature by its maximum absolute value using this formula:

    	X_scaled = X / max(|X|)
  	

Where:

  • X_scaled is the scaled value of the feature.
  • X is the original value of the feature.
  • max(|X|) is the maximum absolute value of the feature.

We're going to use a different example to show you how this method works. Consider a dataset with the following values:

1208157456638860.RswPKRIwBik2DZNLQJoi_height640.png

We'll use the following Python code to get the normalized values using the MaxAbsScaler for feature A:

    	from sklearn.preprocessing import MaxAbsScaler
  	

    	import pandas as pd
  	

    	
  	

    	# Original data
  	

    	data = {'Feature': [10, -20, 15, -5]}
  	

    	df = pd.DataFrame(data)
  	

    	
  	

    	# Initialize the MaxAbsScaler
  	

    	scaler = MaxAbsScaler()
  	

    	
  	

    	# Fit and transform the data
  	

    	df['Scaled'] = scaler.fit_transform(df[['Feature']])
  	

    	
  	

    	# Display the scaled data
  	

    	print(df[['Feature', 'Scaled']])
  	

This is the result you should get:

    	Feature  Scaled
  	

    	0       10    0.50
  	

    	1      -20   -1.00
  	

    	2       15    0.75
  	

    	3       -5   -0.25
  	

We now have a standard range from -1 to 0.5, which is much easier for an algorithm to interpret.

Decimal Scaling

This technique is suitable for datasets with varied decimal points. It normalizes data by moving the decimal point depending on the absolute value of the feature. Here is the formula:

X_scaled = X / 10^j

Where:

  • X_scaled is the scaled value of the feature.
  • X is the original value of the feature.
  • 10^j is 10 raised to the power of j, which is determined based on the maximum value of the feature.

Let's normalize the following dataset using decimal scaling in Python:

  • Values: 0.345, -1.789, 2.456, -0.678

Here is the code:

    	import pandas as pd
  	

    	
  	

    	# Original data with decimal points
  	

    	data = {'Feature': [0.345, -1.789, 2.456, -0.678]}
  	

    	df = pd.DataFrame(data)
  	

    	
  	

    	# Find the maximum absolute value in the dataset
  	

    	max_abs_value = df['Feature'].abs().max()
  	

    	
  	

    	# Determine the scaling factor
  	

    	# Find the smallest power of 10 greater than the maximum absolute value
  	

    	import math
  	

    	scaling_factor = 10 ** math.ceil(math.log10(max_abs_value))
  	

    	
  	

    	# Apply Decimal Scaling
  	

    	df['Scaled'] = df['Feature'] / scaling_factor
  	

    	
  	

    	# Display the original and scaled data
  	

    	print(df)
  	

This is the result:

    	   Feature  Scaled
  	

    	0    0.345  0.0345
  	

    	1   -1.789 -0.1789
  	

    	2    2.456  0.2456
  	

    	3   -0.678 -0.0678
  	

How to Normalize Text Data in Python

As you perform data analysis and preparation for your machine learning model, you may also have to normalize text data. Unlike numerical values, normalizing text data typically includes lowercasing, removing punctuation, and tokenization. For example, you may have to split individual text into units called tokens, which can be phrases or symbols. Say we wanted to split this text:

    	"Tokenization splits texts into words."
  	

Use the following code to normalize this text data:

    	import nltk
  	

    	from nltk.tokenize import word_tokenize
  	

    	
  	

    	# Download the necessary NLTK resource
  	

    	nltk.download('punkt')
  	

    	
  	

    	# Sample text
  	

    	text = "Tokenization splits text into words."
  	

    	
  	

    	# Tokenize the text
  	

    	tokens = word_tokenize(text)
  	

    	
  	

    	# Display the tokens
  	

    	print(tokens)
  	

    	The tokenized text should be as follows:
  	

    	['Tokenization', 'splits', 'text', 'into', 'words', '.']
  	

Conclusion

We have come to the end of our discussion on data normalization methods. This process is essential for preparing input data for machine learning models. While you can manually normalize data, it's impractical for large datasets.

Python is preferred because it has comprehensive libraries, a clean fluid syntax, and it's open source. The methods we've discussed include Min-Max, Decimal, and Z-Score Scaling. Now, you can train your machine learning models with greater accuracy!

FAQ

What is the difference between scaling and normalization?

While both are common data preprocessing techniques, there's a subtle difference between them. Normalization attempts to adjust the range of values by fitting them into a specific range, whereas scaling attempts to adjust a range of feature values without necessarily fitting them into a specific range.

What is the difference between standardization and normalization?

Standardization transforms data to have a mean of zero and a standard deviation of one without binding it to a specific range. Normalization rescales data to fit within a specific range, say [0,1] or [-1,1].

What is the difference between normalized data and denormalized data?

Normalized data has been transformed to fit within a specific range as part of machine learning data preprocessing. On the other hand, denormalized data has been reverted back to its original form.

Back to blog
-

Related articles

Prism AIO Bot Review 2022: Compatibility with Proxies

If you want to cop sneakers from multiple sites, you need an AIO bot. With an All-in-One bot, you can avoid spending your money on multiple sneaker bots and instead use it to get yourself another pair of sneakers. These automation tools allow you to cop lots of sneakers in hot releases on major sneaker sites and sell them for profit. However, this doesn’t mean you should buy the first AIO bot you see online. To help you make the right decision, we’re going to review one of the hottest AIO bots – the Prism AIO bot. You’ll find out its main features and what has led to its spike in popularity. But first, let’s explain what Prism AIO is and how it works.