Basic normalization using Pandas in Python

Python has become the defacto language for data scientists due to its machine learning (ML) libraries Such as sykit learn. When training ML models, sometimes the data must be normalized, and other times the model training normalizes the data for you. It is best practice to normalize the data, but these methods aren't exposed out-of-the-box.

There are two main types of normalization. Z score normalization is good for handling data the may have outliers, and can be negative or positive. Binary normalization brings all values between 0 and 1.

Note: Normalization performs best when based on the range of both the training data and the test data.

The only import required is pandas. The training dataset and test dataset are DataFrames and the columns to normalize are represented by an array of strings.


import pandas

def z_score_normalize(training_dataset, test_dataset, columns_to_normalize):
    combined = pandas.concat([training_dataset, test_dataset])
    for i in [x for x in training_dataset.columns if x in columns_to_normalize]:
        mean = combined[i].mean()
        std = combined[i].std(ddof=0)
        test_dataset[i] = (test_dataset[i] - mean) / std
        training_dataset[i] = (training_dataset[i] - mean) / std

    return training_dataset, test_dataset


def binary_normalize(training_dataset, test_dataset, columns_to_normalize):
    combined = pandas.concat([training_dataset, test_dataset])
    for i in [x for x in training_dataset.columns if x in columns_to_normalize]:
        min = combined[i].min()
        max = combined[i].max()
        test_dataset[i] = (test_dataset[i] - min) / (max - min)
        training_dataset[i] = (training_dataset[i] - min) / (max - min)

    return training_dataset, test_dataset