Python has become the defacto language for data scientists due to its machine learning (ML) libraries Such as sykit learn. When training ML models, sometimes the data must be normalized, and other times the model training normalizes the data for you. It is best practice to normalize the data, but these methods aren't exposed out-of-the-box.
There are two main types of normalization. Z score normalization is good for handling data the may have outliers, and can be negative or positive. Binary normalization brings all values between 0 and 1.
The only import required is pandas. The training dataset and test dataset are DataFrames and the columns to normalize are represented by an array of strings.
import pandas
def z_score_normalize(training_dataset, test_dataset, columns_to_normalize):
combined = pandas.concat([training_dataset, test_dataset])
for i in [x for x in training_dataset.columns if x in columns_to_normalize]:
mean = combined[i].mean()
std = combined[i].std(ddof=0)
test_dataset[i] = (test_dataset[i] - mean) / std
training_dataset[i] = (training_dataset[i] - mean) / std
return training_dataset, test_dataset
def binary_normalize(training_dataset, test_dataset, columns_to_normalize):
combined = pandas.concat([training_dataset, test_dataset])
for i in [x for x in training_dataset.columns if x in columns_to_normalize]:
min = combined[i].min()
max = combined[i].max()
test_dataset[i] = (test_dataset[i] - min) / (max - min)
training_dataset[i] = (training_dataset[i] - min) / (max - min)
return training_dataset, test_dataset