23 Decembrie 2024
Data cleaning is an essential part of any data processing pipeline. Whether you're building reports, training machine learning models, or performing analyses, messy data can lead to incorrect conclusions or even break your workflows. While libraries like pandas often take center stage for these tasks, NumPy offers a leaner, more lightweight alternative for data cleaning tasks--and that's exactly what we'll explore in this post.
Here, we'll walk through cleaning a small dataset of transactions using only NumPy and tackling problems like duplicate entries and missing values. By the end, you'll see how far you can get with just NumPy--and when it might make sense to switch to more specialized tools.
Our dataset is a CSV file containing transaction data with the following columns:
Here's an example of the raw data:
Transaction ID,Date,Customer Name,Amount
101,"2024-12-20","Alice",100.5,
102,"","Charlie",200.0,
101,"2024-12-20","Alice",100.5,
,"","Eve",,
Our goals for cleaning this dataset are:
Even if the transaction ID is missing, we still want to put a number in place, so as to make it easier to work with the transactions. The first thought was to use just one value as a placeholder(0
), but that would mark all rows as duplicate. A more robust approach is to generate unique IDs that are outside of the range of normal IDs.
# Find missing Transaction IDs
transaction_ids = raw_data["Transaction_ID"]
missing_indices = np.where(np.isnan(transaction_ids))[0]
# Generate unique placeholders for missing IDs
unique_placeholders = np.arange(-1, -1 - len(missing_indices), -1)
transaction_ids[missing_indices] = unique_placeholders
Duplicate rows can lead to inaccurate analyses or calculations. In our dataset, we notice that the transaction with Transaction_ID = 103
is duplicated. To address this, we use NumPy's np.unique
function, which efficiently identifies unique entries based on one or more columns.
Here's the code snippet:
# Remove duplicates by Transaction_ID
unique_indices = np.unique(transaction_ids, return_index=True)[1]
This code finds the indices of the first occurence of each unique Transaction_ID
and returns an array with these indices.
The third issue in our dataset is missing dates, represented as empty strings (""
).
For simplicity, we'll replace missing dates with a placeholder value ("Unknown"
). This ensures downstream processes don't break due to unexpected blanks. While this solution works for our example, in real-world scenarios other options--such as inferring the value or dropping the row--might be more appropriate.
Here's how we fill in the missing dates:
# Replace missing dates (empty strings) with "Unknown"
dates = raw_data["Date"]
dates = np.array(["Unknown" if not date else date for date in dates], dtype=str)
This snippet:
Date
column.None
) using if not date
."Unknown"
.Unlike the other missing values, with amounts we use a different strategy. Replacing them with 0
or some other numerical value will skew the dataset. Replacing them with something like "Placeholder"
would make calculations harder. The solution is to replace them with the mean of the entire dataset. This ensures no bias is introduced.
The code:
amounts = raw_data["Amount"]
missing_amounts = np.isnan(amounts)
mean_amount = np.nanmean(amounts)
amounts[missing_amounts] = mean_amount
"Amount"
column.NaN
values.NaN
values with the mean.NumPy's simplicity and speed make it an excellent choice for small to medium-sized datasets or when you don't need the full functionality of pandas. However, for more complex datasets or tasks (e.g., handling multiple types of missing data, advanced indexing), pandas or other libraries may be more efficient.
Here's the complete script for cleaning the dataset:
import numpy as np
# Simulated raw data
raw_data = np.array(
[
(101, "2024-12-20", "Alice", 100.5),
(102, "", "Charlie", 200.0),
(101, "2024-12-20", "Alice", 100.5),
(103, "2024-12-21", "Bob", 150.0),
(None, "", "Eve", None),
(None, "2024-12-23", "Lucian", 75.0)
],
dtype=[
("Transaction_ID", "float"),
("Date", "U10"),
("Customer_Name", "U50"),
("Amount", "float"),
],
)
# Handle missing Transaction IDs
transaction_ids = raw_data["Transaction_ID"]
missing_indices = np.where(np.isnan(transaction_ids))[0]
unique_placeholders = np.arange(-1, -1 - len(missing_indices), -1)
transaction_ids[missing_indices] = unique_placeholders
# Remove duplicates by Transaction_ID
unique_indices = np.unique(transaction_ids, return_index=True)[1]
# Replace missing dates with "Unknown"
dates = raw_data["Date"]
dates = np.array(["Unknown" if not date else date for date in dates], dtype=str)
# Replace missing amounts with the mean
amounts = raw_data["Amount"]
missing_amounts = np.isnan(amounts)
mean_amount = np.nanmean(amounts)
amounts[missing_amounts] = mean_amount
# Combine cleaned columns into a final array
cleaned_data = np.array(
list(zip(
transaction_ids[unique_indices],
dates[unique_indices],
raw_data["Customer_Name"][unique_indices],
amounts[unique_indices]
dtype=[
("Transaction_ID", "int"),
("Date", "U10"),
("Customer_Name", "U50"),
("Amount", "float"),
],
)
# Save cleaned data back to a CSV
np.savetxt(
"test_cleaned_sales_data.csv",
cleaned_data,
delimiter=",",
header="Transaction_ID,Date,Customer_Name,Amount",
fmt=["%d", "%s", "%s", "%.2f"],
comments=""
)
print("Data cleaned and saved to 'test_cleaned_sales_data.csv'.")
I wrote this post with the help of ChatGPT. You can find the full conversation here.