Cleaning Data with Numpy: A Simple Yet Powerful Approach

23 Decembrie 2024

Data cleaning is an essential part of any data processing pipeline. Whether you're building reports, training machine learning models, or performing analyses, messy data can lead to incorrect conclusions or even break your workflows. While libraries like pandas often take center stage for these tasks, NumPy offers a leaner, more lightweight alternative for data cleaning tasks--and that's exactly what we'll explore in this post.

Here, we'll walk through cleaning a small dataset of transactions using only NumPy and tackling problems like duplicate entries and missing values. By the end, you'll see how far you can get with just NumPy--and when it might make sense to switch to more specialized tools.

The Dataset

Our dataset is a CSV file containing transaction data with the following columns:

Here's an example of the raw data:

Transaction ID,Date,Customer Name,Amount
101,"2024-12-20","Alice",100.5,
102,"","Charlie",200.0,
101,"2024-12-20","Alice",100.5,
,"","Eve",,

The Problem

Our goals for cleaning this dataset are:

  1. Handle missing transaction IDs: Replace them with suitable values.
  2. Remove duplicate rows: Transactions with the same Transaction_ID should appear only once.
  3. Handle missing dates: Replace missing values with a placeholder ("Unknown").
  4. Handle missing amounts: Replace missing values with the mean of the dataset, as it doesn't introduce any bias.

Step 1: Handling Missing Transaction IDs

Even if the transaction ID is missing, we still want to put a number in place, so as to make it easier to work with the transactions. The first thought was to use just one value as a placeholder(0), but that would mark all rows as duplicate. A more robust approach is to generate unique IDs that are outside of the range of normal IDs.

# Find missing Transaction IDs
transaction_ids = raw_data["Transaction_ID"]
missing_indices = np.where(np.isnan(transaction_ids))[0]

# Generate unique placeholders for missing IDs
unique_placeholders = np.arange(-1, -1 - len(missing_indices), -1)
transaction_ids[missing_indices] = unique_placeholders

Step 2: Removing Duplicate Rows

Duplicate rows can lead to inaccurate analyses or calculations. In our dataset, we notice that the transaction with Transaction_ID = 103 is duplicated. To address this, we use NumPy's np.unique function, which efficiently identifies unique entries based on one or more columns.

Here's the code snippet:

# Remove duplicates by Transaction_ID
unique_indices = np.unique(transaction_ids, return_index=True)[1]

This code finds the indices of the first occurence of each unique Transaction_ID and returns an array with these indices.

Step 3: Handling Missing Dates

The third issue in our dataset is missing dates, represented as empty strings ("").

For simplicity, we'll replace missing dates with a placeholder value ("Unknown"). This ensures downstream processes don't break due to unexpected blanks. While this solution works for our example, in real-world scenarios other options--such as inferring the value or dropping the row--might be more appropriate.

Here's how we fill in the missing dates:

# Replace missing dates (empty strings) with "Unknown"
dates = raw_data["Date"]
dates = np.array(["Unknown" if not date else date for date in dates], dtype=str)

This snippet:

  1. Iterates over the Date column.
  2. Checks for missing values (empty strings or None) using if not date.
  3. Replaces missing values with "Unknown".

Step 4: Handling Missing Amounts

Unlike the other missing values, with amounts we use a different strategy. Replacing them with 0 or some other numerical value will skew the dataset. Replacing them with something like "Placeholder" would make calculations harder. The solution is to replace them with the mean of the entire dataset. This ensures no bias is introduced.

The code:

amounts = raw_data["Amount"]
missing_amounts = np.isnan(amounts)
mean_amount = np.nanmean(amounts)
amounts[missing_amounts] = mean_amount
  1. Gets values in the "Amount" column.
  2. Calculates the mean while ignoring NaN values.
  3. Replaces the NaN values with the mean.

Reflections and Takeaways

Why NumPy?

NumPy's simplicity and speed make it an excellent choice for small to medium-sized datasets or when you don't need the full functionality of pandas. However, for more complex datasets or tasks (e.g., handling multiple types of missing data, advanced indexing), pandas or other libraries may be more efficient.

Lessons Learned

  1. Always consider the context of your data when cleaning. The choice of how to handle missing values, for example, depends on how the data will be used.
  2. Even for small tasks, modularizing your code (e.g., writing functions for each column's cleaning logic) can make it easier to extend and debug later.

Final Script

Here's the complete script for cleaning the dataset:

import numpy as np

# Simulated raw data
raw_data = np.array(
    [
        (101, "2024-12-20", "Alice", 100.5),
        (102, "", "Charlie", 200.0),
        (101, "2024-12-20", "Alice", 100.5),
        (103, "2024-12-21", "Bob", 150.0),
        (None, "", "Eve", None),
        (None, "2024-12-23", "Lucian", 75.0)
    ],
    dtype=[
        ("Transaction_ID", "float"),
        ("Date", "U10"),
        ("Customer_Name", "U50"),
        ("Amount", "float"),
    ],
)

# Handle missing Transaction IDs
transaction_ids = raw_data["Transaction_ID"]
missing_indices = np.where(np.isnan(transaction_ids))[0]
unique_placeholders = np.arange(-1, -1 - len(missing_indices), -1)
transaction_ids[missing_indices] = unique_placeholders

# Remove duplicates by Transaction_ID
unique_indices = np.unique(transaction_ids, return_index=True)[1]

# Replace missing dates with "Unknown"
dates = raw_data["Date"]
dates = np.array(["Unknown" if not date else date for date in dates], dtype=str)

# Replace missing amounts with the mean
amounts = raw_data["Amount"]
missing_amounts = np.isnan(amounts)
mean_amount = np.nanmean(amounts)
amounts[missing_amounts] = mean_amount

# Combine cleaned columns into a final array
cleaned_data = np.array(
    list(zip(
        transaction_ids[unique_indices],
        dates[unique_indices],
        raw_data["Customer_Name"][unique_indices],
        amounts[unique_indices]
    dtype=[
        ("Transaction_ID", "int"),
        ("Date", "U10"),
        ("Customer_Name", "U50"),
        ("Amount", "float"),
    ],
)

# Save cleaned data back to a CSV
np.savetxt(
    "test_cleaned_sales_data.csv",
    cleaned_data,
    delimiter=",",
    header="Transaction_ID,Date,Customer_Name,Amount",
    fmt=["%d", "%s", "%s", "%.2f"],
    comments=""
)

print("Data cleaned and saved to 'test_cleaned_sales_data.csv'.")

Disclaimer

I wrote this post with the help of ChatGPT. You can find the full conversation here.