Mastering Data Cleaning: Techniques and Real-World Use Case

Data cleaning, also called data cleansing or data scrubbing, is an essential part of data analysis. It involves finding and fixing errors or inconsistencies in raw data to ensure it is accurate, complete, consistent, and usable. Clean data results in better analysis, improved model performance, cost efficiency, and compliance. In this blog post, we'll explore different data cleaning techniques and apply them to a synthetic East African e-commerce dataset using Python.
Introduction to Data Cleaning
Imagine you're a detective trying to solve a mystery. Some of your hints are deceptive, while others are dispersed and lacking. Your task is to use these faulty hints to piece together the truth. That detective work is a lot like data cleaning. It guarantees the accuracy and dependability of your data, assisting you in reaching well-informed conclusions and decisions.
What Makes Data Cleaning Crucial?
Better Analysis: Clean data ensures that your analyses are accurate and reliable.
Improved Model Performance: Machine learning models perform better with clean data.
Cost Efficiency: Poor data quality can lead to costly mistakes and inefficiencies.
Compliance: Many industries have regulations requiring accurate and complete data.
Our Dataset
The following columns are included in the synthetic East African e-commerce dataset: CustomerID, FirstName, LastName, Email, RegistrationDate, Country, City, TotalSpent, Currency, Age, PhoneNumber, and LastLoginDate.
Here we load the dataset:
import pandas as pd
file_path = "eastafrican-e-commerce-dataset.csv"
df = pd.read_csv(file_path, on_bad_lines="skip", engine="python")
df.head()
Dataset Preview:
| CustomerID | FirstName | LastName | Email | RegistrationDate | Country | City | TotalSpent | Currency | Age | PhoneNumber | LastLoginDate |
|------------|-----------|----------|------------------------|------------------|---------|--------------|------------|----------|-----|---------------|---------------|
| CUST001 | David | Okello | david.okello@email.com | 15/01/2022 | Uganda | Kampala | 150000 | UGX | 34 | 256772123456 | 20/08/2025 |
| CUST002 | Esther | Wanjiku | esther.w@email.com | 2022-02-20 | Kenya | Nairobi | 5500 | KES | 28 | 712345678 | 21/08/2025 |
| CUST004 | Fatuma | Juma | fatuma.juma@work.net | 10/04/2022 | Tanzania| Dar es Salaam| 80000 | TZS | 31 | 255755123456 | 15/07/2025 |
| CUST005 | DAVID | OKELLO | david.okello@email.com | 12/05/2022 | UG | Kla | 75000 | Shs | 34 | 7721235678 | NaN |
| CUST001 | David | Okello | david.okello@email.com | 15/01/2022 | Uganda | Kampala | 150000 | UGX | 34 | 256772123456 | 20/08/2025 |
Data Cleaning Techniques
1. Standardize Dates
Having records stored in multiple date formats, such as 08/12/2023 (month/day/year) and 12/08/2023 (day/month/year), can cause inconsistencies. Standardizing dates ensures uniformity across the dataset and simplifies time-based analysis. The code below also automatically converts the field into a datetime data type.
df['RegistrationDate'] = pd.to_datetime(df['RegistrationDate'], dayfirst=True, errors='coerce')
df['LastLoginDate'] = pd.to_datetime(df['LastLoginDate'], dayfirst=True, errors='coerce')
df.head()[['RegistrationDate','LastLoginDate']]
Before
20/08/2025 15/01/2022
21/08/2025 2022-02-20
NaN 12/05/2022
After
2025-08-20 2022-01-15
2025-08-21 NaT
NaT 2022-05-12
This ensures dates can be compared and sorted correctly.
2. Standardize Phone Numbers
Phone numbers in a dataset may appear with spaces, dashes, or even in scientific notation. Standardizing them ensures a consistent format, making it easier to validate and contact customers.
df['PhoneNumber'] = df['PhoneNumber'].astype(str).str.replace(r'\D', '', regex=True)
# Add '0' to PhoneNumber only if it starts with '7', ignore others
df['PhoneNumber'] = df['PhoneNumber'].apply(lambda x: '0'+x if x.startswith('7') else x)
df[['PhoneNumber']].head()
Result
2567721234560
07123456780
2557551234560
07721234560
2567721234560
3. Standardize Text Fields (Country, City, Names)
Inconsistent text (e.g., "ug" vs. "Uganda" or "kla" vs. "Kampala") can lead to errors in grouping and analysis. We can standardize these fields using manual mapping and fuzzy matching.
Step 1: Discover Unique Values
print("Unique values in 'Country' column:")
print(df['Country'].unique())
print("\nUnique values in 'City' column:")
print(df['City'].unique())
Output:
Unique values in 'Country' column:
['uganda' 'kenya' 'tanzania' 'ug' 'ke' 'rwanda']
Unique values in 'City' column:
['kampala' 'nairobi' 'dar es salaam' 'kla' 'entebbe' 'niarobi' 'kmapala' 'nrb' 'mbarara' 'kigali' 'gulu' 'kisumu' 'arusha' 'eldoret' 'mombasa' 'jinja']
Step 2: Manual Mapping
country_mapping = {'ug': 'Uganda', 'ke': 'Kenya', 'UGANDA': 'Uganda'}
city_mapping = {'kla': 'Kampala', 'nrb': 'Nairobi', 'niarobi': 'Nairobi', 'kmapala': 'Kampala'}
df['Country'] = df['Country'].replace(country_mapping)
df['City'] = df['City'].replace(city_mapping)
Step 3: Standardize Casing
df['Country'] = df['Country'].str.title()
df['City'] = df['City'].str.title()
df['FirstName'] = df['FirstName'].str.title()
df['LastName'] = df['LastName'].str.title()
Result
Unique values in 'Country' column:
['Uganda' 'Kenya' 'Tanzania' 'Rwanda']
Unique values in 'City' column:
['Kampala' 'Nairobi' 'Dar Es Salaam' 'Entebbe' 'Mbarara' 'Kigali' 'Gulu' 'Kisumu' 'Arusha' 'Eldoret' 'Mombasa' 'Jinja']
This ensures consistency in text fields, making data easier to analyze and segment.
4. Fuzzy Matching (Fuzzy Finder)
Fuzzy matching assists in the correction of misspellings and abbreviations in bigger datasets with unanticipated variations. Similar to how autocorrect offers the correct word, it finds the closest match in a list of valid values, even if the input is slightly misspelled or shortened.
How it works:
Canonical list: Create a list of correct names.
Similarity scoring: Compare each entry with the canonical list using measures like Levenshtein distance.
Pick the best match: Choose the one with the highest score.
Threshold decision: Auto-correct if confidence is high, flag for review if medium, leave as-is if low.
Example:
import pandas as pd
from fuzzywuzzy import process
from io import StringIO
# Standardize casing first to improve matching accuracy
df_fuzzy['City'] = df_fuzzy['City'].str.title().str.strip()
# A list of our known, correct city names for matching
correct_cities = ['Kampala', 'Nairobi', 'Kigali', 'Dar Es Salaam', 'Entebbe']
print("--- Step 1: Unique Cities BEFORE Fuzzy Matching ---")
print(df_fuzzy['City'].unique())
def fuzzy_match_and_correct(city_name, correct_list):
"""
Finds the best fuzzy match for a city name from a list of correct names.
Returns the corrected city name if the match score is high (>= 80).
"""
if pd.isna(city_name):
return city_name
best_match = process.extractOne(city_name, correct_list)
# Print the matching process to demonstrate the logic
print(f"Processing '{city_name}': Best match is '{best_match[0]}' with score {best_match[1]}")
if best_match[1] >= 80: # Check if the score is 80 or higher
return best_match[0] # Return the best match name
else:
return city_name # Return original name if no good match is found
Result
--- Step 1: Unique Cities BEFORE Fuzzy Matching ---
['Kampala' 'Nairobi' 'Dar Es Salaam' 'Kla' 'Entebbe' 'Niarobi' 'Kmapala'
'Nrb' 'Mbarara' 'Kigali' 'Gulu' 'Kisumu' 'Arusha' 'Eldoret' 'Mombasa'
'Jinja']
Processing 'Kampala': Best match is 'Kampala' with score 100
Processing 'Nairobi': Best match is 'Nairobi' with score 100
Processing 'Dar Es Salaam': Best match is 'Dar Es Salaam' with score 100
Processing 'Kla': Best match is 'Kampala' with score 60
Processing 'Niarobi': Best match is 'Nairobi' with score 86
Processing 'Kmapala': Best match is 'Kampala' with score 86
--- Step 3: Unique Cities AFTER Fuzzy Matching ---
array(['Kampala', 'Nairobi', 'Dar Es Salaam', 'Kla', 'Entebbe', 'Nrb',
'Mbarara', 'Kigali', 'Gulu', 'Kisumu', 'Arusha', 'Eldoret',
'Mombasa', 'Jinja'], dtype=object)
This automatically corrects many typos and abbreviations safely.
5. Deduplication
Duplicate records have the potential to skew analysis and produce incorrect conclusions. By ensuring that each record is only represented once, deduplication produces a dataset that is more accurate and trustworthy.
df = df.drop_duplicates()
Or aggregate:
agg = df.sort_values('LastLoginDate').groupby('CustomerID').agg({
'FirstName':'first', 'LastName':'first', 'Email':'first', 'RegistrationDate':'first',
'Country':'first', 'City':'first', 'TotalSpent':'sum', 'Currency':'first',
'Age':'first', 'PhoneNumber':'first', 'LastLoginDate':'max'
}).reset_index()
6. Handling Missing Values: Choosing the Right Strategy
Missing data is like having a puzzle with missing pieces. The way we handle missing information is critical, as it can significantly impact the reliability of our analysis and the performance of our models. Before we fill in or remove data, it's crucial to understand why it might be missing.
There are three main types of missing data:
Missing Completely at Random (MCAR): The missingness is independent of any of the data, both observed and unobserved.
Example: A temporary server glitch during a software update caused the
TotalSpentfield to be randomly lost for a handful of customers.Recommended Strategy: Since the remaining data is a random sample, you can safely drop the rows with missing values if the number of missing records is small. This approach avoids introducing bias. You could also use simple imputation (e.g., filling with the mean or median) without a high risk of skewing the data.
Missing at Random (MAR): Missingness is systematically related to the observed data but not the unobserved data.
Example: In our data, we see that Email addresses are sometimes missing. This could be because customers with a high
TotalSpentare more likely to have provided a valid email during checkout, while low-spending customers might not.Recommended Strategy: Dropping rows for MAR data can introduce bias. A better approach is imputation, but you should use a method that leverages the variables we do have. For example, you could group customers by
TotalSpentand then impute the missing Email values within each group based on the most common email domain. Or, for a missing numeric field, you could use a more advanced imputation model.
Missing Not at Random (MNAR): This is the most problematic type, where missingness is systematically related to the unobserved data itself.
Example: We see many missing values in the
LastLoginDatecolumn. This could be because customers who are less engaged (the unobserved data) are the ones who did not have a recent login to record.Recommended Strategy: This is the most difficult to handle. Simple imputation or dropping rows will likely introduce significant bias. A very effective and simple strategy is to create a new column to flag missing values. This tells your model that the missingness itself holds valuable information. For example, a model predicting customer churn might learn that
LastLoginDatebeing missing is a strong signal of a disengaged customer.
Common Strategies with Code Examples:
- Drop Rows with Critical Missing Fields: This strategy is most appropriate for MCAR or when the missing data percentage is very low.
# Drop rows where the 'Email' is missing
df.dropna(subset=['Email'], inplace=True)
- Impute (Fill in) Missing Values: Best used for MCAR data or MAR data where you have a clear imputation strategy based on observed variables.
# Fill missing 'Age' values with the median age of the dataset
df['Age'].fillna(df['Age'].median(), inplace=True)
- Flag Missing Data: A robust and simple strategy, particularly useful for MNAR data where the fact that data is missing is informative.
# Create a new column to flag missing 'Age' values
df['Age_missing'] = df['Age'].isna().astype(int)
7. Addressing Outliers
Outliers are data points that deviate significantly from the rest of the data. In our dataset, a customer with an age of 150 years is a perfect example. These extreme values can skew our analysis, distort averages, and negatively impact the performance of predictive models.
Before we decide how to handle an outlier, we must first understand its nature:
Data Entry Error: A typo, like typing 150 instead of 50.
Measurement Error: A faulty sensor or system glitch records an incorrect value.
True Anomaly: A legitimate but extremely rare event, such as a customer with exceptionally high spending.
Identifying Outliers
We can identify outliers using visual methods like box plots or scatter plots, or statistical methods like the Interquartile Range (IQR).
Manual Inspection
One way to identify outliers is by manually checking for values beyond a given range. For example, we can check for ages beyond 120 years:
result = df[df['Age'] >= 120][['FirstName', 'LastName', 'Age']].copy()
print(result)
Output:
FirstName LastName Age
7 bob kato 150
12 grace mutesi 34333031216
40 eric sentamu 250
Statistical Methods: Interquartile Range (IQR)
The IQR method calculates the range between the first quartile (Q1) and the third quartile (Q3) and identifies any values that fall outside the range of Q1 - 1.5 IQR to Q3 + 1.5 IQR.
Here is the provided code to find outliers using the IQR method on the TotalSpent column:
Q1 = df['TotalSpent'].quantile(0.25)
Q3 = df['TotalSpent'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['TotalSpent'] < Q1 - 1.5 * IQR) | (df['TotalSpent'] > Q3 + 1.5 * IQR)]
print(outliers)
Explanation:
Q1: The first quartile (25th percentile).
Q3: The third quartile (75th percentile).
IQR: The interquartile range (Q3 - Q1).
Outliers: Values that fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR.
Decision and Strategy for Handling Outliers
Once outliers are identified, we must decide how to handle them based on their context and relevance.
Strategy 1: Adjust or Correct Outliers
If we are confident that an outlier is due to a data entry error, we can correct it. For example, a negative TotalSpent value (-5000) is impossible and can be corrected by taking the absolute value. Similarly, impossible age values can be replaced with a placeholder.
Code Example:
# Correct negative TotalSpent outliers by taking the absolute value
df['TotalSpent'] = df['TotalSpent'].abs()
# Handle impossible age outliers by replacing them with a placeholder
df.loc[df['Age'] > 100, 'Age'] = None
Strategy 2: Remove Outliers
If an outlier is a clear error and cannot be corrected, or if its presence would severely skew the analysis, it might be best to remove the entire row containing the outlier.
Code Example:
# Remove rows with outliers in TotalSpent
df = df[~((df['TotalSpent'] < (Q1 - 1.5 * IQR)) | (df['TotalSpent'] > (Q3 + 1.5 * IQR)))]
Strategy 3: Keep Outliers
If the outlier represents a genuine and important anomaly, it should be kept. For example, in a fraud detection model, an outlier transaction might be a key indicator of fraudulent activity. Removing it would be a mistake. In such cases, it's often best to handle outliers using robust statistical methods (like median instead of mean) that are less sensitive to extreme values.
Code Example:
# Use median instead of mean to reduce the impact of outliers
median_spent = df['TotalSpent'].median()
Link to the Github Repo
https://github.com/katendejericho5/Documents/tree/main/datacleaning
Conclusion:
Data cleaning isn't just a technical task; it's the foundation of every successful data project. By standardizing formats, correcting inconsistencies, and handling missing values, you transform messy, unreliable data into a trustworthy asset. This critical step ensures your analyses are accurate, your machine learning models perform better, and your business decisions are based on the truth. Don't see data cleaning as a chore, but as an investment. The time you spend on it now will pay off with more reliable insights and smarter outcomes. A clean dataset isn't just a goal—it's a requirement for unlocking the true potential of your data.
References:
Python Documentation. (n.d.). Pandas Documentation. Retrieved from https://pandas.pydata.org/docs/
Wikipedia. (n.d.). Levenshtein Distance. Retrieved from https://en.wikipedia.org/wiki/Levenshtein_distance
IBM. (n.d.). Data Preparation and Cleaning. Retrieved from https://www.ibm.com/cloud/learn/data-preparation
Data Science Central. (n.d.). Handling Outliers in Data. Retrieved from https://www.datasciencecentral.com/profiles/blogs/handling-outliers-in-data
Found this useful? Don’t forget to like and share!


