Cleaning Data

a post for students in database course

Ever heard the saying, “Garbage in, garbage out”? It’s a fundamental truth in the world of data analytics. Imagine you’re trying to bake a perfect cake, but half your ingredients are spoiled, or you’ve accidentally doubled the sugar in one bowl and forgotten it entirely in another. Your cake won’t turn out right, no matter how skilled a baker you are. The same goes for data. Even the most brilliant analysis tools and techniques won’t give you accurate or useful results if the data you feed them is messy, incomplete, or just plain wrong.

This is where data cleaning comes in. It’s the process of identifying and correcting errors, inconsistencies, and issues in your datasets, making them usable and reliable for analysis. While it might not sound as glamorous as building fancy models or creating stunning visualizations, data cleaning is arguably the most crucial step for any aspiring data analyst, and often takes up the majority of their time. It’s rare to receive a dataset that’s “clean, neat, and ready to use” right off the bat. In this post, we’ll dive into the most common data cleaning challenges you’ll face and explore practical approaches to tackle them, ensuring your analyses are built on a solid foundation.

Too Much of a Good Thing? Managing Duplicate and Redundant Data

Sometimes, having too much data can be just as problematic as having too little. Specifically, we’re talking about duplicate data and redundant data. While they sound similar, they’re different issues that require different solutions.

Duplicate Data: The Unwanted Copies

Duplicate data occurs when an entire data point, typically a row in a spreadsheet, appears multiple times with completely identical values. Imagine a list of employees where one employee, Jill Chane, is listed three times with the exact same ID, department, and years with the company. This might seem harmless, but it can severely skew your analysis.

For instance, if Jill is duplicated three times, it would falsely make the IT department appear to be 50% of the company, when without duplicates, it’s only 25%. Similarly, if you calculate the average “Years With Company,” the duplicated entry will pull the average down or up, depending on its value compared to the rest of the data. In one example, the average with duplicates was 8.5 years, but without them, it was 12.25 years – a 36% difference. This kind of distortion can lead to incredibly inaccurate conclusions. Sometimes, you might even encounter “partial duplicates” where only part of a row is copied.

The most straightforward solution for duplicate data is to delete all duplicate rows. It’s a common and effective approach. A crucial important note here: ALWAYS work on a copy of your data! Never make changes directly to your original raw data. If you accidentally delete something crucial from the original, it could be gone forever, and that’s a mistake you definitely want to avoid.

Redundant Data: Different Columns, Same Information

Redundant data is different from duplicate data; it’s when different columns in your dataset convey the exact same information. For example, if you have columns for “Sex” (M/F), “Male” (TRUE/FALSE), and “Female” (TRUE/FALSE), these columns are redundant. Knowing the value in “Sex” (e.g., ‘M’) automatically tells you the values in “Male” (TRUE) and “Female” (FALSE).

While redundant data doesn’t immediately invalidate your data for simple viewing, it becomes a major problem when building statistical models, causing something called multicollinearity. Multicollinearity means that independent variables in your model are highly correlated and can predict each other. This makes your model’s results harder to interpret because you can’t clearly see which variable is truly having an impact. It also means your model might not perform well when applied to new, unseen data. Beyond that, redundant columns simply take up extra space, which can be an issue with very large datasets containing millions or billions of rows.

The safest way to deal with redundant data is to delete the redundant columns. If deleting isn’t an option (perhaps due to permissions or other uses for the columns), you can create a subset of your data that excludes the redundant columns specifically for your statistical modeling. Again, always work on a copy of your data to avoid irreversible mistakes.

The Holes in Your Data: Dealing with Missing Values

Missing data is a ubiquitous challenge in data analytics. Whether it’s a skipped survey question, a forgotten entry, or a measurement that couldn’t be taken, “holes in your dataset happen all the time.” The biggest problem? Most analyses simply won’t run if your data contains “null values,” which are spaces where data should be but isn’t. From a statistical perspective, missing data reduces the analytical power and can even introduce bias, especially if the missingness isn’t random.

There’s no “one perfect approach” to handling missing data, and it’s a “hot-button topic” with much debate. Often, companies have their own policies19. Understanding why data is missing can help you choose the best strategy.

Here are the main types of missing data:

  • Missing completely at random (MCAR)—This is the ideal scenario. There’s no detectable connection between the missing values and any other values in your dataset20. The data is missing purely by chance, like a random system glitch. MCAR introduces minimal to no inherent bias.
  • Missing at random (MAR)—The missingness has a pattern, but this pattern is linked to other variables that you have recorded in your dataset. For example, if older employees are less likely to fill out a digital survey, the missingness is related to age, which is a recorded variable. MAR is usually manageable as the pattern is observable, thus is can be addressed.
  • Missing not at random (MNAR)—This is the most problematic type of missing data. The missingness is linked to a variable or factor that was not recorded in your dataset. For example, if people with very low or very high salaries chose not to disclose them, and salary isn’t captured elsewhere. MNAR is “most likely to cause bias in your results” and requires more investigation.

Now, let’s explore the common methods for dealing with these gaps:

  • Deletion—The “Remove It” Approach

Pros: It’s the “easiest and most used approach” and makes the dataset clean.

Cons: Reduces the amount of data available, which can be a problem for small datasets, and can introduce bias if the data isn’t MCAR. Remember the important note: ALWAYS work on a copy of your data.

  • Listwise deletion (or Casewise Deletion)—If a single value is missing from a row, the entire row is deleted. This is clean and efficient, especially for large MCAR datasets.
    • Pairwise deletion—This is more strategic. You only delete or omit specific missing values when they are needed for a particular analysis. You try to keep good data even if other parts of the row are missing, making it popular when you can’t afford to lose much data.
    • Variable deletion—If more than half the values for a specific variable are missing, it might make sense to delete the entire column. This avoids introducing large amounts of bias if the missingness is for mysterious, unrecorded reasons.
    • Filtering—Similar to deletion, you can use filters to create a subset of your data that excludes missing values, but the original data remains intact.

  • Imputation—The “Fill It In” Approach

Instead of removing data, imputation tries to fill in the missing holes.

  • Mean, median, or mode imputation—You replace missing values with the mean, median, or mode of that variable. This is simple, and can be used for MCAR and MAR data. It does introduces bias by skewing data towards the middle, “artificially normalizes your data.”
    • Hot deck imputation—You take random values from elsewhere in your dataset to fill the holes. This is more random, so potentially introducing less bias. This has become less popular than it was previously.
  • Interpolation—This is a more advanced form of filling holes where you estimate specific values for each gap using other values from that data point as a reference. This often involves analytical modeling approaches like regression and includes finding an equation that matches the data, then using it to complete the missing data. Can be more accurate than simple imputation. Can still introduce bias, accuracy depends on existing data, and methods can get technical, but with spreadsheet software, it can be accomplished quickly.

  • Dealing with MNAR data—None of the above methods can safely account for MNAR data without introducing bias. If you encounter MNAR data, it usually requires “more investigation” to understand why it’s missing. The most conservative approach is often not to use the data from that variable if you can’t figure out the reason for its absence.

Incorrect and Misformatted Data

Beyond missing or duplicated data, sometimes the data itself is just plain wrong due to typos, formatting issues, or simply being in the wrong place.

Invalid Data: Just Plain Wrong Values

Invalid data happens when values don’t match expected formats or ranges. This includes simple typos, inconsistent capitalization (e.g., “los angeles” vs. “Los Angeles”), abbreviations, or even invisible spaces. To software, “los angeles” and “los angeles ” (with an extra space) are entirely different values.

Problem: This makes the “same city” appear as “six different cities,” leading to “weird results.”

Solution: The most important skill here is to identify invalid data. You can do this by generating a list of unique values or a frequency table for a variable. If you find more unique values than expected, you likely have invalid data. Then, you replace the incorrect values with the correct ones. Using drop-down menus for data entry can help reduce these issues from the start.

Specification Mismatch: Data in the Wrong Lane

A specification mismatch occurs when a value has a different data type than the other values in that variable. For example, a column meant for “Cost Per Click” (numeric/currency) might contain the word “Toast.”

Problem: “Having even a single value in a variable be the wrong data type will cause issues” and will “crash any analysis you try to run on that column.” It also happens when an entire variable is the wrong data type for an analysis (e.g., trying to average car manufacturers’ names).

Solution: If it’s a single incorrect value, fix or remove it. If the entire variable is the wrong type, you might need to change its data type or choose a different analysis.

Data Type Validation: Preventing Mismatches at the Source

Data type validation is the process of checking a variable’s data type, either manually or automatically, to prevent specification mismatches.

The best solution is to require the correct data format during data entry. Think of online forms that tell you your phone number entry is invalid if you type letters. When this isn’t possible, manually checking each variable’s data type is necessary.

Bringing It All Together

Data cleaning is the unsung hero of database management. While it can be time-consuming and sometimes frustrating, it’s absolutely essential for producing accurate, reliable, and trustworthy insights. From dealing with unwanted duplicates and redundancies to meticulously handling missing values and correcting incorrect entries, each step ensures your data is primed for meaningful analysis.

Remember, the goal is not to force the data to say what you want, but to shape it so it can accurately tell its story. Mastering these data cleaning techniques will empower you to transform chaotic raw data into a clean, precise foundation, leading to better decisions and more impactful results in your data analytics journey.