Data Cleaning

Difficulty: Beginner Reading Time: 10 minutes

Why Data Is Never Perfect

Every dataset you encounter in the real world will have problems. Survey respondents skip questions. Sensors malfunction and record impossible values. Spreadsheets get copy-pasted with duplicate rows. Someone types "N/A" in a numeric field. A date appears as "03/04/2025" and you cannot tell whether it means March 4th or April 3rd.

Data cleaning is the process of finding and fixing these problems before you analyze the data. It is not glamorous work, but it is arguably the most important step in any analysis. Experienced data scientists estimate that 60-80% of their time goes into cleaning and preparing data. Skip this step, and your results could be meaningless -- or worse, confidently wrong.

60 Cleaning 20 Analysis 10 Reporting 10 Collection

The chart above shows a typical breakdown of time in a data project. Cleaning dominates the workload, which is why learning to do it well pays enormous dividends.

Handling Missing Values

Missing values are the most common data quality issue. A customer leaves the income field blank on a form. A weather station goes offline for a day. A patient misses a follow-up appointment. The question is: what do you do about it?

You have several options, each with trade-offs. You can delete rows with missing values, which is simple but shrinks your dataset and can introduce bias if the missingness is not random (for example, high-income people might skip income questions more often). You can fill in (impute) missing values using the column average, median, or a more sophisticated prediction model. Or you can flag the missing values and include missingness as a separate variable in your analysis.

The right approach depends on how much data is missing and why. If only 2% of rows have missing values and they appear random, deleting those rows is usually fine. If 30% of a column is missing, deletion would throw away too much data and imputation or a different analytical approach is needed.

Detecting and Removing Duplicates

Duplicate records can inflate your results and distort every statistic you compute. If a customer's purchase appears twice in your database, you will overcount revenue and overestimate the number of transactions. Duplicates sneak in through data merges, form re-submissions, system glitches, and manual data entry errors.

Finding duplicates is not always straightforward. Exact duplicates (identical rows) are easy to spot. But what about "John Smith" at "123 Main St" and "Jon Smith" at "123 Main Street"? These are likely the same person with slight variations. Fuzzy matching techniques can help identify near-duplicates, but they require judgment calls about how similar two records need to be before merging them.

10500 Before Cleaning 420 Duplicates Removed 10080 After Cleaning

Dealing with Outliers

An outlier is a value that is dramatically different from the rest of the data. A salary of $5,000,000 in a dataset of mid-level employees. A temperature reading of -40 degrees in Miami in July. A website session lasting 72 hours.

The critical question with outliers is whether they are legitimate or errors. A CEO's salary of $5 million is real -- it just represents a different population. A temperature of -40 in Miami is almost certainly a sensor error. A 72-hour web session might be someone who left a tab open.

For errors, correction or removal is appropriate. For legitimate outliers, you have options: keep them and use robust statistics (like the median instead of the mean), cap them at a reasonable threshold (called winsorizing), or analyze them separately. Never automatically delete outliers just because they are unusual -- always investigate why they exist first.

Data Type Issues

Computers treat numbers, text, dates, and categories differently. If a numeric column accidentally contains text values (like "N/A" or "TBD"), calculations will fail or produce errors. If dates are stored inconsistently (some as "2025-03-15" and others as "15/03/2025"), sorting and filtering will break. If a category is spelled differently in different rows ("USA", "U.S.A.", "United States"), the software will treat them as three separate groups.

Data type cleaning involves standardizing formats, converting strings to numbers where appropriate, parsing dates into a consistent format, and harmonizing category labels. It is tedious but essential. A single rogue text value in a numeric column can cause an entire analysis pipeline to crash.

Common Pitfalls

Even experienced analysts make mistakes during data cleaning. One common pitfall is cleaning the original data without saving a backup. Always work on a copy. Another is over-cleaning -- removing so many rows and values that the remaining data is no longer representative. A third is cleaning inconsistently -- applying different rules to different parts of the dataset without documenting why.

The best defense is to keep a cleaning log: a record of every change you made, why you made it, and how many records were affected. This makes your work reproducible and auditable. If someone questions your results, you can point to the log and show exactly what happened to the raw data before analysis began.

Key Takeaway

Data cleaning is the unglamorous foundation that every reliable analysis is built on. Handle missing values thoughtfully, remove duplicates carefully, investigate outliers before deleting them, and standardize data types consistently. Always work on a copy, document every change, and remember: the time you invest in cleaning your data will save you from drawing conclusions that the data does not actually support.