Skip to Main Content
UNLV Logo
questions, ask us

Getting Started with Data Cleaning and OpenRefine

This guide is meant to introduce readers to the importance of data cleaning through a useful tool for working with "messy" data, OpenRefine.

Common Transformations

By simply viewing your data, you might be able to identify some elements of your data that should be transformed. However, not all necessary transformations are obvious from the human eye. In this section, we will discuss some common data transformations (both visible and not) that are important to consider when cleaning and formatting your data.

Trimming Leading and Trailing Whitespace

While not always visible to the human eye, trimming leading and trailing whitespace is incredibly important when it comes to the eventual analysis of your data. Oftentimes, especially with human-input data, there are invisible blank spaces existing within your data. Frequently, this occurs when someone pressed a space bar accidentally or when copying and pasting information from one place to another. These blank spaces are called whitespace and can drastically impact your analysis. Whitespace can occur either before (leading) or after (trailing) a statement.

For instance, let's take the word "Makerspace." With whitespace, this word could appear as
" Makerspace"
" Makerspace " or
"Makerspace "

These differences can be subtle and hard to spot when looking at large amounts of data. If the whitespace is left in, when you go to analyze your data, the word will now code in 4 different ways, each for the different locations of whitespace. 

To counteract this, OpenRefine possesses a transformation process called "trimming leading and trailing whitespace." To access this, you must go column by column and select the dropdown arrow next to the heading (pictured below).

Once the transformation is complete, OpenRefine will display a message at the top of your screen detailing how many rows in each column were affected by this change.

Changing Case

Another important transformation is altering the case of your textual information (i.e. to uppercase, lowercase, or titlecase). Similar to whitespace, when analyzing data, different cases will cause the same word to code separately. To use the example before,

"Makerspace"
"makerspace" and
"MAKERSPACE"

will all code as separate entities despite potentially describing the same thing. As you would deal with whitespace, you would deal with transforming data cases. By selecting the dropdown arrow next to the column heading, navigating to "edit cells," and then to "common transformations," you are able to select the transformation best for your purposes.

Changing Data Type

Another common transformation is changing the data type of the data represented in each column. In OpenRefine, this allows you to perform different operations on different columns. For instance, by transforming a column containing numbers to read as numbers, you would be able to perform mathematical operations and create graphs.

To change the data type of a column, you would navigate to the dropdown arrow next to the column header. From there, you would select "edit cells," "common transformations," and then select one of several options: to number, to date, to text, to null, or to empty string. In addition to a popup detailing how many cells were altered, changing a data type to number or date will cause the data in the selected column to become GREEN (below).

© University of Nevada Las Vegas