Skip to Main Content
UNLV Logo
questions, ask us

Getting Started with Data Cleaning and OpenRefine

This guide is meant to introduce readers to the importance of data cleaning through a useful tool for working with "messy" data, OpenRefine.

Text Facets

Faceting is a good way to get an overview of a specific column of your data. Text faceting will organize unique items in the selected column by name and will give a count for how many rows or records possess that item name.

For example, let's text facet the column FacilityName.

You will navigate to the drop-down arrow beside the column title, hover over Facet, and then select Text facet. This will create a box on the left-hand side of the screen, under the Facet/Filter sidebar.

 

undefined

This box will allow you to browse through the names of each unique facility and see how many times each facility name appears in the column. By default, this box is sorted by name (alphabetical order). Keeping the box in alphabetical order allows you to catch any initial data-entry mistakes. For example, pictured in the box, there are 2 facility names that could potentially be referencing the same location: Charleston Heights AC and Charleston Heights Arts Center.

 

It is possible (though not always) that these 2 entries are actually referring to the same facility. In cases such as these, you may need to consult an outside source, such as Google or a government website, to confirm or deny your suspicions.

 

Editing Text Facets

Let's say you have confirmed these 2 entries are meant to be the same facility. It is at this time where you can choose which version you would like to be reflected in your data. When hovering your mouse over one of the entries, you will be shown the options to edit or include.

 

undefined

Selecting include will cause the main viewer to display only the entries you have included. For instance, if I wanted to include both Charleston Heights entries, this is what would appear (below). The main viewer would now display only the included entries, telling you that 45 rows matched this request. In the facet box, the entries included will now be colored red and will be bolded. Occasionally, by utilizing include you are able to tell whether the entries in question really do refer to the same facility. To resume viewing all data, select exclude, now visible in the facet box beside the selected entities.

 

undefined

 

Selecting edit will allow you to bulk edit the entry of your choice. Let's say you have decided both Charleston Heights entries refer to the same facility and you would like your final dataset to read "Charleston Heights Arts Center" rather than "AC." By selecting edit, a pop-up will appear, allowing you to change the selected entry name.

 

undefined

 

Once you have changed the entry to your desired name, select Apply. This will apply your changes to the list of facility names, changing all entries that read "AC" to "Arts Center." The changes will be confirmed by a pop-up in the main viewer.

 

undefined

 

In addition, the entries in the facet box will update, now reading "Charleston Heights Arts Center" and showing the new total times referenced in the column.

 

undefined

Numeric Facets

Numeric facets allow you to view and customize the range of numbers represented in your data. This could help you identify potential outliers or numerical entry errors.

To create a numeric facet, choose a column that contains only cells with numbers (it is important that the data type of this cell has been converted to number). From the drop-down arrow beside the column header, select Facet > Numeric Facet. This will create a box on the left-hand side of the interface in the Facet/Filter Box.

By doing this with the Attendance column, you are now able to see the range of attendance across the entire dataset. To restrict the range, simply move the left and right tabs (shown below) to the desired numerical value. The data represented in the center of the screen will condense to the restrictions applied.

 

undefined

 

For example, if you wanted to look at the data where the attendance is above 70,000 people, you would slide to the left bar until the range below read 70,000.00 - 250,000.00. You will see that 8 rows are reflected within this new range (below).

 

undefined

 

The row with no attendance, "Historic Fifth Street School," is included in these results because the "Blank" box remains checked in the Numeric Facet box. To remove this result, deselect the "Blank" box.

Other Facets

Other common facets include:

  • Timeline Facet - Timeline facets will work on columns where the data type is a Date. This will create a timeline box on the left-hand side of the screen where you can see most events in this dataset have occurred after May 2016.
  • Scatterplot Facet - Scatterplot facets will create a graph of the data represented in specific columns. This facet is only possible where the column's data type is a number.
© University of Nevada Las Vegas