Skip to Main Content
UNLV Logo
questions, ask us

Getting Started with Data Cleaning and OpenRefine

This guide is meant to introduce readers to the importance of data cleaning through a useful tool for working with "messy" data, OpenRefine.

Clustering

Working with the same FacilityName text facet created in Faceting, you can also find discrepancies in entries using the Cluster option.

 

undefined

Selecting cluster will create a large pop-up box on your screen. This box contains several different options: method and keying function. To learn the exact definitions and capabilities for each method and keying function, check out Clustering in Depth, a guide produced by the makers of OpenRefine.

 

undefined

 

By choosing different keying functions, OpenRefine automatically clusters your data based on syntax. In larger datasets, selecting different keying functions will combine the column's entries in different ways, allowing you to easily spot typos or inconsistencies. There are several keying functions that identify potential inconsistencies in this dataset, however, we will be focusing on Daitch-Mokotoff. To navigate to Daitch-Mokotoff, select the drop-down box beside the keying function option.

 

undefined

 

Using this function, two potential inconsistencies were discovered (pictured above).

 

Editing, Merging, and Re-clustering

While it is currently unclear whether "West Las Vegas Arts Center" and "West Las Vegas" are meant to refer to the same location, it is reasonable to conclude that "Special Event" and "Special Events" are referring to the same name. Similar to what edit did in the Faceting tab, here you are able to choose the spelling/version of the text you want to be represented in your final dataset.

By selecting the Merge box, you are saying that you would like to combine those two names into the New Cell Value (the editable text box). This will cause all "Special Event" and "Special Events" cells to become coded as referring to the same Facility Name and will rename the cells based on what you decide is appropriate.

Let's say you would like to merge both "Special Event(s)" values and rename those cells "Misc Special Event." You would select the "Merge?" box and write the new name in the "New Cell Value" text box.

 

undefined

 

To apply these changes, select "Merge Selected and Re-Cluster" or "Merge Selected and Close." If you choose to re-cluster, the "Special Event(s)" values will now be removed from the "Cluster and Edit" interface, leaving only "West Las Vegas."

© University of Nevada Las Vegas