Skip to Main Content
UNLV Logo
questions, ask us

Getting Started with Data Cleaning and OpenRefine

This guide is meant to introduce readers to the importance of data cleaning through a useful tool for working with "messy" data, OpenRefine.

What is it?

Previously known as Google Refine, OpenRefine is a robust tool useful for working with messy data. 

Why use it?

With a simple interface, OpenRefine is a powerful but user-friendly program for exploring and cleaning messy data. With its ability to incorporate textual cleaning techniques (such as clustering and faceting), OpenRefine provides an advanced alternative to Excel without needing to understand computer programming.

System Specifications

Running OpenRefine requires:

  1. Java JRE/JDK installed (If you are running a 64 bit operating system, then it's recommended that you install 64 bit Java)
  2. A Supported operating system: Windows, Linux, macOS


NOTE: On Windows, OpenRefine does NOT support Cygwin, MSYS2, or Git Bash for running OpenRefine, instead just use Windows Terminal

Demonstration Dataset

The dataset used in this tutorial was downloaded from the Las Vegas Open Data Portal on March 5, 2020. There is no license specified for reuse.

Definitions

Cluster - To group similar items together. For example, in this guide, we will be demonstrating clustering data from a dataset to clean up any text inconsistencies. To cluster this data, the machine will search for cells with similar text to determine whether they should be combined.

Facility column data:

Charleston Heights AC
West Las Vegas Arts Center
Charleston Heights Arts Center
5th Street School
Special Events

Clustered Facility data:

Charleston Heights AC 3
Charleston Heights Arts Center 4

 

Facet - A method of summarizing data in cells for an overview of the data within them. For example, in this guide, we will be demonstrating facets that can be used to clean and revise data. To facet data, the machine will combine data for a summary view of what is contained in a selected column.

Program column data:

Camps and Classes
Other
Visual and Public Arts
Performing Arts
Performing Arts
Camps and Classes

Program column facet:

Camps and Classes 48
Visual and Public Arts 33
Performing Arts 50
Other 54

 

JSON - A format for structured data.


Parse
- To divide something into its smallest parts. For example, in this guide, we will be demonstrating creating a project using a CSV (comma-separated value) dataset. This means that each unit is separated by commas. To parse this data, the machine will separate each unit based on where a comma falls.

Original data:
Facility, Program, Attendance, Date, Quarter, Period Type, ObjectId

Parsed data:

Facility Program Attendance Date Quarter Period Type ObjectId

 

Whitespace - also known as blank space, or space, this is a term used to describe the "empty" space on a webpage or in a field that isn't visible but can still be seen and read by the machine. This can negatively affect formulas or computer programming by producing errors.

© University of Nevada Las Vegas