Previously known as Google Refine, OpenRefine is a robust tool useful for working with messy data.
With a simple interface, OpenRefine is a powerful but user-friendly program for exploring and cleaning messy data. With its ability to incorporate textual cleaning techniques (such as clustering and faceting), OpenRefine provides an advanced alternative to Excel without needing to understand computer programming.
Running OpenRefine requires:
NOTE: On Windows, OpenRefine does NOT support Cygwin, MSYS2, or Git Bash for running OpenRefine, instead just use Windows Terminal
The dataset used in this tutorial was downloaded from the Las Vegas Open Data Portal on March 5, 2020. There is no license specified for reuse.
Cluster - To group similar items together. For example, in this guide, we will be demonstrating clustering data from a dataset to clean up any text inconsistencies. To cluster this data, the machine will search for cells with similar text to determine whether they should be combined.
Facility column data:
Charleston Heights AC |
West Las Vegas Arts Center |
Charleston Heights Arts Center |
5th Street School |
Special Events |
Clustered Facility data:
Charleston Heights AC | 3 |
Charleston Heights Arts Center | 4 |
Facet - A method of summarizing data in cells for an overview of the data within them. For example, in this guide, we will be demonstrating facets that can be used to clean and revise data. To facet data, the machine will combine data for a summary view of what is contained in a selected column.
Program column data:
Camps and Classes |
Other |
Visual and Public Arts |
Performing Arts |
Performing Arts |
Camps and Classes |
Program column facet:
Camps and Classes | 48 |
Visual and Public Arts | 33 |
Performing Arts | 50 |
Other | 54 |
JSON - A format for structured data.
Parse - To divide something into its smallest parts. For example, in this guide, we will be demonstrating creating a project using a CSV (comma-separated value) dataset. This means that each unit is separated by commas. To parse this data, the machine will separate each unit based on where a comma falls.
Original data:
Facility, Program, Attendance, Date, Quarter, Period Type, ObjectId
Parsed data:
Facility | Program | Attendance | Date | Quarter | Period Type | ObjectId |
Whitespace - also known as blank space, or space, this is a term used to describe the "empty" space on a webpage or in a field that isn't visible but can still be seen and read by the machine. This can negatively affect formulas or computer programming by producing errors.