Skip to Main Content
questions, ask us

Getting Started with Data Cleaning and OpenRefine

This guide is meant to introduce readers to the importance of data cleaning through a useful tool for working with "messy" data, OpenRefine.

Add Column Based on This Column

One way to split data in cells is the Add Column Based on This Column. This feature uses GREL or Python to parse data from one or more cells into new columns while leaving the already existing data alone. The expression indicates what data should be pulled from the original column(s), and then separates it into new columns that you can name beforehand. This feature can also be used to merge data in separate columns into one using a regular expression.

Let's say we want to multiply the attendance column by the value of 2. To add a column based on a column, you will select the drop-down arrow beside the column you would like to change. From there, you would select "Edit column" and then select "Add column based on column."



Selecting this will then produce a popup window (below):



In this window, you are given the opportunity to name your new column, which we will call "Multiple2." Some transformations can result in an error. The box below renaming the column allows you to choose whether any errors will be returned as a null, stored as an error, or simply copies over the original column information. We will be leaving this as "set to blank." Below that is a language option, where you can swap between GREL, Python, and Clojure. We will be leaving this as GREL. Underneath the "Expression" label is where we will code our transformation. In that box, we will leave the word "value" and then type "*2", the computational way of indicating multiplying by 2. You will notice that the column view beneath, on the righthand side, will update, multiplying the left column by 2.



To save these changes and create a new column, select "OK" in the bottom lefthand corner. This will take you back to your dataset and you will now see a new column, "Multiple2."

Adding columns based on another column can be useful when performing calculations or even string operations in Python.

Split into Several Columns

Similar to the Excel feature Text to Columns for separating data from one cell into multiple cells, OpenRefine uses the Split into Several Columns feature to divide columns using a delimiter, regular expression, or field length. You can also limit the maximum number of columns the data is split into, which can be useful for larger datasets.

To split a column, select the drop-down arrow next to the column of your choice. In this case, we will be splitting the "Date" column. From the drop-down, you will select "Edit column" and then "Split into several columns."



This will produce a popup window (below):



Here, we can specify how we want to split a column as well as how many times. We are going to be splitting the "Date" column by the character "T" (NOTE: capitalization does matter). In the "Separator" field, you will delete the present comma and type "T." Then you will select "OK" in the bottom lefthand corner.

As you can see, your original "Date" column has now been replaced, and instead, you have two columns: Date 1 and Date 2, which you can then rename, should you choose to do so.


© University of Nevada Las Vegas