Basic OpenRefine Functions I: Working with columns, sorting, faceting, filtering and clustering

Overview

Teaching: 20 min
Exercises: 40 min
Questions
  • How do I move, rename or remove columns in OpenRefine?

  • How do I sort data in OpenRefine?

  • How can I work with a subset of my full data set in OpenRefine?

  • How can I easily correct common data issues in my data with OpenRefine?

Objectives
  • Explain how to reorder, rename and remove columns

  • Explain how to sort data in columns

  • Explain what Facets and Filters are

  • Use facets and filters to work with a subset of data

Basic Operations


Layout

Another thing you can do is …

Sorting data

  1. Choose the drop down of the column you want to sort and choose Sort...`
  2. Once you have sorted the data a new Sort drop down menu will be displayed.
  3. Unlike Excel, sorts are temporary, you can remove sort and the data will go back to original unordered state
  4. Sort drop down menu lets you change the sort (reverse sort), remove sorts and make sorts permanent 4.Sort on multiple columns at same time by added another sorted column

Facets

Let’s create a text facet

  1. Click on the drop down menu at the top of the column and choose Facet > Text Facet. The facet will then appear in the left hand panel
  2. Filter by facet by clicking or include
  3. Select multiple values to include
  4. You can ‘invert’ your selections to exclude
  5. Include a value and then look at top to invert inclusion.

Exercise 2: Licences used for articles in this file

Use a text facet for the license column and answer these questions:

  1. What is the most common Licence in the file?
  2. How many articles in the file don’t have a licence assigned?

Solution

  • Create a facet for the ‘Licence’ column
  • What is the most common Licence in the file? Answer: CC BY
  • How many articles in the file don’t have a licence assigned? Answer: 6

Filters

More on Facets

As well as ‘Text facets’ Refine also supports a range of other types of facet. These include:

Numeric and Timeline facets display graphs instead of lists of values. The graph includes ‘drag and drop’ controls you can use to set a start and end range to filter the data displayed.

Scatterplot facets are less commonly used - for further information on these see the tutorial at http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial#Exploring_the_data_with_scatter_plots

Custom facets are a range of different types of facets, and also allow you write your own custom facets. Some of the default custom facets are:

Exercise 3: Find all publications without a DOI

  • Use the Facet by blank function to find all publications in this data set without a DOI

Solution

  1. On the DOI column drop down and select Customized facets > Facets by blank
  2. True means that it is blank, so you can:
    • Select include for true to winnow down the publications

Changing Data through Facets

??? This approach is useful in relatively small facets where you might have small variations through punctuation or typing errors etc. For example, a column that should contain only terms from a small restricted list such as days of the week or months of the year.

Exercise 4: Correct the Language values via a facet

  • Text facet on the language column and correct the variation in the EN and English values.

Solution

  1. Create a Text facet on the Language column
  2. Notice that there is both EN and English
  3. Put the mouse over the English value
  4. Click Edit
  5. Type EN and click Apply
  6. See how the Language facet updates

Rows and Records

Exercise 5: Split author names into separate cells

If you look at the Author column you should be able to see that there are multiple names in each cell separated by the pipe or vertical bar symbol |. To work with the author names effectively we need to split them into separate cells:

Use the Edit cells > Split multi-valued cells drop down option on the Authors column to separate the authors by the | separator. Switch between rows and records. What do you notice?

Solution

  1. Click the dropdown menu at the top of the Author column
  2. Choose Edit cells->Split multi-valued cells
  3. In the prompt type the | symbol and click ‘OK’
    • Note that the rows are still numbered sequentially
  4. Click the ‘Records’ option to change to Records mode
    • Note how the numbering has changed - indicating that several rows are related to the same record

records rows

Clustering

Exercise 6: Use Clustering to clean up author data

  • Cluster on the Author column using: Edit cells->Cluster and edit …
  • Choose Edit cells->Cluster and edit from the author column (which should be split into individual values from the last exercise)
  • Using the ‘key collision’ method with the ‘fingerprint’ Keying Function work through the clusters of values, merging them to a single value where appropriate
  • Try changing the clustering method being used - which ones work well?

Break


Key Points

  • You can reorder, rename and remove columns in OpenRefine

  • You can use facets and filters to explore your data

  • You can use facets and filters work with a subset of data in OpenRefine

  • You can easily correct common data issues using Facets and Clustering