So let's start the
OpenRefine desktop
application, which opens up
a new browser window and
runs on a default port on
the local machine.
We can create a new
project, open or import an
existing project,
or just change the language
settings if needed.
Let's download the stock start zip
file and browse through the file
from within the OpenRefine
Create Project menu.
And click next.
This imports the contents of
the file into application
memory. Note that the tool
can work both with compressed
and uncompressed files.
Since the stock data is in the
semi-structured json format, we
will select the root node of the
json tree, which transforms the
data into a tabular format with
rows and columns.
We can choose to preserve empty strings
and perform basic cleanup such as
trimming of whitespaces and
parsing numerical values.
These operations can also be
performed at a later stage.
We will provide a project
name and proceed with
project creation.
Once the project has
been created,
we can click All → Edit columns
and remove some of the unwanted
columns from the data set.
This helps reduce the memory
usage, especially for large
data sets. We'll also go
ahead and reorder some of
the columns in the table.
Now the data set is lighter
with only those columns that
will be used potentially for
further analysis. Just like
in a spreadsheet, we could
choose to sort different
columns. Here we'll go ahead
and sort the volume column
in the descending order.
Financial data sets, such as
stock prices, are constantly
changing. What if our data is
outdated, and we wanted to fetch
the latest data from the stock
market? Using OpenRefine, one
could augment the data using
online web services.
However, calling some of these
web services could be very
time-consuming. So to
demonstrate this feature,
let's work with a smaller
subset of the stocks data and
load it into a new project.
Look for the file stocks_demo.csv in the
session folder,
and load it in OpenRefine.
We will also create a new
OpenRefine project
and call it StocksDemo.
On the ticker column menu, click
Edit column → Add columns by
fetching URLs. This opens up an
expression box where we could
enter the URL of an online web
service which could be invoked
to augment the data.
This online service provides an
API that returns slightly
delayed stock prices.
A good practice is to test
the web service from a
browser client with a known
ticker symbol and validate
the service response.
We will add this URL and append
the ticker symbol as a variable
parameter for each row.
To do this, copy the URL into
the expression box, and insert
codes as shown here.
Now the URL is different for
each row with the ticker symbol
as the symbol parameter.
We'll give a new name to the
column and click OK.
This could take a few minutes
to run depending on the speed
of the web service.
We now have a new column with
the latest stock data.
Notice that the results are in
the XML format and need to be
parsed. Please refer to the
OpenRefine tutorials for some
help with parsing XML and json
results.
For now, here is the
syntax for extracting the stock
price from the XML document.
After a quick review of the
results, we are ready to export
the data to another data
processing tool in our pipeline.
In the next example we will use
OpenRefine for cleaning up a
sample messy data file.
So let's download the
Customers-messy.zip file
from the session folder and store it on
the local computer. The first
step is to visually inspect
the raw data file using a text
editor or Excel and get
familiar with the attributes
and contents of the data.
As before, we will load this
data file into OpenRefine and
create a new project.
Once the data has been
loaded, we'll do a cursory
check on the columns,
and change the character
encoding to UTF 8.
If needed, we could also change
options such as column
separators or add a missing
header row. In our case, we'll
keep the default settings.
So let's give the clean-up
project a name and click on
the Create Project button.
This creates a new OpenRefine
project with a copy of the
original data set. The next step
in the cleanup process is to
visually inspect the data and
check each column for errors,
formatting issues, or missing values.
We can perform common
operations like remove
whitespace from text. For this,
let's click on a column header.
Let's say contact first name,
Edit cells → Trim leading and
trailing whitespace. We can
perform the same operation
on multiple columns.
We could also do basic
concatenation operations like
in a spreadsheet. Let's say we
need to combine the first and
last names of the contact into
one column. For this we'll
click on Edit column →
Add column based on this column.
On the expression box, we
will take the current value
of the contactFirstName.
Add a space followed by the
value of the contact last name.
Let's check to see the results
and assign a name to the new
column, such as
contactFullName. And hit OK.
The newly-created column has
been added to our data set.
A nice thing about OpenRefine is
that you can undo your past
operations, so if you made a
mistake you can go back to a
previous step and correct it.
This makes it really easy to
make incremental changes without
losing time over
saving multiple copies of the data.
Let's go ahead and remove the
column that we just created.
Just like with spreadsheet
applications, we could also
perform other simple, column-based
operations such as
changing the case of the text to
title, upper, or lowercase.
Faceting is a powerful feature
in many data cleaning tools. The
facet summarizes the cells in
that column to give you a bigger
picture on the column. It also
allows for filtering down to
just a subset of the rows that
you want to change in bulk.
Let's see how this works.
Let's create a text facet
on the country column
by clicking on the column's
dropdown menu and
invoking Facet → Text facet. We
can see that
the country names
are not consistent.
For instance, United States
is spelled as US, USA, or
United States, with some
extra characters.
Also, Netherlands is spelled
both with and without the
determiner "the." France is
spelled using the uppercase as
well as lowercase characters.
Clustering allows us to
automatically group and edit
different but similar values.
Let's click on the Cluster
command on the left frame.
We now have the opportunity to
see all the similar words
grouped together and perform a
merge operation. Let's merge all
text referring to the United
States as USA, United Kingdom as
UK, and so on.
The Merge and Re-Cluster
button confirms the
operation. There are no more
obvious clusters that are
left. However, we still
haven't resolved the issue
with Netherlands. For this
we will choose another
clustering operation which
uses a distance metric.
The country column is now fixed,
and there are no more
clusters to be found. However,
we still have one error.
The country Norway has a space
character. See if we can find
a way to fix this error.
Let's now go ahead and use the
tool to clean up a numerical
column like the credit limit.
We'll first apply a numerical
transformation on the column so
that all the numbers are
recognized. Some values in this
column cannot be parsed into
numbers. This is because of the
presence of alphanumeric values,
such as the USD or the dollar
symbol. We can correct this by
applying the Numeric facet.
On the left we see a chart with
the distribution of the data.
Let's filter down to the non-numeric
values only and change
the values of the fields from
the expression box.
This screen allows us to
view the original values as
well as the outcomes when
the expression is applied.
We will involve the replace
function to replace or
substitute the unwanted
characters with blanks. So let's
remove the USD, the dollar, and
the comma symbols.
As you can see, we can make
multiple function calls in a
single line. This is also known
as operator chaining. Now
all the Alpha characters
have been removed.
When we click OK, the tool still
does not see them as numbers.
For this we will need to invoke
the toNumber function to
convert the values within the
facet into numerical values.
So far we have tested these
operations on a numeric facet.
To apply these changes to the
data set in the project we will
copy this expression and apply
to a column transform. This
ensures that all the values in
the credit limit column have
been converted into numbers.
Once we're done with cleaning up
the rest of the data, we are
ready to export the file into a
format of our choosing. For this
example, let's export it as a
CSV or a comma-separated file
and open it to view the results.
All the columns now have clean
data. The data set is now
ready for storage in a
database or for further analysis.