-
So let's start the
OpenRefine desktop
-
application, which opens up
a new browser window and
-
runs on a default port on
the local machine.
-
We can create a new
project, open or import an
-
existing project,
-
or just change the language
settings if needed.
-
Let's download the stock start zip
file and browse through the file
-
from within the OpenRefine
Create Project menu.
-
And click next.
-
This imports the contents of
the file into application
-
memory. Note that the tool
can work both with compressed
-
and uncompressed files.
-
Since the stock data is in the
semi-structured json format, we
-
will select the root node of the
json tree, which transforms the
-
data into a tabular format with
rows and columns.
-
We can choose to preserve empty strings
-
and perform basic cleanup such as
-
trimming of whitespaces and
parsing numerical values.
-
These operations can also be
performed at a later stage.
-
We will provide a project
name and proceed with
-
project creation.
-
Once the project has
been created,
-
we can click All → Edit columns
and remove some of the unwanted
-
columns from the data set.
-
This helps reduce the memory
usage, especially for large
-
data sets. We'll also go
ahead and reorder some of
-
the columns in the table.
-
Now the data set is lighter
with only those columns that
-
will be used potentially for
further analysis. Just like
-
in a spreadsheet, we could
choose to sort different
-
columns. Here we'll go ahead
and sort the volume column
-
in the descending order.
-
Financial data sets, such as
stock prices, are constantly
-
changing. What if our data is
outdated, and we wanted to fetch
-
the latest data from the stock
market? Using OpenRefine, one
-
could augment the data using
online web services.
-
However, calling some of these
web services could be very
-
time-consuming. So to
demonstrate this feature,
-
let's work with a smaller
subset of the stocks data and
-
load it into a new project.
-
Look for the file stocks_demo.csv in the
session folder,
-
and load it in OpenRefine.
-
We will also create a new
OpenRefine project
-
and call it StocksDemo.
-
On the ticker column menu, click
Edit column → Add columns by
-
fetching URLs. This opens up an
expression box where we could
-
enter the URL of an online web
service which could be invoked
-
to augment the data.
-
This online service provides an
API that returns slightly
-
delayed stock prices.
-
A good practice is to test
the web service from a
-
browser client with a known
ticker symbol and validate
-
the service response.
-
We will add this URL and append
the ticker symbol as a variable
-
parameter for each row.
-
To do this, copy the URL into
-
the expression box, and insert
codes as shown here.
-
Now the URL is different for
each row with the ticker symbol
-
as the symbol parameter.
-
We'll give a new name to the
column and click OK.
-
This could take a few minutes
to run depending on the speed
-
of the web service.
-
We now have a new column with
the latest stock data.
-
Notice that the results are in
the XML format and need to be
-
parsed. Please refer to the
OpenRefine tutorials for some
-
help with parsing XML and json
results.
-
For now, here is the
syntax for extracting the stock
-
price from the XML document.
-
After a quick review of the
results, we are ready to export
-
the data to another data
processing tool in our pipeline.
-
In the next example we will use
OpenRefine for cleaning up a
-
sample messy data file.
-
So let's download the
Customers-messy.zip file
-
from the session folder and store it on
-
the local computer. The first
step is to visually inspect
-
the raw data file using a text
editor or Excel and get
-
familiar with the attributes
and contents of the data.
-
As before, we will load this
data file into OpenRefine and
-
create a new project.
-
Once the data has been
loaded, we'll do a cursory
-
check on the columns,
-
and change the character
encoding to UTF 8.
-
If needed, we could also change
options such as column
-
separators or add a missing
-
header row. In our case, we'll
keep the default settings.
-
So let's give the clean-up
project a name and click on
-
the Create Project button.
-
This creates a new OpenRefine
project with a copy of the
-
original data set. The next step
in the cleanup process is to
-
visually inspect the data and
check each column for errors,
-
formatting issues, or missing values.
-
We can perform common
operations like remove
-
whitespace from text. For this,
let's click on a column header.
-
Let's say contact first name,
-
Edit cells → Trim leading and
-
trailing whitespace. We can
perform the same operation
-
on multiple columns.
-
We could also do basic
concatenation operations like
-
in a spreadsheet. Let's say we
need to combine the first and
-
last names of the contact into
one column. For this we'll
-
click on Edit column →
-
Add column based on this column.
-
On the expression box, we
will take the current value
-
of the contactFirstName.
-
Add a space followed by the
value of the contact last name.
-
Let's check to see the results
-
and assign a name to the new
column, such as
-
contactFullName. And hit OK.
-
The newly-created column has
been added to our data set.
-
A nice thing about OpenRefine is
that you can undo your past
-
operations, so if you made a
mistake you can go back to a
-
previous step and correct it.
This makes it really easy to
-
make incremental changes without
losing time over
-
saving multiple copies of the data.
-
Let's go ahead and remove the
column that we just created.
-
Just like with spreadsheet
applications, we could also
-
perform other simple, column-based
operations such as
-
changing the case of the text to
title, upper, or lowercase.
-
Faceting is a powerful feature
in many data cleaning tools. The
-
facet summarizes the cells in
that column to give you a bigger
-
picture on the column. It also
allows for filtering down to
-
just a subset of the rows that
you want to change in bulk.
-
Let's see how this works.
-
Let's create a text facet
on the country column
-
by clicking on the column's
dropdown menu and
-
invoking Facet → Text facet. We
can see that
-
the country names
are not consistent.
-
For instance, United States
is spelled as US, USA, or
-
United States, with some
extra characters.
-
Also, Netherlands is spelled
both with and without the
-
determiner "the." France is
spelled using the uppercase as
-
well as lowercase characters.
-
Clustering allows us to
automatically group and edit
-
different but similar values.
Let's click on the Cluster
-
command on the left frame.
-
We now have the opportunity to
see all the similar words
-
grouped together and perform a
-
merge operation. Let's merge all
text referring to the United
-
States as USA, United Kingdom as
UK, and so on.
-
The Merge and Re-Cluster
button confirms the
-
operation. There are no more
obvious clusters that are
-
left. However, we still
haven't resolved the issue
-
with Netherlands. For this
we will choose another
-
clustering operation which
uses a distance metric.
-
The country column is now fixed,
and there are no more
-
clusters to be found. However,
we still have one error.
-
The country Norway has a space
character. See if we can find
-
a way to fix this error.
-
Let's now go ahead and use the
tool to clean up a numerical
-
column like the credit limit.
We'll first apply a numerical
-
transformation on the column so
that all the numbers are
-
recognized. Some values in this
column cannot be parsed into
-
numbers. This is because of the
presence of alphanumeric values,
-
such as the USD or the dollar
-
symbol. We can correct this by
applying the Numeric facet.
-
On the left we see a chart with
the distribution of the data.
-
Let's filter down to the non-numeric
values only and change
-
the values of the fields from
the expression box.
-
This screen allows us to
view the original values as
-
well as the outcomes when
the expression is applied.
-
We will involve the replace
function to replace or
-
substitute the unwanted
characters with blanks. So let's
-
remove the USD, the dollar, and
the comma symbols.
-
As you can see, we can make
multiple function calls in a
-
single line. This is also known
-
as operator chaining. Now
all the Alpha characters
-
have been removed.
-
When we click OK, the tool still
does not see them as numbers.
-
For this we will need to invoke
the toNumber function to
-
convert the values within the
facet into numerical values.
-
So far we have tested these
operations on a numeric facet.
-
To apply these changes to the
data set in the project we will
-
copy this expression and apply
to a column transform. This
-
ensures that all the values in
the credit limit column have
-
been converted into numbers.
-
Once we're done with cleaning up
the rest of the data, we are
-
ready to export the file into a
format of our choosing. For this
-
example, let's export it as a
CSV or a comma-separated file
-
and open it to view the results.
-
All the columns now have clean
data. The data set is now
-
ready for storage in a
database or for further analysis.