So let's start the
OpenRefine desktop

application, which opens up
a new browser window and

runs on a default port on
the local machine.

We can create a new
project, open or import an

existing project,

or just change the language
settings if needed.

Let's download the stock start zip
file and browse through the file

from within the OpenRefine
Create Project menu.

And click next.

This imports the contents of
the file into application

memory. Note that the tool
can work both with compressed

and uncompressed files.

Since the stock data is in the
semi-structured json format, we

will select the root node of the
json tree, which transforms the

data into a tabular format with
rows and columns.

We can choose to preserve empty strings

and perform basic cleanup such as

trimming of whitespaces and
parsing numerical values.

These operations can also be
performed at a later stage.

We will provide a project
name and proceed with

project creation.

Once the project has
been created,

we can click All → Edit columns
and remove some of the unwanted

columns from the data set.

This helps reduce the memory
usage, especially for large

data sets. We'll also go
ahead and reorder some of

the columns in the table.

Now the data set is lighter
with only those columns that

will be used potentially for
further analysis. Just like

in a spreadsheet, we could
choose to sort different

columns. Here we'll go ahead
and sort the volume column

in the descending order.

Financial data sets, such as
stock prices, are constantly

changing. What if our data is
outdated, and we wanted to fetch

the latest data from the stock
market? Using OpenRefine, one

could augment the data using
online web services.

However, calling some of these
web services could be very

time-consuming. So to
demonstrate this feature,

let's work with a smaller
subset of the stocks data and

load it into a new project.

Look for the file stocks_demo.csv in the
session folder,

and load it in OpenRefine.

We will also create a new
OpenRefine project

and call it StocksDemo.

On the ticker column menu, click
Edit column → Add columns by

fetching URLs. This opens up an
expression box where we could

enter the URL of an online web
service which could be invoked

to augment the data.

This online service provides an
API that returns slightly

delayed stock prices.

A good practice is to test
the web service from a

browser client with a known
ticker symbol and validate

the service response.

We will add this URL and append
the ticker symbol as a variable

parameter for each row.

To do this, copy the URL into

the expression box, and insert
codes as shown here.

Now the URL is different for
each row with the ticker symbol

as the symbol parameter.

We'll give a new name to the
column and click OK.

This could take a few minutes
to run depending on the speed

of the web service.

We now have a new column with
the latest stock data.

Notice that the results are in
the XML format and need to be

parsed. Please refer to the
OpenRefine tutorials for some

help with parsing XML and json
results.

For now, here is the
syntax for extracting the stock

price from the XML document.

After a quick review of the
results, we are ready to export

the data to another data
processing tool in our pipeline.

In the next example we will use
OpenRefine for cleaning up a

sample messy data file.

So let's download the 
Customers-messy.zip file

from the session folder and store it on

the local computer. The first
step is to visually inspect

the raw data file using a text
editor or Excel and get

familiar with the attributes
and contents of the data.

As before, we will load this
data file into OpenRefine and

create a new project.

Once the data has been
loaded, we'll do a cursory

check on the columns,

and change the character
encoding to UTF 8.

If needed, we could also change
options such as column

separators or add a missing

header row. In our case, we'll
keep the default settings.

So let's give the clean-up
project a name and click on

the Create Project button.

This creates a new OpenRefine
project with a copy of the

original data set. The next step
in the cleanup process is to

visually inspect the data and
check each column for errors,

formatting issues, or missing values.

We can perform common
operations like remove

whitespace from text. For this,
let's click on a column header.

Let's say contact first name,

Edit cells → Trim leading and

trailing whitespace. We can
perform the same operation

on multiple columns.

We could also do basic
concatenation operations like

in a spreadsheet. Let's say we
need to combine the first and

last names of the contact into
one column. For this we'll

click on Edit column →

Add column based on this column.

On the expression box, we
will take the current value

of the contactFirstName.

Add a space followed by the
value of the contact last name.

Let's check to see the results

and assign a name to the new
column, such as

contactFullName. And hit OK.

The newly-created column has
been added to our data set.

A nice thing about OpenRefine is
that you can undo your past

operations, so if you made a
mistake you can go back to a

previous step and correct it.
This makes it really easy to

make incremental changes without
losing time over

saving multiple copies of the data.

Let's go ahead and remove the
column that we just created.

Just like with spreadsheet
applications, we could also

perform other simple, column-based 
operations such as

changing the case of the text to
title, upper, or lowercase.

Faceting is a powerful feature
in many data cleaning tools. The

facet summarizes the cells in
that column to give you a bigger

picture on the column. It also
allows for filtering down to

just a subset of the rows that
you want to change in bulk.

Let's see how this works.

Let's create a text facet 
on the country column

by clicking on the column's 
dropdown menu and

invoking Facet → Text facet. We
can see that

the country names
are not consistent.

For instance, United States
is spelled as US, USA, or

United States, with some
extra characters.

Also, Netherlands is spelled
both with and without the

determiner "the." France is
spelled using the uppercase as

well as lowercase characters.

Clustering allows us to
automatically group and edit

different but similar values.
Let's click on the Cluster

command on the left frame.

We now have the opportunity to
see all the similar words

grouped together and perform a

merge operation. Let's merge all
text referring to the United

States as USA, United Kingdom as
UK, and so on.

The Merge and Re-Cluster
button confirms the

operation. There are no more
obvious clusters that are

left. However, we still
haven't resolved the issue

with Netherlands. For this
we will choose another

clustering operation which
uses a distance metric.

The country column is now fixed,
and there are no more

clusters to be found. However, 
we still have one error.

The country Norway has a space
character. See if we can find

a way to fix this error.

Let's now go ahead and use the
tool to clean up a numerical

column like the credit limit.
We'll first apply a numerical

transformation on the column so
that all the numbers are

recognized. Some values in this
column cannot be parsed into

numbers. This is because of the
presence of alphanumeric values,

such as the USD or the dollar

symbol. We can correct this by
applying the Numeric facet.

On the left we see a chart with
the distribution of the data.

Let's filter down to the non-numeric 
values only and change

the values of the fields from
the expression box.

This screen allows us to
view the original values as

well as the outcomes when
the expression is applied.

We will involve the replace
function to replace or

substitute the unwanted
characters with blanks. So let's

remove the USD, the dollar, and
the comma symbols.

As you can see, we can make
multiple function calls in a

single line. This is also known

as operator chaining. Now
all the Alpha characters

have been removed.

When we click OK, the tool still
does not see them as numbers.

For this we will need to invoke
the toNumber function to

convert the values within the
facet into numerical values.

So far we have tested these
operations on a numeric facet.

To apply these changes to the
data set in the project we will

copy this expression and apply
to a column transform. This

ensures that all the values in
the credit limit column have

been converted into numbers.

Once we're done with cleaning up
the rest of the data, we are

ready to export the file into a
format of our choosing. For this

example, let's export it as a
CSV or a comma-separated file

and open it to view the results.

All the columns now have clean
data. The data set is now

ready for storage in a
database or for further analysis.