WEBVTT

00:00:06.250 --> 00:00:08.338
So let's start the
OpenRefine desktop

00:00:08.338 --> 00:00:11.470
application, which opens up
a new browser window and

00:00:11.470 --> 00:00:14.602
runs on a default port on
the local machine.

00:00:16.350 --> 00:00:20.350
We can create a new
project, open or import an

00:00:20.350 --> 00:00:21.930
existing project,

00:00:22.340 --> 00:00:25.025
or just change the language
settings if needed.

00:00:25.025 --> 00:00:29.040
Let's download the stock start zip
file and browse through the file

00:00:29.040 --> 00:00:31.385
from within the OpenRefine
Create Project menu.

00:00:32.450 --> 00:00:33.679
And click next.

00:00:34.510 --> 00:00:37.552
This imports the contents of
the file into application

00:00:37.552 --> 00:00:40.742
memory. Note that the tool
can work both with compressed

00:00:40.742 --> 00:00:42.566
and uncompressed files.

00:00:44.440 --> 00:00:48.455
Since the stock data is in the
semi-structured json format, we

00:00:48.455 --> 00:00:52.305
will select the root node of the
json tree, which transforms the

00:00:52.305 --> 00:00:55.195
data into a tabular format with
rows and columns.

00:00:55.725 --> 00:00:58.295
We can choose to preserve empty strings

00:00:58.295 --> 00:01:00.385
and perform basic cleanup such as

00:01:00.385 --> 00:01:03.420
trimming of whitespaces and
parsing numerical values.

00:01:07.860 --> 00:01:10.960
These operations can also be
performed at a later stage.

00:01:13.430 --> 00:01:15.740
We will provide a project
name and proceed with

00:01:15.740 --> 00:01:17.240
project creation.

00:01:22.990 --> 00:01:24.730
Once the project has
been created,

00:01:25.740 --> 00:01:29.988
we can click All → Edit columns
and remove some of the unwanted

00:01:29.988 --> 00:01:31.758
columns from the data set.

00:01:33.010 --> 00:01:36.381
This helps reduce the memory
usage, especially for large

00:01:36.381 --> 00:01:39.804
data sets. We'll also go
ahead and reorder some of

00:01:39.804 --> 00:01:41.454
the columns in the table.

00:01:49.750 --> 00:01:53.235
Now the data set is lighter
with only those columns that

00:01:53.235 --> 00:01:56.450
will be used potentially for
further analysis. Just like

00:01:56.450 --> 00:01:59.465
in a spreadsheet, we could
choose to sort different

00:01:59.465 --> 00:02:02.815
columns. Here we'll go ahead
and sort the volume column

00:02:02.815 --> 00:02:04.155
in the descending order.

00:02:15.510 --> 00:02:18.534
Financial data sets, such as
stock prices, are constantly

00:02:18.534 --> 00:02:22.566
changing. What if our data is
outdated, and we wanted to fetch

00:02:22.566 --> 00:02:25.926
the latest data from the stock
market? Using OpenRefine, one

00:02:25.926 --> 00:02:28.614
could augment the data using
online web services.

00:02:29.650 --> 00:02:32.530
However, calling some of these
web services could be very

00:02:32.530 --> 00:02:35.073
time-consuming. So to
demonstrate this feature,

00:02:35.073 --> 00:02:38.582
let's work with a smaller
subset of the stocks data and

00:02:38.582 --> 00:02:40.641
load it into a new project.

00:02:40.641 --> 00:02:44.789
Look for the file stocks_demo.csv in the
session folder,

00:02:44.789 --> 00:02:46.938
and load it in OpenRefine.

00:02:50.450 --> 00:02:53.022
We will also create a new
OpenRefine project

00:02:53.022 --> 00:02:54.862
and call it StocksDemo.

00:02:59.870 --> 00:03:04.501
On the ticker column menu, click
Edit column → Add columns by

00:03:04.501 --> 00:03:09.399
fetching URLs. This opens up an
expression box where we could

00:03:09.399 --> 00:03:13.491
enter the URL of an online web
service which could be invoked

00:03:13.491 --> 00:03:15.035
to augment the data.

00:03:15.620 --> 00:03:19.733
This online service provides an
API that returns slightly

00:03:19.733 --> 00:03:21.104
delayed stock prices.

00:03:22.250 --> 00:03:25.273
A good practice is to test
the web service from a

00:03:25.273 --> 00:03:28.910
browser client with a known
ticker symbol and validate

00:03:28.910 --> 00:03:30.183
the service response.

00:03:34.630 --> 00:03:38.650
We will add this URL and append
the ticker symbol as a variable

00:03:38.650 --> 00:03:40.140
parameter for each row.

00:03:41.230 --> 00:03:43.519
To do this, copy the URL into

00:03:43.519 --> 00:03:47.886
the expression box, and insert
codes as shown here.

00:03:56.720 --> 00:04:00.284
Now the URL is different for
each row with the ticker symbol

00:04:00.284 --> 00:04:01.472
as the symbol parameter.

00:04:03.610 --> 00:04:06.426
We'll give a new name to the
column and click OK.

00:04:08.440 --> 00:04:11.296
This could take a few minutes
to run depending on the speed

00:04:11.296 --> 00:04:12.871
of the web service.

00:04:17.010 --> 00:04:19.815
We now have a new column with
the latest stock data.

00:04:20.680 --> 00:04:24.346
Notice that the results are in
the XML format and need to be

00:04:24.346 --> 00:04:28.606
parsed. Please refer to the
OpenRefine tutorials for some

00:04:28.606 --> 00:04:30.988
help with parsing XML and json
results.

00:04:30.988 --> 00:04:35.400
For now, here is the
syntax for extracting the stock

00:04:35.400 --> 00:04:37.065
price from the XML document.

00:04:57.440 --> 00:05:00.908
After a quick review of the
results, we are ready to export

00:05:00.908 --> 00:05:03.798
the data to another data
processing tool in our pipeline.

00:05:07.950 --> 00:05:12.066
In the next example we will use
OpenRefine for cleaning up a

00:05:12.066 --> 00:05:13.438
sample messy data file.

00:05:15.820 --> 00:05:19.541
So let's download the 
Customers-messy.zip file

00:05:19.541 --> 00:05:21.527
from the session folder and store it on

00:05:21.527 --> 00:05:25.739
the local computer. The first
step is to visually inspect

00:05:25.739 --> 00:05:29.663
the raw data file using a text
editor or Excel and get

00:05:29.663 --> 00:05:32.606
familiar with the attributes
and contents of the data.

00:05:35.030 --> 00:05:38.308
As before, we will load this
data file into OpenRefine and

00:05:38.308 --> 00:05:39.500
create a new project.

00:05:41.990 --> 00:05:45.390
Once the data has been
loaded, we'll do a cursory

00:05:45.390 --> 00:05:46.750
check on the columns,

00:05:49.260 --> 00:05:51.996
and change the character
encoding to UTF 8.

00:05:52.690 --> 00:05:55.910
If needed, we could also change
options such as column

00:05:55.910 --> 00:05:57.520
separators or add a missing

00:05:57.520 --> 00:06:01.588
header row. In our case, we'll
keep the default settings.

00:06:02.920 --> 00:06:06.132
So let's give the clean-up
project a name and click on

00:06:06.132 --> 00:06:07.990
the Create Project button.

00:06:09.550 --> 00:06:12.982
This creates a new OpenRefine
project with a copy of the

00:06:12.982 --> 00:06:17.660
original data set. The next step
in the cleanup process is to

00:06:17.660 --> 00:06:21.080
visually inspect the data and
check each column for errors,

00:06:21.080 --> 00:06:23.768
formatting issues, or missing values.

00:06:24.018 --> 00:06:26.851
We can perform common
operations like remove

00:06:26.851 --> 00:06:30.954
whitespace from text. For this,
let's click on a column header.

00:06:30.954 --> 00:06:33.258
Let's say contact first name,

00:06:33.470 --> 00:06:36.528
Edit cells → Trim leading and

00:06:36.528 --> 00:06:40.108
trailing whitespace. We can
perform the same operation

00:06:40.108 --> 00:06:42.082
on multiple columns.

00:06:42.930 --> 00:06:45.530
We could also do basic
concatenation operations like

00:06:45.530 --> 00:06:49.230
in a spreadsheet. Let's say we
need to combine the first and

00:06:49.230 --> 00:06:53.005
last names of the contact into
one column. For this we'll

00:06:53.005 --> 00:06:54.625
click on Edit column →

00:06:55.380 --> 00:06:57.528
Add column based on this column.

00:06:59.000 --> 00:07:01.790
On the expression box, we
will take the current value

00:07:01.790 --> 00:07:03.575
of the contactFirstName.

00:07:04.300 --> 00:07:08.476
Add a space followed by the
value of the contact last name.

00:07:15.850 --> 00:07:18.050
Let's check to see the results

00:07:18.520 --> 00:07:21.772
and assign a name to the new
column, such as

00:07:21.772 --> 00:07:23.869
contactFullName. And hit OK.

00:07:29.870 --> 00:07:32.829
The newly-created column has
been added to our data set.

00:07:33.990 --> 00:07:37.350
A nice thing about OpenRefine is
that you can undo your past

00:07:37.350 --> 00:07:41.200
operations, so if you made a
mistake you can go back to a

00:07:41.200 --> 00:07:44.450
previous step and correct it.
This makes it really easy to

00:07:44.450 --> 00:07:47.050
make incremental changes without
losing time over

00:07:47.050 --> 00:07:49.730
saving multiple copies of the data.

00:07:52.850 --> 00:07:55.705
Let's go ahead and remove the
column that we just created.

00:07:57.300 --> 00:08:00.020
Just like with spreadsheet
applications, we could also

00:08:00.020 --> 00:08:02.980
perform other simple, column-based 
operations such as

00:08:02.980 --> 00:08:06.820
changing the case of the text to
title, upper, or lowercase.

00:08:17.750 --> 00:08:21.457
Faceting is a powerful feature
in many data cleaning tools. The

00:08:21.457 --> 00:08:25.501
facet summarizes the cells in
that column to give you a bigger

00:08:25.501 --> 00:08:29.208
picture on the column. It also
allows for filtering down to

00:08:29.208 --> 00:08:32.769
just a subset of the rows that
you want to change in bulk.

00:08:32.769 --> 00:08:35.103
Let's see how this works.

00:08:35.103 --> 00:08:37.633
Let's create a text facet 
on the country column

00:08:37.633 --> 00:08:40.183
by clicking on the column's 
dropdown menu and

00:08:40.183 --> 00:08:43.760
invoking Facet → Text facet. We
can see that

00:08:43.760 --> 00:08:45.901
the country names
are not consistent.

00:08:47.220 --> 00:08:51.315
For instance, United States
is spelled as US, USA, or

00:08:51.315 --> 00:08:54.045
United States, with some
extra characters.

00:08:55.690 --> 00:08:59.641
Also, Netherlands is spelled
both with and without the

00:08:59.641 --> 00:09:03.592
determiner "the." France is
spelled using the uppercase as

00:09:03.592 --> 00:09:05.348
well as lowercase characters.

00:09:07.130 --> 00:09:09.970
Clustering allows us to
automatically group and edit

00:09:09.970 --> 00:09:13.165
different but similar values.
Let's click on the Cluster

00:09:13.165 --> 00:09:14.940
command on the left frame.

00:09:15.890 --> 00:09:19.282
We now have the opportunity to
see all the similar words

00:09:19.282 --> 00:09:21.202
grouped together and perform a

00:09:21.202 --> 00:09:25.476
merge operation. Let's merge all
text referring to the United

00:09:25.476 --> 00:09:29.496
States as USA, United Kingdom as
UK, and so on.

00:09:33.230 --> 00:09:36.030
The Merge and Re-Cluster
button confirms the

00:09:36.030 --> 00:09:39.180
operation. There are no more
obvious clusters that are

00:09:39.180 --> 00:09:41.980
left. However, we still
haven't resolved the issue

00:09:41.980 --> 00:09:44.780
with Netherlands. For this
we will choose another

00:09:44.780 --> 00:09:47.230
clustering operation which
uses a distance metric.

00:09:56.410 --> 00:09:59.122
The country column is now fixed,
and there are no more

00:09:59.122 --> 00:10:02.394
clusters to be found. However, 
we still have one error.

00:10:02.394 --> 00:10:05.716
The country Norway has a space
character. See if we can find

00:10:05.716 --> 00:10:07.878
a way to fix this error.

00:10:09.690 --> 00:10:13.729
Let's now go ahead and use the
tool to clean up a numerical

00:10:13.729 --> 00:10:17.349
column like the credit limit.
We'll first apply a numerical

00:10:17.349 --> 00:10:20.509
transformation on the column so
that all the numbers are

00:10:20.509 --> 00:10:24.624
recognized. Some values in this
column cannot be parsed into

00:10:24.624 --> 00:10:27.784
numbers. This is because of the
presence of alphanumeric values,

00:10:27.784 --> 00:10:29.996
such as the USD or the dollar

00:10:29.996 --> 00:10:34.130
symbol. We can correct this by
applying the Numeric facet.

00:10:34.750 --> 00:10:38.026
On the left we see a chart with
the distribution of the data.

00:10:38.900 --> 00:10:42.871
Let's filter down to the non-numeric 
values only and change

00:10:42.871 --> 00:10:46.120
the values of the fields from
the expression box.

00:10:52.930 --> 00:10:55.720
This screen allows us to
view the original values as

00:10:55.720 --> 00:10:58.231
well as the outcomes when
the expression is applied.

00:11:02.670 --> 00:11:06.099
We will involve the replace
function to replace or

00:11:06.099 --> 00:11:09.147
substitute the unwanted
characters with blanks. So let's

00:11:09.147 --> 00:11:12.576
remove the USD, the dollar, and
the comma symbols.

00:11:14.010 --> 00:11:17.202
As you can see, we can make
multiple function calls in a

00:11:17.202 --> 00:11:19.248
single line. This is also known

00:11:19.248 --> 00:11:22.710
as operator chaining. Now
all the Alpha characters

00:11:22.710 --> 00:11:24.378
have been removed.

00:11:26.970 --> 00:11:30.623
When we click OK, the tool still
does not see them as numbers.

00:11:31.280 --> 00:11:35.300
For this we will need to invoke
the toNumber function to

00:11:35.300 --> 00:11:38.315
convert the values within the
facet into numerical values.

00:11:48.170 --> 00:11:51.800
So far we have tested these
operations on a numeric facet.

00:11:52.420 --> 00:11:55.844
To apply these changes to the
data set in the project we will

00:11:55.844 --> 00:11:59.064
copy this expression and apply
to a column transform. This

00:11:59.064 --> 00:12:02.266
ensures that all the values in
the credit limit column have

00:12:02.266 --> 00:12:04.314
been converted into numbers.

00:12:13.100 --> 00:12:16.541
Once we're done with cleaning up
the rest of the data, we are

00:12:16.541 --> 00:12:19.862
ready to export the file into a
format of our choosing. For this

00:12:19.862 --> 00:12:23.626
example, let's export it as a
CSV or a comma-separated file

00:12:23.626 --> 00:12:25.565
and open it to view the results.

00:12:30.430 --> 00:12:34.234
All the columns now have clean
data. The data set is now

00:12:34.234 --> 00:12:37.677
ready for storage in a
database or for further analysis.