vimeo.com/.../508797388

0:06 - 0:08

So let's start the
OpenRefine desktop
0:08 - 0:11

application, which opens up
a new browser window and
0:11 - 0:15

runs on a default port on
the local machine.
0:16 - 0:20

We can create a new
project, open or import an
0:20 - 0:22

existing project,
0:22 - 0:25

or just change the language
settings if needed.
0:25 - 0:29

Let's download the stock start zip
file and browse through the file
0:29 - 0:31

from within the OpenRefine
Create Project menu.
0:32 - 0:34

And click next.
0:35 - 0:38

This imports the contents of
the file into application
0:38 - 0:41

memory. Note that the tool
can work both with compressed
0:41 - 0:43

and uncompressed files.
0:44 - 0:48

Since the stock data is in the
semi-structured json format, we
0:48 - 0:52

will select the root node of the
json tree, which transforms the
0:52 - 0:55

data into a tabular format with
rows and columns.
0:56 - 0:58

We can choose to preserve empty strings
0:58 - 1:00

and perform basic cleanup such as
1:00 - 1:03

trimming of whitespaces and
parsing numerical values.
1:08 - 1:11

These operations can also be
performed at a later stage.
1:13 - 1:16

We will provide a project
name and proceed with
1:16 - 1:17

project creation.
1:23 - 1:25

Once the project has
been created,
1:26 - 1:30

we can click All → Edit columns
and remove some of the unwanted
1:30 - 1:32

columns from the data set.
1:33 - 1:36

This helps reduce the memory
usage, especially for large
1:36 - 1:40

data sets. We'll also go
ahead and reorder some of
1:40 - 1:41

the columns in the table.
1:50 - 1:53

Now the data set is lighter
with only those columns that
1:53 - 1:56

will be used potentially for
further analysis. Just like
1:56 - 1:59

in a spreadsheet, we could
choose to sort different
1:59 - 2:03

columns. Here we'll go ahead
and sort the volume column
2:03 - 2:04

in the descending order.
2:16 - 2:19

Financial data sets, such as
stock prices, are constantly
2:19 - 2:23

changing. What if our data is
outdated, and we wanted to fetch
2:23 - 2:26

the latest data from the stock
market? Using OpenRefine, one
2:26 - 2:29

could augment the data using
online web services.
2:30 - 2:33

However, calling some of these
web services could be very
2:33 - 2:35

time-consuming. So to
demonstrate this feature,
2:35 - 2:39

let's work with a smaller
subset of the stocks data and
2:39 - 2:41

load it into a new project.
2:41 - 2:45

Look for the file stocks_demo.csv in the
session folder,
2:45 - 2:47

and load it in OpenRefine.
2:50 - 2:53

We will also create a new
OpenRefine project
2:53 - 2:55

and call it StocksDemo.
3:00 - 3:05

On the ticker column menu, click
Edit column → Add columns by
3:05 - 3:09

fetching URLs. This opens up an
expression box where we could
3:09 - 3:13

enter the URL of an online web
service which could be invoked
3:13 - 3:15

to augment the data.
3:16 - 3:20

This online service provides an
API that returns slightly
3:20 - 3:21

delayed stock prices.
3:22 - 3:25

A good practice is to test
the web service from a
3:25 - 3:29

browser client with a known
ticker symbol and validate
3:29 - 3:30

the service response.
3:35 - 3:39

We will add this URL and append
the ticker symbol as a variable
3:39 - 3:40

parameter for each row.
3:41 - 3:44

To do this, copy the URL into
3:44 - 3:48

the expression box, and insert
codes as shown here.
3:57 - 4:00

Now the URL is different for
each row with the ticker symbol
4:00 - 4:01

as the symbol parameter.
4:04 - 4:06

We'll give a new name to the
column and click OK.
4:08 - 4:11

This could take a few minutes
to run depending on the speed
4:11 - 4:13

of the web service.
4:17 - 4:20

We now have a new column with
the latest stock data.
4:21 - 4:24

Notice that the results are in
the XML format and need to be
4:24 - 4:29

parsed. Please refer to the
OpenRefine tutorials for some
4:29 - 4:31

help with parsing XML and json
results.
4:31 - 4:35

For now, here is the
syntax for extracting the stock
4:35 - 4:37

price from the XML document.
4:57 - 5:01

After a quick review of the
results, we are ready to export
5:01 - 5:04

the data to another data
processing tool in our pipeline.
5:08 - 5:12

In the next example we will use
OpenRefine for cleaning up a
5:12 - 5:13

sample messy data file.
5:16 - 5:20

So let's download the
Customers-messy.zip file
5:20 - 5:22

from the session folder and store it on
5:22 - 5:26

the local computer. The first
step is to visually inspect
5:26 - 5:30

the raw data file using a text
editor or Excel and get
5:30 - 5:33

familiar with the attributes
and contents of the data.
5:35 - 5:38

As before, we will load this
data file into OpenRefine and
5:38 - 5:40

create a new project.
5:42 - 5:45

Once the data has been
loaded, we'll do a cursory
5:45 - 5:47

check on the columns,
5:49 - 5:52

and change the character
encoding to UTF 8.
5:53 - 5:56

If needed, we could also change
options such as column
5:56 - 5:58

separators or add a missing
5:58 - 6:02

header row. In our case, we'll
keep the default settings.
6:03 - 6:06

So let's give the clean-up
project a name and click on
6:06 - 6:08

the Create Project button.
6:10 - 6:13

This creates a new OpenRefine
project with a copy of the
6:13 - 6:18

original data set. The next step
in the cleanup process is to
6:18 - 6:21

visually inspect the data and
check each column for errors,
6:21 - 6:24

formatting issues, or missing values.
6:24 - 6:27

We can perform common
operations like remove
6:27 - 6:31

whitespace from text. For this,
let's click on a column header.
6:31 - 6:33

Let's say contact first name,
6:33 - 6:37

Edit cells → Trim leading and
6:37 - 6:40

trailing whitespace. We can
perform the same operation
6:40 - 6:42

on multiple columns.
6:43 - 6:46

We could also do basic
concatenation operations like
6:46 - 6:49

in a spreadsheet. Let's say we
need to combine the first and
6:49 - 6:53

last names of the contact into
one column. For this we'll
6:53 - 6:55

click on Edit column →
6:55 - 6:58

Add column based on this column.
6:59 - 7:02

On the expression box, we
will take the current value
7:02 - 7:04

of the contactFirstName.
7:04 - 7:08

Add a space followed by the
value of the contact last name.
7:16 - 7:18

Let's check to see the results
7:19 - 7:22

and assign a name to the new
column, such as
7:22 - 7:24

contactFullName. And hit OK.
7:30 - 7:33

The newly-created column has
been added to our data set.
7:34 - 7:37

A nice thing about OpenRefine is
that you can undo your past
7:37 - 7:41

operations, so if you made a
mistake you can go back to a
7:41 - 7:44

previous step and correct it.
This makes it really easy to
7:44 - 7:47

make incremental changes without
losing time over
7:47 - 7:50

saving multiple copies of the data.
7:53 - 7:56

Let's go ahead and remove the
column that we just created.
7:57 - 8:00

Just like with spreadsheet
applications, we could also
8:00 - 8:03

perform other simple, column-based
operations such as
8:03 - 8:07

changing the case of the text to
title, upper, or lowercase.
8:18 - 8:21

Faceting is a powerful feature
in many data cleaning tools. The
8:21 - 8:26

facet summarizes the cells in
that column to give you a bigger
8:26 - 8:29

picture on the column. It also
allows for filtering down to
8:29 - 8:33

just a subset of the rows that
you want to change in bulk.
8:33 - 8:35

Let's see how this works.
8:35 - 8:38

Let's create a text facet
on the country column
8:38 - 8:40

by clicking on the column's
dropdown menu and
8:40 - 8:44

invoking Facet → Text facet. We
can see that
8:44 - 8:46

the country names
are not consistent.
8:47 - 8:51

For instance, United States
is spelled as US, USA, or
8:51 - 8:54

United States, with some
extra characters.
8:56 - 9:00

Also, Netherlands is spelled
both with and without the
9:00 - 9:04

determiner "the." France is
spelled using the uppercase as
9:04 - 9:05

well as lowercase characters.
9:07 - 9:10

Clustering allows us to
automatically group and edit
9:10 - 9:13

different but similar values.
Let's click on the Cluster
9:13 - 9:15

command on the left frame.
9:16 - 9:19

We now have the opportunity to
see all the similar words
9:19 - 9:21

grouped together and perform a
9:21 - 9:25

merge operation. Let's merge all
text referring to the United
9:25 - 9:29

States as USA, United Kingdom as
UK, and so on.
9:33 - 9:36

The Merge and Re-Cluster
button confirms the
9:36 - 9:39

operation. There are no more
obvious clusters that are
9:39 - 9:42

left. However, we still
haven't resolved the issue
9:42 - 9:45

with Netherlands. For this
we will choose another
9:45 - 9:47

clustering operation which
uses a distance metric.
9:56 - 9:59

The country column is now fixed,
and there are no more
9:59 - 10:02

clusters to be found. However,
we still have one error.
10:02 - 10:06

The country Norway has a space
character. See if we can find
10:06 - 10:08

a way to fix this error.
10:10 - 10:14

Let's now go ahead and use the
tool to clean up a numerical
10:14 - 10:17

column like the credit limit.
We'll first apply a numerical
10:17 - 10:21

transformation on the column so
that all the numbers are
10:21 - 10:25

recognized. Some values in this
column cannot be parsed into
10:25 - 10:28

numbers. This is because of the
presence of alphanumeric values,
10:28 - 10:30

such as the USD or the dollar
10:30 - 10:34

symbol. We can correct this by
applying the Numeric facet.
10:35 - 10:38

On the left we see a chart with
the distribution of the data.
10:39 - 10:43

Let's filter down to the non-numeric
values only and change
10:43 - 10:46

the values of the fields from
the expression box.
10:53 - 10:56

This screen allows us to
view the original values as
10:56 - 10:58

well as the outcomes when
the expression is applied.
11:03 - 11:06

We will involve the replace
function to replace or
11:06 - 11:09

substitute the unwanted
characters with blanks. So let's
11:09 - 11:13

remove the USD, the dollar, and
the comma symbols.
11:14 - 11:17

As you can see, we can make
multiple function calls in a
11:17 - 11:19

single line. This is also known
11:19 - 11:23

as operator chaining. Now
all the Alpha characters
11:23 - 11:24

have been removed.
11:27 - 11:31

When we click OK, the tool still
does not see them as numbers.
11:31 - 11:35

For this we will need to invoke
the toNumber function to
11:35 - 11:38

convert the values within the
facet into numerical values.
11:48 - 11:52

So far we have tested these
operations on a numeric facet.
11:52 - 11:56

To apply these changes to the
data set in the project we will
11:56 - 11:59

copy this expression and apply
to a column transform. This
11:59 - 12:02

ensures that all the values in
the credit limit column have
12:02 - 12:04

been converted into numbers.
12:13 - 12:17

Once we're done with cleaning up
the rest of the data, we are
12:17 - 12:20

ready to export the file into a
format of our choosing. For this
12:20 - 12:24

example, let's export it as a
CSV or a comma-separated file
12:24 - 12:26

and open it to view the results.
12:30 - 12:34

All the columns now have clean
data. The data set is now
12:34 - 12:38

ready for storage in a
database or for further analysis.

Title:: vimeo.com/.../508797388
Video Language:: English
Duration:: 12:41

	Cristina.ayuso.global edited English subtitles for vimeo.com/.../508797388
	zhappel edited English subtitles for vimeo.com/.../508797388
	zhappel edited English subtitles for vimeo.com/.../508797388
	naudin.morales edited English subtitles for vimeo.com/.../508797388

English subtitles

Revisions

Revision 4 Edited

Cristina.ayuso.global

vimeo.com/.../508797388

Revisions

Our website uses cookies

Operating cookies (Required)