1
00:00:06,250 --> 00:00:08,338
So let's start the
OpenRefine desktop

2
00:00:08,338 --> 00:00:11,470
application, which opens up
a new browser window and

3
00:00:11,470 --> 00:00:14,602
runs on a default port on
the local machine.

4
00:00:16,350 --> 00:00:20,350
We can create a new
project, open or import an

5
00:00:20,350 --> 00:00:21,930
existing project,

6
00:00:22,340 --> 00:00:25,025
or just change the language
settings if needed.

7
00:00:25,025 --> 00:00:29,040
Let's download the stock start zip
file and browse through the file

8
00:00:29,040 --> 00:00:31,385
from within the OpenRefine
Create Project menu.

9
00:00:32,450 --> 00:00:33,679
And click next.

10
00:00:34,510 --> 00:00:37,552
This imports the contents of
the file into application

11
00:00:37,552 --> 00:00:40,742
memory. Note that the tool
can work both with compressed

12
00:00:40,742 --> 00:00:42,566
and uncompressed files.

13
00:00:44,440 --> 00:00:48,455
Since the stock data is in the
semi-structured json format, we

14
00:00:48,455 --> 00:00:52,305
will select the root node of the
json tree, which transforms the

15
00:00:52,305 --> 00:00:55,195
data into a tabular format with
rows and columns.

16
00:00:55,725 --> 00:00:58,295
We can choose to preserve empty strings

17
00:00:58,295 --> 00:01:00,385
and perform basic cleanup such as

18
00:01:00,385 --> 00:01:03,420
trimming of whitespaces and
parsing numerical values.

19
00:01:07,860 --> 00:01:10,960
These operations can also be
performed at a later stage.

20
00:01:13,430 --> 00:01:15,740
We will provide a project
name and proceed with

21
00:01:15,740 --> 00:01:17,240
project creation.

22
00:01:22,990 --> 00:01:24,730
Once the project has
been created,

23
00:01:25,740 --> 00:01:29,988
we can click All → Edit columns
and remove some of the unwanted

24
00:01:29,988 --> 00:01:31,758
columns from the data set.

25
00:01:33,010 --> 00:01:36,381
This helps reduce the memory
usage, especially for large

26
00:01:36,381 --> 00:01:39,804
data sets. We'll also go
ahead and reorder some of

27
00:01:39,804 --> 00:01:41,454
the columns in the table.

28
00:01:49,750 --> 00:01:53,235
Now the data set is lighter
with only those columns that

29
00:01:53,235 --> 00:01:56,450
will be used potentially for
further analysis. Just like

30
00:01:56,450 --> 00:01:59,465
in a spreadsheet, we could
choose to sort different

31
00:01:59,465 --> 00:02:02,815
columns. Here we'll go ahead
and sort the volume column

32
00:02:02,815 --> 00:02:04,155
in the descending order.

33
00:02:15,510 --> 00:02:18,534
Financial data sets, such as
stock prices, are constantly

34
00:02:18,534 --> 00:02:22,566
changing. What if our data is
outdated, and we wanted to fetch

35
00:02:22,566 --> 00:02:25,926
the latest data from the stock
market? Using OpenRefine, one

36
00:02:25,926 --> 00:02:28,614
could augment the data using
online web services.

37
00:02:29,650 --> 00:02:32,530
However, calling some of these
web services could be very

38
00:02:32,530 --> 00:02:35,073
time-consuming. So to
demonstrate this feature,

39
00:02:35,073 --> 00:02:38,582
let's work with a smaller
subset of the stocks data and

40
00:02:38,582 --> 00:02:40,641
load it into a new project.

41
00:02:40,641 --> 00:02:44,789
Look for the file stocks_demo.csv in the
session folder,

42
00:02:44,789 --> 00:02:46,938
and load it in OpenRefine.

43
00:02:50,450 --> 00:02:53,022
We will also create a new
OpenRefine project

44
00:02:53,022 --> 00:02:54,862
and call it StocksDemo.

45
00:02:59,870 --> 00:03:04,501
On the ticker column menu, click
Edit column → Add columns by

46
00:03:04,501 --> 00:03:09,399
fetching URLs. This opens up an
expression box where we could

47
00:03:09,399 --> 00:03:13,491
enter the URL of an online web
service which could be invoked

48
00:03:13,491 --> 00:03:15,035
to augment the data.

49
00:03:15,620 --> 00:03:19,733
This online service provides an
API that returns slightly

50
00:03:19,733 --> 00:03:21,104
delayed stock prices.

51
00:03:22,250 --> 00:03:25,273
A good practice is to test
the web service from a

52
00:03:25,273 --> 00:03:28,910
browser client with a known
ticker symbol and validate

53
00:03:28,910 --> 00:03:30,183
the service response.

54
00:03:34,630 --> 00:03:38,650
We will add this URL and append
the ticker symbol as a variable

55
00:03:38,650 --> 00:03:40,140
parameter for each row.

56
00:03:41,230 --> 00:03:43,519
To do this, copy the URL into

57
00:03:43,519 --> 00:03:47,886
the expression box, and insert
codes as shown here.

58
00:03:56,720 --> 00:04:00,284
Now the URL is different for
each row with the ticker symbol

59
00:04:00,284 --> 00:04:01,472
as the symbol parameter.

60
00:04:03,610 --> 00:04:06,426
We'll give a new name to the
column and click OK.

61
00:04:08,440 --> 00:04:11,296
This could take a few minutes
to run depending on the speed

62
00:04:11,296 --> 00:04:12,871
of the web service.

63
00:04:17,010 --> 00:04:19,815
We now have a new column with
the latest stock data.

64
00:04:20,680 --> 00:04:24,346
Notice that the results are in
the XML format and need to be

65
00:04:24,346 --> 00:04:28,606
parsed. Please refer to the
OpenRefine tutorials for some

66
00:04:28,606 --> 00:04:30,988
help with parsing XML and json
results.

67
00:04:30,988 --> 00:04:35,400
For now, here is the
syntax for extracting the stock

68
00:04:35,400 --> 00:04:37,065
price from the XML document.

69
00:04:57,440 --> 00:05:00,908
After a quick review of the
results, we are ready to export

70
00:05:00,908 --> 00:05:03,798
the data to another data
processing tool in our pipeline.

71
00:05:07,950 --> 00:05:12,066
In the next example we will use
OpenRefine for cleaning up a

72
00:05:12,066 --> 00:05:13,438
sample messy data file.

73
00:05:15,820 --> 00:05:19,541
So let's download the 
Customers-messy.zip file

74
00:05:19,541 --> 00:05:21,527
from the session folder and store it on

75
00:05:21,527 --> 00:05:25,739
the local computer. The first
step is to visually inspect

76
00:05:25,739 --> 00:05:29,663
the raw data file using a text
editor or Excel and get

77
00:05:29,663 --> 00:05:32,606
familiar with the attributes
and contents of the data.

78
00:05:35,030 --> 00:05:38,308
As before, we will load this
data file into OpenRefine and

79
00:05:38,308 --> 00:05:39,500
create a new project.

80
00:05:41,990 --> 00:05:45,390
Once the data has been
loaded, we'll do a cursory

81
00:05:45,390 --> 00:05:46,750
check on the columns,

82
00:05:49,260 --> 00:05:51,996
and change the character
encoding to UTF 8.

83
00:05:52,690 --> 00:05:55,910
If needed, we could also change
options such as column

84
00:05:55,910 --> 00:05:57,520
separators or add a missing

85
00:05:57,520 --> 00:06:01,588
header row. In our case, we'll
keep the default settings.

86
00:06:02,920 --> 00:06:06,132
So let's give the clean-up
project a name and click on

87
00:06:06,132 --> 00:06:07,990
the Create Project button.

88
00:06:09,550 --> 00:06:12,982
This creates a new OpenRefine
project with a copy of the

89
00:06:12,982 --> 00:06:17,660
original data set. The next step
in the cleanup process is to

90
00:06:17,660 --> 00:06:21,080
visually inspect the data and
check each column for errors,

91
00:06:21,080 --> 00:06:23,768
formatting issues, or missing values.

92
00:06:24,018 --> 00:06:26,851
We can perform common
operations like remove

93
00:06:26,851 --> 00:06:30,954
whitespace from text. For this,
let's click on a column header.

94
00:06:30,954 --> 00:06:33,258
Let's say contact first name,

95
00:06:33,470 --> 00:06:36,528
Edit cells → Trim leading and

96
00:06:36,528 --> 00:06:40,108
trailing whitespace. We can
perform the same operation

97
00:06:40,108 --> 00:06:42,082
on multiple columns.

98
00:06:42,930 --> 00:06:45,530
We could also do basic
concatenation operations like

99
00:06:45,530 --> 00:06:49,230
in a spreadsheet. Let's say we
need to combine the first and

100
00:06:49,230 --> 00:06:53,005
last names of the contact into
one column. For this we'll

101
00:06:53,005 --> 00:06:54,625
click on Edit column →

102
00:06:55,380 --> 00:06:57,528
Add column based on this column.

103
00:06:59,000 --> 00:07:01,790
On the expression box, we
will take the current value

104
00:07:01,790 --> 00:07:03,575
of the contactFirstName.

105
00:07:04,300 --> 00:07:08,476
Add a space followed by the
value of the contact last name.

106
00:07:15,850 --> 00:07:18,050
Let's check to see the results

107
00:07:18,520 --> 00:07:21,772
and assign a name to the new
column, such as

108
00:07:21,772 --> 00:07:23,869
contactFullName. And hit OK.

109
00:07:29,870 --> 00:07:32,829
The newly-created column has
been added to our data set.

110
00:07:33,990 --> 00:07:37,350
A nice thing about OpenRefine is
that you can undo your past

111
00:07:37,350 --> 00:07:41,200
operations, so if you made a
mistake you can go back to a

112
00:07:41,200 --> 00:07:44,450
previous step and correct it.
This makes it really easy to

113
00:07:44,450 --> 00:07:47,050
make incremental changes without
losing time over

114
00:07:47,050 --> 00:07:49,730
saving multiple copies of the data.

115
00:07:52,850 --> 00:07:55,705
Let's go ahead and remove the
column that we just created.

116
00:07:57,300 --> 00:08:00,020
Just like with spreadsheet
applications, we could also

117
00:08:00,020 --> 00:08:02,980
perform other simple, column-based 
operations such as

118
00:08:02,980 --> 00:08:06,820
changing the case of the text to
title, upper, or lowercase.

119
00:08:17,750 --> 00:08:21,457
Faceting is a powerful feature
in many data cleaning tools. The

120
00:08:21,457 --> 00:08:25,501
facet summarizes the cells in
that column to give you a bigger

121
00:08:25,501 --> 00:08:29,208
picture on the column. It also
allows for filtering down to

122
00:08:29,208 --> 00:08:32,769
just a subset of the rows that
you want to change in bulk.

123
00:08:32,769 --> 00:08:35,103
Let's see how this works.

124
00:08:35,103 --> 00:08:37,633
Let's create a text facet 
on the country column

125
00:08:37,633 --> 00:08:40,183
by clicking on the column's 
dropdown menu and

126
00:08:40,183 --> 00:08:43,760
invoking Facet → Text facet. We
can see that

127
00:08:43,760 --> 00:08:45,901
the country names
are not consistent.

128
00:08:47,220 --> 00:08:51,315
For instance, United States
is spelled as US, USA, or

129
00:08:51,315 --> 00:08:54,045
United States, with some
extra characters.

130
00:08:55,690 --> 00:08:59,641
Also, Netherlands is spelled
both with and without the

131
00:08:59,641 --> 00:09:03,592
determiner "the." France is
spelled using the uppercase as

132
00:09:03,592 --> 00:09:05,348
well as lowercase characters.

133
00:09:07,130 --> 00:09:09,970
Clustering allows us to
automatically group and edit

134
00:09:09,970 --> 00:09:13,165
different but similar values.
Let's click on the Cluster

135
00:09:13,165 --> 00:09:14,940
command on the left frame.

136
00:09:15,890 --> 00:09:19,282
We now have the opportunity to
see all the similar words

137
00:09:19,282 --> 00:09:21,202
grouped together and perform a

138
00:09:21,202 --> 00:09:25,476
merge operation. Let's merge all
text referring to the United

139
00:09:25,476 --> 00:09:29,496
States as USA, United Kingdom as
UK, and so on.

140
00:09:33,230 --> 00:09:36,030
The Merge and Re-Cluster
button confirms the

141
00:09:36,030 --> 00:09:39,180
operation. There are no more
obvious clusters that are

142
00:09:39,180 --> 00:09:41,980
left. However, we still
haven't resolved the issue

143
00:09:41,980 --> 00:09:44,780
with Netherlands. For this
we will choose another

144
00:09:44,780 --> 00:09:47,230
clustering operation which
uses a distance metric.

145
00:09:56,410 --> 00:09:59,122
The country column is now fixed,
and there are no more

146
00:09:59,122 --> 00:10:02,394
clusters to be found. However, 
we still have one error.

147
00:10:02,394 --> 00:10:05,716
The country Norway has a space
character. See if we can find

148
00:10:05,716 --> 00:10:07,878
a way to fix this error.

149
00:10:09,690 --> 00:10:13,729
Let's now go ahead and use the
tool to clean up a numerical

150
00:10:13,729 --> 00:10:17,349
column like the credit limit.
We'll first apply a numerical

151
00:10:17,349 --> 00:10:20,509
transformation on the column so
that all the numbers are

152
00:10:20,509 --> 00:10:24,624
recognized. Some values in this
column cannot be parsed into

153
00:10:24,624 --> 00:10:27,784
numbers. This is because of the
presence of alphanumeric values,

154
00:10:27,784 --> 00:10:29,996
such as the USD or the dollar

155
00:10:29,996 --> 00:10:34,130
symbol. We can correct this by
applying the Numeric facet.

156
00:10:34,750 --> 00:10:38,026
On the left we see a chart with
the distribution of the data.

157
00:10:38,900 --> 00:10:42,871
Let's filter down to the non-numeric 
values only and change

158
00:10:42,871 --> 00:10:46,120
the values of the fields from
the expression box.

159
00:10:52,930 --> 00:10:55,720
This screen allows us to
view the original values as

160
00:10:55,720 --> 00:10:58,231
well as the outcomes when
the expression is applied.

161
00:11:02,670 --> 00:11:06,099
We will involve the replace
function to replace or

162
00:11:06,099 --> 00:11:09,147
substitute the unwanted
characters with blanks. So let's

163
00:11:09,147 --> 00:11:12,576
remove the USD, the dollar, and
the comma symbols.

164
00:11:14,010 --> 00:11:17,202
As you can see, we can make
multiple function calls in a

165
00:11:17,202 --> 00:11:19,248
single line. This is also known

166
00:11:19,248 --> 00:11:22,710
as operator chaining. Now
all the Alpha characters

167
00:11:22,710 --> 00:11:24,378
have been removed.

168
00:11:26,970 --> 00:11:30,623
When we click OK, the tool still
does not see them as numbers.

169
00:11:31,280 --> 00:11:35,300
For this we will need to invoke
the toNumber function to

170
00:11:35,300 --> 00:11:38,315
convert the values within the
facet into numerical values.

171
00:11:48,170 --> 00:11:51,800
So far we have tested these
operations on a numeric facet.

172
00:11:52,420 --> 00:11:55,844
To apply these changes to the
data set in the project we will

173
00:11:55,844 --> 00:11:59,064
copy this expression and apply
to a column transform. This

174
00:11:59,064 --> 00:12:02,266
ensures that all the values in
the credit limit column have

175
00:12:02,266 --> 00:12:04,314
been converted into numbers.

176
00:12:13,100 --> 00:12:16,541
Once we're done with cleaning up
the rest of the data, we are

177
00:12:16,541 --> 00:12:19,862
ready to export the file into a
format of our choosing. For this

178
00:12:19,862 --> 00:12:23,626
example, let's export it as a
CSV or a comma-separated file

179
00:12:23,626 --> 00:12:25,565
and open it to view the results.

180
00:12:30,430 --> 00:12:34,234
All the columns now have clean
data. The data set is now

181
00:12:34,234 --> 00:12:37,677
ready for storage in a
database or for further analysis.