0:00:06.250,0:00:08.338
So let's start the[br]OpenRefine desktop

0:00:08.338,0:00:11.470
application, which opens up[br]a new browser window and

0:00:11.470,0:00:14.602
runs on a default port on[br]the local machine.

0:00:16.350,0:00:20.350
We can create a new[br]project, open or import an

0:00:20.350,0:00:21.930
existing project,

0:00:22.340,0:00:25.025
or just change the language[br]settings if needed.

0:00:25.025,0:00:29.040
Let's download the stock start zip[br]file and browse through the file

0:00:29.040,0:00:31.385
from within the OpenRefine[br]Create Project menu.

0:00:32.450,0:00:33.679
And click next.

0:00:34.510,0:00:37.552
This imports the contents of[br]the file into application

0:00:37.552,0:00:40.742
memory. Note that the tool[br]can work both with compressed

0:00:40.742,0:00:42.566
and uncompressed files.

0:00:44.440,0:00:48.455
Since the stock data is in the[br]semi-structured json format, we

0:00:48.455,0:00:52.305
will select the root node of the[br]json tree, which transforms the

0:00:52.305,0:00:55.195
data into a tabular format with[br]rows and columns.

0:00:55.725,0:00:58.295
We can choose to preserve empty strings

0:00:58.295,0:01:00.385
and perform basic cleanup such as

0:01:00.385,0:01:03.420
trimming of whitespaces and[br]parsing numerical values.

0:01:07.860,0:01:10.960
These operations can also be[br]performed at a later stage.

0:01:13.430,0:01:15.740
We will provide a project[br]name and proceed with

0:01:15.740,0:01:17.240
project creation.

0:01:22.990,0:01:24.730
Once the project has[br]been created,

0:01:25.740,0:01:29.988
we can click All → Edit columns[br]and remove some of the unwanted

0:01:29.988,0:01:31.758
columns from the data set.

0:01:33.010,0:01:36.381
This helps reduce the memory[br]usage, especially for large

0:01:36.381,0:01:39.804
data sets. We'll also go[br]ahead and reorder some of

0:01:39.804,0:01:41.454
the columns in the table.

0:01:49.750,0:01:53.235
Now the data set is lighter[br]with only those columns that

0:01:53.235,0:01:56.450
will be used potentially for[br]further analysis. Just like

0:01:56.450,0:01:59.465
in a spreadsheet, we could[br]choose to sort different

0:01:59.465,0:02:02.815
columns. Here we'll go ahead[br]and sort the volume column

0:02:02.815,0:02:04.155
in the descending order.

0:02:15.510,0:02:18.534
Financial data sets, such as[br]stock prices, are constantly

0:02:18.534,0:02:22.566
changing. What if our data is[br]outdated, and we wanted to fetch

0:02:22.566,0:02:25.926
the latest data from the stock[br]market? Using OpenRefine, one

0:02:25.926,0:02:28.614
could augment the data using[br]online web services.

0:02:29.650,0:02:32.530
However, calling some of these[br]web services could be very

0:02:32.530,0:02:35.073
time-consuming. So to[br]demonstrate this feature,

0:02:35.073,0:02:38.582
let's work with a smaller[br]subset of the stocks data and

0:02:38.582,0:02:40.641
load it into a new project.

0:02:40.641,0:02:44.789
Look for the file stocks_demo.csv in the[br]session folder,

0:02:44.789,0:02:46.938
and load it in OpenRefine.

0:02:50.450,0:02:53.022
We will also create a new[br]OpenRefine project

0:02:53.022,0:02:54.862
and call it StocksDemo.

0:02:59.870,0:03:04.501
On the ticker column menu, click[br]Edit column → Add columns by

0:03:04.501,0:03:09.399
fetching URLs. This opens up an[br]expression box where we could

0:03:09.399,0:03:13.491
enter the URL of an online web[br]service which could be invoked

0:03:13.491,0:03:15.035
to augment the data.

0:03:15.620,0:03:19.733
This online service provides an[br]API that returns slightly

0:03:19.733,0:03:21.104
delayed stock prices.

0:03:22.250,0:03:25.273
A good practice is to test[br]the web service from a

0:03:25.273,0:03:28.910
browser client with a known[br]ticker symbol and validate

0:03:28.910,0:03:30.183
the service response.

0:03:34.630,0:03:38.650
We will add this URL and append[br]the ticker symbol as a variable

0:03:38.650,0:03:40.140
parameter for each row.

0:03:41.230,0:03:43.519
To do this, copy the URL into

0:03:43.519,0:03:47.886
the expression box, and insert[br]codes as shown here.

0:03:56.720,0:04:00.284
Now the URL is different for[br]each row with the ticker symbol

0:04:00.284,0:04:01.472
as the symbol parameter.

0:04:03.610,0:04:06.426
We'll give a new name to the[br]column and click OK.

0:04:08.440,0:04:11.296
This could take a few minutes[br]to run depending on the speed

0:04:11.296,0:04:12.871
of the web service.

0:04:17.010,0:04:19.815
We now have a new column with[br]the latest stock data.

0:04:20.680,0:04:24.346
Notice that the results are in[br]the XML format and need to be

0:04:24.346,0:04:28.606
parsed. Please refer to the[br]OpenRefine tutorials for some

0:04:28.606,0:04:30.988
help with parsing XML and json[br]results.

0:04:30.988,0:04:35.400
For now, here is the[br]syntax for extracting the stock

0:04:35.400,0:04:37.065
price from the XML document.

0:04:57.440,0:05:00.908
After a quick review of the[br]results, we are ready to export

0:05:00.908,0:05:03.798
the data to another data[br]processing tool in our pipeline.

0:05:07.950,0:05:12.066
In the next example we will use[br]OpenRefine for cleaning up a

0:05:12.066,0:05:13.438
sample messy data file.

0:05:15.820,0:05:19.541
So let's download the [br]Customers-messy.zip file

0:05:19.541,0:05:21.527
from the session folder and store it on

0:05:21.527,0:05:25.739
the local computer. The first[br]step is to visually inspect

0:05:25.739,0:05:29.663
the raw data file using a text[br]editor or Excel and get

0:05:29.663,0:05:32.606
familiar with the attributes[br]and contents of the data.

0:05:35.030,0:05:38.308
As before, we will load this[br]data file into OpenRefine and

0:05:38.308,0:05:39.500
create a new project.

0:05:41.990,0:05:45.390
Once the data has been[br]loaded, we'll do a cursory

0:05:45.390,0:05:46.750
check on the columns,

0:05:49.260,0:05:51.996
and change the character[br]encoding to UTF 8.

0:05:52.690,0:05:55.910
If needed, we could also change[br]options such as column

0:05:55.910,0:05:57.520
separators or add a missing

0:05:57.520,0:06:01.588
header row. In our case, we'll[br]keep the default settings.

0:06:02.920,0:06:06.132
So let's give the clean-up[br]project a name and click on

0:06:06.132,0:06:07.990
the Create Project button.

0:06:09.550,0:06:12.982
This creates a new OpenRefine[br]project with a copy of the

0:06:12.982,0:06:17.660
original data set. The next step[br]in the cleanup process is to

0:06:17.660,0:06:21.080
visually inspect the data and[br]check each column for errors,

0:06:21.080,0:06:23.768
formatting issues, or missing values.

0:06:24.018,0:06:26.851
We can perform common[br]operations like remove

0:06:26.851,0:06:30.954
whitespace from text. For this,[br]let's click on a column header.

0:06:30.954,0:06:33.258
Let's say contact first name,

0:06:33.470,0:06:36.528
Edit cells → Trim leading and

0:06:36.528,0:06:40.108
trailing whitespace. We can[br]perform the same operation

0:06:40.108,0:06:42.082
on multiple columns.

0:06:42.930,0:06:45.530
We could also do basic[br]concatenation operations like

0:06:45.530,0:06:49.230
in a spreadsheet. Let's say we[br]need to combine the first and

0:06:49.230,0:06:53.005
last names of the contact into[br]one column. For this we'll

0:06:53.005,0:06:54.625
click on Edit column →

0:06:55.380,0:06:57.528
Add column based on this column.

0:06:59.000,0:07:01.790
On the expression box, we[br]will take the current value

0:07:01.790,0:07:03.575
of the contactFirstName.

0:07:04.300,0:07:08.476
Add a space followed by the[br]value of the contact last name.

0:07:15.850,0:07:18.050
Let's check to see the results

0:07:18.520,0:07:21.772
and assign a name to the new[br]column, such as

0:07:21.772,0:07:23.869
contactFullName. And hit OK.

0:07:29.870,0:07:32.829
The newly-created column has[br]been added to our data set.

0:07:33.990,0:07:37.350
A nice thing about OpenRefine is[br]that you can undo your past

0:07:37.350,0:07:41.200
operations, so if you made a[br]mistake you can go back to a

0:07:41.200,0:07:44.450
previous step and correct it.[br]This makes it really easy to

0:07:44.450,0:07:47.050
make incremental changes without[br]losing time over

0:07:47.050,0:07:49.730
saving multiple copies of the data.

0:07:52.850,0:07:55.705
Let's go ahead and remove the[br]column that we just created.

0:07:57.300,0:08:00.020
Just like with spreadsheet[br]applications, we could also

0:08:00.020,0:08:02.980
perform other simple, column-based [br]operations such as

0:08:02.980,0:08:06.820
changing the case of the text to[br]title, upper, or lowercase.

0:08:17.750,0:08:21.457
Faceting is a powerful feature[br]in many data cleaning tools. The

0:08:21.457,0:08:25.501
facet summarizes the cells in[br]that column to give you a bigger

0:08:25.501,0:08:29.208
picture on the column. It also[br]allows for filtering down to

0:08:29.208,0:08:32.769
just a subset of the rows that[br]you want to change in bulk.

0:08:32.769,0:08:35.103
Let's see how this works.

0:08:35.103,0:08:37.633
Let's create a text facet [br]on the country column

0:08:37.633,0:08:40.183
by clicking on the column's [br]dropdown menu and

0:08:40.183,0:08:43.760
invoking Facet → Text facet. We[br]can see that

0:08:43.760,0:08:45.901
the country names[br]are not consistent.

0:08:47.220,0:08:51.315
For instance, United States[br]is spelled as US, USA, or

0:08:51.315,0:08:54.045
United States, with some[br]extra characters.

0:08:55.690,0:08:59.641
Also, Netherlands is spelled[br]both with and without the

0:08:59.641,0:09:03.592
determiner "the." France is[br]spelled using the uppercase as

0:09:03.592,0:09:05.348
well as lowercase characters.

0:09:07.130,0:09:09.970
Clustering allows us to[br]automatically group and edit

0:09:09.970,0:09:13.165
different but similar values.[br]Let's click on the Cluster

0:09:13.165,0:09:14.940
command on the left frame.

0:09:15.890,0:09:19.282
We now have the opportunity to[br]see all the similar words

0:09:19.282,0:09:21.202
grouped together and perform a

0:09:21.202,0:09:25.476
merge operation. Let's merge all[br]text referring to the United

0:09:25.476,0:09:29.496
States as USA, United Kingdom as[br]UK, and so on.

0:09:33.230,0:09:36.030
The Merge and Re-Cluster[br]button confirms the

0:09:36.030,0:09:39.180
operation. There are no more[br]obvious clusters that are

0:09:39.180,0:09:41.980
left. However, we still[br]haven't resolved the issue

0:09:41.980,0:09:44.780
with Netherlands. For this[br]we will choose another

0:09:44.780,0:09:47.230
clustering operation which[br]uses a distance metric.

0:09:56.410,0:09:59.122
The country column is now fixed,[br]and there are no more

0:09:59.122,0:10:02.394
clusters to be found. However, [br]we still have one error.

0:10:02.394,0:10:05.716
The country Norway has a space[br]character. See if we can find

0:10:05.716,0:10:07.878
a way to fix this error.

0:10:09.690,0:10:13.729
Let's now go ahead and use the[br]tool to clean up a numerical

0:10:13.729,0:10:17.349
column like the credit limit.[br]We'll first apply a numerical

0:10:17.349,0:10:20.509
transformation on the column so[br]that all the numbers are

0:10:20.509,0:10:24.624
recognized. Some values in this[br]column cannot be parsed into

0:10:24.624,0:10:27.784
numbers. This is because of the[br]presence of alphanumeric values,

0:10:27.784,0:10:29.996
such as the USD or the dollar

0:10:29.996,0:10:34.130
symbol. We can correct this by[br]applying the Numeric facet.

0:10:34.750,0:10:38.026
On the left we see a chart with[br]the distribution of the data.

0:10:38.900,0:10:42.871
Let's filter down to the non-numeric [br]values only and change

0:10:42.871,0:10:46.120
the values of the fields from[br]the expression box.

0:10:52.930,0:10:55.720
This screen allows us to[br]view the original values as

0:10:55.720,0:10:58.231
well as the outcomes when[br]the expression is applied.

0:11:02.670,0:11:06.099
We will involve the replace[br]function to replace or

0:11:06.099,0:11:09.147
substitute the unwanted[br]characters with blanks. So let's

0:11:09.147,0:11:12.576
remove the USD, the dollar, and[br]the comma symbols.

0:11:14.010,0:11:17.202
As you can see, we can make[br]multiple function calls in a

0:11:17.202,0:11:19.248
single line. This is also known

0:11:19.248,0:11:22.710
as operator chaining. Now[br]all the Alpha characters

0:11:22.710,0:11:24.378
have been removed.

0:11:26.970,0:11:30.623
When we click OK, the tool still[br]does not see them as numbers.

0:11:31.280,0:11:35.300
For this we will need to invoke[br]the toNumber function to

0:11:35.300,0:11:38.315
convert the values within the[br]facet into numerical values.

0:11:48.170,0:11:51.800
So far we have tested these[br]operations on a numeric facet.

0:11:52.420,0:11:55.844
To apply these changes to the[br]data set in the project we will

0:11:55.844,0:11:59.064
copy this expression and apply[br]to a column transform. This

0:11:59.064,0:12:02.266
ensures that all the values in[br]the credit limit column have

0:12:02.266,0:12:04.314
been converted into numbers.

0:12:13.100,0:12:16.541
Once we're done with cleaning up[br]the rest of the data, we are

0:12:16.541,0:12:19.862
ready to export the file into a[br]format of our choosing. For this

0:12:19.862,0:12:23.626
example, let's export it as a[br]CSV or a comma-separated file

0:12:23.626,0:12:25.565
and open it to view the results.

0:12:30.430,0:12:34.234
All the columns now have clean[br]data. The data set is now

0:12:34.234,0:12:37.677
ready for storage in a[br]database or for further analysis.