0:00:06.250,0:00:08.338 So let's start the[br]OpenRefine desktop 0:00:08.338,0:00:11.470 application, which opens up[br]a new browser window and 0:00:11.470,0:00:14.602 runs on a default port on[br]the local machine. 0:00:16.350,0:00:20.350 We can create a new[br]project, open or import an 0:00:20.350,0:00:21.930 existing project, 0:00:22.340,0:00:25.025 or just change the language[br]settings if needed. 0:00:25.025,0:00:29.040 Let's download the stock start zip[br]file and browse through the file 0:00:29.040,0:00:31.385 from within the OpenRefine[br]Create Project menu. 0:00:32.450,0:00:33.679 And click next. 0:00:34.510,0:00:37.552 This imports the contents of[br]the file into application 0:00:37.552,0:00:40.742 memory. Note that the tool[br]can work both with compressed 0:00:40.742,0:00:42.566 and uncompressed files. 0:00:44.440,0:00:48.455 Since the stock data is in the[br]semi-structured json format, we 0:00:48.455,0:00:52.305 will select the root node of the[br]json tree, which transforms the 0:00:52.305,0:00:55.195 data into a tabular format with[br]rows and columns. 0:00:55.725,0:00:58.295 We can choose to preserve empty strings 0:00:58.295,0:01:00.385 and perform basic cleanup such as 0:01:00.385,0:01:03.420 trimming of whitespaces and[br]parsing numerical values. 0:01:07.860,0:01:10.960 These operations can also be[br]performed at a later stage. 0:01:13.430,0:01:15.740 We will provide a project[br]name and proceed with 0:01:15.740,0:01:17.240 project creation. 0:01:22.990,0:01:24.730 Once the project has[br]been created, 0:01:25.740,0:01:29.988 we can click All → Edit columns[br]and remove some of the unwanted 0:01:29.988,0:01:31.758 columns from the data set. 0:01:33.010,0:01:36.381 This helps reduce the memory[br]usage, especially for large 0:01:36.381,0:01:39.804 data sets. We'll also go[br]ahead and reorder some of 0:01:39.804,0:01:41.454 the columns in the table. 0:01:49.750,0:01:53.235 Now the data set is lighter[br]with only those columns that 0:01:53.235,0:01:56.450 will be used potentially for[br]further analysis. Just like 0:01:56.450,0:01:59.465 in a spreadsheet, we could[br]choose to sort different 0:01:59.465,0:02:02.815 columns. Here we'll go ahead[br]and sort the volume column 0:02:02.815,0:02:04.155 in the descending order. 0:02:15.510,0:02:18.534 Financial data sets, such as[br]stock prices, are constantly 0:02:18.534,0:02:22.566 changing. What if our data is[br]outdated, and we wanted to fetch 0:02:22.566,0:02:25.926 the latest data from the stock[br]market? Using OpenRefine, one 0:02:25.926,0:02:28.614 could augment the data using[br]online web services. 0:02:29.650,0:02:32.530 However, calling some of these[br]web services could be very 0:02:32.530,0:02:35.073 time-consuming. So to[br]demonstrate this feature, 0:02:35.073,0:02:38.582 let's work with a smaller[br]subset of the stocks data and 0:02:38.582,0:02:40.641 load it into a new project. 0:02:40.641,0:02:44.789 Look for the file stocks_demo.csv in the[br]session folder, 0:02:44.789,0:02:46.938 and load it in OpenRefine. 0:02:50.450,0:02:53.022 We will also create a new[br]OpenRefine project 0:02:53.022,0:02:54.862 and call it StocksDemo. 0:02:59.870,0:03:04.501 On the ticker column menu, click[br]Edit column → Add columns by 0:03:04.501,0:03:09.399 fetching URLs. This opens up an[br]expression box where we could 0:03:09.399,0:03:13.491 enter the URL of an online web[br]service which could be invoked 0:03:13.491,0:03:15.035 to augment the data. 0:03:15.620,0:03:19.733 This online service provides an[br]API that returns slightly 0:03:19.733,0:03:21.104 delayed stock prices. 0:03:22.250,0:03:25.273 A good practice is to test[br]the web service from a 0:03:25.273,0:03:28.910 browser client with a known[br]ticker symbol and validate 0:03:28.910,0:03:30.183 the service response. 0:03:34.630,0:03:38.650 We will add this URL and append[br]the ticker symbol as a variable 0:03:38.650,0:03:40.140 parameter for each row. 0:03:41.230,0:03:43.519 To do this, copy the URL into 0:03:43.519,0:03:47.886 the expression box, and insert[br]codes as shown here. 0:03:56.720,0:04:00.284 Now the URL is different for[br]each row with the ticker symbol 0:04:00.284,0:04:01.472 as the symbol parameter. 0:04:03.610,0:04:06.426 We'll give a new name to the[br]column and click OK. 0:04:08.440,0:04:11.296 This could take a few minutes[br]to run depending on the speed 0:04:11.296,0:04:12.871 of the web service. 0:04:17.010,0:04:19.815 We now have a new column with[br]the latest stock data. 0:04:20.680,0:04:24.346 Notice that the results are in[br]the XML format and need to be 0:04:24.346,0:04:28.606 parsed. Please refer to the[br]OpenRefine tutorials for some 0:04:28.606,0:04:30.988 help with parsing XML and json[br]results. 0:04:30.988,0:04:35.400 For now, here is the[br]syntax for extracting the stock 0:04:35.400,0:04:37.065 price from the XML document. 0:04:57.440,0:05:00.908 After a quick review of the[br]results, we are ready to export 0:05:00.908,0:05:03.798 the data to another data[br]processing tool in our pipeline. 0:05:07.950,0:05:12.066 In the next example we will use[br]OpenRefine for cleaning up a 0:05:12.066,0:05:13.438 sample messy data file. 0:05:15.820,0:05:19.541 So let's download the [br]Customers-messy.zip file 0:05:19.541,0:05:21.527 from the session folder and store it on 0:05:21.527,0:05:25.739 the local computer. The first[br]step is to visually inspect 0:05:25.739,0:05:29.663 the raw data file using a text[br]editor or Excel and get 0:05:29.663,0:05:32.606 familiar with the attributes[br]and contents of the data. 0:05:35.030,0:05:38.308 As before, we will load this[br]data file into OpenRefine and 0:05:38.308,0:05:39.500 create a new project. 0:05:41.990,0:05:45.390 Once the data has been[br]loaded, we'll do a cursory 0:05:45.390,0:05:46.750 check on the columns, 0:05:49.260,0:05:51.996 and change the character[br]encoding to UTF 8. 0:05:52.690,0:05:55.910 If needed, we could also change[br]options such as column 0:05:55.910,0:05:57.520 separators or add a missing 0:05:57.520,0:06:01.588 header row. In our case, we'll[br]keep the default settings. 0:06:02.920,0:06:06.132 So let's give the clean-up[br]project a name and click on 0:06:06.132,0:06:07.990 the Create Project button. 0:06:09.550,0:06:12.982 This creates a new OpenRefine[br]project with a copy of the 0:06:12.982,0:06:17.660 original data set. The next step[br]in the cleanup process is to 0:06:17.660,0:06:21.080 visually inspect the data and[br]check each column for errors, 0:06:21.080,0:06:23.768 formatting issues, or missing values. 0:06:24.018,0:06:26.851 We can perform common[br]operations like remove 0:06:26.851,0:06:30.954 whitespace from text. For this,[br]let's click on a column header. 0:06:30.954,0:06:33.258 Let's say contact first name, 0:06:33.470,0:06:36.528 Edit cells → Trim leading and 0:06:36.528,0:06:40.108 trailing whitespace. We can[br]perform the same operation 0:06:40.108,0:06:42.082 on multiple columns. 0:06:42.930,0:06:45.530 We could also do basic[br]concatenation operations like 0:06:45.530,0:06:49.230 in a spreadsheet. Let's say we[br]need to combine the first and 0:06:49.230,0:06:53.005 last names of the contact into[br]one column. For this we'll 0:06:53.005,0:06:54.625 click on Edit column → 0:06:55.380,0:06:57.528 Add column based on this column. 0:06:59.000,0:07:01.790 On the expression box, we[br]will take the current value 0:07:01.790,0:07:03.575 of the contactFirstName. 0:07:04.300,0:07:08.476 Add a space followed by the[br]value of the contact last name. 0:07:15.850,0:07:18.050 Let's check to see the results 0:07:18.520,0:07:21.772 and assign a name to the new[br]column, such as 0:07:21.772,0:07:23.869 contactFullName. And hit OK. 0:07:29.870,0:07:32.829 The newly-created column has[br]been added to our data set. 0:07:33.990,0:07:37.350 A nice thing about OpenRefine is[br]that you can undo your past 0:07:37.350,0:07:41.200 operations, so if you made a[br]mistake you can go back to a 0:07:41.200,0:07:44.450 previous step and correct it.[br]This makes it really easy to 0:07:44.450,0:07:47.050 make incremental changes without[br]losing time over 0:07:47.050,0:07:49.730 saving multiple copies of the data. 0:07:52.850,0:07:55.705 Let's go ahead and remove the[br]column that we just created. 0:07:57.300,0:08:00.020 Just like with spreadsheet[br]applications, we could also 0:08:00.020,0:08:02.980 perform other simple, column-based [br]operations such as 0:08:02.980,0:08:06.820 changing the case of the text to[br]title, upper, or lowercase. 0:08:17.750,0:08:21.457 Faceting is a powerful feature[br]in many data cleaning tools. The 0:08:21.457,0:08:25.501 facet summarizes the cells in[br]that column to give you a bigger 0:08:25.501,0:08:29.208 picture on the column. It also[br]allows for filtering down to 0:08:29.208,0:08:32.769 just a subset of the rows that[br]you want to change in bulk. 0:08:32.769,0:08:35.103 Let's see how this works. 0:08:35.103,0:08:37.633 Let's create a text facet [br]on the country column 0:08:37.633,0:08:40.183 by clicking on the column's [br]dropdown menu and 0:08:40.183,0:08:43.760 invoking Facet → Text facet. We[br]can see that 0:08:43.760,0:08:45.901 the country names[br]are not consistent. 0:08:47.220,0:08:51.315 For instance, United States[br]is spelled as US, USA, or 0:08:51.315,0:08:54.045 United States, with some[br]extra characters. 0:08:55.690,0:08:59.641 Also, Netherlands is spelled[br]both with and without the 0:08:59.641,0:09:03.592 determiner "the." France is[br]spelled using the uppercase as 0:09:03.592,0:09:05.348 well as lowercase characters. 0:09:07.130,0:09:09.970 Clustering allows us to[br]automatically group and edit 0:09:09.970,0:09:13.165 different but similar values.[br]Let's click on the Cluster 0:09:13.165,0:09:14.940 command on the left frame. 0:09:15.890,0:09:19.282 We now have the opportunity to[br]see all the similar words 0:09:19.282,0:09:21.202 grouped together and perform a 0:09:21.202,0:09:25.476 merge operation. Let's merge all[br]text referring to the United 0:09:25.476,0:09:29.496 States as USA, United Kingdom as[br]UK, and so on. 0:09:33.230,0:09:36.030 The Merge and Re-Cluster[br]button confirms the 0:09:36.030,0:09:39.180 operation. There are no more[br]obvious clusters that are 0:09:39.180,0:09:41.980 left. However, we still[br]haven't resolved the issue 0:09:41.980,0:09:44.780 with Netherlands. For this[br]we will choose another 0:09:44.780,0:09:47.230 clustering operation which[br]uses a distance metric. 0:09:56.410,0:09:59.122 The country column is now fixed,[br]and there are no more 0:09:59.122,0:10:02.394 clusters to be found. However, [br]we still have one error. 0:10:02.394,0:10:05.716 The country Norway has a space[br]character. See if we can find 0:10:05.716,0:10:07.878 a way to fix this error. 0:10:09.690,0:10:13.729 Let's now go ahead and use the[br]tool to clean up a numerical 0:10:13.729,0:10:17.349 column like the credit limit.[br]We'll first apply a numerical 0:10:17.349,0:10:20.509 transformation on the column so[br]that all the numbers are 0:10:20.509,0:10:24.624 recognized. Some values in this[br]column cannot be parsed into 0:10:24.624,0:10:27.784 numbers. This is because of the[br]presence of alphanumeric values, 0:10:27.784,0:10:29.996 such as the USD or the dollar 0:10:29.996,0:10:34.130 symbol. We can correct this by[br]applying the Numeric facet. 0:10:34.750,0:10:38.026 On the left we see a chart with[br]the distribution of the data. 0:10:38.900,0:10:42.871 Let's filter down to the non-numeric [br]values only and change 0:10:42.871,0:10:46.120 the values of the fields from[br]the expression box. 0:10:52.930,0:10:55.720 This screen allows us to[br]view the original values as 0:10:55.720,0:10:58.231 well as the outcomes when[br]the expression is applied. 0:11:02.670,0:11:06.099 We will involve the replace[br]function to replace or 0:11:06.099,0:11:09.147 substitute the unwanted[br]characters with blanks. So let's 0:11:09.147,0:11:12.576 remove the USD, the dollar, and[br]the comma symbols. 0:11:14.010,0:11:17.202 As you can see, we can make[br]multiple function calls in a 0:11:17.202,0:11:19.248 single line. This is also known 0:11:19.248,0:11:22.710 as operator chaining. Now[br]all the Alpha characters 0:11:22.710,0:11:24.378 have been removed. 0:11:26.970,0:11:30.623 When we click OK, the tool still[br]does not see them as numbers. 0:11:31.280,0:11:35.300 For this we will need to invoke[br]the toNumber function to 0:11:35.300,0:11:38.315 convert the values within the[br]facet into numerical values. 0:11:48.170,0:11:51.800 So far we have tested these[br]operations on a numeric facet. 0:11:52.420,0:11:55.844 To apply these changes to the[br]data set in the project we will 0:11:55.844,0:11:59.064 copy this expression and apply[br]to a column transform. This 0:11:59.064,0:12:02.266 ensures that all the values in[br]the credit limit column have 0:12:02.266,0:12:04.314 been converted into numbers. 0:12:13.100,0:12:16.541 Once we're done with cleaning up[br]the rest of the data, we are 0:12:16.541,0:12:19.862 ready to export the file into a[br]format of our choosing. For this 0:12:19.862,0:12:23.626 example, let's export it as a[br]CSV or a comma-separated file 0:12:23.626,0:12:25.565 and open it to view the results. 0:12:30.430,0:12:34.234 All the columns now have clean[br]data. The data set is now 0:12:34.234,0:12:37.677 ready for storage in a[br]database or for further analysis.