WEBVTT 00:00:06.250 --> 00:00:08.338 So let's start the OpenRefine desktop 00:00:08.338 --> 00:00:11.470 application, which opens up a new browser window and 00:00:11.470 --> 00:00:14.602 runs on a default port on the local machine. 00:00:16.350 --> 00:00:20.350 We can create a new project, open or import an 00:00:20.350 --> 00:00:21.930 existing project, 00:00:22.340 --> 00:00:25.025 or just change the language settings if needed. 00:00:25.025 --> 00:00:29.040 Let's download the stock start zip file and browse through the file 00:00:29.040 --> 00:00:31.385 from within the OpenRefine Create Project menu. 00:00:32.450 --> 00:00:33.679 And click next. 00:00:34.510 --> 00:00:37.552 This imports the contents of the file into application 00:00:37.552 --> 00:00:40.742 memory. Note that the tool can work both with compressed 00:00:40.742 --> 00:00:42.566 and uncompressed files. 00:00:44.440 --> 00:00:48.455 Since the stock data is in the semi-structured json format, we 00:00:48.455 --> 00:00:52.305 will select the root node of the json tree, which transforms the 00:00:52.305 --> 00:00:55.195 data into a tabular format with rows and columns. 00:00:55.725 --> 00:00:58.295 We can choose to preserve empty strings 00:00:58.295 --> 00:01:00.385 and perform basic cleanup such as 00:01:00.385 --> 00:01:03.420 trimming of whitespaces and parsing numerical values. 00:01:07.860 --> 00:01:10.960 These operations can also be performed at a later stage. 00:01:13.430 --> 00:01:15.740 We will provide a project name and proceed with 00:01:15.740 --> 00:01:17.240 project creation. 00:01:22.990 --> 00:01:24.730 Once the project has been created, 00:01:25.740 --> 00:01:29.988 we can click All → Edit columns and remove some of the unwanted 00:01:29.988 --> 00:01:31.758 columns from the data set. 00:01:33.010 --> 00:01:36.381 This helps reduce the memory usage, especially for large 00:01:36.381 --> 00:01:39.804 data sets. We'll also go ahead and reorder some of 00:01:39.804 --> 00:01:41.454 the columns in the table. 00:01:49.750 --> 00:01:53.235 Now the data set is lighter with only those columns that 00:01:53.235 --> 00:01:56.450 will be used potentially for further analysis. Just like 00:01:56.450 --> 00:01:59.465 in a spreadsheet, we could choose to sort different 00:01:59.465 --> 00:02:02.815 columns. Here we'll go ahead and sort the volume column 00:02:02.815 --> 00:02:04.155 in the descending order. 00:02:15.510 --> 00:02:18.534 Financial data sets, such as stock prices, are constantly 00:02:18.534 --> 00:02:22.566 changing. What if our data is outdated, and we wanted to fetch 00:02:22.566 --> 00:02:25.926 the latest data from the stock market? Using OpenRefine, one 00:02:25.926 --> 00:02:28.614 could augment the data using online web services. 00:02:29.650 --> 00:02:32.530 However, calling some of these web services could be very 00:02:32.530 --> 00:02:35.073 time-consuming. So to demonstrate this feature, 00:02:35.073 --> 00:02:38.582 let's work with a smaller subset of the stocks data and 00:02:38.582 --> 00:02:40.641 load it into a new project. 00:02:40.641 --> 00:02:44.789 Look for the file stocks_demo.csv in the session folder, 00:02:44.789 --> 00:02:46.938 and load it in OpenRefine. 00:02:50.450 --> 00:02:53.022 We will also create a new OpenRefine project 00:02:53.022 --> 00:02:54.862 and call it StocksDemo. 00:02:59.870 --> 00:03:04.501 On the ticker column menu, click Edit column → Add columns by 00:03:04.501 --> 00:03:09.399 fetching URLs. This opens up an expression box where we could 00:03:09.399 --> 00:03:13.491 enter the URL of an online web service which could be invoked 00:03:13.491 --> 00:03:15.035 to augment the data. 00:03:15.620 --> 00:03:19.733 This online service provides an API that returns slightly 00:03:19.733 --> 00:03:21.104 delayed stock prices. 00:03:22.250 --> 00:03:25.273 A good practice is to test the web service from a 00:03:25.273 --> 00:03:28.910 browser client with a known ticker symbol and validate 00:03:28.910 --> 00:03:30.183 the service response. 00:03:34.630 --> 00:03:38.650 We will add this URL and append the ticker symbol as a variable 00:03:38.650 --> 00:03:40.140 parameter for each row. 00:03:41.230 --> 00:03:43.519 To do this, copy the URL into 00:03:43.519 --> 00:03:47.886 the expression box, and insert codes as shown here. 00:03:56.720 --> 00:04:00.284 Now the URL is different for each row with the ticker symbol 00:04:00.284 --> 00:04:01.472 as the symbol parameter. 00:04:03.610 --> 00:04:06.426 We'll give a new name to the column and click OK. 00:04:08.440 --> 00:04:11.296 This could take a few minutes to run depending on the speed 00:04:11.296 --> 00:04:12.871 of the web service. 00:04:17.010 --> 00:04:19.815 We now have a new column with the latest stock data. 00:04:20.680 --> 00:04:24.346 Notice that the results are in the XML format and need to be 00:04:24.346 --> 00:04:28.606 parsed. Please refer to the OpenRefine tutorials for some 00:04:28.606 --> 00:04:30.988 help with parsing XML and json results. 00:04:30.988 --> 00:04:35.400 For now, here is the syntax for extracting the stock 00:04:35.400 --> 00:04:37.065 price from the XML document. 00:04:57.440 --> 00:05:00.908 After a quick review of the results, we are ready to export 00:05:00.908 --> 00:05:03.798 the data to another data processing tool in our pipeline. 00:05:07.950 --> 00:05:12.066 In the next example we will use OpenRefine for cleaning up a 00:05:12.066 --> 00:05:13.438 sample messy data file. 00:05:15.820 --> 00:05:19.541 So let's download the Customers-messy.zip file 00:05:19.541 --> 00:05:21.527 from the session folder and store it on 00:05:21.527 --> 00:05:25.739 the local computer. The first step is to visually inspect 00:05:25.739 --> 00:05:29.663 the raw data file using a text editor or Excel and get 00:05:29.663 --> 00:05:32.606 familiar with the attributes and contents of the data. 00:05:35.030 --> 00:05:38.308 As before, we will load this data file into OpenRefine and 00:05:38.308 --> 00:05:39.500 create a new project. 00:05:41.990 --> 00:05:45.390 Once the data has been loaded, we'll do a cursory 00:05:45.390 --> 00:05:46.750 check on the columns, 00:05:49.260 --> 00:05:51.996 and change the character encoding to UTF 8. 00:05:52.690 --> 00:05:55.910 If needed, we could also change options such as column 00:05:55.910 --> 00:05:57.520 separators or add a missing 00:05:57.520 --> 00:06:01.588 header row. In our case, we'll keep the default settings. 00:06:02.920 --> 00:06:06.132 So let's give the clean-up project a name and click on 00:06:06.132 --> 00:06:07.990 the Create Project button. 00:06:09.550 --> 00:06:12.982 This creates a new OpenRefine project with a copy of the 00:06:12.982 --> 00:06:17.660 original data set. The next step in the cleanup process is to 00:06:17.660 --> 00:06:21.080 visually inspect the data and check each column for errors, 00:06:21.080 --> 00:06:23.768 formatting issues, or missing values. 00:06:24.018 --> 00:06:26.851 We can perform common operations like remove 00:06:26.851 --> 00:06:30.954 whitespace from text. For this, let's click on a column header. 00:06:30.954 --> 00:06:33.258 Let's say contact first name, 00:06:33.470 --> 00:06:36.528 Edit cells → Trim leading and 00:06:36.528 --> 00:06:40.108 trailing whitespace. We can perform the same operation 00:06:40.108 --> 00:06:42.082 on multiple columns. 00:06:42.930 --> 00:06:45.530 We could also do basic concatenation operations like 00:06:45.530 --> 00:06:49.230 in a spreadsheet. Let's say we need to combine the first and 00:06:49.230 --> 00:06:53.005 last names of the contact into one column. For this we'll 00:06:53.005 --> 00:06:54.625 click on Edit column → 00:06:55.380 --> 00:06:57.528 Add column based on this column. 00:06:59.000 --> 00:07:01.790 On the expression box, we will take the current value 00:07:01.790 --> 00:07:03.575 of the contactFirstName. 00:07:04.300 --> 00:07:08.476 Add a space followed by the value of the contact last name. 00:07:15.850 --> 00:07:18.050 Let's check to see the results 00:07:18.520 --> 00:07:21.772 and assign a name to the new column, such as 00:07:21.772 --> 00:07:23.869 contactFullName. And hit OK. 00:07:29.870 --> 00:07:32.829 The newly-created column has been added to our data set. 00:07:33.990 --> 00:07:37.350 A nice thing about OpenRefine is that you can undo your past 00:07:37.350 --> 00:07:41.200 operations, so if you made a mistake you can go back to a 00:07:41.200 --> 00:07:44.450 previous step and correct it. This makes it really easy to 00:07:44.450 --> 00:07:47.050 make incremental changes without losing time over 00:07:47.050 --> 00:07:49.730 saving multiple copies of the data. 00:07:52.850 --> 00:07:55.705 Let's go ahead and remove the column that we just created. 00:07:57.300 --> 00:08:00.020 Just like with spreadsheet applications, we could also 00:08:00.020 --> 00:08:02.980 perform other simple, column-based operations such as 00:08:02.980 --> 00:08:06.820 changing the case of the text to title, upper, or lowercase. 00:08:17.750 --> 00:08:21.457 Faceting is a powerful feature in many data cleaning tools. The 00:08:21.457 --> 00:08:25.501 facet summarizes the cells in that column to give you a bigger 00:08:25.501 --> 00:08:29.208 picture on the column. It also allows for filtering down to 00:08:29.208 --> 00:08:32.769 just a subset of the rows that you want to change in bulk. 00:08:32.769 --> 00:08:35.103 Let's see how this works. 00:08:35.103 --> 00:08:37.633 Let's create a text facet on the country column 00:08:37.633 --> 00:08:40.183 by clicking on the column's dropdown menu and 00:08:40.183 --> 00:08:43.760 invoking Facet → Text facet. We can see that 00:08:43.760 --> 00:08:45.901 the country names are not consistent. 00:08:47.220 --> 00:08:51.315 For instance, United States is spelled as US, USA, or 00:08:51.315 --> 00:08:54.045 United States, with some extra characters. 00:08:55.690 --> 00:08:59.641 Also, Netherlands is spelled both with and without the 00:08:59.641 --> 00:09:03.592 determiner "the." France is spelled using the uppercase as 00:09:03.592 --> 00:09:05.348 well as lowercase characters. 00:09:07.130 --> 00:09:09.970 Clustering allows us to automatically group and edit 00:09:09.970 --> 00:09:13.165 different but similar values. Let's click on the Cluster 00:09:13.165 --> 00:09:14.940 command on the left frame. 00:09:15.890 --> 00:09:19.282 We now have the opportunity to see all the similar words 00:09:19.282 --> 00:09:21.202 grouped together and perform a 00:09:21.202 --> 00:09:25.476 merge operation. Let's merge all text referring to the United 00:09:25.476 --> 00:09:29.496 States as USA, United Kingdom as UK, and so on. 00:09:33.230 --> 00:09:36.030 The Merge and Re-Cluster button confirms the 00:09:36.030 --> 00:09:39.180 operation. There are no more obvious clusters that are 00:09:39.180 --> 00:09:41.980 left. However, we still haven't resolved the issue 00:09:41.980 --> 00:09:44.780 with Netherlands. For this we will choose another 00:09:44.780 --> 00:09:47.230 clustering operation which uses a distance metric. 00:09:56.410 --> 00:09:59.122 The country column is now fixed, and there are no more 00:09:59.122 --> 00:10:02.394 clusters to be found. However, we still have one error. 00:10:02.394 --> 00:10:05.716 The country Norway has a space character. See if we can find 00:10:05.716 --> 00:10:07.878 a way to fix this error. 00:10:09.690 --> 00:10:13.729 Let's now go ahead and use the tool to clean up a numerical 00:10:13.729 --> 00:10:17.349 column like the credit limit. We'll first apply a numerical 00:10:17.349 --> 00:10:20.509 transformation on the column so that all the numbers are 00:10:20.509 --> 00:10:24.624 recognized. Some values in this column cannot be parsed into 00:10:24.624 --> 00:10:27.784 numbers. This is because of the presence of alphanumeric values, 00:10:27.784 --> 00:10:29.996 such as the USD or the dollar 00:10:29.996 --> 00:10:34.130 symbol. We can correct this by applying the Numeric facet. 00:10:34.750 --> 00:10:38.026 On the left we see a chart with the distribution of the data. 00:10:38.900 --> 00:10:42.871 Let's filter down to the non-numeric values only and change 00:10:42.871 --> 00:10:46.120 the values of the fields from the expression box. 00:10:52.930 --> 00:10:55.720 This screen allows us to view the original values as 00:10:55.720 --> 00:10:58.231 well as the outcomes when the expression is applied. 00:11:02.670 --> 00:11:06.099 We will involve the replace function to replace or 00:11:06.099 --> 00:11:09.147 substitute the unwanted characters with blanks. So let's 00:11:09.147 --> 00:11:12.576 remove the USD, the dollar, and the comma symbols. 00:11:14.010 --> 00:11:17.202 As you can see, we can make multiple function calls in a 00:11:17.202 --> 00:11:19.248 single line. This is also known 00:11:19.248 --> 00:11:22.710 as operator chaining. Now all the Alpha characters 00:11:22.710 --> 00:11:24.378 have been removed. 00:11:26.970 --> 00:11:30.623 When we click OK, the tool still does not see them as numbers. 00:11:31.280 --> 00:11:35.300 For this we will need to invoke the toNumber function to 00:11:35.300 --> 00:11:38.315 convert the values within the facet into numerical values. 00:11:48.170 --> 00:11:51.800 So far we have tested these operations on a numeric facet. 00:11:52.420 --> 00:11:55.844 To apply these changes to the data set in the project we will 00:11:55.844 --> 00:11:59.064 copy this expression and apply to a column transform. This 00:11:59.064 --> 00:12:02.266 ensures that all the values in the credit limit column have 00:12:02.266 --> 00:12:04.314 been converted into numbers. 00:12:13.100 --> 00:12:16.541 Once we're done with cleaning up the rest of the data, we are 00:12:16.541 --> 00:12:19.862 ready to export the file into a format of our choosing. For this 00:12:19.862 --> 00:12:23.626 example, let's export it as a CSV or a comma-separated file 00:12:23.626 --> 00:12:25.565 and open it to view the results. 00:12:30.430 --> 00:12:34.234 All the columns now have clean data. The data set is now 00:12:34.234 --> 00:12:37.677 ready for storage in a database or for further analysis.