1 00:00:06,250 --> 00:00:08,338 So let's start the OpenRefine desktop 2 00:00:08,338 --> 00:00:11,470 application, which opens up a new browser window and 3 00:00:11,470 --> 00:00:14,602 runs on a default port on the local machine. 4 00:00:16,350 --> 00:00:20,350 We can create a new project, open or import an 5 00:00:20,350 --> 00:00:21,930 existing project, 6 00:00:22,340 --> 00:00:25,025 or just change the language settings if needed. 7 00:00:25,025 --> 00:00:29,040 Let's download the stock start zip file and browse through the file 8 00:00:29,040 --> 00:00:31,385 from within the OpenRefine Create Project menu. 9 00:00:32,450 --> 00:00:33,679 And click next. 10 00:00:34,510 --> 00:00:37,552 This imports the contents of the file into application 11 00:00:37,552 --> 00:00:40,742 memory. Note that the tool can work both with compressed 12 00:00:40,742 --> 00:00:42,566 and uncompressed files. 13 00:00:44,440 --> 00:00:48,455 Since the stock data is in the semi-structured json format, we 14 00:00:48,455 --> 00:00:52,305 will select the root node of the json tree, which transforms the 15 00:00:52,305 --> 00:00:55,195 data into a tabular format with rows and columns. 16 00:00:55,725 --> 00:00:58,295 We can choose to preserve empty strings 17 00:00:58,295 --> 00:01:00,385 and perform basic cleanup such as 18 00:01:00,385 --> 00:01:03,420 trimming of whitespaces and parsing numerical values. 19 00:01:07,860 --> 00:01:10,960 These operations can also be performed at a later stage. 20 00:01:13,430 --> 00:01:15,740 We will provide a project name and proceed with 21 00:01:15,740 --> 00:01:17,240 project creation. 22 00:01:22,990 --> 00:01:24,730 Once the project has been created, 23 00:01:25,740 --> 00:01:29,988 we can click All → Edit columns and remove some of the unwanted 24 00:01:29,988 --> 00:01:31,758 columns from the data set. 25 00:01:33,010 --> 00:01:36,381 This helps reduce the memory usage, especially for large 26 00:01:36,381 --> 00:01:39,804 data sets. We'll also go ahead and reorder some of 27 00:01:39,804 --> 00:01:41,454 the columns in the table. 28 00:01:49,750 --> 00:01:53,235 Now the data set is lighter with only those columns that 29 00:01:53,235 --> 00:01:56,450 will be used potentially for further analysis. Just like 30 00:01:56,450 --> 00:01:59,465 in a spreadsheet, we could choose to sort different 31 00:01:59,465 --> 00:02:02,815 columns. Here we'll go ahead and sort the volume column 32 00:02:02,815 --> 00:02:04,155 in the descending order. 33 00:02:15,510 --> 00:02:18,534 Financial data sets, such as stock prices, are constantly 34 00:02:18,534 --> 00:02:22,566 changing. What if our data is outdated, and we wanted to fetch 35 00:02:22,566 --> 00:02:25,926 the latest data from the stock market? Using OpenRefine, one 36 00:02:25,926 --> 00:02:28,614 could augment the data using online web services. 37 00:02:29,650 --> 00:02:32,530 However, calling some of these web services could be very 38 00:02:32,530 --> 00:02:35,073 time-consuming. So to demonstrate this feature, 39 00:02:35,073 --> 00:02:38,582 let's work with a smaller subset of the stocks data and 40 00:02:38,582 --> 00:02:40,641 load it into a new project. 41 00:02:40,641 --> 00:02:44,789 Look for the file stocks_demo.csv in the session folder, 42 00:02:44,789 --> 00:02:46,938 and load it in OpenRefine. 43 00:02:50,450 --> 00:02:53,022 We will also create a new OpenRefine project 44 00:02:53,022 --> 00:02:54,862 and call it StocksDemo. 45 00:02:59,870 --> 00:03:04,501 On the ticker column menu, click Edit column → Add columns by 46 00:03:04,501 --> 00:03:09,399 fetching URLs. This opens up an expression box where we could 47 00:03:09,399 --> 00:03:13,491 enter the URL of an online web service which could be invoked 48 00:03:13,491 --> 00:03:15,035 to augment the data. 49 00:03:15,620 --> 00:03:19,733 This online service provides an API that returns slightly 50 00:03:19,733 --> 00:03:21,104 delayed stock prices. 51 00:03:22,250 --> 00:03:25,273 A good practice is to test the web service from a 52 00:03:25,273 --> 00:03:28,910 browser client with a known ticker symbol and validate 53 00:03:28,910 --> 00:03:30,183 the service response. 54 00:03:34,630 --> 00:03:38,650 We will add this URL and append the ticker symbol as a variable 55 00:03:38,650 --> 00:03:40,140 parameter for each row. 56 00:03:41,230 --> 00:03:43,519 To do this, copy the URL into 57 00:03:43,519 --> 00:03:47,886 the expression box, and insert codes as shown here. 58 00:03:56,720 --> 00:04:00,284 Now the URL is different for each row with the ticker symbol 59 00:04:00,284 --> 00:04:01,472 as the symbol parameter. 60 00:04:03,610 --> 00:04:06,426 We'll give a new name to the column and click OK. 61 00:04:08,440 --> 00:04:11,296 This could take a few minutes to run depending on the speed 62 00:04:11,296 --> 00:04:12,871 of the web service. 63 00:04:17,010 --> 00:04:19,815 We now have a new column with the latest stock data. 64 00:04:20,680 --> 00:04:24,346 Notice that the results are in the XML format and need to be 65 00:04:24,346 --> 00:04:28,606 parsed. Please refer to the OpenRefine tutorials for some 66 00:04:28,606 --> 00:04:30,988 help with parsing XML and json results. 67 00:04:30,988 --> 00:04:35,400 For now, here is the syntax for extracting the stock 68 00:04:35,400 --> 00:04:37,065 price from the XML document. 69 00:04:57,440 --> 00:05:00,908 After a quick review of the results, we are ready to export 70 00:05:00,908 --> 00:05:03,798 the data to another data processing tool in our pipeline. 71 00:05:07,950 --> 00:05:12,066 In the next example we will use OpenRefine for cleaning up a 72 00:05:12,066 --> 00:05:13,438 sample messy data file. 73 00:05:15,820 --> 00:05:19,541 So let's download the Customers-messy.zip file 74 00:05:19,541 --> 00:05:21,527 from the session folder and store it on 75 00:05:21,527 --> 00:05:25,739 the local computer. The first step is to visually inspect 76 00:05:25,739 --> 00:05:29,663 the raw data file using a text editor or Excel and get 77 00:05:29,663 --> 00:05:32,606 familiar with the attributes and contents of the data. 78 00:05:35,030 --> 00:05:38,308 As before, we will load this data file into OpenRefine and 79 00:05:38,308 --> 00:05:39,500 create a new project. 80 00:05:41,990 --> 00:05:45,390 Once the data has been loaded, we'll do a cursory 81 00:05:45,390 --> 00:05:46,750 check on the columns, 82 00:05:49,260 --> 00:05:51,996 and change the character encoding to UTF 8. 83 00:05:52,690 --> 00:05:55,910 If needed, we could also change options such as column 84 00:05:55,910 --> 00:05:57,520 separators or add a missing 85 00:05:57,520 --> 00:06:01,588 header row. In our case, we'll keep the default settings. 86 00:06:02,920 --> 00:06:06,132 So let's give the clean-up project a name and click on 87 00:06:06,132 --> 00:06:07,990 the Create Project button. 88 00:06:09,550 --> 00:06:12,982 This creates a new OpenRefine project with a copy of the 89 00:06:12,982 --> 00:06:17,660 original data set. The next step in the cleanup process is to 90 00:06:17,660 --> 00:06:21,080 visually inspect the data and check each column for errors, 91 00:06:21,080 --> 00:06:23,768 formatting issues, or missing values. 92 00:06:24,018 --> 00:06:26,851 We can perform common operations like remove 93 00:06:26,851 --> 00:06:30,954 whitespace from text. For this, let's click on a column header. 94 00:06:30,954 --> 00:06:33,258 Let's say contact first name, 95 00:06:33,470 --> 00:06:36,528 Edit cells → Trim leading and 96 00:06:36,528 --> 00:06:40,108 trailing whitespace. We can perform the same operation 97 00:06:40,108 --> 00:06:42,082 on multiple columns. 98 00:06:42,930 --> 00:06:45,530 We could also do basic concatenation operations like 99 00:06:45,530 --> 00:06:49,230 in a spreadsheet. Let's say we need to combine the first and 100 00:06:49,230 --> 00:06:53,005 last names of the contact into one column. For this we'll 101 00:06:53,005 --> 00:06:54,625 click on Edit column → 102 00:06:55,380 --> 00:06:57,528 Add column based on this column. 103 00:06:59,000 --> 00:07:01,790 On the expression box, we will take the current value 104 00:07:01,790 --> 00:07:03,575 of the contactFirstName. 105 00:07:04,300 --> 00:07:08,476 Add a space followed by the value of the contact last name. 106 00:07:15,850 --> 00:07:18,050 Let's check to see the results 107 00:07:18,520 --> 00:07:21,772 and assign a name to the new column, such as 108 00:07:21,772 --> 00:07:23,869 contactFullName. And hit OK. 109 00:07:29,870 --> 00:07:32,829 The newly-created column has been added to our data set. 110 00:07:33,990 --> 00:07:37,350 A nice thing about OpenRefine is that you can undo your past 111 00:07:37,350 --> 00:07:41,200 operations, so if you made a mistake you can go back to a 112 00:07:41,200 --> 00:07:44,450 previous step and correct it. This makes it really easy to 113 00:07:44,450 --> 00:07:47,050 make incremental changes without losing time over 114 00:07:47,050 --> 00:07:49,730 saving multiple copies of the data. 115 00:07:52,850 --> 00:07:55,705 Let's go ahead and remove the column that we just created. 116 00:07:57,300 --> 00:08:00,020 Just like with spreadsheet applications, we could also 117 00:08:00,020 --> 00:08:02,980 perform other simple, column-based operations such as 118 00:08:02,980 --> 00:08:06,820 changing the case of the text to title, upper, or lowercase. 119 00:08:17,750 --> 00:08:21,457 Faceting is a powerful feature in many data cleaning tools. The 120 00:08:21,457 --> 00:08:25,501 facet summarizes the cells in that column to give you a bigger 121 00:08:25,501 --> 00:08:29,208 picture on the column. It also allows for filtering down to 122 00:08:29,208 --> 00:08:32,769 just a subset of the rows that you want to change in bulk. 123 00:08:32,769 --> 00:08:35,103 Let's see how this works. 124 00:08:35,103 --> 00:08:37,633 Let's create a text facet on the country column 125 00:08:37,633 --> 00:08:40,183 by clicking on the column's dropdown menu and 126 00:08:40,183 --> 00:08:43,760 invoking Facet → Text facet. We can see that 127 00:08:43,760 --> 00:08:45,901 the country names are not consistent. 128 00:08:47,220 --> 00:08:51,315 For instance, United States is spelled as US, USA, or 129 00:08:51,315 --> 00:08:54,045 United States, with some extra characters. 130 00:08:55,690 --> 00:08:59,641 Also, Netherlands is spelled both with and without the 131 00:08:59,641 --> 00:09:03,592 determiner "the." France is spelled using the uppercase as 132 00:09:03,592 --> 00:09:05,348 well as lowercase characters. 133 00:09:07,130 --> 00:09:09,970 Clustering allows us to automatically group and edit 134 00:09:09,970 --> 00:09:13,165 different but similar values. Let's click on the Cluster 135 00:09:13,165 --> 00:09:14,940 command on the left frame. 136 00:09:15,890 --> 00:09:19,282 We now have the opportunity to see all the similar words 137 00:09:19,282 --> 00:09:21,202 grouped together and perform a 138 00:09:21,202 --> 00:09:25,476 merge operation. Let's merge all text referring to the United 139 00:09:25,476 --> 00:09:29,496 States as USA, United Kingdom as UK, and so on. 140 00:09:33,230 --> 00:09:36,030 The Merge and Re-Cluster button confirms the 141 00:09:36,030 --> 00:09:39,180 operation. There are no more obvious clusters that are 142 00:09:39,180 --> 00:09:41,980 left. However, we still haven't resolved the issue 143 00:09:41,980 --> 00:09:44,780 with Netherlands. For this we will choose another 144 00:09:44,780 --> 00:09:47,230 clustering operation which uses a distance metric. 145 00:09:56,410 --> 00:09:59,122 The country column is now fixed, and there are no more 146 00:09:59,122 --> 00:10:02,394 clusters to be found. However, we still have one error. 147 00:10:02,394 --> 00:10:05,716 The country Norway has a space character. See if we can find 148 00:10:05,716 --> 00:10:07,878 a way to fix this error. 149 00:10:09,690 --> 00:10:13,729 Let's now go ahead and use the tool to clean up a numerical 150 00:10:13,729 --> 00:10:17,349 column like the credit limit. We'll first apply a numerical 151 00:10:17,349 --> 00:10:20,509 transformation on the column so that all the numbers are 152 00:10:20,509 --> 00:10:24,624 recognized. Some values in this column cannot be parsed into 153 00:10:24,624 --> 00:10:27,784 numbers. This is because of the presence of alphanumeric values, 154 00:10:27,784 --> 00:10:29,996 such as the USD or the dollar 155 00:10:29,996 --> 00:10:34,130 symbol. We can correct this by applying the Numeric facet. 156 00:10:34,750 --> 00:10:38,026 On the left we see a chart with the distribution of the data. 157 00:10:38,900 --> 00:10:42,871 Let's filter down to the non-numeric values only and change 158 00:10:42,871 --> 00:10:46,120 the values of the fields from the expression box. 159 00:10:52,930 --> 00:10:55,720 This screen allows us to view the original values as 160 00:10:55,720 --> 00:10:58,231 well as the outcomes when the expression is applied. 161 00:11:02,670 --> 00:11:06,099 We will involve the replace function to replace or 162 00:11:06,099 --> 00:11:09,147 substitute the unwanted characters with blanks. So let's 163 00:11:09,147 --> 00:11:12,576 remove the USD, the dollar, and the comma symbols. 164 00:11:14,010 --> 00:11:17,202 As you can see, we can make multiple function calls in a 165 00:11:17,202 --> 00:11:19,248 single line. This is also known 166 00:11:19,248 --> 00:11:22,710 as operator chaining. Now all the Alpha characters 167 00:11:22,710 --> 00:11:24,378 have been removed. 168 00:11:26,970 --> 00:11:30,623 When we click OK, the tool still does not see them as numbers. 169 00:11:31,280 --> 00:11:35,300 For this we will need to invoke the toNumber function to 170 00:11:35,300 --> 00:11:38,315 convert the values within the facet into numerical values. 171 00:11:48,170 --> 00:11:51,800 So far we have tested these operations on a numeric facet. 172 00:11:52,420 --> 00:11:55,844 To apply these changes to the data set in the project we will 173 00:11:55,844 --> 00:11:59,064 copy this expression and apply to a column transform. This 174 00:11:59,064 --> 00:12:02,266 ensures that all the values in the credit limit column have 175 00:12:02,266 --> 00:12:04,314 been converted into numbers. 176 00:12:13,100 --> 00:12:16,541 Once we're done with cleaning up the rest of the data, we are 177 00:12:16,541 --> 00:12:19,862 ready to export the file into a format of our choosing. For this 178 00:12:19,862 --> 00:12:23,626 example, let's export it as a CSV or a comma-separated file 179 00:12:23,626 --> 00:12:25,565 and open it to view the results. 180 00:12:30,430 --> 00:12:34,234 All the columns now have clean data. The data set is now 181 00:12:34,234 --> 00:12:37,677 ready for storage in a database or for further analysis.