[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:06.25,0:00:08.34,Default,,0000,0000,0000,,So let's start the\NOpenRefine desktop Dialogue: 0,0:00:08.34,0:00:11.47,Default,,0000,0000,0000,,application, which opens up\Na new browser window and Dialogue: 0,0:00:11.47,0:00:14.60,Default,,0000,0000,0000,,runs on a default port on\Nthe local machine. Dialogue: 0,0:00:16.35,0:00:20.35,Default,,0000,0000,0000,,We can create a new\Nproject, open or import an Dialogue: 0,0:00:20.35,0:00:21.93,Default,,0000,0000,0000,,existing project, Dialogue: 0,0:00:22.34,0:00:25.02,Default,,0000,0000,0000,,or just change the language\Nsettings if needed. Dialogue: 0,0:00:25.02,0:00:29.04,Default,,0000,0000,0000,,Let's download the stock start zip\Nfile and browse through the file Dialogue: 0,0:00:29.04,0:00:31.38,Default,,0000,0000,0000,,from within the OpenRefine\NCreate Project menu. Dialogue: 0,0:00:32.45,0:00:33.68,Default,,0000,0000,0000,,And click next. Dialogue: 0,0:00:34.51,0:00:37.55,Default,,0000,0000,0000,,This imports the contents of\Nthe file into application Dialogue: 0,0:00:37.55,0:00:40.74,Default,,0000,0000,0000,,memory. Note that the tool\Ncan work both with compressed Dialogue: 0,0:00:40.74,0:00:42.57,Default,,0000,0000,0000,,and uncompressed files. Dialogue: 0,0:00:44.44,0:00:48.46,Default,,0000,0000,0000,,Since the stock data is in the\Nsemi-structured json format, we Dialogue: 0,0:00:48.46,0:00:52.30,Default,,0000,0000,0000,,will select the root node of the\Njson tree, which transforms the Dialogue: 0,0:00:52.30,0:00:55.20,Default,,0000,0000,0000,,data into a tabular format with\Nrows and columns. Dialogue: 0,0:00:55.72,0:00:58.30,Default,,0000,0000,0000,,We can choose to preserve empty strings Dialogue: 0,0:00:58.30,0:01:00.38,Default,,0000,0000,0000,,and perform basic cleanup such as Dialogue: 0,0:01:00.38,0:01:03.42,Default,,0000,0000,0000,,trimming of whitespaces and\Nparsing numerical values. Dialogue: 0,0:01:07.86,0:01:10.96,Default,,0000,0000,0000,,These operations can also be\Nperformed at a later stage. Dialogue: 0,0:01:13.43,0:01:15.74,Default,,0000,0000,0000,,We will provide a project\Nname and proceed with Dialogue: 0,0:01:15.74,0:01:17.24,Default,,0000,0000,0000,,project creation. Dialogue: 0,0:01:22.99,0:01:24.73,Default,,0000,0000,0000,,Once the project has\Nbeen created, Dialogue: 0,0:01:25.74,0:01:29.99,Default,,0000,0000,0000,,we can click All → Edit columns\Nand remove some of the unwanted Dialogue: 0,0:01:29.99,0:01:31.76,Default,,0000,0000,0000,,columns from the data set. Dialogue: 0,0:01:33.01,0:01:36.38,Default,,0000,0000,0000,,This helps reduce the memory\Nusage, especially for large Dialogue: 0,0:01:36.38,0:01:39.80,Default,,0000,0000,0000,,data sets. We'll also go\Nahead and reorder some of Dialogue: 0,0:01:39.80,0:01:41.45,Default,,0000,0000,0000,,the columns in the table. Dialogue: 0,0:01:49.75,0:01:53.24,Default,,0000,0000,0000,,Now the data set is lighter\Nwith only those columns that Dialogue: 0,0:01:53.24,0:01:56.45,Default,,0000,0000,0000,,will be used potentially for\Nfurther analysis. Just like Dialogue: 0,0:01:56.45,0:01:59.46,Default,,0000,0000,0000,,in a spreadsheet, we could\Nchoose to sort different Dialogue: 0,0:01:59.46,0:02:02.82,Default,,0000,0000,0000,,columns. Here we'll go ahead\Nand sort the volume column Dialogue: 0,0:02:02.82,0:02:04.16,Default,,0000,0000,0000,,in the descending order. Dialogue: 0,0:02:15.51,0:02:18.53,Default,,0000,0000,0000,,Financial data sets, such as\Nstock prices, are constantly Dialogue: 0,0:02:18.53,0:02:22.57,Default,,0000,0000,0000,,changing. What if our data is\Noutdated, and we wanted to fetch Dialogue: 0,0:02:22.57,0:02:25.93,Default,,0000,0000,0000,,the latest data from the stock\Nmarket? Using OpenRefine, one Dialogue: 0,0:02:25.93,0:02:28.61,Default,,0000,0000,0000,,could augment the data using\Nonline web services. Dialogue: 0,0:02:29.65,0:02:32.53,Default,,0000,0000,0000,,However, calling some of these\Nweb services could be very Dialogue: 0,0:02:32.53,0:02:35.07,Default,,0000,0000,0000,,time-consuming. So to\Ndemonstrate this feature, Dialogue: 0,0:02:35.07,0:02:38.58,Default,,0000,0000,0000,,let's work with a smaller\Nsubset of the stocks data and Dialogue: 0,0:02:38.58,0:02:40.64,Default,,0000,0000,0000,,load it into a new project. Dialogue: 0,0:02:40.64,0:02:44.79,Default,,0000,0000,0000,,Look for the file stocks_demo.csv in the\Nsession folder, Dialogue: 0,0:02:44.79,0:02:46.94,Default,,0000,0000,0000,,and load it in OpenRefine. Dialogue: 0,0:02:50.45,0:02:53.02,Default,,0000,0000,0000,,We will also create a new\NOpenRefine project Dialogue: 0,0:02:53.02,0:02:54.86,Default,,0000,0000,0000,,and call it StocksDemo. Dialogue: 0,0:02:59.87,0:03:04.50,Default,,0000,0000,0000,,On the ticker column menu, click\NEdit column → Add columns by Dialogue: 0,0:03:04.50,0:03:09.40,Default,,0000,0000,0000,,fetching URLs. This opens up an\Nexpression box where we could Dialogue: 0,0:03:09.40,0:03:13.49,Default,,0000,0000,0000,,enter the URL of an online web\Nservice which could be invoked Dialogue: 0,0:03:13.49,0:03:15.04,Default,,0000,0000,0000,,to augment the data. Dialogue: 0,0:03:15.62,0:03:19.73,Default,,0000,0000,0000,,This online service provides an\NAPI that returns slightly Dialogue: 0,0:03:19.73,0:03:21.10,Default,,0000,0000,0000,,delayed stock prices. Dialogue: 0,0:03:22.25,0:03:25.27,Default,,0000,0000,0000,,A good practice is to test\Nthe web service from a Dialogue: 0,0:03:25.27,0:03:28.91,Default,,0000,0000,0000,,browser client with a known\Nticker symbol and validate Dialogue: 0,0:03:28.91,0:03:30.18,Default,,0000,0000,0000,,the service response. Dialogue: 0,0:03:34.63,0:03:38.65,Default,,0000,0000,0000,,We will add this URL and append\Nthe ticker symbol as a variable Dialogue: 0,0:03:38.65,0:03:40.14,Default,,0000,0000,0000,,parameter for each row. Dialogue: 0,0:03:41.23,0:03:43.52,Default,,0000,0000,0000,,To do this, copy the URL into Dialogue: 0,0:03:43.52,0:03:47.89,Default,,0000,0000,0000,,the expression box, and insert\Ncodes as shown here. Dialogue: 0,0:03:56.72,0:04:00.28,Default,,0000,0000,0000,,Now the URL is different for\Neach row with the ticker symbol Dialogue: 0,0:04:00.28,0:04:01.47,Default,,0000,0000,0000,,as the symbol parameter. Dialogue: 0,0:04:03.61,0:04:06.43,Default,,0000,0000,0000,,We'll give a new name to the\Ncolumn and click OK. Dialogue: 0,0:04:08.44,0:04:11.30,Default,,0000,0000,0000,,This could take a few minutes\Nto run depending on the speed Dialogue: 0,0:04:11.30,0:04:12.87,Default,,0000,0000,0000,,of the web service. Dialogue: 0,0:04:17.01,0:04:19.82,Default,,0000,0000,0000,,We now have a new column with\Nthe latest stock data. Dialogue: 0,0:04:20.68,0:04:24.35,Default,,0000,0000,0000,,Notice that the results are in\Nthe XML format and need to be Dialogue: 0,0:04:24.35,0:04:28.61,Default,,0000,0000,0000,,parsed. Please refer to the\NOpenRefine tutorials for some Dialogue: 0,0:04:28.61,0:04:30.99,Default,,0000,0000,0000,,help with parsing XML and json\Nresults. Dialogue: 0,0:04:30.99,0:04:35.40,Default,,0000,0000,0000,,For now, here is the\Nsyntax for extracting the stock Dialogue: 0,0:04:35.40,0:04:37.06,Default,,0000,0000,0000,,price from the XML document. Dialogue: 0,0:04:57.44,0:05:00.91,Default,,0000,0000,0000,,After a quick review of the\Nresults, we are ready to export Dialogue: 0,0:05:00.91,0:05:03.80,Default,,0000,0000,0000,,the data to another data\Nprocessing tool in our pipeline. Dialogue: 0,0:05:07.95,0:05:12.07,Default,,0000,0000,0000,,In the next example we will use\NOpenRefine for cleaning up a Dialogue: 0,0:05:12.07,0:05:13.44,Default,,0000,0000,0000,,sample messy data file. Dialogue: 0,0:05:15.82,0:05:19.54,Default,,0000,0000,0000,,So let's download the \NCustomers-messy.zip file Dialogue: 0,0:05:19.54,0:05:21.53,Default,,0000,0000,0000,,from the session folder and store it on Dialogue: 0,0:05:21.53,0:05:25.74,Default,,0000,0000,0000,,the local computer. The first\Nstep is to visually inspect Dialogue: 0,0:05:25.74,0:05:29.66,Default,,0000,0000,0000,,the raw data file using a text\Neditor or Excel and get Dialogue: 0,0:05:29.66,0:05:32.61,Default,,0000,0000,0000,,familiar with the attributes\Nand contents of the data. Dialogue: 0,0:05:35.03,0:05:38.31,Default,,0000,0000,0000,,As before, we will load this\Ndata file into OpenRefine and Dialogue: 0,0:05:38.31,0:05:39.50,Default,,0000,0000,0000,,create a new project. Dialogue: 0,0:05:41.99,0:05:45.39,Default,,0000,0000,0000,,Once the data has been\Nloaded, we'll do a cursory Dialogue: 0,0:05:45.39,0:05:46.75,Default,,0000,0000,0000,,check on the columns, Dialogue: 0,0:05:49.26,0:05:51.100,Default,,0000,0000,0000,,and change the character\Nencoding to UTF 8. Dialogue: 0,0:05:52.69,0:05:55.91,Default,,0000,0000,0000,,If needed, we could also change\Noptions such as column Dialogue: 0,0:05:55.91,0:05:57.52,Default,,0000,0000,0000,,separators or add a missing Dialogue: 0,0:05:57.52,0:06:01.59,Default,,0000,0000,0000,,header row. In our case, we'll\Nkeep the default settings. Dialogue: 0,0:06:02.92,0:06:06.13,Default,,0000,0000,0000,,So let's give the clean-up\Nproject a name and click on Dialogue: 0,0:06:06.13,0:06:07.99,Default,,0000,0000,0000,,the Create Project button. Dialogue: 0,0:06:09.55,0:06:12.98,Default,,0000,0000,0000,,This creates a new OpenRefine\Nproject with a copy of the Dialogue: 0,0:06:12.98,0:06:17.66,Default,,0000,0000,0000,,original data set. The next step\Nin the cleanup process is to Dialogue: 0,0:06:17.66,0:06:21.08,Default,,0000,0000,0000,,visually inspect the data and\Ncheck each column for errors, Dialogue: 0,0:06:21.08,0:06:23.77,Default,,0000,0000,0000,,formatting issues, or missing values. Dialogue: 0,0:06:24.02,0:06:26.85,Default,,0000,0000,0000,,We can perform common\Noperations like remove Dialogue: 0,0:06:26.85,0:06:30.95,Default,,0000,0000,0000,,whitespace from text. For this,\Nlet's click on a column header. Dialogue: 0,0:06:30.95,0:06:33.26,Default,,0000,0000,0000,,Let's say contact first name, Dialogue: 0,0:06:33.47,0:06:36.53,Default,,0000,0000,0000,,Edit cells → Trim leading and Dialogue: 0,0:06:36.53,0:06:40.11,Default,,0000,0000,0000,,trailing whitespace. We can\Nperform the same operation Dialogue: 0,0:06:40.11,0:06:42.08,Default,,0000,0000,0000,,on multiple columns. Dialogue: 0,0:06:42.93,0:06:45.53,Default,,0000,0000,0000,,We could also do basic\Nconcatenation operations like Dialogue: 0,0:06:45.53,0:06:49.23,Default,,0000,0000,0000,,in a spreadsheet. Let's say we\Nneed to combine the first and Dialogue: 0,0:06:49.23,0:06:53.00,Default,,0000,0000,0000,,last names of the contact into\None column. For this we'll Dialogue: 0,0:06:53.00,0:06:54.62,Default,,0000,0000,0000,,click on Edit column → Dialogue: 0,0:06:55.38,0:06:57.53,Default,,0000,0000,0000,,Add column based on this column. Dialogue: 0,0:06:59.00,0:07:01.79,Default,,0000,0000,0000,,On the expression box, we\Nwill take the current value Dialogue: 0,0:07:01.79,0:07:03.58,Default,,0000,0000,0000,,of the contactFirstName. Dialogue: 0,0:07:04.30,0:07:08.48,Default,,0000,0000,0000,,Add a space followed by the\Nvalue of the contact last name. Dialogue: 0,0:07:15.85,0:07:18.05,Default,,0000,0000,0000,,Let's check to see the results Dialogue: 0,0:07:18.52,0:07:21.77,Default,,0000,0000,0000,,and assign a name to the new\Ncolumn, such as Dialogue: 0,0:07:21.77,0:07:23.87,Default,,0000,0000,0000,,contactFullName. And hit OK. Dialogue: 0,0:07:29.87,0:07:32.83,Default,,0000,0000,0000,,The newly-created column has\Nbeen added to our data set. Dialogue: 0,0:07:33.99,0:07:37.35,Default,,0000,0000,0000,,A nice thing about OpenRefine is\Nthat you can undo your past Dialogue: 0,0:07:37.35,0:07:41.20,Default,,0000,0000,0000,,operations, so if you made a\Nmistake you can go back to a Dialogue: 0,0:07:41.20,0:07:44.45,Default,,0000,0000,0000,,previous step and correct it.\NThis makes it really easy to Dialogue: 0,0:07:44.45,0:07:47.05,Default,,0000,0000,0000,,make incremental changes without\Nlosing time over Dialogue: 0,0:07:47.05,0:07:49.73,Default,,0000,0000,0000,,saving multiple copies of the data. Dialogue: 0,0:07:52.85,0:07:55.70,Default,,0000,0000,0000,,Let's go ahead and remove the\Ncolumn that we just created. Dialogue: 0,0:07:57.30,0:08:00.02,Default,,0000,0000,0000,,Just like with spreadsheet\Napplications, we could also Dialogue: 0,0:08:00.02,0:08:02.98,Default,,0000,0000,0000,,perform other simple, column-based \Noperations such as Dialogue: 0,0:08:02.98,0:08:06.82,Default,,0000,0000,0000,,changing the case of the text to\Ntitle, upper, or lowercase. Dialogue: 0,0:08:17.75,0:08:21.46,Default,,0000,0000,0000,,Faceting is a powerful feature\Nin many data cleaning tools. The Dialogue: 0,0:08:21.46,0:08:25.50,Default,,0000,0000,0000,,facet summarizes the cells in\Nthat column to give you a bigger Dialogue: 0,0:08:25.50,0:08:29.21,Default,,0000,0000,0000,,picture on the column. It also\Nallows for filtering down to Dialogue: 0,0:08:29.21,0:08:32.77,Default,,0000,0000,0000,,just a subset of the rows that\Nyou want to change in bulk. Dialogue: 0,0:08:32.77,0:08:35.10,Default,,0000,0000,0000,,Let's see how this works. Dialogue: 0,0:08:35.10,0:08:37.63,Default,,0000,0000,0000,,Let's create a text facet \Non the country column Dialogue: 0,0:08:37.63,0:08:40.18,Default,,0000,0000,0000,,by clicking on the column's \Ndropdown menu and Dialogue: 0,0:08:40.18,0:08:43.76,Default,,0000,0000,0000,,invoking Facet → Text facet. We\Ncan see that Dialogue: 0,0:08:43.76,0:08:45.90,Default,,0000,0000,0000,,the country names\Nare not consistent. Dialogue: 0,0:08:47.22,0:08:51.32,Default,,0000,0000,0000,,For instance, United States\Nis spelled as US, USA, or Dialogue: 0,0:08:51.32,0:08:54.04,Default,,0000,0000,0000,,United States, with some\Nextra characters. Dialogue: 0,0:08:55.69,0:08:59.64,Default,,0000,0000,0000,,Also, Netherlands is spelled\Nboth with and without the Dialogue: 0,0:08:59.64,0:09:03.59,Default,,0000,0000,0000,,determiner "the." France is\Nspelled using the uppercase as Dialogue: 0,0:09:03.59,0:09:05.35,Default,,0000,0000,0000,,well as lowercase characters. Dialogue: 0,0:09:07.13,0:09:09.97,Default,,0000,0000,0000,,Clustering allows us to\Nautomatically group and edit Dialogue: 0,0:09:09.97,0:09:13.16,Default,,0000,0000,0000,,different but similar values.\NLet's click on the Cluster Dialogue: 0,0:09:13.16,0:09:14.94,Default,,0000,0000,0000,,command on the left frame. Dialogue: 0,0:09:15.89,0:09:19.28,Default,,0000,0000,0000,,We now have the opportunity to\Nsee all the similar words Dialogue: 0,0:09:19.28,0:09:21.20,Default,,0000,0000,0000,,grouped together and perform a Dialogue: 0,0:09:21.20,0:09:25.48,Default,,0000,0000,0000,,merge operation. Let's merge all\Ntext referring to the United Dialogue: 0,0:09:25.48,0:09:29.50,Default,,0000,0000,0000,,States as USA, United Kingdom as\NUK, and so on. Dialogue: 0,0:09:33.23,0:09:36.03,Default,,0000,0000,0000,,The Merge and Re-Cluster\Nbutton confirms the Dialogue: 0,0:09:36.03,0:09:39.18,Default,,0000,0000,0000,,operation. There are no more\Nobvious clusters that are Dialogue: 0,0:09:39.18,0:09:41.98,Default,,0000,0000,0000,,left. However, we still\Nhaven't resolved the issue Dialogue: 0,0:09:41.98,0:09:44.78,Default,,0000,0000,0000,,with Netherlands. For this\Nwe will choose another Dialogue: 0,0:09:44.78,0:09:47.23,Default,,0000,0000,0000,,clustering operation which\Nuses a distance metric. Dialogue: 0,0:09:56.41,0:09:59.12,Default,,0000,0000,0000,,The country column is now fixed,\Nand there are no more Dialogue: 0,0:09:59.12,0:10:02.39,Default,,0000,0000,0000,,clusters to be found. However, \Nwe still have one error. Dialogue: 0,0:10:02.39,0:10:05.72,Default,,0000,0000,0000,,The country Norway has a space\Ncharacter. See if we can find Dialogue: 0,0:10:05.72,0:10:07.88,Default,,0000,0000,0000,,a way to fix this error. Dialogue: 0,0:10:09.69,0:10:13.73,Default,,0000,0000,0000,,Let's now go ahead and use the\Ntool to clean up a numerical Dialogue: 0,0:10:13.73,0:10:17.35,Default,,0000,0000,0000,,column like the credit limit.\NWe'll first apply a numerical Dialogue: 0,0:10:17.35,0:10:20.51,Default,,0000,0000,0000,,transformation on the column so\Nthat all the numbers are Dialogue: 0,0:10:20.51,0:10:24.62,Default,,0000,0000,0000,,recognized. Some values in this\Ncolumn cannot be parsed into Dialogue: 0,0:10:24.62,0:10:27.78,Default,,0000,0000,0000,,numbers. This is because of the\Npresence of alphanumeric values, Dialogue: 0,0:10:27.78,0:10:29.100,Default,,0000,0000,0000,,such as the USD or the dollar Dialogue: 0,0:10:29.100,0:10:34.13,Default,,0000,0000,0000,,symbol. We can correct this by\Napplying the Numeric facet. Dialogue: 0,0:10:34.75,0:10:38.03,Default,,0000,0000,0000,,On the left we see a chart with\Nthe distribution of the data. Dialogue: 0,0:10:38.90,0:10:42.87,Default,,0000,0000,0000,,Let's filter down to the non-numeric \Nvalues only and change Dialogue: 0,0:10:42.87,0:10:46.12,Default,,0000,0000,0000,,the values of the fields from\Nthe expression box. Dialogue: 0,0:10:52.93,0:10:55.72,Default,,0000,0000,0000,,This screen allows us to\Nview the original values as Dialogue: 0,0:10:55.72,0:10:58.23,Default,,0000,0000,0000,,well as the outcomes when\Nthe expression is applied. Dialogue: 0,0:11:02.67,0:11:06.10,Default,,0000,0000,0000,,We will involve the replace\Nfunction to replace or Dialogue: 0,0:11:06.10,0:11:09.15,Default,,0000,0000,0000,,substitute the unwanted\Ncharacters with blanks. So let's Dialogue: 0,0:11:09.15,0:11:12.58,Default,,0000,0000,0000,,remove the USD, the dollar, and\Nthe comma symbols. Dialogue: 0,0:11:14.01,0:11:17.20,Default,,0000,0000,0000,,As you can see, we can make\Nmultiple function calls in a Dialogue: 0,0:11:17.20,0:11:19.25,Default,,0000,0000,0000,,single line. This is also known Dialogue: 0,0:11:19.25,0:11:22.71,Default,,0000,0000,0000,,as operator chaining. Now\Nall the Alpha characters Dialogue: 0,0:11:22.71,0:11:24.38,Default,,0000,0000,0000,,have been removed. Dialogue: 0,0:11:26.97,0:11:30.62,Default,,0000,0000,0000,,When we click OK, the tool still\Ndoes not see them as numbers. Dialogue: 0,0:11:31.28,0:11:35.30,Default,,0000,0000,0000,,For this we will need to invoke\Nthe toNumber function to Dialogue: 0,0:11:35.30,0:11:38.32,Default,,0000,0000,0000,,convert the values within the\Nfacet into numerical values. Dialogue: 0,0:11:48.17,0:11:51.80,Default,,0000,0000,0000,,So far we have tested these\Noperations on a numeric facet. Dialogue: 0,0:11:52.42,0:11:55.84,Default,,0000,0000,0000,,To apply these changes to the\Ndata set in the project we will Dialogue: 0,0:11:55.84,0:11:59.06,Default,,0000,0000,0000,,copy this expression and apply\Nto a column transform. This Dialogue: 0,0:11:59.06,0:12:02.27,Default,,0000,0000,0000,,ensures that all the values in\Nthe credit limit column have Dialogue: 0,0:12:02.27,0:12:04.31,Default,,0000,0000,0000,,been converted into numbers. Dialogue: 0,0:12:13.10,0:12:16.54,Default,,0000,0000,0000,,Once we're done with cleaning up\Nthe rest of the data, we are Dialogue: 0,0:12:16.54,0:12:19.86,Default,,0000,0000,0000,,ready to export the file into a\Nformat of our choosing. For this Dialogue: 0,0:12:19.86,0:12:23.63,Default,,0000,0000,0000,,example, let's export it as a\NCSV or a comma-separated file Dialogue: 0,0:12:23.63,0:12:25.56,Default,,0000,0000,0000,,and open it to view the results. Dialogue: 0,0:12:30.43,0:12:34.23,Default,,0000,0000,0000,,All the columns now have clean\Ndata. The data set is now Dialogue: 0,0:12:34.23,0:12:37.68,Default,,0000,0000,0000,,ready for storage in a\Ndatabase or for further analysis.