So let's start the OpenRefine desktop application, which opens up a new browser window and runs on a default port on the local machine. We can create a new project, open or import an existing project, or just change the language settings if needed. Let's download the stock start zip file and browse through the file from within the OpenRefine Create Project menu. And click next. This imports the contents of the file into application memory. Note that the tool can work both with compressed and uncompressed files. Since the stock data is in the semi-structured json format, we will select the root node of the json tree, which transforms the data into a tabular format with rows and columns. We can choose to preserve empty strings and perform basic cleanup such as trimming of whitespaces and parsing numerical values. These operations can also be performed at a later stage. We will provide a project name and proceed with project creation. Once the project has been created, we can click All → Edit columns and remove some of the unwanted columns from the data set. This helps reduce the memory usage, especially for large data sets. We'll also go ahead and reorder some of the columns in the table. Now the data set is lighter with only those columns that will be used potentially for further analysis. Just like in a spreadsheet, we could choose to sort different columns. Here we'll go ahead and sort the volume column in the descending order. Financial data sets, such as stock prices, are constantly changing. What if our data is outdated, and we wanted to fetch the latest data from the stock market? Using OpenRefine, one could augment the data using online web services. However, calling some of these web services could be very time-consuming. So to demonstrate this feature, let's work with a smaller subset of the stocks data and load it into a new project. Look for the file stocks_demo.csv in the session folder, and load it in OpenRefine. We will also create a new OpenRefine project and call it StocksDemo. On the ticker column menu, click Edit column → Add columns by fetching URLs. This opens up an expression box where we could enter the URL of an online web service which could be invoked to augment the data. This online service provides an API that returns slightly delayed stock prices. A good practice is to test the web service from a browser client with a known ticker symbol and validate the service response. We will add this URL and append the ticker symbol as a variable parameter for each row. To do this, copy the URL into the expression box, and insert codes as shown here. Now the URL is different for each row with the ticker symbol as the symbol parameter. We'll give a new name to the column and click OK. This could take a few minutes to run depending on the speed of the web service. We now have a new column with the latest stock data. Notice that the results are in the XML format and need to be parsed. Please refer to the OpenRefine tutorials for some help with parsing XML and json results. For now, here is the syntax for extracting the stock price from the XML document. After a quick review of the results, we are ready to export the data to another data processing tool in our pipeline. In the next example we will use OpenRefine for cleaning up a sample messy data file. So let's download the Customers-messy.zip file from the session folder and store it on the local computer. The first step is to visually inspect the raw data file using a text editor or Excel and get familiar with the attributes and contents of the data. As before, we will load this data file into OpenRefine and create a new project. Once the data has been loaded, we'll do a cursory check on the columns, and change the character encoding to UTF 8. If needed, we could also change options such as column separators or add a missing header row. In our case, we'll keep the default settings. So let's give the clean-up project a name and click on the Create Project button. This creates a new OpenRefine project with a copy of the original data set. The next step in the cleanup process is to visually inspect the data and check each column for errors, formatting issues, or missing values. We can perform common operations like remove whitespace from text. For this, let's click on a column header. Let's say contact first name, Edit cells → Trim leading and trailing whitespace. We can perform the same operation on multiple columns. We could also do basic concatenation operations like in a spreadsheet. Let's say we need to combine the first and last names of the contact into one column. For this we'll click on Edit column → Add column based on this column. On the expression box, we will take the current value of the contactFirstName. Add a space followed by the value of the contact last name. Let's check to see the results and assign a name to the new column, such as contactFullName. And hit OK. The newly-created column has been added to our data set. A nice thing about OpenRefine is that you can undo your past operations, so if you made a mistake you can go back to a previous step and correct it. This makes it really easy to make incremental changes without losing time over saving multiple copies of the data. Let's go ahead and remove the column that we just created. Just like with spreadsheet applications, we could also perform other simple, column-based operations such as changing the case of the text to title, upper, or lowercase. Faceting is a powerful feature in many data cleaning tools. The facet summarizes the cells in that column to give you a bigger picture on the column. It also allows for filtering down to just a subset of the rows that you want to change in bulk. Let's see how this works. Let's create a text facet on the country column by clicking on the column's dropdown menu and invoking Facet → Text facet. We can see that the country names are not consistent. For instance, United States is spelled as US, USA, or United States, with some extra characters. Also, Netherlands is spelled both with and without the determiner "the." France is spelled using the uppercase as well as lowercase characters. Clustering allows us to automatically group and edit different but similar values. Let's click on the Cluster command on the left frame. We now have the opportunity to see all the similar words grouped together and perform a merge operation. Let's merge all text referring to the United States as USA, United Kingdom as UK, and so on. The Merge and Re-Cluster button confirms the operation. There are no more obvious clusters that are left. However, we still haven't resolved the issue with Netherlands. For this we will choose another clustering operation which uses a distance metric. The country column is now fixed, and there are no more clusters to be found. However, we still have one error. The country Norway has a space character. See if we can find a way to fix this error. Let's now go ahead and use the tool to clean up a numerical column like the credit limit. We'll first apply a numerical transformation on the column so that all the numbers are recognized. Some values in this column cannot be parsed into numbers. This is because of the presence of alphanumeric values, such as the USD or the dollar symbol. We can correct this by applying the Numeric facet. On the left we see a chart with the distribution of the data. Let's filter down to the non-numeric values only and change the values of the fields from the expression box. This screen allows us to view the original values as well as the outcomes when the expression is applied. We will involve the replace function to replace or substitute the unwanted characters with blanks. So let's remove the USD, the dollar, and the comma symbols. As you can see, we can make multiple function calls in a single line. This is also known as operator chaining. Now all the Alpha characters have been removed. When we click OK, the tool still does not see them as numbers. For this we will need to invoke the toNumber function to convert the values within the facet into numerical values. So far we have tested these operations on a numeric facet. To apply these changes to the data set in the project we will copy this expression and apply to a column transform. This ensures that all the values in the credit limit column have been converted into numbers. Once we're done with cleaning up the rest of the data, we are ready to export the file into a format of our choosing. For this example, let's export it as a CSV or a comma-separated file and open it to view the results. All the columns now have clean data. The data set is now ready for storage in a database or for further analysis.