Return to Video

vimeo.com/.../508797388

  • 0:06 - 0:08
    So let's start the
    OpenRefine desktop
  • 0:08 - 0:11
    application, which opens up
    a new browser window and
  • 0:11 - 0:15
    runs on a default port on
    the local machine.
  • 0:16 - 0:20
    We can create a new
    project, open or import an
  • 0:20 - 0:22
    existing project,
  • 0:22 - 0:25
    or just change the language
    settings if needed.
  • 0:25 - 0:29
    Let's download the stock start zip
    file and browse through the file
  • 0:29 - 0:31
    from within the OpenRefine
    Create Project menu.
  • 0:32 - 0:34
    And click next.
  • 0:35 - 0:38
    This imports the contents of
    the file into application
  • 0:38 - 0:41
    memory. Note that the tool
    can work both with compressed
  • 0:41 - 0:43
    and uncompressed files.
  • 0:44 - 0:48
    Since the stock data is in the
    semi-structured json format, we
  • 0:48 - 0:52
    will select the root node of the
    json tree, which transforms the
  • 0:52 - 0:55
    data into a tabular format with
    rows and columns.
  • 0:56 - 0:58
    We can choose to preserve empty strings
  • 0:58 - 1:00
    and perform basic cleanup such as
  • 1:00 - 1:03
    trimming of whitespaces and
    parsing numerical values.
  • 1:08 - 1:11
    These operations can also be
    performed at a later stage.
  • 1:13 - 1:16
    We will provide a project
    name and proceed with
  • 1:16 - 1:17
    project creation.
  • 1:23 - 1:25
    Once the project has
    been created,
  • 1:26 - 1:30
    we can click All → Edit columns
    and remove some of the unwanted
  • 1:30 - 1:32
    columns from the data set.
  • 1:33 - 1:36
    This helps reduce the memory
    usage, especially for large
  • 1:36 - 1:40
    data sets. We'll also go
    ahead and reorder some of
  • 1:40 - 1:41
    the columns in the table.
  • 1:50 - 1:53
    Now the data set is lighter
    with only those columns that
  • 1:53 - 1:56
    will be used potentially for
    further analysis. Just like
  • 1:56 - 1:59
    in a spreadsheet, we could
    choose to sort different
  • 1:59 - 2:03
    columns. Here we'll go ahead
    and sort the volume column
  • 2:03 - 2:04
    in the descending order.
  • 2:16 - 2:19
    Financial data sets, such as
    stock prices, are constantly
  • 2:19 - 2:23
    changing. What if our data is
    outdated, and we wanted to fetch
  • 2:23 - 2:26
    the latest data from the stock
    market? Using OpenRefine, one
  • 2:26 - 2:29
    could augment the data using
    online web services.
  • 2:30 - 2:33
    However, calling some of these
    web services could be very
  • 2:33 - 2:35
    time-consuming. So to
    demonstrate this feature,
  • 2:35 - 2:39
    let's work with a smaller
    subset of the stocks data and
  • 2:39 - 2:41
    load it into a new project.
  • 2:41 - 2:45
    Look for the file stocks_demo.csv in the
    session folder,
  • 2:45 - 2:47
    and load it in OpenRefine.
  • 2:50 - 2:53
    We will also create a new
    OpenRefine project
  • 2:53 - 2:55
    and call it StocksDemo.
  • 3:00 - 3:05
    On the ticker column menu, click
    Edit column → Add columns by
  • 3:05 - 3:09
    fetching URLs. This opens up an
    expression box where we could
  • 3:09 - 3:13
    enter the URL of an online web
    service which could be invoked
  • 3:13 - 3:15
    to augment the data.
  • 3:16 - 3:20
    This online service provides an
    API that returns slightly
  • 3:20 - 3:21
    delayed stock prices.
  • 3:22 - 3:25
    A good practice is to test
    the web service from a
  • 3:25 - 3:29
    browser client with a known
    ticker symbol and validate
  • 3:29 - 3:30
    the service response.
  • 3:35 - 3:39
    We will add this URL and append
    the ticker symbol as a variable
  • 3:39 - 3:40
    parameter for each row.
  • 3:41 - 3:44
    To do this, copy the URL into
  • 3:44 - 3:48
    the expression box, and insert
    codes as shown here.
  • 3:57 - 4:00
    Now the URL is different for
    each row with the ticker symbol
  • 4:00 - 4:01
    as the symbol parameter.
  • 4:04 - 4:06
    We'll give a new name to the
    column and click OK.
  • 4:08 - 4:11
    This could take a few minutes
    to run depending on the speed
  • 4:11 - 4:13
    of the web service.
  • 4:17 - 4:20
    We now have a new column with
    the latest stock data.
  • 4:21 - 4:24
    Notice that the results are in
    the XML format and need to be
  • 4:24 - 4:29
    parsed. Please refer to the
    OpenRefine tutorials for some
  • 4:29 - 4:31
    help with parsing XML and json
    results.
  • 4:31 - 4:35
    For now, here is the
    syntax for extracting the stock
  • 4:35 - 4:37
    price from the XML document.
  • 4:57 - 5:01
    After a quick review of the
    results, we are ready to export
  • 5:01 - 5:04
    the data to another data
    processing tool in our pipeline.
  • 5:08 - 5:12
    In the next example we will use
    OpenRefine for cleaning up a
  • 5:12 - 5:13
    sample messy data file.
  • 5:16 - 5:20
    So let's download the
    Customers-messy.zip file
  • 5:20 - 5:22
    from the session folder and store it on
  • 5:22 - 5:26
    the local computer. The first
    step is to visually inspect
  • 5:26 - 5:30
    the raw data file using a text
    editor or Excel and get
  • 5:30 - 5:33
    familiar with the attributes
    and contents of the data.
  • 5:35 - 5:38
    As before, we will load this
    data file into OpenRefine and
  • 5:38 - 5:40
    create a new project.
  • 5:42 - 5:45
    Once the data has been
    loaded, we'll do a cursory
  • 5:45 - 5:47
    check on the columns,
  • 5:49 - 5:52
    and change the character
    encoding to UTF 8.
  • 5:53 - 5:56
    If needed, we could also change
    options such as column
  • 5:56 - 5:58
    separators or add a missing
  • 5:58 - 6:02
    header row. In our case, we'll
    keep the default settings.
  • 6:03 - 6:06
    So let's give the clean-up
    project a name and click on
  • 6:06 - 6:08
    the Create Project button.
  • 6:10 - 6:13
    This creates a new OpenRefine
    project with a copy of the
  • 6:13 - 6:18
    original data set. The next step
    in the cleanup process is to
  • 6:18 - 6:21
    visually inspect the data and
    check each column for errors,
  • 6:21 - 6:24
    formatting issues, or missing values.
  • 6:24 - 6:27
    We can perform common
    operations like remove
  • 6:27 - 6:31
    whitespace from text. For this,
    let's click on a column header.
  • 6:31 - 6:33
    Let's say contact first name,
  • 6:33 - 6:37
    Edit cells → Trim leading and
  • 6:37 - 6:40
    trailing whitespace. We can
    perform the same operation
  • 6:40 - 6:42
    on multiple columns.
  • 6:43 - 6:46
    We could also do basic
    concatenation operations like
  • 6:46 - 6:49
    in a spreadsheet. Let's say we
    need to combine the first and
  • 6:49 - 6:53
    last names of the contact into
    one column. For this we'll
  • 6:53 - 6:55
    click on Edit column →
  • 6:55 - 6:58
    Add column based on this column.
  • 6:59 - 7:02
    On the expression box, we
    will take the current value
  • 7:02 - 7:04
    of the contactFirstName.
  • 7:04 - 7:08
    Add a space followed by the
    value of the contact last name.
  • 7:16 - 7:18
    Let's check to see the results
  • 7:19 - 7:22
    and assign a name to the new
    column, such as
  • 7:22 - 7:24
    contactFullName. And hit OK.
  • 7:30 - 7:33
    The newly-created column has
    been added to our data set.
  • 7:34 - 7:37
    A nice thing about OpenRefine is
    that you can undo your past
  • 7:37 - 7:41
    operations, so if you made a
    mistake you can go back to a
  • 7:41 - 7:44
    previous step and correct it.
    This makes it really easy to
  • 7:44 - 7:47
    make incremental changes without
    losing time over
  • 7:47 - 7:50
    saving multiple copies of the data.
  • 7:53 - 7:56
    Let's go ahead and remove the
    column that we just created.
  • 7:57 - 8:00
    Just like with spreadsheet
    applications, we could also
  • 8:00 - 8:03
    perform other simple, column-based
    operations such as
  • 8:03 - 8:07
    changing the case of the text to
    title, upper, or lowercase.
  • 8:18 - 8:21
    Faceting is a powerful feature
    in many data cleaning tools. The
  • 8:21 - 8:26
    facet summarizes the cells in
    that column to give you a bigger
  • 8:26 - 8:29
    picture on the column. It also
    allows for filtering down to
  • 8:29 - 8:33
    just a subset of the rows that
    you want to change in bulk.
  • 8:33 - 8:35
    Let's see how this works.
  • 8:35 - 8:38
    Let's create a text facet
    on the country column
  • 8:38 - 8:40
    by clicking on the column's
    dropdown menu and
  • 8:40 - 8:44
    invoking Facet → Text facet. We
    can see that
  • 8:44 - 8:46
    the country names
    are not consistent.
  • 8:47 - 8:51
    For instance, United States
    is spelled as US, USA, or
  • 8:51 - 8:54
    United States, with some
    extra characters.
  • 8:56 - 9:00
    Also, Netherlands is spelled
    both with and without the
  • 9:00 - 9:04
    determiner "the." France is
    spelled using the uppercase as
  • 9:04 - 9:05
    well as lowercase characters.
  • 9:07 - 9:10
    Clustering allows us to
    automatically group and edit
  • 9:10 - 9:13
    different but similar values.
    Let's click on the Cluster
  • 9:13 - 9:15
    command on the left frame.
  • 9:16 - 9:19
    We now have the opportunity to
    see all the similar words
  • 9:19 - 9:21
    grouped together and perform a
  • 9:21 - 9:25
    merge operation. Let's merge all
    text referring to the United
  • 9:25 - 9:29
    States as USA, United Kingdom as
    UK, and so on.
  • 9:33 - 9:36
    The Merge and Re-Cluster
    button confirms the
  • 9:36 - 9:39
    operation. There are no more
    obvious clusters that are
  • 9:39 - 9:42
    left. However, we still
    haven't resolved the issue
  • 9:42 - 9:45
    with Netherlands. For this
    we will choose another
  • 9:45 - 9:47
    clustering operation which
    uses a distance metric.
  • 9:56 - 9:59
    The country column is now fixed,
    and there are no more
  • 9:59 - 10:02
    clusters to be found. However,
    we still have one error.
  • 10:02 - 10:06
    The country Norway has a space
    character. See if we can find
  • 10:06 - 10:08
    a way to fix this error.
  • 10:10 - 10:14
    Let's now go ahead and use the
    tool to clean up a numerical
  • 10:14 - 10:17
    column like the credit limit.
    We'll first apply a numerical
  • 10:17 - 10:21
    transformation on the column so
    that all the numbers are
  • 10:21 - 10:25
    recognized. Some values in this
    column cannot be parsed into
  • 10:25 - 10:28
    numbers. This is because of the
    presence of alphanumeric values,
  • 10:28 - 10:30
    such as the USD or the dollar
  • 10:30 - 10:34
    symbol. We can correct this by
    applying the Numeric facet.
  • 10:35 - 10:38
    On the left we see a chart with
    the distribution of the data.
  • 10:39 - 10:43
    Let's filter down to the non-numeric
    values only and change
  • 10:43 - 10:46
    the values of the fields from
    the expression box.
  • 10:53 - 10:56
    This screen allows us to
    view the original values as
  • 10:56 - 10:58
    well as the outcomes when
    the expression is applied.
  • 11:03 - 11:06
    We will involve the replace
    function to replace or
  • 11:06 - 11:09
    substitute the unwanted
    characters with blanks. So let's
  • 11:09 - 11:13
    remove the USD, the dollar, and
    the comma symbols.
  • 11:14 - 11:17
    As you can see, we can make
    multiple function calls in a
  • 11:17 - 11:19
    single line. This is also known
  • 11:19 - 11:23
    as operator chaining. Now
    all the Alpha characters
  • 11:23 - 11:24
    have been removed.
  • 11:27 - 11:31
    When we click OK, the tool still
    does not see them as numbers.
  • 11:31 - 11:35
    For this we will need to invoke
    the toNumber function to
  • 11:35 - 11:38
    convert the values within the
    facet into numerical values.
  • 11:48 - 11:52
    So far we have tested these
    operations on a numeric facet.
  • 11:52 - 11:56
    To apply these changes to the
    data set in the project we will
  • 11:56 - 11:59
    copy this expression and apply
    to a column transform. This
  • 11:59 - 12:02
    ensures that all the values in
    the credit limit column have
  • 12:02 - 12:04
    been converted into numbers.
  • 12:13 - 12:17
    Once we're done with cleaning up
    the rest of the data, we are
  • 12:17 - 12:20
    ready to export the file into a
    format of our choosing. For this
  • 12:20 - 12:24
    example, let's export it as a
    CSV or a comma-separated file
  • 12:24 - 12:26
    and open it to view the results.
  • 12:30 - 12:34
    All the columns now have clean
    data. The data set is now
  • 12:34 - 12:38
    ready for storage in a
    database or for further analysis.
Title:
vimeo.com/.../508797388
Video Language:
English
Duration:
12:41

English subtitles

Revisions