Return to Video

Auditing Completeness - Data Wranging with MongoDB

  • 0:00 - 0:03
    Now completeness can be difficult to assess. As my
  • 0:03 - 0:06
    friend David Silverman is fond of saying, you don't know
  • 0:06 - 0:08
    what you don't know. Now in this discussion we're
  • 0:08 - 0:11
    not talking about individual fields missing from a record. Instead,
  • 0:11 - 0:15
    we're actually talking about missing records. That is to
  • 0:15 - 0:18
    say, trying to figure out when it's an entire record
  • 0:18 - 0:21
    that's gone missing. So the solution here is very similar
  • 0:21 - 0:26
    to the solution for accuracy. Essentially, we need reference data.
  • 0:26 - 0:28
    So, let me give you an example from something
  • 0:28 - 0:31
    I work with on a regular basis. I'm the director
  • 0:31 - 0:34
    of education at MongoDB. And part of my role, I'm
  • 0:34 - 0:38
    responsible for our certification program. Now, we do things a
  • 0:38 - 0:41
    little bit differently than some other technology companies, in that
  • 0:41 - 0:45
    our certification exams are delivered entirely online. So, in addition
  • 0:45 - 0:49
    to a completed exam record for every test taker. In
  • 0:49 - 0:51
    delivering our exams online, we have to have a way
  • 0:51 - 0:54
    of proctoring them to ensure the security of the
  • 0:54 - 0:57
    exam. The way we do that is through a web
  • 0:57 - 1:00
    proctoring solution where capture video of the test taker
  • 1:00 - 1:04
    session. We capture both video of the test taker themselves,
  • 1:04 - 1:08
    using a web cam, as well as screen capture
  • 1:08 - 1:11
    for everything that's going on on the test taker's computer
  • 1:11 - 1:13
    screen as they're taking our exam. What this means,
  • 1:13 - 1:16
    then, is that we have three separate data stores that
  • 1:16 - 1:18
    need to agree with one another on a couple of
  • 1:18 - 1:21
    different things. And here is where we get to the completeness
  • 1:21 - 1:25
    example. They must agree on the list of test takers, that
  • 1:25 - 1:27
    is to say, if there is a record for a test
  • 1:27 - 1:31
    taker in any one of these databases, there must be a
  • 1:31 - 1:34
    record for that same test taker in the other two. They
  • 1:34 - 1:37
    also need to agree on the duration of the exam session.
  • 1:37 - 1:42
    So the video, here and here, should be approximately the same
  • 1:42 - 1:45
    length and it should agree with the elapsed time that
  • 1:45 - 1:49
    we recorded for a test taker. And, of course, these
  • 1:49 - 1:53
    have to agree approximately. Within sum Epsilon. Now, you're probably
  • 1:53 - 1:56
    thinking what if somebody took an exam and the record doesn't
  • 1:56 - 1:59
    exist in any one of these three databases. Well, that's
  • 1:59 - 2:02
    exactly right. As I mentioned at the beginning of this, this
  • 2:02 - 2:04
    is a difficult problem to address because we don't know
  • 2:04 - 2:07
    what we don't know. In that case using the solution that
  • 2:07 - 2:10
    we just described we would not detect the missing exam
  • 2:10 - 2:12
    record. So, in point of fact, we actually do a couple
  • 2:12 - 2:15
    of things in addition to this to ensure completeness for our
  • 2:15 - 2:18
    exam records. And those are essentially preventative measures to make sure
  • 2:18 - 2:21
    that we can't get into a situation where we haven't captured
  • 2:21 - 2:25
    the appropriate exam data for a given test taker. So, to
  • 2:25 - 2:28
    wrap this up, as with much of data cleaning, the means
  • 2:28 - 2:32
    of auditing completion is situation specific. It really depends on what
  • 2:32 - 2:36
    the data is that you are auditing and what reference sources you have access to.
タイトル:
Auditing Completeness - Data Wranging with MongoDB
Video Language:
English
Team:
Udacity
プロジェクト:
UD032: Data Wrangling with MongoDB
Duration:
02:36

English subtitles

改訂 Compare revisions