Return to Video

https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4

  • 0:04 - 0:06
    Alright, welcome back.
  • 0:06 - 0:10
    We're going to keep marching through
    a few other examples here so you have
  • 0:10 - 0:13
    a bunch of examples on how
    to implement text analysis.
  • 0:13 - 0:17
    Again, we started with
    more classic literature.
  • 0:17 - 0:25
    We looked at the pdf with the political
    and economic literary analysis of
  • 0:25 - 0:28
    Harry Potter, Donald Trump, and Voldemort.
  • 0:28 - 0:35
    We looked briefly at the ilicit financial
    flow data that I'm working on right now
  • 0:36 - 0:40
    and at least looked at some
    basic applications there.
  • 0:40 - 0:46
    I want to show you now a little bit about
    how you can use text analysis to
  • 0:46 - 0:49
    do something like analyze
    social media data.
  • 0:50 - 0:52
    In this case, we'll use Twitter data.
  • 0:53 - 0:56
    Guessing at least a number
    of you are on Twitter.
  • 0:56 - 0:59
    I assume something similar could be done
    for some of the other social media
  • 0:59 - 1:05
    platforms, the key though is you just have
    to be able to download the data
  • 1:05 - 1:08
    or get the data somehow
    in its raw form.
  • 1:08 - 1:12
    Twitter makes that very easy,
    so I'm going to show you
  • 1:12 - 1:15
    at least how you would
    do this for Twitter data.
  • 1:17 - 1:22
    Again, I just don't know on Facebook,
    Instagram, Snapchat, all that
  • 1:22 - 1:23
    but if you can get it--
  • 1:23 - 1:26
    if you can download it in
    csv somehow, you can
  • 1:26 - 1:30
    analyze patterns in
    your social media activity.
  • 1:33 - 1:34
    All right.
  • 1:35 - 1:39
    To do this, you know, first off you're
    going to need to go
  • 1:39 - 1:41
    and get the Twitter data.
  • 1:41 - 1:45
    So, if that's what you're going to do
    like if you have Twitter
  • 1:45 - 1:47
    and you want to go analyze some of this,
  • 1:47 - 1:54
    go to your Twitter and there's
    a set of steps you've got to go through
  • 1:54 - 1:59
    to get there and
    in the notes here, in the pdf notes,
  • 2:01 - 2:06
    I've got some links here that will
    walk you through
  • 2:06 - 2:09
    how to download your Twitter data.
  • 2:09 - 2:12
    One brief-- so, I'm not going to go
    through all that, just follow some of this
  • 2:12 - 2:20
    understand that the core thing you need is
    a csv file and you need a csv file with
  • 2:20 - 2:24
    all your Twitter information all in
    one spot like all in a single sheet.
  • 2:25 - 2:31
    When I did this, just recently, Twitter
    downloaded everything as .js
  • 2:31 - 2:34
    and so I had to use a converter to get it
    from .js to .csv,
  • 2:34 - 2:36
    that was the only thing
    I couldn't find
  • 2:36 - 2:39
    that wasn't otherwise
    in the instructions but otherwise,
  • 2:39 - 2:45
    follow the instructions here, it may
    download as .js,
  • 2:45 - 2:48
    you can just use a simple converter
    online and turn it to .csv.
  • 2:49 - 2:52
    If you don't have Twitter, in the next
    segment,
  • 2:52 - 2:56
    I'm going to show you how to do this
    with all of Trump's Twitter data.
  • 2:56 - 3:01
    I found a site that had all of
    his Twitter data so, I went
  • 3:01 - 3:07
    and downloaded all that, put it in a csv
    and I should say that my csv of
  • 3:07 - 3:12
    my Twitter data and the csv with Trump's
    Twitter data are on Canvas
  • 3:12 - 3:16
    so you'll be able to just get the csv's
    for at least my stuff and Twitter stuff
  • 3:17 - 3:19
    but if you have interest on
    doing it for your own,
  • 3:19 - 3:23
    if you have Twitter, go for it, otherwise
    work with at least these ones
  • 3:23 - 3:25
    or maybe ask a friend who's
    on Twitter or something
  • 3:25 - 3:30
    can be kind of fun to look at the Twitter
    data and examine some of the patterns
  • 3:30 - 3:34
    and what you're saying, what you're
    doing, what you're thinking out loud
  • 3:34 - 3:35
    and so forth.
  • 3:38 - 3:42
    To do this let's bring in
    a few of the libraries.
  • 3:42 - 3:45
    So, most of these should
    be read in already.
  • 3:45 - 3:48
    Lubridates, to work with dates, you've
    been doing that before
  • 3:48 - 3:56
    and then readr, so those should be the
    ones that we're going to need and
  • 3:57 - 4:01
    I need to remember to for sure execute
    the command to call those libraries,
  • 4:01 - 4:03
    so let me do that real quick.
  • 4:05 - 4:12
    We've done that, loaded those, and
    then going to read in now the csv
  • 4:12 - 4:16
    with my tweets.
  • 4:18 - 4:20
    Hopefully that works.
  • 4:21 - 4:25
    Note here again, earlier -- in an earlier
    segment, I set the working directory
  • 4:25 - 4:28
    you know when you download the csv from
    Canvas,
  • 4:28 - 4:32
    you'll need to put it in a working
    directory and set that appropriately
  • 4:32 - 4:35
    but I should have this in here now.
  • 4:35 - 4:38
    Let's double check, yeah, okay.
  • 4:38 - 4:42
    So, looking in the global environment,
    I can see in fact I'll just view it here.
  • 4:42 - 4:50
    Now I've got my Twitter data in here
    as time stamp, tweet ID, text,
  • 4:50 - 4:53
    retweet count, and favorite count.
  • 4:53 - 4:57
    As you might image, this is
    the time stamp of when it was tweeted.
  • 4:57 - 5:00
    That's an ID that Twitter attaches to it.
  • 5:00 - 5:01
    That's the actual text.
  • 5:02 - 5:08
    This is-- I can't remember if this
    whether I retweeted it or whether
  • 5:08 - 5:09
    others retweeted.
  • 5:09 - 5:15
    I think this might be where the others
    retweeted it and then a favorite count
  • 5:15 - 5:22
    in terms of how often something was liked
    or disliked so, you're going to--
  • 5:22 - 5:24
    if you want to work with my Twitter data,
  • 5:24 - 5:27
    you can go ahead and judge me in
    all sorts of ways (laughter)
  • 5:27 - 5:30
    but not in ways that couldn't
    have otherwise
  • 5:30 - 5:32
    because you can find me on Twitter
    but at any rate,
  • 5:32 - 5:35
    that's my Twitter data in an nutshell.
  • 5:35 - 5:39
    I'm not actually on Twitter all that much
    so if you can see in my case, I've got
  • 5:41 - 5:45
    just scrolling through here, I mean
    only in the order of, let's see,
  • 5:45 - 5:52
    475 total and that actually, I believe,
    counts retweets
  • 5:52 - 5:53
    or it'll keep when I retweet others,
  • 5:53 - 5:57
    which I think actually
    is this text 'RT' you can see here
  • 5:57 - 5:59
    is when I retweet someone else.
  • 5:59 - 6:04
    I believe retweet count is how often
    someone else retweets what I put out there
  • 6:04 - 6:07
    Again, I don't do a lot of it here
  • 6:07 - 6:11
    but that gives you a sense of
    what this looks like.
  • 6:11 - 6:16
    One thing I want to draw your attention
    to here is that Twitter's time stamp is
  • 6:16 - 6:19
    a complete nightmare when it comes to R.
  • 6:19 - 6:24
    So, you know, I don't know if this makes
    you guys feel any better or worse
  • 6:24 - 6:28
    but just know that I sat there and banged
    my head against the wall for you know
  • 6:28 - 6:34
    10, 15 minutes just trying to figure out
    the best way to parse this time stamp data
  • 6:34 - 6:42
    so that I could then use the lubridate
    command to be able to to then declare
  • 6:42 - 6:48
    the date and time information so then
    I could use it for some analysis.
  • 6:50 - 6:53
    I was really trying to wrap my head around
    through the best way,
  • 6:53 - 6:56
    I think there's probably much
    better ways out there
  • 6:56 - 7:01
    but just to give you a sense that this can
    be a pain for everyone.
  • 7:01 - 7:02
    I mean, back on this --
  • 7:02 - 7:05
    sorry, I was going to make this point --
    I mean, if you look at this this they put
  • 7:05 - 7:12
    day of the week, month,
    date, and then hour, minute, second
  • 7:12 - 7:15
    and then time zone and then year, which
    is like --
  • 7:15 - 7:17
    lubridate has no way of dealing
    with this
  • 7:17 - 7:21
    and it essentially --
    I basically thought,
  • 7:21 - 7:25
    "well, I'm just going to parse this into a
    bunch of different variables,
  • 7:25 - 7:26
    "rename them,
  • 7:26 - 7:30
    "bind them back together, in an order
    that I want and then declare it in
  • 7:30 - 7:32
    "the proper date and time format."
  • 7:32 - 7:41
    So, the code I used for all that is the
    following-- I have to pull this over...
  • 7:46 - 7:52
    Fix the time stamp so it's usable in R.
  • 7:53 - 8:02
    Alright, this is -- again, something
    fairly complicated here but essentially
  • 8:02 - 8:09
    trying to fix this so it's usable in some
    way and this required me to essentially
  • 8:09 - 8:13
    pull apart -- which is what I'm doing
    here, I'm pulling apart
  • 8:13 - 8:15
    all the different elements of
  • 8:15 - 8:22
    that long and complicated
    time stamp there, all separated by spaces,
  • 8:22 - 8:28
    creating something of a matrix here
    and then -- in fact, I can show you
  • 8:28 - 8:33
    what I did there. I turned it into its own
    little matrix like this.
  • 8:33 - 8:38
    This is a separate data type here,
  • 8:38 - 8:41
    this is not in the central data frame,
    I've created this as a matrix.
  • 8:41 - 8:48
    That is a matrix command and it now
    just put everything into a matrix of sorts
  • 8:48 - 8:52
    and then renamed everything in
    the matrix.
  • 8:55 - 8:59
    So we've just done that there and then
    bind it back-- oh shoot.
  • 8:59 - 9:01
    I just... okay.
  • 9:03 - 9:04
    Bind it back together.
  • 9:04 - 9:08
    Now, you can see I've got the time
    stamp but I've got all these separate
  • 9:08 - 9:14
    variables now: daily, month, day,
    hms, time zone, and year.
  • 9:14 - 9:20
    And then, I can use -- going back to
    the lubridate command,
  • 9:20 - 9:22
    this is key to what we taught you earlier
  • 9:22 - 9:26
    in terms of declaring the date and time
    format, the 'ymd.'
  • 9:26 - 9:31
    I'm putting together year, month, day
    calling it 'ymd'
  • 9:31 - 9:34
    and then that's going
    to give me a final data frame
  • 9:34 - 9:38
    that then becomes usable and
    you can see up here
  • 9:38 - 9:45
    in tweets, now I've got a time stamp that
    is year, month, day
  • 9:45 - 9:47
    and it's officially declared as such.
  • 9:47 - 9:52
    Remember back on module, I believe, seven
    if we can declare this 'ymd'
  • 9:52 - 9:57
    or declare it in a format that is usable
    for R, with that lubridate package,
  • 9:57 - 10:03
    then that timestamp variable becomes
    something that R could then use to create
  • 10:03 - 10:09
    plots, figures, graphs, tables, whatever,
    rather than being a complete mess.
  • 10:09 - 10:11
    And if you don't believe me on this,
  • 10:11 - 10:17
    just go and try to take that original date
    and time stamp information from Twitter
  • 10:17 - 10:19
    and try to do anything with it and
  • 10:19 - 10:22
    I guarantee that R is going to choke
    over and over and over and over again
  • 10:22 - 10:25
    just because it's in a completely
    non-usable format.
  • 10:25 - 10:30
    So, you want to get it back into some
    format that's like this time stamp.
  • 10:30 - 10:32
    I could've added like
    hour, minutes, second stuff
  • 10:32 - 10:37
    but it was sufficient to just
    do a time stamp like this.
  • 10:37 - 10:41
    Get it to that format and declare it
    as an official time unit and then,
  • 10:41 - 10:44
    as we'll see -- when we do --
    in fact, we're going to do this now.
  • 10:44 - 10:51
    Let's go ahead and plot my Twitter
    activity and we can see in the code here
  • 10:51 - 10:52
    what we're talking about.
  • 10:52 - 11:00
    We want to plot the Twitter
    activity over time.
  • 11:00 - 11:04
    So, see here, now what I've done is
    in this histogram,
  • 11:05 - 11:08
    I've declared the time stamp
    as the x variable.
  • 11:08 - 11:12
    I can only do that because I declared
    the time information using that lubridate
  • 11:12 - 11:17
    command which required me to get
    the date and time and the proper format.
  • 11:19 - 11:24
    If I do that, you can see that this
    is my basic Twitter activity over time.
  • 11:25 - 11:29
    It probably comes as no surprise,
    given looking over that very briefly,
  • 11:29 - 11:33
    but I'm a very, I'm a pretty
    unfaithful Twitter user.
  • 11:33 - 11:37
    I'm pretty unfaithful at most
    social media use, in general,
  • 11:37 - 11:42
    going from these giant bursts such as
    in January 2018
  • 11:42 - 11:45
    all the way down to like long
    periods of no activity here.
  • 11:46 - 11:52
    But that gives you a general sense of the
    over-time type behavior
  • 11:52 - 11:55
    that may or may not be useful,
  • 11:55 - 11:57
    we may want to do other things
    like look at, again,
  • 11:57 - 12:00
    word frequency or
    sentiment analysis, which we can
  • 12:00 - 12:04
    do here in a second but
    that gives you a basic look.
  • 12:04 - 12:09
    One other thing that's useful with
    Twitter data is to get rid of retweets.
  • 12:10 - 12:16
    If you're not -- if you don't use Twitter
    you could compose an original tweet
  • 12:16 - 12:21
    and often times people do that but if
    someone else has an original tweet,
  • 12:21 - 12:24
    you can just choose to retweet that and then
    it just broadcasts that
  • 12:24 - 12:26
    or whatever back to the
    rest of your Twitter followers.
  • 12:26 - 12:29
    But it's like broadcasting someone else's
    message, so --
  • 12:29 - 12:31
    most social media has the same thing,
  • 12:31 - 12:35
    sharing or otherwise but those are often
    not as useful
  • 12:35 - 12:37
    because it's maybe not
    your original thought.
  • 12:37 - 12:41
    So if we wanted to get to like
    my original thought here,
  • 12:41 - 12:46
    which admittedly in Twitter is not much,
    we would want to get rid of retweets
  • 12:46 - 12:51
    and so, let's do that
    and then we can look at some
  • 12:51 - 12:55
    word counts and otherwise
    so just my original tweets here.
  • 12:56 - 13:01
    So, this code here will allow us
    to get rid of some retweets and
  • 13:01 - 13:08
    the key thing here is remember
    our filter command gets rid of
  • 13:08 - 13:14
    observations so we won't go through
    everything here but just know
  • 13:14 - 13:18
    we're trying to find instances where
    that 'RT' is in the text,
  • 13:18 - 13:20
    and then get rid of that.
  • 13:20 - 13:25
    And if we can do that, then
    it'll essentially get rid of all the
  • 13:25 - 13:33
    non-original information here
    and that should give us,
  • 13:33 - 13:37
    let's see what we call
    this, 'tidy tweets',
  • 13:41 - 13:43
    Let's see so we now have that
  • 13:43 - 13:49
    and then we've got our word counts,
    there we go, okay, great.
  • 13:49 - 13:55
    So with that then, we tidied it up--
    I forgot to explain this line which was
  • 13:55 - 14:00
    remember, unnesting tokens
    so we're unnested the tokens here as well,
  • 14:00 - 14:06
    that way it's tokenized
    so every row now is a single word here
  • 14:06 - 14:12
    so that we can analyze the
    words across all the tweets.
  • 14:12 - 14:20
    So with that then, we can look
    at our frequencies which done earlier,
  • 14:21 - 14:30
    but it's just the standard,
    just copy and paste this in
  • 14:30 - 14:36
    but... "(word, sort = TRUE)"
    I believe,
  • 14:47 - 14:48
    (inaudible)
  • 14:56 - 14:59
    Here we've got @AP so looks like
    my original tweets are actually
  • 14:59 - 15:01
    not even that original.
  • 15:01 - 15:06
    I'm probably mostly posting
    news articles, we've got news,
  • 15:06 - 15:12
    Trump, kids, @NateMJensen,
    one of my colleagues here
  • 15:12 - 15:18
    who incessantly makes fun of me
    and I him, and science.
  • 15:19 - 15:25
    So those are some of the basics
    that most frequently used words.
  • 15:25 - 15:31
    We can also do this as a figure,
    recall our code from earlier,
  • 15:33 - 15:40
    visualize word counts that should
    give us a few more words, and in that
  • 15:40 - 15:45
    brief one, we have the higher ones:
    @AP, news, Trump, kids, science,
  • 15:45 - 15:51
    Nate Jensen... Now we've got a few more
    here: war, people, Sudan, peace, papers,
  • 15:51 - 15:59
    family, BBC, and so forth and some others,
    Dada Kim, at Riverside, Claire Adida
  • 15:59 - 16:05
    at UC San Diego, so others who engage with
    some on Twitter... anyway,
  • 16:05 - 16:10
    that gives you some -- one view of the
    word count, we could do the work cloud,
  • 16:10 - 16:14
    which again, it's going to tell us
    something fairly similar
  • 16:15 - 16:18
    or just allow us to visualize it
    differently.
  • 16:19 - 16:25
    That's that same text there, you can see
    a lot of those same words showing up.
  • 16:25 - 16:29
    Anyway, we could do more, I didn't do
    a sense input analysis or anything.
  • 16:29 - 16:33
    We could do some of that stuff as well,
    be fairly straight forward, just call that
  • 16:33 - 16:38
    being lexicon and then it would take all
    the words in Twitter and,
  • 16:38 - 16:42
    essentially, query against that, compute
    positive, how often they're positive,
  • 16:42 - 16:44
    negative, and then you can compute that
    sentiment score,
  • 16:44 - 16:46
    which is fairly straightforward.
  • 16:47 - 16:51
    That's the basics of doing -- analyzing
    your twitter feed.
  • 16:51 - 16:56
    It's kind of fun to do. I'll admit, I
    was --
  • 16:56 - 17:00
    between the timestamp information and
    the downloaded csv
  • 17:00 - 17:06
    was somewhat frustrated trying to get this
    in shape, but it ultimately worked.
  • 17:06 - 17:10
    You got some code here you can try to
    adapt. But again, the key here:
  • 17:10 - 17:16
    get it in csv, download it from
    Twitter, get the data, and then
  • 17:18 - 17:21
    try to fix that timestamp if you can.
  • 17:21 - 17:26
    With that said, you don't actually have to
    fix the timestamp to do work cloud
  • 17:26 - 17:29
    or word counts or sentiment analysis.
  • 17:29 - 17:31
    You can skip all of that time timestamp stuff
  • 17:31 - 17:34
    if you want and I wanted to work through
    it for you so that you could do
  • 17:34 - 17:38
    the overtime plots if you wanted. But if
    all you want to do is
  • 17:38 - 17:48
    word cloud, sentiment analysis... word
    frequencies, that type of things,
  • 17:48 - 17:52
    and topic models or whatever, you could do
    all of that without the data information.
  • 17:52 - 17:55
    Anyway, that gives you a basic sense.
  • 17:55 - 17:58
    With that we'll end this segment, when we
    come back in the next segment,
  • 17:58 - 18:03
    we'll, very quickly, look at Trump's
    twitter activity and then
  • 18:03 - 18:06
    move to a conclusion at that point.
    So we'll end that segment here.
Title:
https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4
Video Language:
English
Duration:
18:22

English subtitles

Revisions