https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4

0:04 - 0:06

Alright, welcome back.
0:06 - 0:10

We're going to keep marching through
a few other examples here so you have
0:10 - 0:13

a bunch of examples on how
to implement text analysis.
0:13 - 0:17

Again, we started with
more classic literature.
0:17 - 0:25

We looked at the pdf with the political
and economic literary analysis of
0:25 - 0:28

Harry Potter, Donald Trump, and Voldemort.
0:28 - 0:35

We looked briefly at the ilicit financial
flow data that I'm working on right now
0:36 - 0:40

and at least looked at some
basic applications there.
0:40 - 0:46

I want to show you now a little bit about
how you can use text analysis to
0:46 - 0:49

do something like analyze
social media data.
0:50 - 0:52

In this case, we'll use Twitter data.
0:53 - 0:56

Guessing at least a number
of you are on Twitter.
0:56 - 0:59

I assume something similar could be done
for some of the other social media
0:59 - 1:05

platforms, the key though is you just have
to be able to download the data
1:05 - 1:08

or get the data somehow
in its raw form.
1:08 - 1:12

Twitter makes that very easy,
so I'm going to show you
1:12 - 1:15

at least how you would
do this for Twitter data.
1:17 - 1:22

Again, I just don't know on Facebook,
Instagram, Snapchat, all that
1:22 - 1:23

but if you can get it--
1:23 - 1:26

if you can download it in
csv somehow, you can
1:26 - 1:30

analyze patterns in
your social media activity.
1:33 - 1:34

All right.
1:35 - 1:39

To do this, you know, first off you're
going to need to go
1:39 - 1:41

and get the Twitter data.
1:41 - 1:45

So, if that's what you're going to do
like if you have Twitter
1:45 - 1:47

and you want to go analyze some of this,
1:47 - 1:54

go to your Twitter and there's
a set of steps you've got to go through
1:54 - 1:59

to get there and
in the notes here, in the pdf notes,
2:01 - 2:06

I've got some links here that will
walk you through
2:06 - 2:09

how to download your Twitter data.
2:09 - 2:12

One brief-- so, I'm not going to go
through all that, just follow some of this
2:12 - 2:20

understand that the core thing you need is
a csv file and you need a csv file with
2:20 - 2:24

all your Twitter information all in
one spot like all in a single sheet.
2:25 - 2:31

When I did this, just recently, Twitter
downloaded everything as .js
2:31 - 2:34

and so I had to use a converter to get it
from .js to .csv,
2:34 - 2:36

that was the only thing
I couldn't find
2:36 - 2:39

that wasn't otherwise
in the instructions but otherwise,
2:39 - 2:45

follow the instructions here, it may
download as .js,
2:45 - 2:48

you can just use a simple converter
online and turn it to .csv.
2:49 - 2:52

If you don't have Twitter, in the next
segment,
2:52 - 2:56

I'm going to show you how to do this
with all of Trump's Twitter data.
2:56 - 3:01

I found a site that had all of
his Twitter data so, I went
3:01 - 3:07

and downloaded all that, put it in a csv
and I should say that my csv of
3:07 - 3:12

my Twitter data and the csv with Trump's
Twitter data are on Canvas
3:12 - 3:16

so you'll be able to just get the csv's
for at least my stuff and Twitter stuff
3:17 - 3:19

but if you have interest on
doing it for your own,
3:19 - 3:23

if you have Twitter, go for it, otherwise
work with at least these ones
3:23 - 3:25

or maybe ask a friend who's
on Twitter or something
3:25 - 3:30

can be kind of fun to look at the Twitter
data and examine some of the patterns
3:30 - 3:34

and what you're saying, what you're
doing, what you're thinking out loud
3:34 - 3:35

and so forth.
3:38 - 3:42

To do this let's bring in
a few of the libraries.
3:42 - 3:45

So, most of these should
be read in already.
3:45 - 3:48

Lubridates, to work with dates, you've
been doing that before
3:48 - 3:56

and then readr, so those should be the
ones that we're going to need and
3:57 - 4:01

I need to remember to for sure execute
the command to call those libraries,
4:01 - 4:03

so let me do that real quick.
4:05 - 4:12

We've done that, loaded those, and
then going to read in now the csv
4:12 - 4:16

with my tweets.
4:18 - 4:20

Hopefully that works.
4:21 - 4:25

Note here again, earlier -- in an earlier
segment, I set the working directory
4:25 - 4:28

you know when you download the csv from
Canvas,
4:28 - 4:32

you'll need to put it in a working
directory and set that appropriately
4:32 - 4:35

but I should have this in here now.
4:35 - 4:38

Let's double check, yeah, okay.
4:38 - 4:42

So, looking in the global environment,
I can see in fact I'll just view it here.
4:42 - 4:50

Now I've got my Twitter data in here
as time stamp, tweet ID, text,
4:50 - 4:53

retweet count, and favorite count.
4:53 - 4:57

As you might image, this is
the time stamp of when it was tweeted.
4:57 - 5:00

That's an ID that Twitter attaches to it.
5:00 - 5:01

That's the actual text.
5:02 - 5:08

This is-- I can't remember if this
whether I retweeted it or whether
5:08 - 5:09

others retweeted.
5:09 - 5:15

I think this might be where the others
retweeted it and then a favorite count
5:15 - 5:22

in terms of how often something was liked
or disliked so, you're going to--
5:22 - 5:24

if you want to work with my Twitter data,
5:24 - 5:27

you can go ahead and judge me in
all sorts of ways (laughter)
5:27 - 5:30

but not in ways that couldn't
have otherwise
5:30 - 5:32

because you can find me on Twitter
but at any rate,
5:32 - 5:35

that's my Twitter data in an nutshell.
5:35 - 5:39

I'm not actually on Twitter all that much
so if you can see in my case, I've got
5:41 - 5:45

just scrolling through here, I mean
only in the order of, let's see,
5:45 - 5:52

475 total and that actually, I believe,
counts retweets
5:52 - 5:53

or it'll keep when I retweet others,
5:53 - 5:57

which I think actually
is this text 'RT' you can see here
5:57 - 5:59

is when I retweet someone else.
5:59 - 6:04

I believe retweet count is how often
someone else retweets what I put out there
6:04 - 6:07

Again, I don't do a lot of it here
6:07 - 6:11

but that gives you a sense of
what this looks like.
6:11 - 6:16

One thing I want to draw your attention
to here is that Twitter's time stamp is
6:16 - 6:19

a complete nightmare when it comes to R.
6:19 - 6:24

So, you know, I don't know if this makes
you guys feel any better or worse
6:24 - 6:28

but just know that I sat there and banged
my head against the wall for you know
6:28 - 6:34

10, 15 minutes just trying to figure out
the best way to parse this time stamp data
6:34 - 6:42

so that I could then use the lubridate
command to be able to to then declare
6:42 - 6:48

the date and time information so then
I could use it for some analysis.
6:50 - 6:53

I was really trying to wrap my head around
through the best way,
6:53 - 6:56

I think there's probably much
better ways out there
6:56 - 7:01

but just to give you a sense that this can
be a pain for everyone.
7:01 - 7:02

I mean, back on this --
7:02 - 7:05

sorry, I was going to make this point --
I mean, if you look at this this they put
7:05 - 7:12

day of the week, month,
date, and then hour, minute, second
7:12 - 7:15

and then time zone and then year, which
is like --
7:15 - 7:17

lubridate has no way of dealing
with this
7:17 - 7:21

and it essentially --
I basically thought,
7:21 - 7:25

"well, I'm just going to parse this into a
bunch of different variables,
7:25 - 7:26

"rename them,
7:26 - 7:30

"bind them back together, in an order
that I want and then declare it in
7:30 - 7:32

"the proper date and time format."
7:32 - 7:41

So, the code I used for all that is the
following-- I have to pull this over...
7:46 - 7:52

Fix the time stamp so it's usable in R.
7:53 - 8:02

Alright, this is -- again, something
fairly complicated here but essentially
8:02 - 8:09

trying to fix this so it's usable in some
way and this required me to essentially
8:09 - 8:13

pull apart -- which is what I'm doing
here, I'm pulling apart
8:13 - 8:15

all the different elements of
8:15 - 8:22

that long and complicated
time stamp there, all separated by spaces,
8:22 - 8:28

creating something of a matrix here
and then -- in fact, I can show you
8:28 - 8:33

what I did there. I turned it into its own
little matrix like this.
8:33 - 8:38

This is a separate data type here,
8:38 - 8:41

this is not in the central data frame,
I've created this as a matrix.
8:41 - 8:48

That is a matrix command and it now
just put everything into a matrix of sorts
8:48 - 8:52

and then renamed everything in
the matrix.
8:55 - 8:59

So we've just done that there and then
bind it back-- oh shoot.
8:59 - 9:01

I just... okay.
9:03 - 9:04

Bind it back together.
9:04 - 9:08

Now, you can see I've got the time
stamp but I've got all these separate
9:08 - 9:14

variables now: daily, month, day,
hms, time zone, and year.
9:14 - 9:20

And then, I can use -- going back to
the lubridate command,
9:20 - 9:22

this is key to what we taught you earlier
9:22 - 9:26

in terms of declaring the date and time
format, the 'ymd.'
9:26 - 9:31

I'm putting together year, month, day
calling it 'ymd'
9:31 - 9:34

and then that's going
to give me a final data frame
9:34 - 9:38

that then becomes usable and
you can see up here
9:38 - 9:45

in tweets, now I've got a time stamp that
is year, month, day
9:45 - 9:47

and it's officially declared as such.
9:47 - 9:52

Remember back on module, I believe, seven
if we can declare this 'ymd'
9:52 - 9:57

or declare it in a format that is usable
for R, with that lubridate package,
9:57 - 10:03

then that timestamp variable becomes
something that R could then use to create
10:03 - 10:09

plots, figures, graphs, tables, whatever,
rather than being a complete mess.
10:09 - 10:11

And if you don't believe me on this,
10:11 - 10:17

just go and try to take that original date
and time stamp information from Twitter
10:17 - 10:19

and try to do anything with it and
10:19 - 10:22

I guarantee that R is going to choke
over and over and over and over again
10:22 - 10:25

just because it's in a completely
non-usable format.
10:25 - 10:30

So, you want to get it back into some
format that's like this time stamp.
10:30 - 10:32

I could've added like
hour, minutes, second stuff
10:32 - 10:37

but it was sufficient to just
do a time stamp like this.
10:37 - 10:41

Get it to that format and declare it
as an official time unit and then,
10:41 - 10:44

as we'll see -- when we do --
in fact, we're going to do this now.
10:44 - 10:51

Let's go ahead and plot my Twitter
activity and we can see in the code here
10:51 - 10:52

what we're talking about.
10:52 - 11:00

We want to plot the Twitter
activity over time.
11:00 - 11:04

So, see here, now what I've done is
in this histogram,
11:05 - 11:08

I've declared the time stamp
as the x variable.
11:08 - 11:12

I can only do that because I declared
the time information using that lubridate
11:12 - 11:17

command which required me to get
the date and time and the proper format.
11:19 - 11:24

If I do that, you can see that this
is my basic Twitter activity over time.
11:25 - 11:29

It probably comes as no surprise,
given looking over that very briefly,
11:29 - 11:33

but I'm a very, I'm a pretty
unfaithful Twitter user.
11:33 - 11:37

I'm pretty unfaithful at most
social media use, in general,
11:37 - 11:42

going from these giant bursts such as
in January 2018
11:42 - 11:45

all the way down to like long
periods of no activity here.
11:46 - 11:52

But that gives you a general sense of the
over-time type behavior
11:52 - 11:55

that may or may not be useful,
11:55 - 11:57

we may want to do other things
like look at, again,
11:57 - 12:00

word frequency or
sentiment analysis, which we can
12:00 - 12:04

do here in a second but
that gives you a basic look.
12:04 - 12:09

One other thing that's useful with
Twitter data is to get rid of retweets.
12:10 - 12:16

If you're not -- if you don't use Twitter
you could compose an original tweet
12:16 - 12:21

and often times people do that but if
someone else has an original tweet,
12:21 - 12:24

you can just choose to retweet that and then
it just broadcasts that
12:24 - 12:26

or whatever back to the
rest of your Twitter followers.
12:26 - 12:29

But it's like broadcasting someone else's
message, so --
12:29 - 12:31

most social media has the same thing,
12:31 - 12:35

sharing or otherwise but those are often
not as useful
12:35 - 12:37

because it's maybe not
your original thought.
12:37 - 12:41

So if we wanted to get to like
my original thought here,
12:41 - 12:46

which admittedly in Twitter is not much,
we would want to get rid of retweets
12:46 - 12:51

and so, let's do that
and then we can look at some
12:51 - 12:55

word counts and otherwise
so just my original tweets here.
12:56 - 13:01

So, this code here will allow us
to get rid of some retweets and
13:01 - 13:08

the key thing here is remember
our filter command gets rid of
13:08 - 13:14

observations so we won't go through
everything here but just know
13:14 - 13:18

we're trying to find instances where
that 'RT' is in the text,
13:18 - 13:20

and then get rid of that.
13:20 - 13:25

And if we can do that, then
it'll essentially get rid of all the
13:25 - 13:33

non-original information here
and that should give us,
13:33 - 13:37

let's see what we call
this, 'tidy tweets',
13:41 - 13:43

Let's see so we now have that
13:43 - 13:49

and then we've got our word counts,
there we go, okay, great.
13:49 - 13:55

So with that then, we tidied it up--
I forgot to explain this line which was
13:55 - 14:00

remember, unnesting tokens
so we're unnested the tokens here as well,
14:00 - 14:06

that way it's tokenized
so every row now is a single word here
14:06 - 14:12

so that we can analyze the
words across all the tweets.
14:12 - 14:20

So with that then, we can look
at our frequencies which done earlier,
14:21 - 14:30

but it's just the standard,
just copy and paste this in
14:30 - 14:36

but... "(word, sort = TRUE)"
I believe,
14:47 - 14:48

(inaudible)
14:56 - 14:59

Here we've got @AP so looks like
my original tweets are actually
14:59 - 15:01

not even that original.
15:01 - 15:06

I'm probably mostly posting
news articles, we've got news,
15:06 - 15:12

Trump, kids, @NateMJensen,
one of my colleagues here
15:12 - 15:18

who incessantly makes fun of me
and I him, and science.
15:19 - 15:25

So those are some of the basics
that most frequently used words.
15:25 - 15:31

We can also do this as a figure,
recall our code from earlier,
15:33 - 15:40

visualize word counts that should
give us a few more words, and in that
15:40 - 15:45

brief one, we have the higher ones:
@AP, news, Trump, kids, science,
15:45 - 15:51

Nate Jensen... Now we've got a few more
here: war, people, Sudan, peace, papers,
15:51 - 15:59

family, BBC, and so forth and some others,
Dada Kim, at Riverside, Claire Adida
15:59 - 16:05

at UC San Diego, so others who engage with
some on Twitter... anyway,
16:05 - 16:10

that gives you some -- one view of the
word count, we could do the work cloud,
16:10 - 16:14

which again, it's going to tell us
something fairly similar
16:15 - 16:18

or just allow us to visualize it
differently.
16:19 - 16:25

That's that same text there, you can see
a lot of those same words showing up.
16:25 - 16:29

Anyway, we could do more, I didn't do
a sense input analysis or anything.
16:29 - 16:33

We could do some of that stuff as well,
be fairly straight forward, just call that
16:33 - 16:38

being lexicon and then it would take all
the words in Twitter and,
16:38 - 16:42

essentially, query against that, compute
positive, how often they're positive,
16:42 - 16:44

negative, and then you can compute that
sentiment score,
16:44 - 16:46

which is fairly straightforward.
16:47 - 16:51

That's the basics of doing -- analyzing
your twitter feed.
16:51 - 16:56

It's kind of fun to do. I'll admit, I
was --
16:56 - 17:00

between the timestamp information and
the downloaded csv
17:00 - 17:06

was somewhat frustrated trying to get this
in shape, but it ultimately worked.
17:06 - 17:10

You got some code here you can try to
adapt. But again, the key here:
17:10 - 17:16

get it in csv, download it from
Twitter, get the data, and then
17:18 - 17:21

try to fix that timestamp if you can.
17:21 - 17:26

With that said, you don't actually have to
fix the timestamp to do work cloud
17:26 - 17:29

or word counts or sentiment analysis.
17:29 - 17:31

You can skip all of that time timestamp stuff
17:31 - 17:34

if you want and I wanted to work through
it for you so that you could do
17:34 - 17:38

the overtime plots if you wanted. But if
all you want to do is
17:38 - 17:48

word cloud, sentiment analysis... word
frequencies, that type of things,
17:48 - 17:52

and topic models or whatever, you could do
all of that without the data information.
17:52 - 17:55

Anyway, that gives you a basic sense.
17:55 - 17:58

With that we'll end this segment, when we
come back in the next segment,
17:58 - 18:03

we'll, very quickly, look at Trump's
twitter activity and then
18:03 - 18:06

move to a conclusion at that point.
So we'll end that segment here.

Title:: https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4
Video Language:: English
Duration:: 18:22

	Richard M Gaunt edited English subtitles for https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4
	miguelyapur edited English subtitles for https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4
	utcaptions123 edited English subtitles for https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4
	julietammarquez edited English subtitles for https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4
	julietammarquez edited English subtitles for https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4
	julietammarquez edited English subtitles for https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4

English subtitles

Revisions

Revision 6 Edited

Richard M Gaunt

https:/.../2020-06-29_gov355m_14.e_analyzing-twitter-data.mp4

Revisions

Our website uses cookies

Operating cookies (Required)