-
Alright, welcome back.
-
We're going to keep marching through
a few other examples here so you have
-
a bunch of examples on how
to implement text analysis.
-
Again, we started with
more classic literature.
-
We looked at the pdf with the political
and economic literary analysis of
-
Harry Potter, Donald Trump, and Voldemort.
-
We looked briefly at the ilicit financial
flow data that I'm working on right now
-
and at least looked at some
basic applications there.
-
I want to show you now a little bit about
how you can use text analysis to
-
do something like analyze
social media data.
-
In this case, we'll use Twitter data.
-
Guessing at least a number
of you are on Twitter.
-
I assume something similar could be done
for some of the other social media
-
platforms, the key though is you just have
to be able to download the data
-
or get the data somehow
in its raw form.
-
Twitter makes that very easy,
so I'm going to show you
-
at least how you would
do this for Twitter data.
-
Again, I just don't know on Facebook,
Instagram, Snapchat, all that
-
but if you can get it--
-
if you can download it in
csv somehow, you can
-
analyze patterns in
your social media activity.
-
All right.
-
To do this, you know, first off you're
going to need to go
-
and get the Twitter data.
-
So, if that's what you're going to do
like if you have Twitter
-
and you want to go analyze some of this,
-
go to your Twitter and there's
a set of steps you've got to go through
-
to get there and
in the notes here, in the pdf notes,
-
I've got some links here that will
walk you through
-
how to download your Twitter data.
-
One brief-- so, I'm not going to go
through all that, just follow some of this
-
understand that the core thing you need is
a csv file and you need a csv file with
-
all your Twitter information all in
one spot like all in a single sheet.
-
When I did this, just recently, Twitter
downloaded everything as .js
-
and so I had to use a converter to get it
from .js to .csv,
-
that was the only thing
I couldn't find
-
that wasn't otherwise
in the instructions but otherwise,
-
follow the instructions here, it may
download as .js,
-
you can just use a simple converter
online and turn it to .csv.
-
If you don't have Twitter, in the next
segment,
-
I'm going to show you how to do this
with all of Trump's Twitter data.
-
I found a site that had all of
his Twitter data so, I went
-
and downloaded all that, put it in a csv
and I should say that my csv of
-
my Twitter data and the csv with Trump's
Twitter data are on Canvas
-
so you'll be able to just get the csv's
for at least my stuff and Twitter stuff
-
but if you have interest on
doing it for your own,
-
if you have Twitter, go for it, otherwise
work with at least these ones
-
or maybe ask a friend who's
on Twitter or something
-
can be kind of fun to look at the Twitter
data and examine some of the patterns
-
and what you're saying, what you're
doing, what you're thinking out loud
-
and so forth.
-
To do this let's bring in
a few of the libraries.
-
So, most of these should
be read in already.
-
Lubridates, to work with dates, you've
been doing that before
-
and then readr, so those should be the
ones that we're going to need and
-
I need to remember to for sure execute
the command to call those libraries,
-
so let me do that real quick.
-
We've done that, loaded those, and
then going to read in now the csv
-
with my tweets.
-
Hopefully that works.
-
Note here again, earlier -- in an earlier
segment, I set the working directory
-
you know when you download the csv from
Canvas,
-
you'll need to put it in a working
directory and set that appropriately
-
but I should have this in here now.
-
Let's double check, yeah, okay.
-
So, looking in the global environment,
I can see in fact I'll just view it here.
-
Now I've got my Twitter data in here
as time stamp, tweet ID, text,
-
retweet count, and favorite count.
-
As you might image, this is
the time stamp of when it was tweeted.
-
That's an ID that Twitter attaches to it.
-
That's the actual text.
-
This is-- I can't remember if this
whether I retweeted it or whether
-
others retweeted.
-
I think this might be where the others
retweeted it and then a favorite count
-
in terms of how often something was liked
or disliked so, you're going to--
-
if you want to work with my Twitter data,
-
you can go ahead and judge me in
all sorts of ways (laughter)
-
but not in ways that couldn't
have otherwise
-
because you can find me on Twitter
but at any rate,
-
that's my Twitter data in an nutshell.
-
I'm not actually on Twitter all that much
so if you can see in my case, I've got
-
just scrolling through here, I mean
only in the order of, let's see,
-
475 total and that actually, I believe,
counts retweets
-
or it'll keep when I retweet others,
-
which I think actually
is this text 'RT' you can see here
-
is when I retweet someone else.
-
I believe retweet count is how often
someone else retweets what I put out there
-
Again, I don't do a lot of it here
-
but that gives you a sense of
what this looks like.
-
One thing I want to draw your attention
to here is that Twitter's time stamp is
-
a complete nightmare when it comes to R.
-
So, you know, I don't know if this makes
you guys feel any better or worse
-
but just know that I sat there and banged
my head against the wall for you know
-
10, 15 minutes just trying to figure out
the best way to parse this time stamp data
-
so that I could then use the lubridate
command to be able to to then declare
-
the date and time information so then
I could use it for some analysis.
-
I was really trying to wrap my head around
through the best way,
-
I think there's probably much
better ways out there
-
but just to give you a sense that this can
be a pain for everyone.
-
I mean, back on this --
-
sorry, I was going to make this point --
I mean, if you look at this this they put
-
day of the week, month,
date, and then hour, minute, second
-
and then time zone and then year, which
is like --
-
lubridate has no way of dealing
with this
-
and it essentially --
I basically thought,
-
"well, I'm just going to parse this into a
bunch of different variables,
-
"rename them,
-
"bind them back together, in an order
that I want and then declare it in
-
"the proper date and time format."
-
So, the code I used for all that is the
following-- I have to pull this over...
-
Fix the time stamp so it's usable in R.
-
Alright, this is -- again, something
fairly complicated here but essentially
-
trying to fix this so it's usable in some
way and this required me to essentially
-
pull apart -- which is what I'm doing
here, I'm pulling apart
-
all the different elements of
-
that long and complicated
time stamp there, all separated by spaces,
-
creating something of a matrix here
and then -- in fact, I can show you
-
what I did there. I turned it into its own
little matrix like this.
-
This is a separate data type here,
-
this is not in the central data frame,
I've created this as a matrix.
-
That is a matrix command and it now
just put everything into a matrix of sorts
-
and then renamed everything in
the matrix.
-
So we've just done that there and then
bind it back-- oh shoot.
-
I just... okay.
-
Bind it back together.
-
Now, you can see I've got the time
stamp but I've got all these separate
-
variables now: daily, month, day,
hms, time zone, and year.
-
And then, I can use -- going back to
the lubridate command,
-
this is key to what we taught you earlier
-
in terms of declaring the date and time
format, the 'ymd.'
-
I'm putting together year, month, day
calling it 'ymd'
-
and then that's going
to give me a final data frame
-
that then becomes usable and
you can see up here
-
in tweets, now I've got a time stamp that
is year, month, day
-
and it's officially declared as such.
-
Remember back on module, I believe, seven
if we can declare this 'ymd'
-
or declare it in a format that is usable
for R, with that lubridate package,
-
then that timestamp variable becomes
something that R could then use to create
-
plots, figures, graphs, tables, whatever,
rather than being a complete mess.
-
And if you don't believe me on this,
-
just go and try to take that original date
and time stamp information from Twitter
-
and try to do anything with it and
-
I guarantee that R is going to choke
over and over and over and over again
-
just because it's in a completely
non-usable format.
-
So, you want to get it back into some
format that's like this time stamp.
-
I could've added like
hour, minutes, second stuff
-
but it was sufficient to just
do a time stamp like this.
-
Get it to that format and declare it
as an official time unit and then,
-
as we'll see -- when we do --
in fact, we're going to do this now.
-
Let's go ahead and plot my Twitter
activity and we can see in the code here
-
what we're talking about.
-
We want to plot the Twitter
activity over time.
-
So, see here, now what I've done is
in this histogram,
-
I've declared the time stamp
as the x variable.
-
I can only do that because I declared
the time information using that lubridate
-
command which required me to get
the date and time and the proper format.
-
If I do that, you can see that this
is my basic Twitter activity over time.
-
It probably comes as no surprise,
given looking over that very briefly,
-
but I'm a very, I'm a pretty
unfaithful Twitter user.
-
I'm pretty unfaithful at most
social media use, in general,
-
going from these giant bursts such as
in January 2018
-
all the way down to like long
periods of no activity here.
-
But that gives you a general sense of the
over-time type behavior
-
that may or may not be useful,
-
we may want to do other things
like look at, again,
-
word frequency or
sentiment analysis, which we can
-
do here in a second but
that gives you a basic look.
-
One other thing that's useful with
Twitter data is to get rid of retweets.
-
If you're not -- if you don't use Twitter
you could compose an original tweet
-
and often times people do that but if
someone else has an original tweet,
-
you can just choose to retweet that and then
it just broadcasts that
-
or whatever back to the
rest of your Twitter followers.
-
But it's like broadcasting someone else's
message, so --
-
most social media has the same thing,
-
sharing or otherwise but those are often
not as useful
-
because it's maybe not
your original thought.
-
So if we wanted to get to like
my original thought here,
-
which admittedly in Twitter is not much,
we would want to get rid of retweets
-
and so, let's do that
and then we can look at some
-
word counts and otherwise
so just my original tweets here.
-
So, this code here will allow us
to get rid of some retweets and
-
the key thing here is remember
our filter command gets rid of
-
observations so we won't go through
everything here but just know
-
we're trying to find instances where
that 'RT' is in the text,
-
and then get rid of that.
-
And if we can do that, then
it'll essentially get rid of all the
-
non-original information here
and that should give us,
-
let's see what we call
this, 'tidy tweets',
-
Let's see so we now have that
-
and then we've got our word counts,
there we go, okay, great.
-
So with that then, we tidied it up--
I forgot to explain this line which was
-
remember, unnesting tokens
so we're unnested the tokens here as well,
-
that way it's tokenized
so every row now is a single word here
-
so that we can analyze the
words across all the tweets.
-
So with that then, we can look
at our frequencies which done earlier,
-
but it's just the standard,
just copy and paste this in
-
but... "(word, sort = TRUE)"
I believe,
-
(inaudible)
-
Here we've got @AP so looks like
my original tweets are actually
-
not even that original.
-
I'm probably mostly posting
news articles, we've got news,
-
Trump, kids, @NateMJensen,
one of my colleagues here
-
who incessantly makes fun of me
and I him, and science.
-
So those are some of the basics
that most frequently used words.
-
We can also do this as a figure,
recall our code from earlier,
-
visualize word counts that should
give us a few more words, and in that
-
brief one, we have the higher ones:
@AP, news, Trump, kids, science,
-
Nate Jensen... Now we've got a few more
here: war, people, Sudan, peace, papers,
-
family, BBC, and so forth and some others,
Dada Kim, at Riverside, Claire Adida
-
at UC San Diego, so others who engage with
some on Twitter... anyway,
-
that gives you some -- one view of the
word count, we could do the work cloud,
-
which again, it's going to tell us
something fairly similar
-
or just allow us to visualize it
differently.
-
That's that same text there, you can see
a lot of those same words showing up.
-
Anyway, we could do more, I didn't do
a sense input analysis or anything.
-
We could do some of that stuff as well,
be fairly straight forward, just call that
-
being lexicon and then it would take all
the words in Twitter and,
-
essentially, query against that, compute
positive, how often they're positive,
-
negative, and then you can compute that
sentiment score,
-
which is fairly straightforward.
-
That's the basics of doing -- analyzing
your twitter feed.
-
It's kind of fun to do. I'll admit, I
was --
-
between the timestamp information and
the downloaded csv
-
was somewhat frustrated trying to get this
in shape, but it ultimately worked.
-
You got some code here you can try to
adapt. But again, the key here:
-
get it in csv, download it from
Twitter, get the data, and then
-
try to fix that timestamp if you can.
-
With that said, you don't actually have to
fix the timestamp to do work cloud
-
or word counts or sentiment analysis.
-
You can skip all of that time timestamp stuff
-
if you want and I wanted to work through
it for you so that you could do
-
the overtime plots if you wanted. But if
all you want to do is
-
word cloud, sentiment analysis... word
frequencies, that type of things,
-
and topic models or whatever, you could do
all of that without the data information.
-
Anyway, that gives you a basic sense.
-
With that we'll end this segment, when we
come back in the next segment,
-
we'll, very quickly, look at Trump's
twitter activity and then
-
move to a conclusion at that point.
So we'll end that segment here.