-
In this lecture, we’re going to talk about trying out your interface with people
-
and doing so in a way that you can improve your designs based on what you learned.
-
One of the most common things that people ask when running studies is: “Do you like my interface?”
-
and it’s a really natural thing to ask, because on some level it’s what we all want to know.
-
But this is really problematic on a whole lot of levels.
-
For one it’s not very specific, and so sometimes people are trying to make this better
-
and so they’ll improve it by doing something like: “How much do you like my interface on one to five scale?”
-
Or: “‘This is a useful interface’ — Agree or disagree on one to five scale.”
-
And this adds some kind of a patina of scientificness to it
-
but really it’s just the same thing — you’re asking somebody “Do you like my interface?”
-
And people are nice, so they’re going to say “Sure I like your interface.”
-
This is the “please the experimenter” bias.
-
And this can be especially strong when there are social or cultural or power differences
-
between the experimenter and the people that you’re trying out your interface with:
-
For example, [inaudible] and colleague show this effect in India
-
where this effect was exacerbated when the experimenter was white.
-
Now, you should not take this to mean that you shouldn’t have your developers try out stuff with users —
-
Being the person who is both the developer and the person who is trying stuff out is incredible valuable.
-
And one example I like a lot of this is Mike Krieger,
-
one of the Instagram founders — [he] is also a former master student and TA of mine.
-
And Mike, when he left Stanford and joined Silicon Valley,
-
every Friday afternoon he would bring people into the lab into his office
-
and have them try out whatever they were working on that week.
-
And so that way they were able to get this regular feedback each week
-
and the people who were building those systems got to see real people trying them out.
-
This can be nails-on-a-chalkboard painful, but you’ll also learn a ton.
-
So how do we get beyond “Do you like my interface?”
-
The basic strategy that we’re going to talk about today is being able
-
to use specific measures and concrete questions to be able to deliver meaningful results.
-
One of the problems of “Do you like my interface?” is “Compared to what?”
-
And I think one of the reasons people say “Yeah sure” is that there’s no comparison point
-
and so one thing that’s really important is when you’re measuring the effectiveness of your interface,
-
even informally, it’s really nice to have some kind of comparison.
-
It’s also important think about, well, what’s the yardstick?
-
What constitutes “good” in this arena?
-
What are the measures that you’re going to use?
-
So how can we get beyond “Do you like my interface?”
-
One of the ways that we can start out is by asking a base rate question,
-
like “What fraction of people click on the first link in a search results page?”
-
Or “What fraction of students come to class?”
-
Once we start to measure correlations things get even more interesting,
-
like, “Is there a relationship between the time of day a class is offered and how many students attend it?”
-
Or “Is there a relationship between the order of a search result and the clickthrough rate?”
-
For both students and clickthrough, there can be multiple explanations.
-
For example, if there are fewer students that attend early morning classes,
-
is that a function of when students want to show up,
-
or is that a function of when good professors want to teach?
-
With the clickthrough example, there are also two kinds of explanations.
-
If lower placed links yield fewer clicks, Is that because the links are of intrinsically poorer quality,
-
or is it because people just click on the first link —
-
[that] they don’t bother getting to the second one even if it might be better?
-
To isolate the effect of placement and identifying it as playing a casual role,
-
you’d need to isolate that as a variable by say, randomizing the order or search results.
-
As we start to talk about these experiments, let’s introduce a few terms that are going to help us.
-
The multiple different conditions that we try, that’s the thing we are manipulating —
-
for example, the time of a class, or the location of a particular link on a search results page.
-
These manipulations are independent variables because they are independent of what the user does.
-
They are in the control of the experimenter.
-
Then we are going to measure what the user does
-
and those measures are called dependent variables because they depend on what the user does.
-
Common measures in HCI include things like task completion time —
-
How long does it take somebody to complete a task
-
(for example, find something I want to buy, create a new account, order an item)?
-
Accuracy — How many mistakes did people make,
-
and were those fatal errors or were those things that they were able to quickly recover from?
-
Recall — How much does a person remember afterward, or after periods of non-use?
-
And emotional response — How does the person feel about the tasks being completed?
-
Were they confident, were they stressed?
-
Would the user recommend this system to a friend?
-
So, your independent variables are the things that you manipulate,
-
your dependent variables are the things that you measure.
-
How reliable is your experiment?
-
If you ran this again, would you see the same results?
-
That’s the internal validity of an experiment.
-
So, have a precise experiment, you need to better remove the confounding factors.
-
Also, it’s important to study enough people so that the result is unlikely to have been by chance.
-
You may be able to run the same study over and over and get the same results
-
but it may not matter in some real-world sense and the external validity is the generalizability of your results.
-
Does this apply only to eighteen-year-olds in a college classroom?
-
Or does this apply to everybody in the world?
-
Let’s bring this back to HCI and talk about one of the problems you’re likely to face as a designer.
-
I think one of the things that we commonly want to be able to do
-
is to be able to ask something like “Is my cool new approach better than the industry standard?”
-
Because after all, that’s why you’re making the new thing.
-
Now, one of the challenges with this, especially early on in the design process
-
is that you may have something which is very much in its prototype stages
-
and something that is the industry standard is likely to benefit from years and years of refinement.
-
And at the same time, it may be stuck with years and years of cruft
-
which may or may not be intrinsic to its approach.
-
So if you compare your cool new tool to some industry standard, there is two things varying here.
-
One is the fidelity of the implementation and the other one of course is the approach.
-
Consequently, when you get the results,
-
you can’t know whether to attribute the results to fidelity or approach or some combination of the two.
-
So we’re going to talk about ways of teasing apart those different causal factors.
-
Now, one thing I should say right off the bat is there are some times where it may be more
-
or less relevant whether you have a good handle on what the causal factors are.
-
So for example, if you’re trying to decide between two different digital cameras,
-
at the end of the day, maybe all you care about is image quality or usability or some other factor
-
and exactly what makes that image quality better or worse
-
or any other element along the way may be less relevant to you.
-
If you don’t have control over the variables, then identifying cause may not always be what you want.
-
But when you are a designer, you do have control over the variables,
-
and that’s when it is really important to ascertain cause.
-
Here’s an example of a study that came out right when the iPhone was released,
-
done by a research firm User Centric, and I’m going to read from this news article here.
-
Research firm User Centric has released a study
-
that tries to gauge how effective the iPhone’s unusual onscreen keyboard is.
-
The goal is certainly a noble one
-
but I cannot say the survey’s approach results in data that makes much sense.
-
User Centric brought in twenty owners of other phones.
-
Half had qwerty keyboards, half had ordinary numeric phones, with keypads.
-
None were familiar with the iPhone.
-
The research involved having the test subjects enter six sample test messages with the phones
-
that they already had, and six with the iPhone.
-
The end result was that the iPhone newbies took twice as long
-
to enter text with an iPhone as they did with their own phones and made lots more typos.
-
So let’s critique this study and talk about its benefits and drawbacks.
-
Here’s the webpage directly from User Centric.
-
What’s our manipulation in this study?
-
Well the manipulation is going to be the input style.
-
How about the measure in the study?
-
It’s going to be the words per minute.
-
And there’s absolutely value in being able to measure the initial usability of the iPhone.
-
For several reasons, one is if you’re introducing new technology,
-
it’s beneficial if people are able to get up to speed pretty quickly.
-
However it’s important to realize that this comparison is intrinsically unfair
-
because the users of the previous cell phones were experts at that input modality
-
and the people who are using the iphone are novices in that modality.
-
And so it seems quite likely that the iPhone users, once they become actual users,
-
are going to get better over time and so if you’re not used to something the first time you try it,
-
that may not be a deal killer, and it’s certainly not an apples-to-apples comparison.
-
Another thing that we don’t get out of this article is “Is this difference significant?”
-
So we read that each person who typed six messages in each of two conditions
-
and so they did their own device and the iPhone, or vice versa.
-
Six messages each and that the iPhone users were half the speed of the…
-
or rather the people typing with the iPhone were half as fast as when they got to type with a mini qwerty
-
at the device that they were accustomed to.
-
So while this may tell us something about the initial usability of the iPhone,
-
in terms of the long-term usability, you know, I don’t think we get so much out of this here.
-
If you weren’t s atisfied by that initial data, you’re in good company: neither were the authors of that study.
-
So they went back a month later and they ran another study where they brought in 40 new people to the lab
-
who were either iPhone users, qwerty users, or nine key users.
-
And now it’s more of an apples-to-apples comparison
-
in that they are going to test people that are relatively experts in these three different modalities —
-
after about a month on the iPhone you’re probably starting to asymptote in terms of your performance.
-
Definitely it gets better over time, even past a month; but, you know, a month starts to get more reasonable.
-
And what they found was that iPhone users and qwerty users were about the same in terms of speed,
-
and that the numeric keypad users were much slower.
-
So once again our manipulation is going to be input style and we’re going to measure speed.
-
This time we’re also going to measure error rate.
-
And what we see is that iPhone users and qwerty users are essentially the same speed.
-
However, the iPhone users make many more errors.
-
Now, one thing I should point out about the study is
-
that each of the different devices was used by a different group of people.
-
And it was done this way so that each device was used by somebody
-
who is comfortable and had experience with working with that device.
-
And so, we removed the worry that you had newbies working on these devices.
-
However, especially in 2007, there may have been significant differences
-
in who the people were who were using the early adopters of the 2007 iPhone
-
or maybe business users were particularly drawn to the qwerty devices or people who had better things
-
to do with their time than send e-mail on their telephone or using the nine key devices.
-
And so, while this comparison is better than the previous one,
-
the potential for variation between the user populations is still problematic.
-
If what you’d like to be able to claim is something about the intrinsic properties of the device,
-
it may at least in part have to do with the users.
-
So, what are some st rategies for fairer comparison?
-
To brainstorm a couple of options one thing that you can do is insert your approach in to your production setting
-
and this may seem like a lot of work —
-
sometimes it is but in the age of the web this is a lot easier than it used to be.
-
And it’s possible even if you don’t have access to the server of the service that you’re comparing against.
-
You can use things like a proxy server or client-side scripting
-
to be able to put your own technique in and have an apples-to-apples comparison.
-
A second strategy for neutralizing the environment difference between a production version
-
and your new approach is to make a version of the production thing in the same style as your new approach.
-
That also makes them equivalent in terms of their implementation fidelity.
-
A third strategy and one that’s used commonly in research,
-
is to scale things down so you’re looking at just a piece of the system at a particular point in time.
-
That way you don’t have to worry about implementing a whole big, giant thing.
-
You can just focus on one small piece and have that comparison be fair.
-
And the fourth strategy is that when expertise is relevant,
-
train people up — give them the practice that they need —,
-
so that they can start at least hitting that asymptote in terms of performance
-
and you can get a better read than what they would be as newbies.
-
So now to close out this lecture, if somebody asks you the question “Is interface x better than interface y?”
-
you know that we’re off to a good start because we have a comparison.
-
However, you also know to be worried: What does “better” mean?
-
And often, in a complex system, you’re going to have several measures. That’s totally cool.
-
There’s a lot of value in being explicit though about what it is you mean by better —
-
What are you trying to accomplish? What are you trying to [im]prove?
-
And if anybody ever tells you that their interface is always better,
-
don’t believe them because nearly all of the time the answer is going to be “it depends.”
-
And the interesting question is “What does it depend on?”
-
Most interfaces are good for some things and not for others.
-
For example if you have a tablet computer where all of the screen is devoted to display,
-
that is going to be great for reading, for web browsing, for that kind of activity, looking at pictures.
-
Not so good if you want to type a novel.
-
So here, we’ve introduced controlled comparison
-
as a way of finding the smoking gun, as a way of inferring cause.
-
And often for, when you have only two conditions,
-
we’re going to talk about that as being a minimal pairs design.
-
As a practicing designer, the reason to care about what’s causal
-
is that it gives you the material to make a better decision going forward.
-
A lot of studies violate this constraint.
-
And, that gets dangerous because it doesn’t, it prevents you from being able to make sound decisions.
-
I hope that the tools that we’ve talked about today and in the next several lectures
-
will help you become a wise skeptic like our friend in this XKCD comic.
-
I’ll see you next time.