Return to Video

Lecture 9.1: Designing Studies you can learn from (15:52)

  • 0:02 - 0:05
    In this lecture, we’re going to talk about trying out your interface with people
  • 0:05 - 0:12
    and doing so in a way that you can improve your designs based on what you learned.
  • 0:12 - 0:17
    One of the most common things that people ask when running studies is: “Do you like my interface?”
  • 0:17 - 0:21
    and it’s a really natural thing to ask, because on some level it’s what we all want to know.
  • 0:21 - 0:24
    But this is really problematic on a whole lot of levels.
  • 0:24 - 0:28
    For one it’s not very specific, and so sometimes people are trying to make this better
  • 0:28 - 0:34
    and so they’ll improve it by doing something like: “How much do you like my interface on one to five scale?”
  • 0:34 - 0:39
    Or: “‘This is a useful interface’ — Agree or disagree on one to five scale.”
  • 0:39 - 0:43
    And this adds some kind of a patina of scientificness to it
  • 0:43 - 0:47
    but really it’s just the same thing — you’re asking somebody “Do you like my interface?”
  • 0:47 - 0:50
    And people are nice, so they’re going to say “Sure I like your interface.”
  • 0:50 - 0:52
    This is the “please the experimenter” bias.
  • 0:52 - 0:57
    And this can be especially strong when there are social or cultural or power differences
  • 0:57 - 1:01
    between the experimenter and the people that you’re trying out your interface with:
  • 1:01 - 1:05
    For example, [inaudible] and colleague show this effect in India
  • 1:05 - 1:09
    where this effect was exacerbated when the experimenter was white.
  • 1:09 - 1:16
    Now, you should not take this to mean that you shouldn’t have your developers try out stuff with users —
  • 1:16 - 1:22
    Being the person who is both the developer and the person who is trying stuff out is incredible valuable.
  • 1:22 - 1:25
    And one example I like a lot of this is Mike Krieger,
  • 1:25 - 1:30
    one of the Instagram founders — [he] is also a former master student and TA of mine.
  • 1:30 - 1:32
    And Mike, when he left Stanford and joined Silicon Valley,
  • 1:32 - 1:36
    every Friday afternoon he would bring people into the lab into his office
  • 1:36 - 1:40
    and have them try out whatever they were working on that week.
  • 1:40 - 1:43
    And so that way they were able to get this regular feedback each week
  • 1:43 - 1:48
    and the people who were building those systems got to see real people trying them out.
  • 1:48 - 1:52
    This can be nails-on-a-chalkboard painful, but you’ll also learn a ton.
  • 1:52 - 1:55
    So how do we get beyond “Do you like my interface?”
  • 1:55 - 1:59
    The basic strategy that we’re going to talk about today is being able
  • 1:59 - 2:05
    to use specific measures and concrete questions to be able to deliver meaningful results.
  • 2:05 - 2:10
    One of the problems of “Do you like my interface?” is “Compared to what?”
  • 2:10 - 2:16
    And I think one of the reasons people say “Yeah sure” is that there’s no comparison point
  • 2:16 - 2:22
    and so one thing that’s really important is when you’re measuring the effectiveness of your interface,
  • 2:22 - 2:26
    even informally, it’s really nice to have some kind of comparison.
  • 2:26 - 2:29
    It’s also important think about, well, what’s the yardstick?
  • 2:29 - 2:31
    What constitutes “good” in this arena?
  • 2:31 - 2:34
    What are the measures that you’re going to use?
  • 2:34 - 2:37
    So how can we get beyond “Do you like my interface?”
  • 2:37 - 2:41
    One of the ways that we can start out is by asking a base rate question,
  • 2:41 - 2:47
    like “What fraction of people click on the first link in a search results page?”
  • 2:47 - 2:50
    Or “What fraction of students come to class?”
  • 2:50 - 2:55
    Once we start to measure correlations things get even more interesting,
  • 2:55 - 3:00
    like, “Is there a relationship between the time of day a class is offered and how many students attend it?”
  • 3:00 - 3:08
    Or “Is there a relationship between the order of a search result and the clickthrough rate?”
  • 3:08 - 3:11
    For both students and clickthrough, there can be multiple explanations.
  • 3:11 - 3:16
    For example, if there are fewer students that attend early morning classes,
  • 3:16 - 3:19
    is that a function of when students want to show up,
  • 3:19 - 3:23
    or is that a function of when good professors want to teach?
  • 3:23 - 3:26
    With the clickthrough example, there are also two kinds of explanations.
  • 3:26 - 3:38
    If lower placed links yield fewer clicks, Is that because the links are of intrinsically poorer quality,
  • 3:38 - 3:41
    or is it because people just click on the first link —
  • 3:41 - 3:45
    [that] they don’t bother getting to the second one even if it might be better?
  • 3:45 - 3:49
    To isolate the effect of placement and identifying it as playing a casual role,
  • 3:49 - 3:54
    you’d need to isolate that as a variable by say, randomizing the order or search results.
  • 3:54 - 4:00
    As we start to talk about these experiments, let’s introduce a few terms that are going to help us.
  • 4:00 - 4:05
    The multiple different conditions that we try, that’s the thing we are manipulating —
  • 4:05 - 4:12
    for example, the time of a class, or the location of a particular link on a search results page.
  • 4:12 - 4:18
    These manipulations are independent variables because they are independent of what the user does.
  • 4:18 - 4:22
    They are in the control of the experimenter.
  • 4:22 - 4:27
    Then we are going to measure what the user does
  • 4:27 - 4:31
    and those measures are called dependent variables because they depend on what the user does.
  • 4:31 - 4:36
    Common measures in HCI include things like task completion time —
  • 4:36 - 4:39
    How long does it take somebody to complete a task
  • 4:39 - 4:43
    (for example, find something I want to buy, create a new account, order an item)?
  • 4:43 - 4:47
    Accuracy — How many mistakes did people make,
  • 4:47 - 4:51
    and were those fatal errors or were those things that they were able to quickly recover from?
  • 4:51 - 4:55
    Recall — How much does a person remember afterward, or after periods of non-use?
  • 4:55 - 4:59
    And emotional response — How does the person feel about the tasks being completed?
  • 4:59 - 5:01
    Were they confident, were they stressed?
  • 5:01 - 5:04
    Would the user recommend this system to a friend?
  • 5:04 - 5:09
    So, your independent variables are the things that you manipulate,
  • 5:09 - 5:12
    your dependent variables are the things that you measure.
  • 5:12 - 5:14
    How reliable is your experiment?
  • 5:14 - 5:18
    If you ran this again, would you see the same results?
  • 5:18 - 5:21
    That’s the internal validity of an experiment.
  • 5:21 - 5:25
    So, have a precise experiment, you need to better remove the confounding factors.
  • 5:25 - 5:30
    Also, it’s important to study enough people so that the result is unlikely to have been by chance.
  • 5:30 - 5:34
    You may be able to run the same study over and over and get the same results
  • 5:34 - 5:42
    but it may not matter in some real-world sense and the external validity is the generalizability of your results.
  • 5:42 - 5:45
    Does this apply only to eighteen-year-olds in a college classroom?
  • 5:45 - 5:48
    Or does this apply to everybody in the world?
  • 5:48 - 5:52
    Let’s bring this back to HCI and talk about one of the problems you’re likely to face as a designer.
  • 5:52 - 5:55
    I think one of the things that we commonly want to be able to do
  • 5:55 - 6:00
    is to be able to ask something like “Is my cool new approach better than the industry standard?”
  • 6:00 - 6:03
    Because after all, that’s why you’re making the new thing.
  • 6:03 - 6:07
    Now, one of the challenges with this, especially early on in the design process
  • 6:07 - 6:11
    is that you may have something which is very much in its prototype stages
  • 6:11 - 6:17
    and something that is the industry standard is likely to benefit from years and years of refinement.
  • 6:17 - 6:22
    And at the same time, it may be stuck with years and years of cruft
  • 6:22 - 6:25
    which may or may not be intrinsic to its approach.
  • 6:25 - 6:31
    So if you compare your cool new tool to some industry standard, there is two things varying here.
  • 6:31 - 6:36
    One is the fidelity of the implementation and the other one of course is the approach.
  • 6:36 - 6:38
    Consequently, when you get the results,
  • 6:38 - 6:44
    you can’t know whether to attribute the results to fidelity or approach or some combination of the two.
  • 6:44 - 6:48
    So we’re going to talk about ways of teasing apart those different causal factors.
  • 6:48 - 6:54
    Now, one thing I should say right off the bat is there are some times where it may be more
  • 6:54 - 6:57
    or less relevant whether you have a good handle on what the causal factors are.
  • 6:57 - 7:01
    So for example, if you’re trying to decide between two different digital cameras,
  • 7:01 - 7:08
    at the end of the day, maybe all you care about is image quality or usability or some other factor
  • 7:08 - 7:13
    and exactly what makes that image quality better or worse
  • 7:13 - 7:18
    or any other element along the way may be less relevant to you.
  • 7:18 - 7:24
    If you don’t have control over the variables, then identifying cause may not always be what you want.
  • 7:24 - 7:28
    But when you are a designer, you do have control over the variables,
  • 7:28 - 7:31
    and that’s when it is really important to ascertain cause.
  • 7:31 - 7:36
    Here’s an example of a study that came out right when the iPhone was released,
  • 7:36 - 7:41
    done by a research firm User Centric, and I’m going to read from this news article here.
  • 7:41 - 7:43
    Research firm User Centric has released a study
  • 7:43 - 7:49
    that tries to gauge how effective the iPhone’s unusual onscreen keyboard is.
  • 7:49 - 7:51
    The goal is certainly a noble one
  • 7:51 - 7:56
    but I cannot say the survey’s approach results in data that makes much sense.
  • 7:56 - 8:00
    User Centric brought in twenty owners of other phones.
  • 8:00 - 8:05
    Half had qwerty keyboards, half had ordinary numeric phones, with keypads.
  • 8:05 - 8:08
    None were familiar with the iPhone.
  • 8:08 - 8:14
    The research involved having the test subjects enter six sample test messages with the phones
  • 8:14 - 8:17
    that they already had, and six with the iPhone.
  • 8:17 - 8:21
    The end result was that the iPhone newbies took twice as long
  • 8:21 - 8:27
    to enter text with an iPhone as they did with their own phones and made lots more typos.
  • 8:27 - 8:32
    So let’s critique this study and talk about its benefits and drawbacks.
  • 8:32 - 8:34
    Here’s the webpage directly from User Centric.
  • 8:34 - 8:38
    What’s our manipulation in this study?
  • 8:38 - 8:42
    Well the manipulation is going to be the input style.
  • 8:42 - 8:45
    How about the measure in the study?
  • 8:45 - 8:49
    It’s going to be the words per minute.
  • 8:49 - 8:56
    And there’s absolutely value in being able to measure the initial usability of the iPhone.
  • 8:56 - 9:00
    For several reasons, one is if you’re introducing new technology,
  • 9:00 - 9:04
    it’s beneficial if people are able to get up to speed pretty quickly.
  • 9:04 - 9:09
    However it’s important to realize that this comparison is intrinsically unfair
  • 9:09 - 9:15
    because the users of the previous cell phones were experts at that input modality
  • 9:15 - 9:19
    and the people who are using the iphone are novices in that modality.
  • 9:19 - 9:24
    And so it seems quite likely that the iPhone users, once they become actual users,
  • 9:24 - 9:29
    are going to get better over time and so if you’re not used to something the first time you try it,
  • 9:29 - 9:35
    that may not be a deal killer, and it’s certainly not an apples-to-apples comparison.
  • 9:35 - 9:40
    Another thing that we don’t get out of this article is “Is this difference significant?”
  • 9:40 - 9:47
    So we read that each person who typed six messages in each of two conditions
  • 9:47 - 9:52
    and so they did their own device and the iPhone, or vice versa.
  • 9:52 - 10:00
    Six messages each and that the iPhone users were half the speed of the…
  • 10:00 - 10:09
    or rather the people typing with the iPhone were half as fast as when they got to type with a mini qwerty
  • 10:09 - 10:13
    at the device that they were accustomed to.
  • 10:13 - 10:17
    So while this may tell us something about the initial usability of the iPhone,
  • 10:17 - 10:23
    in terms of the long-term usability, you know, I don’t think we get so much out of this here.
  • 10:23 - 10:30
    If you weren’t s atisfied by that initial data, you’re in good company: neither were the authors of that study.
  • 10:30 - 10:35
    So they went back a month later and they ran another study where they brought in 40 new people to the lab
  • 10:35 - 10:40
    who were either iPhone users, qwerty users, or nine key users.
  • 10:40 - 10:43
    And now it’s more of an apples-to-apples comparison
  • 10:43 - 10:49
    in that they are going to test people that are relatively experts in these three different modalities —
  • 10:49 - 10:55
    after about a month on the iPhone you’re probably starting to asymptote in terms of your performance.
  • 10:55 - 11:03
    Definitely it gets better over time, even past a month; but, you know, a month starts to get more reasonable.
  • 11:03 - 11:12
    And what they found was that iPhone users and qwerty users were about the same in terms of speed,
  • 11:12 - 11:17
    and that the numeric keypad users were much slower.
  • 11:17 - 11:22
    So once again our manipulation is going to be input style and we’re going to measure speed.
  • 11:22 - 11:25
    This time we’re also going to measure error rate.
  • 11:25 - 11:30
    And what we see is that iPhone users and qwerty users are essentially the same speed.
  • 11:30 - 11:37
    However, the iPhone users make many more errors.
  • 11:37 - 11:40
    Now, one thing I should point out about the study is
  • 11:40 - 11:47
    that each of the different devices was used by a different group of people.
  • 11:47 - 11:52
    And it was done this way so that each device was used by somebody
  • 11:52 - 11:56
    who is comfortable and had experience with working with that device.
  • 11:56 - 12:01
    And so, we removed the worry that you had newbies working on these devices.
  • 12:01 - 12:05
    However, especially in 2007, there may have been significant differences
  • 12:05 - 12:11
    in who the people were who were using the early adopters of the 2007 iPhone
  • 12:11 - 12:17
    or maybe business users were particularly drawn to the qwerty devices or people who had better things
  • 12:17 - 12:22
    to do with their time than send e-mail on their telephone or using the nine key devices.
  • 12:22 - 12:27
    And so, while this comparison is better than the previous one,
  • 12:27 - 12:32
    the potential for variation between the user populations is still problematic.
  • 12:32 - 12:37
    If what you’d like to be able to claim is something about the intrinsic properties of the device,
  • 12:37 - 12:42
    it may at least in part have to do with the users.
  • 12:42 - 12:45
    So, what are some st rategies for fairer comparison?
  • 12:45 - 12:50
    To brainstorm a couple of options one thing that you can do is insert your approach in to your production setting
  • 12:50 - 12:53
    and this may seem like a lot of work —
  • 12:53 - 12:57
    sometimes it is but in the age of the web this is a lot easier than it used to be.
  • 12:57 - 13:03
    And it’s possible even if you don’t have access to the server of the service that you’re comparing against.
  • 13:03 - 13:07
    You can use things like a proxy server or client-side scripting
  • 13:07 - 13:12
    to be able to put your own technique in and have an apples-to-apples comparison.
  • 13:12 - 13:17
    A second strategy for neutralizing the environment difference between a production version
  • 13:17 - 13:26
    and your new approach is to make a version of the production thing in the same style as your new approach.
  • 13:26 - 13:31
    That also makes them equivalent in terms of their implementation fidelity.
  • 13:31 - 13:34
    A third strategy and one that’s used commonly in research,
  • 13:34 - 13:39
    is to scale things down so you’re looking at just a piece of the system at a particular point in time.
  • 13:39 - 13:43
    That way you don’t have to worry about implementing a whole big, giant thing.
  • 13:43 - 13:48
    You can just focus on one small piece and have that comparison be fair.
  • 13:48 - 13:53
    And the fourth strategy is that when expertise is relevant,
  • 13:53 - 13:56
    train people up — give them the practice that they need —,
  • 13:56 - 14:01
    so that they can start at least hitting that asymptote in terms of performance
  • 14:01 - 14:05
    and you can get a better read than what they would be as newbies.
  • 14:05 - 14:12
    So now to close out this lecture, if somebody asks you the question “Is interface x better than interface y?”
  • 14:12 - 14:15
    you know that we’re off to a good start because we have a comparison.
  • 14:15 - 14:19
    However, you also know to be worried: What does “better” mean?
  • 14:19 - 14:26
    And often, in a complex system, you’re going to have several measures. That’s totally cool.
  • 14:26 - 14:31
    There’s a lot of value in being explicit though about what it is you mean by better —
  • 14:31 - 14:34
    What are you trying to accomplish? What are you trying to [im]prove?
  • 14:34 - 14:38
    And if anybody ever tells you that their interface is always better,
  • 14:38 - 14:44
    don’t believe them because nearly all of the time the answer is going to be “it depends.”
  • 14:44 - 14:48
    And the interesting question is “What does it depend on?”
  • 14:48 - 14:53
    Most interfaces are good for some things and not for others.
  • 14:53 - 14:58
    For example if you have a tablet computer where all of the screen is devoted to display,
  • 14:58 - 15:04
    that is going to be great for reading, for web browsing, for that kind of activity, looking at pictures.
  • 15:04 - 15:06
    Not so good if you want to type a novel.
  • 15:06 - 15:09
    So here, we’ve introduced controlled comparison
  • 15:09 - 15:14
    as a way of finding the smoking gun, as a way of inferring cause.
  • 15:14 - 15:17
    And often for, when you have only two conditions,
  • 15:17 - 15:21
    we’re going to talk about that as being a minimal pairs design.
  • 15:21 - 15:25
    As a practicing designer, the reason to care about what’s causal
  • 15:25 - 15:30
    is that it gives you the material to make a better decision going forward.
  • 15:30 - 15:32
    A lot of studies violate this constraint.
  • 15:32 - 15:40
    And, that gets dangerous because it doesn’t, it prevents you from being able to make sound decisions.
  • 15:40 - 15:44
    I hope that the tools that we’ve talked about today and in the next several lectures
  • 15:44 - 15:49
    will help you become a wise skeptic like our friend in this XKCD comic.
  • 15:49 - 15:53
    I’ll see you next time.
Title:
Lecture 9.1: Designing Studies you can learn from (15:52)
Video Language:
English

English subtitles

Revisions