In this lecture, we’re going to talk about trying out your interface with people and doing so in a way that you can improve your designs based on what you learned. One of the most common things that people ask when running studies is: “Do you like my interface?” and it’s a really natural thing to ask, because on some level it’s what we all want to know. But this is really problematic on a whole lot of levels. For one it’s not very specific, and so sometimes people are trying to make this better and so they’ll improve it by doing something like: “How much do you like my interface on one to five scale?” Or: “‘This is a useful interface’ — Agree or disagree on one to five scale.” And this adds some kind of a patina of scientificness to it but really it’s just the same thing — you’re asking somebody “Do you like my interface?” And people are nice, so they’re going to say “Sure I like your interface.” This is the “please the experimenter” bias. And this can be especially strong when there are social or cultural or power differences between the experimenter and the people that you’re trying out your interface with: For example, [inaudible] and colleague show this effect in India where this effect was exacerbated when the experimenter was white. Now, you should not take this to mean that you shouldn’t have your developers try out stuff with users — Being the person who is both the developer and the person who is trying stuff out is incredible valuable. And one example I like a lot of this is Mike Krieger, one of the Instagram founders — [he] is also a former master student and TA of mine. And Mike, when he left Stanford and joined Silicon Valley, every Friday afternoon he would bring people into the lab into his office and have them try out whatever they were working on that week. And so that way they were able to get this regular feedback each week and the people who were building those systems got to see real people trying them out. This can be nails-on-a-chalkboard painful, but you’ll also learn a ton. So how do we get beyond “Do you like my interface?” The basic strategy that we’re going to talk about today is being able to use specific measures and concrete questions to be able to deliver meaningful results. One of the problems of “Do you like my interface?” is “Compared to what?” And I think one of the reasons people say “Yeah sure” is that there’s no comparison point and so one thing that’s really important is when you’re measuring the effectiveness of your interface, even informally, it’s really nice to have some kind of comparison. It’s also important think about, well, what’s the yardstick? What constitutes “good” in this arena? What are the measures that you’re going to use? So how can we get beyond “Do you like my interface?” One of the ways that we can start out is by asking a base rate question, like “What fraction of people click on the first link in a search results page?” Or “What fraction of students come to class?” Once we start to measure correlations things get even more interesting, like, “Is there a relationship between the time of day a class is offered and how many students attend it?” Or “Is there a relationship between the order of a search result and the clickthrough rate?” For both students and clickthrough, there can be multiple explanations. For example, if there are fewer students that attend early morning classes, is that a function of when students want to show up, or is that a function of when good professors want to teach? With the clickthrough example, there are also two kinds of explanations. If lower placed links yield fewer clicks, Is that because the links are of intrinsically poorer quality, or is it because people just click on the first link — [that] they don’t bother getting to the second one even if it might be better? To isolate the effect of placement and identifying it as playing a casual role, you’d need to isolate that as a variable by say, randomizing the order or search results. As we start to talk about these experiments, let’s introduce a few terms that are going to help us. The multiple different conditions that we try, that’s the thing we are manipulating — for example, the time of a class, or the location of a particular link on a search results page. These manipulations are independent variables because they are independent of what the user does. They are in the control of the experimenter. Then we are going to measure what the user does and those measures are called dependent variables because they depend on what the user does. Common measures in HCI include things like task completion time — How long does it take somebody to complete a task (for example, find something I want to buy, create a new account, order an item)? Accuracy — How many mistakes did people make, and were those fatal errors or were those things that they were able to quickly recover from? Recall — How much does a person remember afterward, or after periods of non-use? And emotional response — How does the person feel about the tasks being completed? Were they confident, were they stressed? Would the user recommend this system to a friend? So, your independent variables are the things that you manipulate, your dependent variables are the things that you measure. How reliable is your experiment? If you ran this again, would you see the same results? That’s the internal validity of an experiment. So, have a precise experiment, you need to better remove the confounding factors. Also, it’s important to study enough people so that the result is unlikely to have been by chance. You may be able to run the same study over and over and get the same results but it may not matter in some real-world sense and the external validity is the generalizability of your results. Does this apply only to eighteen-year-olds in a college classroom? Or does this apply to everybody in the world? Let’s bring this back to HCI and talk about one of the problems you’re likely to face as a designer. I think one of the things that we commonly want to be able to do is to be able to ask something like “Is my cool new approach better than the industry standard?” Because after all, that’s why you’re making the new thing. Now, one of the challenges with this, especially early on in the design process is that you may have something which is very much in its prototype stages and something that is the industry standard is likely to benefit from years and years of refinement. And at the same time, it may be stuck with years and years of cruft which may or may not be intrinsic to its approach. So if you compare your cool new tool to some industry standard, there is two things varying here. One is the fidelity of the implementation and the other one of course is the approach. Consequently, when you get the results, you can’t know whether to attribute the results to fidelity or approach or some combination of the two. So we’re going to talk about ways of teasing apart those different causal factors. Now, one thing I should say right off the bat is there are some times where it may be more or less relevant whether you have a good handle on what the causal factors are. So for example, if you’re trying to decide between two different digital cameras, at the end of the day, maybe all you care about is image quality or usability or some other factor and exactly what makes that image quality better or worse or any other element along the way may be less relevant to you. If you don’t have control over the variables, then identifying cause may not always be what you want. But when you are a designer, you do have control over the variables, and that’s when it is really important to ascertain cause. Here’s an example of a study that came out right when the iPhone was released, done by a research firm User Centric, and I’m going to read from this news article here. Research firm User Centric has released a study that tries to gauge how effective the iPhone’s unusual onscreen keyboard is. The goal is certainly a noble one but I cannot say the survey’s approach results in data that makes much sense. User Centric brought in twenty owners of other phones. Half had qwerty keyboards, half had ordinary numeric phones, with keypads. None were familiar with the iPhone. The research involved having the test subjects enter six sample test messages with the phones that they already had, and six with the iPhone. The end result was that the iPhone newbies took twice as long to enter text with an iPhone as they did with their own phones and made lots more typos. So let’s critique this study and talk about its benefits and drawbacks. Here’s the webpage directly from User Centric. What’s our manipulation in this study? Well the manipulation is going to be the input style. How about the measure in the study? It’s going to be the words per minute. And there’s absolutely value in being able to measure the initial usability of the iPhone. For several reasons, one is if you’re introducing new technology, it’s beneficial if people are able to get up to speed pretty quickly. However it’s important to realize that this comparison is intrinsically unfair because the users of the previous cell phones were experts at that input modality and the people who are using the iphone are novices in that modality. And so it seems quite likely that the iPhone users, once they become actual users, are going to get better over time and so if you’re not used to something the first time you try it, that may not be a deal killer, and it’s certainly not an apples-to-apples comparison. Another thing that we don’t get out of this article is “Is this difference significant?” So we read that each person who typed six messages in each of two conditions and so they did their own device and the iPhone, or vice versa. Six messages each and that the iPhone users were half the speed of the… or rather the people typing with the iPhone were half as fast as when they got to type with a mini qwerty at the device that they were accustomed to. So while this may tell us something about the initial usability of the iPhone, in terms of the long-term usability, you know, I don’t think we get so much out of this here. If you weren’t s atisfied by that initial data, you’re in good company: neither were the authors of that study. So they went back a month later and they ran another study where they brought in 40 new people to the lab who were either iPhone users, qwerty users, or nine key users. And now it’s more of an apples-to-apples comparison in that they are going to test people that are relatively experts in these three different modalities — after about a month on the iPhone you’re probably starting to asymptote in terms of your performance. Definitely it gets better over time, even past a month; but, you know, a month starts to get more reasonable. And what they found was that iPhone users and qwerty users were about the same in terms of speed, and that the numeric keypad users were much slower. So once again our manipulation is going to be input style and we’re going to measure speed. This time we’re also going to measure error rate. And what we see is that iPhone users and qwerty users are essentially the same speed. However, the iPhone users make many more errors. Now, one thing I should point out about the study is that each of the different devices was used by a different group of people. And it was done this way so that each device was used by somebody who is comfortable and had experience with working with that device. And so, we removed the worry that you had newbies working on these devices. However, especially in 2007, there may have been significant differences in who the people were who were using the early adopters of the 2007 iPhone or maybe business users were particularly drawn to the qwerty devices or people who had better things to do with their time than send e-mail on their telephone or using the nine key devices. And so, while this comparison is better than the previous one, the potential for variation between the user populations is still problematic. If what you’d like to be able to claim is something about the intrinsic properties of the device, it may at least in part have to do with the users. So, what are some st rategies for fairer comparison? To brainstorm a couple of options one thing that you can do is insert your approach in to your production setting and this may seem like a lot of work — sometimes it is but in the age of the web this is a lot easier than it used to be. And it’s possible even if you don’t have access to the server of the service that you’re comparing against. You can use things like a proxy server or client-side scripting to be able to put your own technique in and have an apples-to-apples comparison. A second strategy for neutralizing the environment difference between a production version and your new approach is to make a version of the production thing in the same style as your new approach. That also makes them equivalent in terms of their implementation fidelity. A third strategy and one that’s used commonly in research, is to scale things down so you’re looking at just a piece of the system at a particular point in time. That way you don’t have to worry about implementing a whole big, giant thing. You can just focus on one small piece and have that comparison be fair. And the fourth strategy is that when expertise is relevant, train people up — give them the practice that they need —, so that they can start at least hitting that asymptote in terms of performance and you can get a better read than what they would be as newbies. So now to close out this lecture, if somebody asks you the question “Is interface x better than interface y?” you know that we’re off to a good start because we have a comparison. However, you also know to be worried: What does “better” mean? And often, in a complex system, you’re going to have several measures. That’s totally cool. There’s a lot of value in being explicit though about what it is you mean by better — What are you trying to accomplish? What are you trying to [im]prove? And if anybody ever tells you that their interface is always better, don’t believe them because nearly all of the time the answer is going to be “it depends.” And the interesting question is “What does it depend on?” Most interfaces are good for some things and not for others. For example if you have a tablet computer where all of the screen is devoted to display, that is going to be great for reading, for web browsing, for that kind of activity, looking at pictures. Not so good if you want to type a novel. So here, we’ve introduced controlled comparison as a way of finding the smoking gun, as a way of inferring cause. And often for, when you have only two conditions, we’re going to talk about that as being a minimal pairs design. As a practicing designer, the reason to care about what’s causal is that it gives you the material to make a better decision going forward. A lot of studies violate this constraint. And, that gets dangerous because it doesn’t, it prevents you from being able to make sound decisions. I hope that the tools that we’ve talked about today and in the next several lectures will help you become a wise skeptic like our friend in this XKCD comic. I’ll see you next time.