1 00:00:00,000 --> 00:00:12,260 rc3 preroll music 2 00:00:12,260 --> 00:00:17,930 Herald: All right, so again, let's introduce the next talk, accessible inputs 3 00:00:17,930 --> 00:00:25,320 for readers, coders and hackers, the talk by David Williams-King about custom off-, 4 00:00:25,320 --> 00:00:30,230 well, not off the shelf, but custom accessibility solutions. He will give you 5 00:00:30,230 --> 00:00:35,420 some demonstrations and that includes his own custom made voice input, an added link 6 00:00:35,420 --> 00:00:38,110 system. Here is David Williams-King 7 00:00:40,020 --> 00:00:46,440 David: Thank you for the introduction. Let's go ahead and get started. So, yeah, 8 00:00:46,440 --> 00:00:50,650 I'm talking about accessibility, particularly accessible input for readers, 9 00:00:50,650 --> 00:00:57,840 coders and hackers. So what do I mean by accessibility? I mean people that have 10 00:00:57,840 --> 00:01:02,780 physical or motor impairments. This could be due to repetitive strain injury, carpal 11 00:01:02,780 --> 00:01:08,030 tunnel, all kinds of medical conditions. If you have this type of thing, you 12 00:01:08,030 --> 00:01:11,860 probably can't use a normal computer keyboard, computer mouse or even a phone 13 00:01:11,860 --> 00:01:18,720 touch screen. However, technology does allow users to interact with these devices 14 00:01:18,720 --> 00:01:23,780 just using different forms of input. And it's really valuable to these people 15 00:01:23,780 --> 00:01:28,909 because, you know, being able to interact with the device provides some agency they 16 00:01:28,909 --> 00:01:32,781 can they can do things on their own and it provides a means of communication with the 17 00:01:32,781 --> 00:01:38,439 outside world. So it's an important problem to look at. And it's what I care 18 00:01:38,439 --> 00:01:45,200 about a lot. Let's talk a bit about me for a moment. I'm a systems security person. I 19 00:01:45,200 --> 00:01:49,920 did a phd in cybersecurity at Columbia. If you're interested in low level software 20 00:01:49,920 --> 00:01:54,509 defenses, you can look that up. And I'm currently the CTO at a startup called 21 00:01:54,509 --> 00:02:03,360 Elpha Secure. I started developing medical issues in around 2014. And as a result of 22 00:02:03,360 --> 00:02:07,770 that, in an ongoing fashion, I can only type a few thousand keystrokes per day. 23 00:02:07,770 --> 00:02:12,090 Roughly fifteen thousand is my maximum. That sounds like a lot, but imagine you're 24 00:02:12,090 --> 00:02:17,069 typing at a hundred words per minute. That's five hundred characters per minute, 25 00:02:17,069 --> 00:02:23,349 which means it takes you 30 minutes to hit fifteen thousand characters. So 26 00:02:23,349 --> 00:02:29,519 essentially I, I can work like the equivalent of a fast programmer for, for 27 00:02:29,519 --> 00:02:33,700 half an hour. And then after that I would be unable to use my hands for anything, 28 00:02:33,700 --> 00:02:38,420 including like preparing food for myself or opening, closing doors and so on. So I 29 00:02:38,420 --> 00:02:42,189 have to be very careful about my hand use and actually have a little program that 30 00:02:42,189 --> 00:02:46,690 you can see on the slide there that measures the keystrokes for me so I can 31 00:02:46,690 --> 00:02:51,809 tell it when I'm going over. So what do I do? Well, I do a lot of pair programming, 32 00:02:51,809 --> 00:02:56,650 for sure. I log into the same machine as other people and we work together. I'm 33 00:02:56,650 --> 00:03:00,430 also a very heavy user of speech recognition and I gave a talk at that 34 00:03:00,430 --> 00:03:06,900 about voice coding with speech recognition at the Hope 11 conference. So you can go 35 00:03:06,900 --> 00:03:15,419 check that out if you're interested. So when I talk about accessible input, I mean 36 00:03:15,419 --> 00:03:18,779 different ways that a human can provide input to a computer. So ergonomic 37 00:03:18,779 --> 00:03:23,019 keyboards are a simple one. Speech recognition, eye tracking or gaze tracking 38 00:03:23,019 --> 00:03:26,699 so you can see where you're looking or where you're pointing your head and 39 00:03:26,699 --> 00:03:32,229 maybe use that to replace a mouse, that's head gestures, I suppose. And there's 40 00:03:32,229 --> 00:03:38,650 always this distinction between bespoke, like custom input mechanisms and somewhat 41 00:03:38,650 --> 00:03:44,499 mainstream ones. So I'll give you some examples. You've probably heard of Stephen 42 00:03:44,499 --> 00:03:50,230 Hawking. He's a very famous professor, and he was actually a bit of an extreme case. 43 00:03:50,230 --> 00:03:56,142 He had, he was diagnosed with ALS when he was 21. So his his physical 44 00:03:56,142 --> 00:04:00,669 ability, abilities degraded over the years because he lived for many decades after 45 00:04:00,669 --> 00:04:05,059 that and he went through many communication mechanisms. Initially his 46 00:04:05,059 --> 00:04:08,309 speech changed so that it was only intelligible to his family and close 47 00:04:08,309 --> 00:04:14,239 friends, but he was still able to speak. And then after that he would work with the 48 00:04:14,239 --> 00:04:19,440 human interpreter and raise his eyebrows to pick various letters. And then and keep 49 00:04:19,440 --> 00:04:24,690 in mind, this is like the 60s or 70s, right? So computers were not really where 50 00:04:24,690 --> 00:04:29,840 they are today. Later he would operate a switch with one hand, just like on off on 51 00:04:29,840 --> 00:04:35,009 off, kind of morse code and select from a bank of words. And that was around 15 52 00:04:35,009 --> 00:04:41,080 words per minute. Eventually, he was unable to move his hand, so a team of 53 00:04:41,080 --> 00:04:44,490 engineers from Intel worked with him and they figured out, they were trying to do 54 00:04:44,490 --> 00:04:48,229 like brain scans and all kinds of stuff. But again, this was like in the eighties, 55 00:04:48,229 --> 00:04:54,599 so there was not not too much they could do. So they basically just created some 56 00:04:54,599 --> 00:04:59,120 custom software to detect muscle movements in his cheek. And he used that with 57 00:04:59,120 --> 00:05:03,550 predictive, predictive words, the same way that a phone, smartphone keyboard will 58 00:05:03,550 --> 00:05:07,180 predict which word you want to say next. Stephen Hawking, used something similar to 59 00:05:07,180 --> 00:05:12,689 that, except instead of swiping on a phone, he was moving his cheek muscles, so 60 00:05:12,689 --> 00:05:17,810 that's obviously a sequence of like highly customized input mechanisms for, for 61 00:05:17,810 --> 00:05:23,979 someone very, very specialized for that person. I also want to talk about someone 62 00:05:23,979 --> 00:05:29,592 else named Professor Sang-Mook Lee, whom I've met. That was me when I had more of a 63 00:05:29,592 --> 00:05:36,180 beard than I do now. He he's a professor at Seoul National University in South 64 00:05:36,180 --> 00:05:42,969 Korea. And he sometimes called like the Korean Stephen Hawking, because he's a big 65 00:05:42,969 --> 00:05:47,990 advocate for people with disabilities. Anyway, what he uses is you can 66 00:05:47,990 --> 00:05:52,360 see a little orange device near his mouth there. It's called a sip and puff mouse 67 00:05:52,360 --> 00:05:56,930 so he can blow into it and suck air through it and also move it around. And 68 00:05:56,930 --> 00:06:02,280 that acts as a mouse cursor on the Android device in front of him. It will move the 69 00:06:02,280 --> 00:06:08,229 cursor around and click when he when he blows air and so on. So that combined 70 00:06:08,229 --> 00:06:13,909 with speech recognition, lets him use mainstream Android hardware. He still has 71 00:06:13,909 --> 00:06:21,249 access to, you know, email apps and like Web Browsers and like Maps and everything 72 00:06:21,249 --> 00:06:26,159 that comes on a normal Android device. So he's way more capable than Stephen 73 00:06:26,159 --> 00:06:29,949 Hawking, as who could, Stephen Hawking could communicate, but just to a person at 74 00:06:29,949 --> 00:06:35,830 a very slow rate. Right. Part of it's due to the nature of his injury. But it's also 75 00:06:35,830 --> 00:06:43,939 a testament to how far the technology has improved. So let's talk a little bit about 76 00:06:43,939 --> 00:06:49,480 what makes good accessibility. I think performance is very important, right? You 77 00:06:49,480 --> 00:06:53,889 want high accuracy. You don't want typos, low latency. I don't want to speak and 78 00:06:53,889 --> 00:06:58,389 then five seconds later have words appear. It's too long, especially if I have to 79 00:06:58,389 --> 00:07:02,509 make corrections. Right. And you want high throughput, which we already talked about. 80 00:07:02,509 --> 00:07:06,240 Oh, I forgot to mention Stephen Hawking had like 15 words per minute. A normal 81 00:07:06,240 --> 00:07:12,349 person speaking is 150. So that's a big difference. (laughs) The higher 82 00:07:12,349 --> 00:07:16,479 throughput you can get, the better. And for input accessibility, I think and this 83 00:07:16,479 --> 00:07:20,879 is not scientific. This is just what I've learned from using myself and observing 84 00:07:20,879 --> 00:07:25,330 many of these systems. I think it's important to get completeness, consistency 85 00:07:25,330 --> 00:07:31,479 and customization. For completeness I mean, can I do any action? So Stephen or 86 00:07:31,479 --> 00:07:40,590 Professor Sang-Mook Lee, his, his orange mouth input device, the sip and puff is 87 00:07:40,590 --> 00:07:44,199 quite powerful, but it doesn't let him do every action. For example, for some reason 88 00:07:44,199 --> 00:07:48,379 when he gets an incoming call, the the input doesn't work. So he has to call over 89 00:07:48,379 --> 00:07:52,430 a person physically to tap the accept call button or the reject call button, which is 90 00:07:52,430 --> 00:07:55,729 really annoying. Right. If you don't have completeness, you can't be fully 91 00:07:55,729 --> 00:08:01,729 independent. Consistency, very important as well. The same way we develop motor 92 00:08:01,729 --> 00:08:07,580 memory for muscle memory, for a keyboard. You develop memory for any types of 93 00:08:07,580 --> 00:08:11,690 patterns that you do. But if the thing you say or the thing you do keeps changing in 94 00:08:11,690 --> 00:08:18,220 order to do the same action. That's not good. And finally, customization. So the 95 00:08:18,220 --> 00:08:22,809 learning curve for beginners is important for any accessibility device, but 96 00:08:22,809 --> 00:08:27,150 designing for expert use is almost more important because anyone who uses an 97 00:08:27,150 --> 00:08:31,229 accessibility interface becomes an expert at it. The example I like to give is 98 00:08:31,229 --> 00:08:35,440 screen readers like a blind person using a screen reader on a phone. They will crank 99 00:08:35,440 --> 00:08:41,880 up the speed at which the speech is being produced. And I actually met someone who 100 00:08:41,880 --> 00:08:46,321 made his speech 16 times faster than normal human speech. I could not 101 00:08:46,341 --> 00:08:51,020 understand it at all, I sound like brbrbrbr, but he could understand it perfectly. And that's just 102 00:08:51,020 --> 00:08:56,190 because he used it so much that he's become an expert at its use. Let's analyze 103 00:08:56,190 --> 00:09:01,040 ergonomic keyboards just for a moment, because it's fun. You know, they are kind 104 00:09:01,040 --> 00:09:04,260 of like a normal keyboard. They'll have a, you'll have a slow pace when you're 105 00:09:04,260 --> 00:09:07,630 starting to learn them. But once you're good at it, you have very good accuracy, 106 00:09:07,630 --> 00:09:11,709 like instantaneous low latency. Right. You press the key, the computer receives it 107 00:09:11,709 --> 00:09:17,510 immediately and very high throughput. It has high as you are on a regular keyboard. 108 00:09:17,510 --> 00:09:20,329 So they're actually fantastic accessibility devices, right. They're 109 00:09:20,329 --> 00:09:23,950 completely compatible with original keyboards. And if all you need is an 110 00:09:23,950 --> 00:09:28,600 ergonomic keyboard, then you're in luck because it's a very good accessibility 111 00:09:28,600 --> 00:09:34,480 device. I'm going to talk about two things, computers, but also Android 112 00:09:34,480 --> 00:09:39,750 devices, so let's start with Android devices. Yes, the built in voice 113 00:09:39,750 --> 00:09:43,340 recognition and Android is really incredible. So even though the microphones 114 00:09:43,340 --> 00:09:47,000 on the devices aren't great, Google has just collected so much data from so many 115 00:09:47,000 --> 00:09:51,590 different sources that they've built like better than human accuracy for for their 116 00:09:51,590 --> 00:09:56,570 voice recognition. The voice accessibility interface is kind of so so we'll talk 117 00:09:56,570 --> 00:09:59,649 about that in a bit. That's the interface where you can control the Android device 118 00:09:59,649 --> 00:10:04,230 entirely by voice. For other input mechanisms. You could use like a sip and 119 00:10:04,230 --> 00:10:09,010 puff device or you could use physical styluses. That's something that I do a 120 00:10:09,010 --> 00:10:13,320 lot, actually, because for me, my fingers get sore. And if I can hold a stylus in my 121 00:10:13,320 --> 00:10:19,220 hand and kind of not use my fingers, then that's very effective. So and the Elecom 122 00:10:19,220 --> 00:10:23,750 styluses from a Japanese company are the lightest I've found and they don't require 123 00:10:23,750 --> 00:10:30,131 a lot of force. So the ones at the top there are they're like 12 grams and the 124 00:10:30,131 --> 00:10:34,160 one on the bottom is 4.7 grams. And you've got almost no force to use them. So very 125 00:10:34,160 --> 00:10:38,040 nice on the left there you can see the Android speech recognition is built into 126 00:10:38,040 --> 00:10:41,860 the keyboard now. Right. You can just press that and start speaking. It 127 00:10:41,860 --> 00:10:46,160 supports different languages, and it's very accurate, it's very nice. And 128 00:10:46,160 --> 00:10:51,470 actually, when I was working at Google for a bit, I talked to the speech recognition 129 00:10:51,470 --> 00:10:54,470 team as like: Why are you doing on server speech recognition? You should do 130 00:10:54,470 --> 00:10:58,029 it on the devices. But of course, Android devices are, they're all very different 131 00:10:58,029 --> 00:11:02,529 and many of them are not very powerful. So they were having trouble getting 132 00:11:02,529 --> 00:11:06,450 satisfactory speech recognition on the device. So for a long time, there's some 133 00:11:06,450 --> 00:11:10,630 server latency, server lag that you do speech recognition and you wait a bit. And 134 00:11:10,630 --> 00:11:14,190 then sometime this year, I just was using speech recognition and it became so much 135 00:11:14,190 --> 00:11:18,360 faster. I was extremely excited and I looked into it and yeah, they just 136 00:11:18,360 --> 00:11:22,000 switched on my device. At least they switched on the On device speech recognition 137 00:11:22,000 --> 00:11:25,710 model. And so now it's incredibly fast and also incredibly accurate. I'm a huge fan 138 00:11:25,710 --> 00:11:30,949 of it. On the right hand side. We can actually see the voice access interface. 139 00:11:30,949 --> 00:11:34,899 So this is meant to allow you to use a phone entirely by voice. Again, while I 140 00:11:34,899 --> 00:11:37,940 was at Google, I tried the the beta version before it was publicly released 141 00:11:37,940 --> 00:11:43,529 and I was like, this is pretty bad, mostly because it did, it lacked completeness. 142 00:11:43,529 --> 00:11:47,209 There would be things on the screen that would not be selected. So here we see show 143 00:11:47,209 --> 00:11:52,510 labels. And then I can I can say like four, five, six, whatever, to tap on that 144 00:11:52,510 --> 00:11:57,070 thing. But as you can see at the bottom, there was like a Twitter Web app link and 145 00:11:57,070 --> 00:12:00,140 there's no number on it. So if I want to click on that, I'm out of luck. And this 146 00:12:00,140 --> 00:12:06,500 is actually a problem in the design of the accessibility interface that it only, it 147 00:12:06,500 --> 00:12:11,519 doesn't expose the full DOM. It exposes only a subset of it. And so an 148 00:12:11,519 --> 00:12:18,959 accessibility mechanism can't ever see those other things. And furthermore, the 149 00:12:18,959 --> 00:12:22,279 way the Google speech recognition works, they have to reestablish a new connection 150 00:12:22,279 --> 00:12:26,480 every 30 seconds. And if you're in the middle of speaking, it will just throw 151 00:12:26,480 --> 00:12:29,959 away whatever you were saying because it just decided it had to reconnect, which is 152 00:12:29,959 --> 00:12:34,610 really unfortunate. They later released that publicly and then sometime this year 153 00:12:34,610 --> 00:12:39,860 they did the update, which is pretty nice. It now has like a mouse grid, which lets, 154 00:12:39,860 --> 00:12:44,050 which solves a lot of the completeness problems. Like you can, you can use a grid 155 00:12:44,050 --> 00:12:50,040 to narrow down somewhere on the screen and then tap there. But the server issues and 156 00:12:50,040 --> 00:12:54,870 the expert use is still not good, like, if I want to turn it, if I want to do 157 00:12:54,870 --> 00:12:59,540 something with the mouse grid, I have to say "mouse grid on. 6. 5. mouse grid off". 158 00:12:59,540 --> 00:13:02,899 And I can't combine those together. So there's a lot of latency and it's not 159 00:13:02,899 --> 00:13:09,611 really that fun to use, but better than nothing? Absolutely! I just want to really 160 00:13:09,611 --> 00:13:13,149 briefly show you as well that this same feature of like being able to select links 161 00:13:13,149 --> 00:13:17,209 on a screen is available on desktops. This is a plug in for Chrome called Vimium. And 162 00:13:17,209 --> 00:13:22,670 it's very powerful because you can then combine this with keyboards or other input 163 00:13:22,670 --> 00:13:26,650 mechanisms. And this one is complete. It uses the entire DOM and anything you can 164 00:13:26,650 --> 00:13:31,130 click on will be highlighted. So very nice. I just want to give a quick example 165 00:13:31,130 --> 00:13:35,380 of me using some of these systems. So I've been trying to learn Japanese and there's 166 00:13:35,380 --> 00:13:39,130 a couple of highly regarded websites for this, but they're not consistent. When I 167 00:13:39,130 --> 00:13:43,829 use the browser show labels like, you know, the thing to press next page or 168 00:13:43,829 --> 00:13:47,970 something like that or like, you know, I give up or whatever it is, it keeps 169 00:13:47,970 --> 00:13:51,980 changing. So the letters that are being used keep changing. And that's because of 170 00:13:51,980 --> 00:13:56,500 the dynamic way that they're generating the HTML. So not really very useful. What 171 00:13:56,500 --> 00:14:01,160 I do instead is I use a program called Anki and that has very simple shortcuts in 172 00:14:01,160 --> 00:14:06,410 its desktop app. One, two, three, four. So it's nice to use and consistent and it's 173 00:14:06,410 --> 00:14:11,530 syncs with an Android app and then I can use my stylus on the Android device. So it 174 00:14:11,530 --> 00:14:16,450 works pretty well. But even so, as you can see from the chart in the bottom there, 175 00:14:16,450 --> 00:14:20,220 there are many days when I can't use this, even though I would like to, because I've 176 00:14:20,220 --> 00:14:25,770 overused my hands or overused my voice. When I'm using voice recognition all day, 177 00:14:25,770 --> 00:14:28,649 every day, I do tend to lose my voice. And as you can see from the graph, sometimes I 178 00:14:28,649 --> 00:14:33,700 lose it for a week or two at a time. So same thing with any accessibility 179 00:14:33,700 --> 00:14:38,410 interface, you know, you've got to use many different techniques and it's always, 180 00:14:38,410 --> 00:14:44,259 it's never perfect is just the best you can do at that moment. Something else I 181 00:14:44,259 --> 00:14:49,770 like to do is read books. I read a lot of books and I love e-book readers, the 182 00:14:49,770 --> 00:14:54,139 dedicated e-ink displays. You can read them in sunlight, they last forever, battery 183 00:14:54,139 --> 00:14:59,060 wise. Unfortunately, it's hard to add other input mechanisms to them. They don't have 184 00:14:59,060 --> 00:15:03,569 microphones or other sensors and you can't really install custom software on them. 185 00:15:03,569 --> 00:15:07,250 But for Android based devices and there are also like e-book reading apps for 186 00:15:07,250 --> 00:15:10,399 Android devices, they have everything you can install custom software and they have 187 00:15:10,399 --> 00:15:15,569 microphones and many other sensors. So I made two apps that allow you to read 188 00:15:15,569 --> 00:15:21,319 e-books with an e-book reader. The first one is Voice Next Page. It's based on one 189 00:15:21,319 --> 00:15:25,759 of my speech recognition engines called Silvius, and it does do server based 190 00:15:25,759 --> 00:15:29,290 recognition. So you have to capture all the audio, use 300 kilobits a second to 191 00:15:29,290 --> 00:15:35,560 send it to the server and recognize things like next page, previous page. However, it 192 00:15:35,560 --> 00:15:40,329 doesn't cut out every 30 seconds. It keeps going. So that's that's one win for it I 193 00:15:40,329 --> 00:15:46,470 guess. And it is published in the Play store. Huge thanks to Sarah Leventhal, who 194 00:15:46,470 --> 00:15:49,670 did a lot of the implementation. Very complicated to make an accessibility app 195 00:15:49,670 --> 00:15:55,819 on Android. But we persevered and it works quite nicely. So I'm going to actually 196 00:15:55,819 --> 00:16:03,149 show you an example of voice next page. This over here is my phone on the left 197 00:16:03,149 --> 00:16:08,649 hand side just captured so that you guys can see it. So here's the Voice Next Page. 198 00:16:08,649 --> 00:16:13,820 And basically the connection is green. I can do, the server is up and running and 199 00:16:13,820 --> 00:16:19,700 so on. I just press start and then I'll switch to an Android reading app and say, 200 00:16:19,700 --> 00:16:23,120 next page, previous page. I won't speak otherwise because it will chapel 201 00:16:23,120 --> 00:16:26,400 everything I'm saying. 202 00:16:32,910 --> 00:16:34,880 Next Page 203 00:16:36,090 --> 00:16:37,640 Next Page 204 00:16:38,310 --> 00:16:40,100 Previous Page 205 00:16:41,520 --> 00:16:42,860 Center 206 00:16:43,680 --> 00:16:45,030 Center 207 00:16:46,620 --> 00:16:48,120 Foreground 208 00:16:49,155 --> 00:16:50,845 Stop listening 209 00:16:54,960 --> 00:16:58,680 So that's a demo of The Voice Next Page, and it's 210 00:16:58,680 --> 00:17:03,259 extremely helpful. I built it a couple of years ago along with Sarah, and I use it a 211 00:17:03,259 --> 00:17:07,800 lot. So, yeah, you can go ahead and download it if you guys wanna try it out. 212 00:17:07,800 --> 00:17:12,530 And the other one is called Blink Next Page. So the idea for this, I got this 213 00:17:12,530 --> 00:17:18,260 idea from a research paper this year that was studying eyelid gestures. I didn't use 214 00:17:18,260 --> 00:17:24,210 any of their code, but it's a great idea. So the way this works is you detect blinks 215 00:17:24,210 --> 00:17:28,590 by using the Android camera and then you can trigger an action like turning pages 216 00:17:28,590 --> 00:17:34,330 in an e-book reader. This actually doesn't need any networking. It's able to use the 217 00:17:34,330 --> 00:17:38,960 on device face recognition models from Google, and it is still under development. 218 00:17:38,960 --> 00:17:44,630 So it's not on the play store yet, but it is working. And, you know, please contact 219 00:17:44,630 --> 00:17:54,430 me if you want to try it. So just give me one moment to set that demo up here. So 220 00:17:54,430 --> 00:18:00,590 I'm going to use... The main problem with this current implementation is that it 221 00:18:00,590 --> 00:18:07,030 uses two devices. So that was easier to implement. And I use two devices anyway. 222 00:18:07,030 --> 00:18:14,040 But obviously I want a one device version if I'm actually going to use it for 223 00:18:14,040 --> 00:18:18,281 anything. So here's how this works. This device I point at me, at my eyes, the 224 00:18:18,281 --> 00:18:24,010 other device I put wherever it's convenient to read, ups sorry, and if I blink 225 00:18:24,010 --> 00:18:28,780 my eyes, the phone will buzz once it detects that I blink my eyes and it will 226 00:18:28,780 --> 00:18:35,410 turn the page automatically on the other Android device. Now I have to blink both 227 00:18:35,410 --> 00:18:41,500 my eyes for half a second. If I want to go backwards, I can blink just my left eye. 228 00:18:41,500 --> 00:18:49,510 And if I want to go forwards like quickly, I can blink my right eye and hold it. (background buzzing) 229 00:18:49,510 --> 00:18:54,640 Anyway, it does have some false positives. That's why like you can go backwards in 230 00:18:54,640 --> 00:18:59,790 case it detects that you've accidentally flipped the page. And lighting is also 231 00:18:59,790 --> 00:19:03,560 very important. Like if I have a light behind me, then this is not going to be 232 00:19:03,560 --> 00:19:07,760 able to identify whether my eyes are open or closed properly. So it has some 233 00:19:07,760 --> 00:19:19,150 limitations, but very simple to use. So I'm a big fan. OK, so that's enough about 234 00:19:19,150 --> 00:19:23,760 Android devices, let's talk very briefly about desktop computers. So if you're 235 00:19:23,760 --> 00:19:27,450 going to use a desktop computer, of course, try using that show labels plugin 236 00:19:27,450 --> 00:19:33,210 in a browser. For native apps you can try Dragon NaturallySpeaking, which is fine if 237 00:19:33,210 --> 00:19:37,190 you're just like using basic things. But if you're trying to do complicated things, 238 00:19:37,190 --> 00:19:40,830 you should definitely use a voice coding system. You could also consider using eye 239 00:19:40,830 --> 00:19:45,810 tracking to replace a mouse. I personally, I don't use that. I find it hurts my eyes, 240 00:19:45,810 --> 00:19:50,400 but I do use a trackball with very little force and a wacom tablet. Some people will 241 00:19:50,400 --> 00:19:55,640 even scroll up and down by humming, for example, but I don't have that setup. 242 00:19:55,640 --> 00:20:00,600 There's a bunch of nice talks out there on voice coding. The top left is Tavis Rudds 243 00:20:00,600 --> 00:20:06,110 talk from many years ago that got many of us interested. Emily Shea gave a talk 244 00:20:06,110 --> 00:20:10,971 there about best practices for voice coding. And then I gave a talk a couple of 245 00:20:10,971 --> 00:20:16,470 years ago at the Hope 11 conference, which you can also check out. It's mostly out of 246 00:20:16,470 --> 00:20:21,560 date by now, but it's still interesting. So there are a lot of voice coding 247 00:20:21,560 --> 00:20:27,660 systems, the sort of grandfather of them all is Dragonfly. It's become a grammar 248 00:20:27,660 --> 00:20:35,370 standard. Caster is if you're willing to memorize lots of unusual words, you can 249 00:20:35,370 --> 00:20:40,950 become much better, much faster than I currently am at voice coding. aenea is how 250 00:20:40,950 --> 00:20:45,710 you originally used Dragon to work on a Linux machine, for example, because Dragon 251 00:20:45,710 --> 00:20:52,620 only runs on Windows. Talon is a closed source program, which is, but it's very 252 00:20:52,620 --> 00:20:56,790 powerful. Has a big user base, especially for Mac OS. There are ports now. And Talon 253 00:20:56,790 --> 00:21:04,640 used to use Dragon, but it's now using a speech system from Facebook. Silvius is 254 00:21:04,640 --> 00:21:09,640 the system that I created, the models are not very accurate, but it's a nice 255 00:21:09,640 --> 00:21:12,910 architecture where there's client- server, so it makes it easy to build things like 256 00:21:12,910 --> 00:21:18,130 the voice next page. So Voice next page was using Silvius. And then the the most 257 00:21:18,130 --> 00:21:22,390 recent one I think on this list is kaldi- active-grammar, which is extremely 258 00:21:22,390 --> 00:21:26,420 powerful and extremely customizable. And it's also open source. It works on all 259 00:21:26,420 --> 00:21:29,590 platforms. So I really highly recommend that. So let's talk a bit more about 260 00:21:29,590 --> 00:21:35,300 kaldi-active-grammar. But first, for voice coding, I've already mentioned, you have 261 00:21:35,300 --> 00:21:38,890 to be careful how you use your voice right. Breathe from your belly. Don't 262 00:21:38,890 --> 00:21:42,180 tighten your muscles and breathe from your chest. Try to speak normally. And I'm not 263 00:21:42,180 --> 00:21:45,230 particularly good at this. Like you'll hear me when I'm speaking commands that my 264 00:21:45,230 --> 00:21:50,550 inflection changes. So I do tend to overuse my voice, but you just have to be 265 00:21:50,550 --> 00:21:53,780 conscious of that. The microphone hardware does matter. I do recommend like a blue 266 00:21:53,780 --> 00:21:59,801 yeti on a microphone arm that you can pull and put close to your face like this. I 267 00:21:59,801 --> 00:22:04,340 will use this one for my speaking demo and. Yeah. And the other thing is your 268 00:22:04,340 --> 00:22:08,190 grammar is fully customizable. So if you keep saying a word and the system doesn't 269 00:22:08,190 --> 00:22:14,190 recognize it, just change it to another word. And it's complete in the sense you 270 00:22:14,190 --> 00:22:17,680 can type any key on the keyboard. And the most important thing for expert use or 271 00:22:17,680 --> 00:22:22,120 customizability is that you can do chaining. So with the voice coding system, 272 00:22:22,120 --> 00:22:27,040 you can say multiple commands at once. If there's, and it's a huge time saving, 273 00:22:27,040 --> 00:22:32,140 you'll see what I mean when I give a quick demo. When I do voice coding, I'm a very 274 00:22:32,140 --> 00:22:39,150 heavy vim and tmux user. You know, there have been I've worked with many people 275 00:22:39,150 --> 00:22:41,870 before, so I have some cheat sheet information there. So if you're 276 00:22:41,870 --> 00:22:45,130 interested, you can go check that out. But yeah, let's just do a quick demo of voice 277 00:22:45,130 --> 00:22:54,350 coding here. "Turn this mic on". "Desk left two". "Control delta", "open new terminal". 278 00:22:54,350 --> 00:22:59,930 "Charlie delta space slash tango mike papa enter". "Command vim". "Hotel hotel point 279 00:22:59,930 --> 00:23:08,720 charlie papa papa, enter". "India , hash word include space langel", "india oscar word 280 00:23:08,720 --> 00:23:16,030 stream rangel, enter, enter", "india noi tango space word mean", "no mike arch india 281 00:23:16,030 --> 00:23:23,750 noi space len ren space lace enter enter race up tab word print fox scratch nope code 282 00:23:23,750 --> 00:23:31,080 standard charlie oscar uniform tango space langel langel space quote. Sentence hello, 283 00:23:31,080 --> 00:23:40,250 voice coding bang, scratch six delta india noi golf, bang, backslash, noi quote 284 00:23:40,250 --> 00:23:46,340 semicolon act sky fox mike romeo noi oscar word return space number zero semicolon 285 00:23:46,340 --> 00:23:53,450 act vim save and quit. Golf plus plus space hotel hotel tab minus oscar space 286 00:23:53,450 --> 00:24:03,840 hotel hotel enter. Point slash hotel hotel enter. Desk right. So that's just a quick 287 00:24:03,840 --> 00:24:09,010 example of voice coding, you can use it to write any programing language, you can use 288 00:24:09,010 --> 00:24:13,881 it to control anything on your desktop. It's very powerful. It has a bit of a 289 00:24:13,881 --> 00:24:18,990 learning curve, but it's very powerful. So the creator of kaldi-active-grammar is 290 00:24:18,990 --> 00:24:26,050 also named David. I'm named David, but just a coincidence. And he says of kaldi- 291 00:24:26,050 --> 00:24:31,260 active-grammar, that I haven't typed with the keyboard in many years and kaldi- 292 00:24:31,260 --> 00:24:35,640 active-grammar is bootstrapped in that I have been developing it entirely using the 293 00:24:35,640 --> 00:24:42,490 previous versions of it. So, David has a medical condition that means he has very 294 00:24:42,490 --> 00:24:48,270 low dexterity, so it's hard for him to use a keyboard. And yet he basically got 295 00:24:48,270 --> 00:24:53,000 kaldi-active-grammar working through the skin of his teeth or something and then 296 00:24:53,000 --> 00:24:58,710 continues to develop it using it. And yeah, I'm a huge fan of the project. I 297 00:24:58,710 --> 00:25:02,640 haven't contributed much, but I did give some of the hardware resources like GPU 298 00:25:02,640 --> 00:25:08,100 and CPU compute resources to allow training to happen. But I would also like 299 00:25:08,100 --> 00:25:12,970 to show you a video of David using kaldi- active-grammar, just, so you can see it as 300 00:25:12,970 --> 00:25:20,780 well. So, the other thing about David is, that he has a speech impediment or a 301 00:25:20,780 --> 00:25:25,000 speech, I don't know, an accent or whatever. So it's difficult to, for a 302 00:25:25,000 --> 00:25:28,060 normal speech recognition system, to understand him. And you might have trouble 303 00:25:28,060 --> 00:25:31,050 understanding him here. But you can see in the lower right, what the speech system 304 00:25:31,050 --> 00:25:37,390 understands what he's saying. Oh, I realized, that I do need to switch 305 00:25:37,390 --> 00:25:41,502 something in OBS, so that you guys can hear it. Sorry. There you go. 306 00:25:41,502 --> 00:26:03,430 (Other) David using kaldi-active-grammar system (not understandable) 307 00:26:03,430 --> 00:26:05,900 Here, you get the idea and hopefully, you 308 00:26:05,900 --> 00:26:10,530 guys were able to hear that. If not, you can also find this on the website that I'm 309 00:26:10,530 --> 00:26:18,350 going to show you at the end. One other thing, I want to show you about this is, 310 00:26:18,350 --> 00:26:23,010 David has actually set up this humming to scroll, which I think is pretty cool. Of 311 00:26:23,010 --> 00:26:28,260 course, I've gone and turned off the OBS there. But he's just doing hmmm and it's 312 00:26:28,260 --> 00:26:33,240 understanding that and scrolling down. So, something that I'm able to do with my 313 00:26:33,240 --> 00:26:41,730 trackball, but he's using his voice for, so pretty cool. So I'm almost done here. 314 00:26:41,730 --> 00:26:46,550 In summary, good input accessibility means you need completeness, consistency and 315 00:26:46,550 --> 00:26:49,591 customization. You need to be able to do any action that you could do with the 316 00:26:49,591 --> 00:26:55,110 other input mechanisms. And doing the same input should have the same action. And 317 00:26:55,110 --> 00:27:00,210 remember, your users will become experts, so the system needs to be designed for 318 00:27:00,210 --> 00:27:05,640 that. For e-book reading: Yes, I'm trying to allow anyone to read, even if they're 319 00:27:05,640 --> 00:27:10,860 experiencing some severe physical or motor impairment, because I think that gives you 320 00:27:10,860 --> 00:27:15,031 a lot of power to be able to turn the pages and read your favorite books. And 321 00:27:15,031 --> 00:27:19,270 for speech recognition, yeah, Android speech recognition is very good. Silvius 322 00:27:19,270 --> 00:27:23,490 accuracy is not so good, but it's easy to use quickly for experimentation and to 323 00:27:23,490 --> 00:27:28,150 make other types of things like Voice Next Page. And please do check out kaldi- 324 00:27:28,150 --> 00:27:33,850 active-grammar if you have some serious need for voice recognition. Lastly, I put 325 00:27:33,850 --> 00:27:39,050 all of this onto a website, voxhub.io, so you can see Voice Next Page, Blink Next 326 00:27:39,050 --> 00:27:42,100 Page, kaldi-active-grammar and so on, just instructions for how to use it and how to 327 00:27:42,100 --> 00:27:47,130 set it up. So please do check that out. And tons of acknowledgments, lots of 328 00:27:47,130 --> 00:27:50,030 people that have helped me along the way, but I want to especially call out 329 00:27:50,030 --> 00:27:53,700 Professor Sang-Mook Lee, who actually invited me to Korea a couple of times to 330 00:27:53,700 --> 00:27:58,140 give talks - a big inspiration. And of course, David Zurow, who has actually been 331 00:27:58,140 --> 00:28:02,900 able to bootstrap into a fully voice coding environment. So that's all I have 332 00:28:02,900 --> 00:28:07,300 for today. Thank you very much. 333 00:28:07,300 --> 00:28:15,600 Herald: Alright, I suppose I'm back on the air, so let me see. I want to remind 334 00:28:15,600 --> 00:28:21,780 everyone before we go into the Q&A that you can ask your questions for this talk 335 00:28:21,780 --> 00:28:25,880 on IRC, the link is under the video, or you can use Twitter or the Fediverse with 336 00:28:25,880 --> 00:28:34,380 the hashtag #rc3two. Again, I'll hold it up here, "rc3two". 337 00:28:34,380 --> 00:28:38,680 Thanks for your talk, David. That was really interesting. Thanks for talk, 338 00:28:38,680 --> 00:28:47,160 David. I, yeah, I think we have a couple of questions from the Signal Angels. 339 00:28:47,160 --> 00:28:50,600 Before that, I just wanted to say I've recently spent some time playing with a 340 00:28:50,600 --> 00:28:56,900 like the VoiceOver system in iOS and that can now actually tell you what is on a 341 00:28:56,900 --> 00:29:03,210 photo, which is kind of amazing. Oh, by the way, I can't hear you here on on the 342 00:29:03,210 --> 00:29:05,470 Mumble. David: Yeah. Sorry, I wasn't saying 343 00:29:05,470 --> 00:29:10,440 anything. Yeah, no, it's so I focused mostly on input accessability, right? 344 00:29:10,440 --> 00:29:13,890 Which is like how do you get data to the computer. But there's been huge 345 00:29:13,890 --> 00:29:16,610 improvements in the other way around as well, right? The computer doing VoiceOver 346 00:29:16,610 --> 00:29:19,150 things. Herald: So we have about let's see, 347 00:29:19,150 --> 00:29:25,010 five-six minutes left at least for Q&A. We have a question by Toby++, he asks: "Your 348 00:29:25,010 --> 00:29:29,080 next page application looks cool. Do you have statistics of how many people use it 349 00:29:29,080 --> 00:29:35,650 or found it on the App Store?" David: Not very many. The Voice Next Page 350 00:29:35,650 --> 00:29:40,950 was advertised only so far as a little academic poster. So I've gotten a few 351 00:29:40,950 --> 00:29:46,310 people to use it. But I run eight concurrent workers and we've never hit 352 00:29:46,310 --> 00:29:51,560 more than that. (laughs) So not super popular, but I do hope that some people will see it 353 00:29:51,560 --> 00:29:54,891 because of this talk and go and check out. Herald: That's cool. Next question. How 354 00:29:54,891 --> 00:30:00,000 error prone are the speech recognition systems at all? E.g., can you do coding 355 00:30:00,000 --> 00:30:06,490 while doing workouts? David: So one thing about speech 356 00:30:06,490 --> 00:30:09,640 recognition is very sensitive to the microphone, so when you're doing it 357 00:30:09,640 --> 00:30:38,270 Technical malfunction. We'll be back soon. 358 00:30:38,270 --> 00:30:40,650 David (cont.): Any mistakes, right? 359 00:30:40,650 --> 00:30:43,830 That's the thing about having low latency, you just say something and you watch it 360 00:30:43,830 --> 00:30:47,870 and you make sure that it was what you wanted to say. I don't know exactly how 361 00:30:47,870 --> 00:30:52,010 many words per minute I can say with voice coding, but I can say it much faster than 362 00:30:52,010 --> 00:30:55,500 regular speech. So I'd say at least like 200, maybe 300 words per minute. 363 00:30:55,500 --> 00:30:57,050 So it's actually a very high bandwidth mechanism. 364 00:30:57,050 --> 00:31:02,590 Herald: That's really awesome. A question from peppyjndivos: "Any advice for software 365 00:31:02,590 --> 00:31:07,760 authors to make their stuff more accessible?" 366 00:31:07,760 --> 00:31:15,420 David: There are good web accessibility guidelines. So if you're just making a 367 00:31:15,420 --> 00:31:19,240 website or something, I would definitely follow those. They tend to be focused more 368 00:31:19,240 --> 00:31:24,350 on people that are blind because that is, you know, it's more of an obvious fail. 369 00:31:24,350 --> 00:31:29,880 like they just can't interact at all with your website. But things like, you know, 370 00:31:29,880 --> 00:31:36,580 if Duolingo, for example, had used the same, like, the same accessibility access 371 00:31:36,580 --> 00:31:40,360 tag on their, like, next button, then they would always be the same letter for me and 372 00:31:40,360 --> 00:31:46,400 I wouldn't have to be like Fox-Charlie , Fox-Delta, Fox-something - changes all the 373 00:31:46,400 --> 00:31:51,850 time. So I think consistency is very important. And integrating with any 374 00:31:51,850 --> 00:31:57,690 existing accessibility APIs is also a very important - Web APIs, Android APIs and so 375 00:31:57,690 --> 00:32:01,730 on, because, you know, we can't make every program out there like voice compatible. 376 00:32:01,730 --> 00:32:05,360 We just have to meet in the middle where they interact at the keyboard layer or the 377 00:32:05,360 --> 00:32:08,490 accessibility layer. Herald: Awesome. AmericN has a question, 378 00:32:08,490 --> 00:32:13,730 wonders if these systems use similar approaches like stenography with mnemonics 379 00:32:13,730 --> 00:32:18,530 or if there's any projects working having that in mind. 380 00:32:18,530 --> 00:32:26,830 David: A very good question. So, the first thing everyone uses is the NATO phonetic 381 00:32:26,830 --> 00:32:32,900 alphabet to spell letters, for example, Alpha. Bravo, Charlie. Some people then 382 00:32:32,900 --> 00:32:38,910 will substitute letters for things that are too long, like November. I use noi. 383 00:32:38,910 --> 00:32:41,690 Sometimes the speech system doesn't understand you. Whenever I said Alpha, 384 00:32:41,690 --> 00:32:45,620 Dragon was like, oh, you're saying "offer". So I changed it. It's Arch for 385 00:32:45,620 --> 00:32:53,300 me, Arch, Brav, Char. So, and also most of these grammars are in a common grammar 386 00:32:53,300 --> 00:32:56,640 format. They are written in Python and they're compatible with Dragonfly. So you 387 00:32:56,640 --> 00:33:00,920 can grab a grammar for, I don't know, for Aenea and get it to work with kaldi- 388 00:33:00,920 --> 00:33:04,550 active-grammar with very little effort. I actually have a grammar that works on both 389 00:33:04,550 --> 00:33:10,970 Aenea and kaldi-active-grammar, and that's what I use. So there's a bit of lingua 390 00:33:10,970 --> 00:33:14,060 franca, I guess, you can kind of guess what other people are using. But at the 391 00:33:14,060 --> 00:33:19,190 same time there's a lot of customization, you know, because people change words, 392 00:33:19,190 --> 00:33:23,160 they add their own commands, they change words based on what the speech system 393 00:33:23,160 --> 00:33:27,150 understands. Herald: Alright, LEB asks, is there an online 394 00:33:27,150 --> 00:33:32,130 community you can propose for accessibility technologies? 395 00:33:32,130 --> 00:33:40,460 David: There's an amazing forum for anything related to voice coding. All the 396 00:33:40,460 --> 00:33:51,560 developers of new voice coding software are there. Sorry, I just need to drink. So 397 00:33:51,560 --> 00:33:56,760 it's a really fantastic resource. I do link to it from voxhub.io. I believe it's 398 00:33:56,760 --> 00:34:01,690 at the bottom of the kaldi-active-grammar page. So you can definitely check that 399 00:34:01,690 --> 00:34:07,450 out. For general accessibility, I don't know, I could recommend the accessibility 400 00:34:07,450 --> 00:34:11,530 mailing list at Google, but that's only if you work at Google. Other than that, yeah, 401 00:34:11,530 --> 00:34:16,240 I think it depends on your community, right? I think if you're looking for web 402 00:34:16,240 --> 00:34:20,220 accessibility, you could go for some Mozilla mailing list and so on. If you're 403 00:34:20,220 --> 00:34:24,509 looking for desktop accessibility, then maybe you could go find some stuff about 404 00:34:24,509 --> 00:34:29,579 the Windows Speech API. unintelligible Herald: One last question from Joe Neilson. 405 00:34:29,579 --> 00:34:34,730 Could there be legal issues if you make an e-book into audio? I'm not sure what that 406 00:34:34,730 --> 00:34:42,849 refers to. David: Yeah. So if you are like doing, if 407 00:34:42,849 --> 00:34:45,780 you're using a screen reader and you're like, you try to get it to read out the 408 00:34:45,780 --> 00:34:55,059 contents of an e-book, right? So most, most of the time there are fair use 409 00:34:55,059 --> 00:35:02,609 exceptions for copyright law, even in the US, and making a copy yourself for 410 00:35:02,609 --> 00:35:08,661 personal purposes so that you can access it is usually considered fair use. If you 411 00:35:08,661 --> 00:35:14,079 were trying to commercialize it or make money off of that or like, I don't know, 412 00:35:14,079 --> 00:35:18,270 you're a famous streamer and all you do is highlight text and have it read it out, 413 00:35:18,270 --> 00:35:21,280 then maybe, but I would say that definitely falls under fair use. 414 00:35:21,280 --> 00:35:26,740 Herald: Alright. So I guess that's it for the talk. I think we're hitting the timing 415 00:35:26,740 --> 00:35:30,380 mark really well. Thank you so much, David, for that. That was really, really 416 00:35:30,380 --> 00:35:36,160 interesting. I learned a lot and thanks everyone for watching and stay on. I think 417 00:35:36,160 --> 00:35:40,369 there might be some news coming up. Thanks and everyone. 418 00:35:40,369 --> 00:35:55,640 rc3 postroll music 419 00:35:55,640 --> 00:36:18,549 Subtitles created by c3subtitles.de in the year 2020. Join, and help us!