rc3 preroll music Herald: All right, so again, let's introduce the next talk, accessible inputs for readers, coders and hackers, the talk by David Williams-King about custom off-, well, not off the shelf, but custom accessibility solutions. He will give you some demonstrations and that includes his own custom made voice input, an added link system. Here is David Williams-King David: Thank you for the introduction. Let's go ahead and get started. So, yeah, I'm talking about accessibility, particularly accessible input for readers, coders and hackers. So what do I mean by accessibility? I mean people that have physical or motor impairments. This could be due to repetitive strain injury, carpal tunnel, all kinds of medical conditions. If you have this type of thing, you probably can't use a normal computer keyboard, computer mouse or even a phone touch screen. However, technology does allow users to interact with these devices just using different forms of input. And it's really valuable to these people because, you know, being able to interact with the device provides some agency they can they can do things on their own and it provides a means of communication with the outside world. So it's an important problem to look at. And it's what I care about a lot. Let's talk a bit about me for a moment. I'm a systems security person. I did a phd in cybersecurity at Columbia. If you're interested in low level software defenses, you can look that up. And I'm currently the CTO at a startup called Elpha Secure. I started developing medical issues in around 2014. And as a result of that, in an ongoing fashion, I can only type a few thousand keystrokes per day. Roughly fifteen thousand is my maximum. That sounds like a lot, but imagine you're typing at a hundred words per minute. That's five hundred characters per minute, which means it takes you 30 minutes to hit fifteen thousand characters. So essentially I, I can work like the equivalent of a fast programmer for, for half an hour. And then after that I would be unable to use my hands for anything, including like preparing food for myself or opening, closing doors and so on. So I have to be very careful about my hand use and actually have a little program that you can see on the slide there that measures the keystrokes for me so I can tell it when I'm going over. So what do I do? Well, I do a lot of pair programming, for sure. I log into the same machine as other people and we work together. I'm also a very heavy user of speech recognition and I gave a talk at that about voice coding with speech recognition at the Hope 11 conference. So you can go check that out if you're interested. So when I talk about accessible input, I mean different ways that a human can provide input to a computer. So ergonomic keyboards are a simple one. Speech recognition, eye tracking or gaze tracking so you can see where you're looking or where you're pointing your head and maybe use that to replace a mouse, that's head gestures, I suppose. And there's always this distinction between bespoke, like custom input mechanisms and somewhat mainstream ones. So I'll give you some examples. You've probably heard of Stephen Hawking. He's a very famous professor, and he was actually a bit of an extreme case. He had, he was diagnosed with ALS when he was 21. So his his physical ability, abilities degraded over the years because he lived for many decades after that and he went through many communication mechanisms. Initially his speech changed so that it was only intelligible to his family and close friends, but he was still able to speak. And then after that he would work with the human interpreter and raise his eyebrows to pick various letters. And then and keep in mind, this is like the 60s or 70s, right? So computers were not really where they are today. Later he would operate a switch with one hand, just like on off on off, kind of morse code and select from a bank of words. And that was around 15 words per minute. Eventually, he was unable to move his hand, so a team of engineers from Intel worked with him and they figured out, they were trying to do like brain scans and all kinds of stuff. But again, this was like in the eighties, so there was not not too much they could do. So they basically just created some custom software to detect muscle movements in his cheek. And he used that with predictive, predictive words, the same way that a phone, smartphone keyboard will predict which word you want to say next. Stephen Hawking, used something similar to that, except instead of swiping on a phone, he was moving his cheek muscles, so that's obviously a sequence of like highly customized input mechanisms for, for someone very, very specialized for that person. I also want to talk about someone else named Professor Sang-Mook Lee, whom I've met. That was me when I had more of a beard than I do now. He he's a professor at Seoul National University in South Korea. And he sometimes called like the Korean Stephen Hawking, because he's a big advocate for people with disabilities. Anyway, what he uses is you can see a little orange device near his mouth there. It's called a sip and puff mouse so he can blow into it and suck air through it and also move it around. And that acts as a mouse cursor on the Android device in front of him. It will move the cursor around and click when he when he blows air and so on. So that combined with speech recognition, lets him use mainstream Android hardware. He still has access to, you know, email apps and like Web Browsers and like Maps and everything that comes on a normal Android device. So he's way more capable than Stephen Hawking, as who could, Stephen Hawking could communicate, but just to a person at a very slow rate. Right. Part of it's due to the nature of his injury. But it's also a testament to how far the technology has improved. So let's talk a little bit about what makes good accessibility. I think performance is very important, right? You want high accuracy. You don't want typos, low latency. I don't want to speak and then five seconds later have words appear. It's too long, especially if I have to make corrections. Right. And you want high throughput, which we already talked about. Oh, I forgot to mention Stephen Hawking had like 15 words per minute. A normal person speaking is 150. So that's a big difference. (laughs) The higher throughput you can get, the better. And for input accessibility, I think and this is not scientific. This is just what I've learned from using myself and observing many of these systems. I think it's important to get completeness, consistency and customization. For completeness I mean, can I do any action? So Stephen or Professor Sang-Mook Lee, his, his orange mouth input device, the sip and puff is quite powerful, but it doesn't let him do every action. For example, for some reason when he gets an incoming call, the the input doesn't work. So he has to call over a person physically to tap the accept call button or the reject call button, which is really annoying. Right. If you don't have completeness, you can't be fully independent. Consistency, very important as well. The same way we develop motor memory for muscle memory, for a keyboard. You develop memory for any types of patterns that you do. But if the thing you say or the thing you do keeps changing in order to do the same action. That's not good. And finally, customization. So the learning curve for beginners is important for any accessibility device, but designing for expert use is almost more important because anyone who uses an accessibility interface becomes an expert at it. The example I like to give is screen readers like a blind person using a screen reader on a phone. They will crank up the speed at which the speech is being produced. And I actually met someone who made his speech 16 times faster than normal human speech. I could not understand it at all, I sound like brbrbrbr, but he could understand it perfectly. And that's just because he used it so much that he's become an expert at its use. Let's analyze ergonomic keyboards just for a moment, because it's fun. You know, they are kind of like a normal keyboard. They'll have a, you'll have a slow pace when you're starting to learn them. But once you're good at it, you have very good accuracy, like instantaneous low latency. Right. You press the key, the computer receives it immediately and very high throughput. It has high as you are on a regular keyboard. So they're actually fantastic accessibility devices, right. They're completely compatible with original keyboards. And if all you need is an ergonomic keyboard, then you're in luck because it's a very good accessibility device. I'm going to talk about two things, computers, but also Android devices, so let's start with Android devices. Yes, the built in voice recognition and Android is really incredible. So even though the microphones on the devices aren't great, Google has just collected so much data from so many different sources that they've built like better than human accuracy for for their voice recognition. The voice accessibility interface is kind of so so we'll talk about that in a bit. That's the interface where you can control the Android device entirely by voice. For other input mechanisms. You could use like a sip and puff device or you could use physical styluses. That's something that I do a lot, actually, because for me, my fingers get sore. And if I can hold a stylus in my hand and kind of not use my fingers, then that's very effective. So and the Elecom styluses from a Japanese company are the lightest I've found and they don't require a lot of force. So the ones at the top there are they're like 12 grams and the one on the bottom is 4.7 grams. And you've got almost no force to use them. So very nice on the left there you can see the Android speech recognition is built into the keyboard now. Right. You can just press that and start speaking. It supports different languages, and it's very accurate, it's very nice. And actually, when I was working at Google for a bit, I talked to the speech recognition team as like: Why are you doing on server speech recognition? You should do it on the devices. But of course, Android devices are, they're all very different and many of them are not very powerful. So they were having trouble getting satisfactory speech recognition on the device. So for a long time, there's some server latency, server lag that you do speech recognition and you wait a bit. And then sometime this year, I just was using speech recognition and it became so much faster. I was extremely excited and I looked into it and yeah, they just switched on my device. At least they switched on the On device speech recognition model. And so now it's incredibly fast and also incredibly accurate. I'm a huge fan of it. On the right hand side. We can actually see the voice access interface. So this is meant to allow you to use a phone entirely by voice. Again, while I was at Google, I tried the the beta version before it was publicly released and I was like, this is pretty bad, mostly because it did, it lacked completeness. There would be things on the screen that would not be selected. So here we see show labels. And then I can I can say like four, five, six, whatever, to tap on that thing. But as you can see at the bottom, there was like a Twitter Web app link and there's no number on it. So if I want to click on that, I'm out of luck. And this is actually a problem in the design of the accessibility interface that it only, it doesn't expose the full DOM. It exposes only a subset of it. And so an accessibility mechanism can't ever see those other things. And furthermore, the way the Google speech recognition works, they have to reestablish a new connection every 30 seconds. And if you're in the middle of speaking, it will just throw away whatever you were saying because it just decided it had to reconnect, which is really unfortunate. They later released that publicly and then sometime this year they did the update, which is pretty nice. It now has like a mouse grid, which lets, which solves a lot of the completeness problems. Like you can, you can use a grid to narrow down somewhere on the screen and then tap there. But the server issues and the expert use is still not good, like, if I want to turn it, if I want to do something with the mouse grid, I have to say "mouse grid on. 6. 5. mouse grid off". And I can't combine those together. So there's a lot of latency and it's not really that fun to use, but better than nothing? Absolutely! I just want to really briefly show you as well that this same feature of like being able to select links on a screen is available on desktops. This is a plug in for Chrome called Vimium. And it's very powerful because you can then combine this with keyboards or other input mechanisms. And this one is complete. It uses the entire DOM and anything you can click on will be highlighted. So very nice. I just want to give a quick example of me using some of these systems. So I've been trying to learn Japanese and there's a couple of highly regarded websites for this, but they're not consistent. When I use the browser show labels like, you know, the thing to press next page or something like that or like, you know, I give up or whatever it is, it keeps changing. So the letters that are being used keep changing. And that's because of the dynamic way that they're generating the HTML. So not really very useful. What I do instead is I use a program called Anki and that has very simple shortcuts in its desktop app. One, two, three, four. So it's nice to use and consistent and it's syncs with an Android app and then I can use my stylus on the Android device. So it works pretty well. But even so, as you can see from the chart in the bottom there, there are many days when I can't use this, even though I would like to, because I've overused my hands or overused my voice. When I'm using voice recognition all day, every day, I do tend to lose my voice. And as you can see from the graph, sometimes I lose it for a week or two at a time. So same thing with any accessibility interface, you know, you've got to use many different techniques and it's always, it's never perfect is just the best you can do at that moment. Something else I like to do is read books. I read a lot of books and I love e-book readers, the dedicated e-ink displays. You can read them in sunlight, they last forever, battery wise. Unfortunately, it's hard to add other input mechanisms to them. They don't have microphones or other sensors and you can't really install custom software on them. But for Android based devices and there are also like e-book reading apps for Android devices, they have everything you can install custom software and they have microphones and many other sensors. So I made two apps that allow you to read e-books with an e-book reader. The first one is Voice Next Page. It's based on one of my speech recognition engines called Silvius, and it does do server based recognition. So you have to capture all the audio, use 300 kilobits a second to send it to the server and recognize things like next page, previous page. However, it doesn't cut out every 30 seconds. It keeps going. So that's that's one win for it I guess. And it is published in the Play store. Huge thanks to Sarah Leventhal, who did a lot of the implementation. Very complicated to make an accessibility app on Android. But we persevered and it works quite nicely. So I'm going to actually show you an example of voice next page. This over here is my phone on the left hand side just captured so that you guys can see it. So here's the Voice Next Page. And basically the connection is green. I can do, the server is up and running and so on. I just press start and then I'll switch to an Android reading app and say, next page, previous page. I won't speak otherwise because it will chapel everything I'm saying. Next Page Next Page Previous Page Center Center Foreground Stop listening So that's a demo of The Voice Next Page, and it's extremely helpful. I built it a couple of years ago along with Sarah, and I use it a lot. So, yeah, you can go ahead and download it if you guys wanna try it out. And the other one is called Blink Next Page. So the idea for this, I got this idea from a research paper this year that was studying eyelid gestures. I didn't use any of their code, but it's a great idea. So the way this works is you detect blinks by using the Android camera and then you can trigger an action like turning pages in an e-book reader. This actually doesn't need any networking. It's able to use the on device face recognition models from Google, and it is still under development. So it's not on the play store yet, but it is working. And, you know, please contact me if you want to try it. So just give me one moment to set that demo up here. So I'm going to use... The main problem with this current implementation is that it uses two devices. So that was easier to implement. And I use two devices anyway. But obviously I want a one device version if I'm actually going to use it for anything. So here's how this works. This device I point at me, at my eyes, the other device I put wherever it's convenient to read, ups sorry, and if I blink my eyes, the phone will buzz once it detects that I blink my eyes and it will turn the page automatically on the other Android device. Now I have to blink both my eyes for half a second. If I want to go backwards, I can blink just my left eye. And if I want to go forwards like quickly, I can blink my right eye and hold it. (background buzzing) Anyway, it does have some false positives. That's why like you can go backwards in case it detects that you've accidentally flipped the page. And lighting is also very important. Like if I have a light behind me, then this is not going to be able to identify whether my eyes are open or closed properly. So it has some limitations, but very simple to use. So I'm a big fan. OK, so that's enough about Android devices, let's talk very briefly about desktop computers. So if you're going to use a desktop computer, of course, try using that show labels plugin in a browser. For native apps you can try Dragon NaturallySpeaking, which is fine if you're just like using basic things. But if you're trying to do complicated things, you should definitely use a voice coding system. You could also consider using eye tracking to replace a mouse. I personally, I don't use that. I find it hurts my eyes, but I do use a trackball with very little force and a wacom tablet. Some people will even scroll up and down by humming, for example, but I don't have that setup. There's a bunch of nice talks out there on voice coding. The top left is Tavis Rudds talk from many years ago that got many of us interested. Emily Shea gave a talk there about best practices for voice coding. And then I gave a talk a couple of years ago at the Hope 11 conference, which you can also check out. It's mostly out of date by now, but it's still interesting. So there are a lot of voice coding systems, the sort of grandfather of them all is Dragonfly. It's become a grammar standard. Caster is if you're willing to memorize lots of unusual words, you can become much better, much faster than I currently am at voice coding. aenea is how you originally used Dragon to work on a Linux machine, for example, because Dragon only runs on Windows. Talon is a closed source program, which is, but it's very powerful. Has a big user base, especially for Mac OS. There are ports now. And Talon used to use Dragon, but it's now using a speech system from Facebook. Silvius is the system that I created, the models are not very accurate, but it's a nice architecture where there's client- server, so it makes it easy to build things like the voice next page. So Voice next page was using Silvius. And then the the most recent one I think on this list is kaldi- active-grammar, which is extremely powerful and extremely customizable. And it's also open source. It works on all platforms. So I really highly recommend that. So let's talk a bit more about kaldi-active-grammar. But first, for voice coding, I've already mentioned, you have to be careful how you use your voice right. Breathe from your belly. Don't tighten your muscles and breathe from your chest. Try to speak normally. And I'm not particularly good at this. Like you'll hear me when I'm speaking commands that my inflection changes. So I do tend to overuse my voice, but you just have to be conscious of that. The microphone hardware does matter. I do recommend like a blue yeti on a microphone arm that you can pull and put close to your face like this. I will use this one for my speaking demo and. Yeah. And the other thing is your grammar is fully customizable. So if you keep saying a word and the system doesn't recognize it, just change it to another word. And it's complete in the sense you can type any key on the keyboard. And the most important thing for expert use or customizability is that you can do chaining. So with the voice coding system, you can say multiple commands at once. If there's, and it's a huge time saving, you'll see what I mean when I give a quick demo. When I do voice coding, I'm a very heavy vim and tmux user. You know, there have been I've worked with many people before, so I have some cheat sheet information there. So if you're interested, you can go check that out. But yeah, let's just do a quick demo of voice coding here. "Turn this mic on". "Desk left two". "Control delta", "open new terminal". "Charlie delta space slash tango mike papa enter". "Command vim". "Hotel hotel point charlie papa papa, enter". "India , hash word include space langel", "india oscar word stream rangel, enter, enter", "india noi tango space word mean", "no mike arch india noi space len ren space lace enter enter race up tab word print fox scratch nope code standard charlie oscar uniform tango space langel langel space quote. Sentence hello, voice coding bang, scratch six delta india noi golf, bang, backslash, noi quote semicolon act sky fox mike romeo noi oscar word return space number zero semicolon act vim save and quit. Golf plus plus space hotel hotel tab minus oscar space hotel hotel enter. Point slash hotel hotel enter. Desk right. So that's just a quick example of voice coding, you can use it to write any programing language, you can use it to control anything on your desktop. It's very powerful. It has a bit of a learning curve, but it's very powerful. So the creator of kaldi-active-grammar is also named David. I'm named David, but just a coincidence. And he says of kaldi- active-grammar, that I haven't typed with the keyboard in many years and kaldi- active-grammar is bootstrapped in that I have been developing it entirely using the previous versions of it. So, David has a medical condition that means he has very low dexterity, so it's hard for him to use a keyboard. And yet he basically got kaldi-active-grammar working through the skin of his teeth or something and then continues to develop it using it. And yeah, I'm a huge fan of the project. I haven't contributed much, but I did give some of the hardware resources like GPU and CPU compute resources to allow training to happen. But I would also like to show you a video of David using kaldi- active-grammar, just, so you can see it as well. So, the other thing about David is, that he has a speech impediment or a speech, I don't know, an accent or whatever. So it's difficult to, for a normal speech recognition system, to understand him. And you might have trouble understanding him here. But you can see in the lower right, what the speech system understands what he's saying. Oh, I realized, that I do need to switch something in OBS, so that you guys can hear it. Sorry. There you go. (Other) David using kaldi-active-grammar system (not understandable) Here, you get the idea and hopefully, you guys were able to hear that. If not, you can also find this on the website that I'm going to show you at the end. One other thing, I want to show you about this is, David has actually set up this humming to scroll, which I think is pretty cool. Of course, I've gone and turned off the OBS there. But he's just doing hmmm and it's understanding that and scrolling down. So, something that I'm able to do with my trackball, but he's using his voice for, so pretty cool. So I'm almost done here. In summary, good input accessibility means you need completeness, consistency and customization. You need to be able to do any action that you could do with the other input mechanisms. And doing the same input should have the same action. And remember, your users will become experts, so the system needs to be designed for that. For e-book reading: Yes, I'm trying to allow anyone to read, even if they're experiencing some severe physical or motor impairment, because I think that gives you a lot of power to be able to turn the pages and read your favorite books. And for speech recognition, yeah, Android speech recognition is very good. Silvius accuracy is not so good, but it's easy to use quickly for experimentation and to make other types of things like Voice Next Page. And please do check out kaldi- active-grammar if you have some serious need for voice recognition. Lastly, I put all of this onto a website, voxhub.io, so you can see Voice Next Page, Blink Next Page, kaldi-active-grammar and so on, just instructions for how to use it and how to set it up. So please do check that out. And tons of acknowledgments, lots of people that have helped me along the way, but I want to especially call out Professor Sang-Mook Lee, who actually invited me to Korea a couple of times to give talks - a big inspiration. And of course, David Zurow, who has actually been able to bootstrap into a fully voice coding environment. So that's all I have for today. Thank you very much. Herald: Alright, I suppose I'm back on the air, so let me see. I want to remind everyone before we go into the Q&A that you can ask your questions for this talk on IRC, the link is under the video, or you can use Twitter or the Fediverse with the hashtag #rc3two. Again, I'll hold it up here, "rc3two". Thanks for your talk, David. That was really interesting. Thanks for talk, David. I, yeah, I think we have a couple of questions from the Signal Angels. Before that, I just wanted to say I've recently spent some time playing with a like the VoiceOver system in iOS and that can now actually tell you what is on a photo, which is kind of amazing. Oh, by the way, I can't hear you here on on the Mumble. David: Yeah. Sorry, I wasn't saying anything. Yeah, no, it's so I focused mostly on input accessability, right? Which is like how do you get data to the computer. But there's been huge improvements in the other way around as well, right? The computer doing VoiceOver things. Herald: So we have about let's see, five-six minutes left at least for Q&A. We have a question by Toby++, he asks: "Your next page application looks cool. Do you have statistics of how many people use it or found it on the App Store?" David: Not very many. The Voice Next Page was advertised only so far as a little academic poster. So I've gotten a few people to use it. But I run eight concurrent workers and we've never hit more than that. (laughs) So not super popular, but I do hope that some people will see it because of this talk and go and check out. Herald: That's cool. Next question. How error prone are the speech recognition systems at all? E.g., can you do coding while doing workouts? David: So one thing about speech recognition is very sensitive to the microphone, so when you're doing it Technical malfunction. We'll be back soon. David (cont.): Any mistakes, right? That's the thing about having low latency, you just say something and you watch it and you make sure that it was what you wanted to say. I don't know exactly how many words per minute I can say with voice coding, but I can say it much faster than regular speech. So I'd say at least like 200, maybe 300 words per minute. So it's actually a very high bandwidth mechanism. Herald: That's really awesome. A question from peppyjndivos: "Any advice for software authors to make their stuff more accessible?" David: There are good web accessibility guidelines. So if you're just making a website or something, I would definitely follow those. They tend to be focused more on people that are blind because that is, you know, it's more of an obvious fail. like they just can't interact at all with your website. But things like, you know, if Duolingo, for example, had used the same, like, the same accessibility access tag on their, like, next button, then they would always be the same letter for me and I wouldn't have to be like Fox-Charlie , Fox-Delta, Fox-something - changes all the time. So I think consistency is very important. And integrating with any existing accessibility APIs is also a very important - Web APIs, Android APIs and so on, because, you know, we can't make every program out there like voice compatible. We just have to meet in the middle where they interact at the keyboard layer or the accessibility layer. Herald: Awesome. AmericN has a question, wonders if these systems use similar approaches like stenography with mnemonics or if there's any projects working having that in mind. David: A very good question. So, the first thing everyone uses is the NATO phonetic alphabet to spell letters, for example, Alpha. Bravo, Charlie. Some people then will substitute letters for things that are too long, like November. I use noi. Sometimes the speech system doesn't understand you. Whenever I said Alpha, Dragon was like, oh, you're saying "offer". So I changed it. It's Arch for me, Arch, Brav, Char. So, and also most of these grammars are in a common grammar format. They are written in Python and they're compatible with Dragonfly. So you can grab a grammar for, I don't know, for Aenea and get it to work with kaldi- active-grammar with very little effort. I actually have a grammar that works on both Aenea and kaldi-active-grammar, and that's what I use. So there's a bit of lingua franca, I guess, you can kind of guess what other people are using. But at the same time there's a lot of customization, you know, because people change words, they add their own commands, they change words based on what the speech system understands. Herald: Alright, LEB asks, is there an online community you can propose for accessibility technologies? David: There's an amazing forum for anything related to voice coding. All the developers of new voice coding software are there. Sorry, I just need to drink. So it's a really fantastic resource. I do link to it from voxhub.io. I believe it's at the bottom of the kaldi-active-grammar page. So you can definitely check that out. For general accessibility, I don't know, I could recommend the accessibility mailing list at Google, but that's only if you work at Google. Other than that, yeah, I think it depends on your community, right? I think if you're looking for web accessibility, you could go for some Mozilla mailing list and so on. If you're looking for desktop accessibility, then maybe you could go find some stuff about the Windows Speech API. unintelligible Herald: One last question from Joe Neilson. Could there be legal issues if you make an e-book into audio? I'm not sure what that refers to. David: Yeah. So if you are like doing, if you're using a screen reader and you're like, you try to get it to read out the contents of an e-book, right? So most, most of the time there are fair use exceptions for copyright law, even in the US, and making a copy yourself for personal purposes so that you can access it is usually considered fair use. If you were trying to commercialize it or make money off of that or like, I don't know, you're a famous streamer and all you do is highlight text and have it read it out, then maybe, but I would say that definitely falls under fair use. Herald: Alright. So I guess that's it for the talk. I think we're hitting the timing mark really well. Thank you so much, David, for that. That was really, really interesting. I learned a lot and thanks everyone for watching and stay on. I think there might be some news coming up. Thanks and everyone. rc3 postroll music Subtitles created by c3subtitles.de in the year 2020. Join, and help us!