#rC3 - Accessible input for readers, coders, and hackers

Edit subtitles

0:00 - 0:12

rc3 preroll music
0:12 - 0:18

Herald: All right, so again, let's
introduce the next talk, accessible inputs
0:18 - 0:25

for readers, coders and hackers, the talk
by David Williams-King about custom off-,
0:25 - 0:30

well, not off the shelf, but custom
accessibility solutions. He will give you
0:30 - 0:35

some demonstrations and that includes his
own custom made voice input, an added link
0:35 - 0:38

system. Here is David Williams-King
0:40 - 0:46

David: Thank you for the introduction.
Let's go ahead and get started. So, yeah,
0:46 - 0:51

I'm talking about accessibility,
particularly accessible input for readers,
0:51 - 0:58

coders and hackers. So what do I mean by
accessibility? I mean people that have
0:58 - 1:03

physical or motor impairments. This could
be due to repetitive strain injury, carpal
1:03 - 1:08

tunnel, all kinds of medical conditions.
If you have this type of thing, you
1:08 - 1:12

probably can't use a normal computer
keyboard, computer mouse or even a phone
1:12 - 1:19

touch screen. However, technology does
allow users to interact with these devices
1:19 - 1:24

just using different forms of input. And
it's really valuable to these people
1:24 - 1:29

because, you know, being able to interact
with the device provides some agency they
1:29 - 1:33

can they can do things on their own and it
provides a means of communication with the
1:33 - 1:38

outside world. So it's an important
problem to look at. And it's what I care
1:38 - 1:45

about a lot. Let's talk a bit about me for
a moment. I'm a systems security person. I
1:45 - 1:50

did a phd in cybersecurity at Columbia. If
you're interested in low level software
1:50 - 1:55

defenses, you can look that up. And I'm
currently the CTO at a startup called
1:55 - 2:03

Elpha Secure. I started developing medical
issues in around 2014. And as a result of
2:03 - 2:08

that, in an ongoing fashion, I can only
type a few thousand keystrokes per day.
2:08 - 2:12

Roughly fifteen thousand is my maximum.
That sounds like a lot, but imagine you're
2:12 - 2:17

typing at a hundred words per minute.
That's five hundred characters per minute,
2:17 - 2:23

which means it takes you 30 minutes to hit
fifteen thousand characters. So
2:23 - 2:30

essentially I, I can work like the
equivalent of a fast programmer for, for
2:30 - 2:34

half an hour. And then after that I would
be unable to use my hands for anything,
2:34 - 2:38

including like preparing food for myself
or opening, closing doors and so on. So I
2:38 - 2:42

have to be very careful about my hand use
and actually have a little program that
2:42 - 2:47

you can see on the slide there that
measures the keystrokes for me so I can
2:47 - 2:52

tell it when I'm going over. So what do I
do? Well, I do a lot of pair programming,
2:52 - 2:57

for sure. I log into the same machine as
other people and we work together. I'm
2:57 - 3:00

also a very heavy user of speech
recognition and I gave a talk at that
3:00 - 3:07

about voice coding with speech recognition
at the Hope 11 conference. So you can go
3:07 - 3:15

check that out if you're interested. So
when I talk about accessible input, I mean
3:15 - 3:19

different ways that a human can provide
input to a computer. So ergonomic
3:19 - 3:23

keyboards are a simple one. Speech
recognition, eye tracking or gaze tracking
3:23 - 3:27

so you can see where you're looking
or where you're pointing your head and
3:27 - 3:32

maybe use that to replace a mouse, that's
head gestures, I suppose. And there's
3:32 - 3:39

always this distinction between bespoke,
like custom input mechanisms and somewhat
3:39 - 3:44

mainstream ones. So I'll give you some
examples. You've probably heard of Stephen
3:44 - 3:50

Hawking. He's a very famous professor, and
he was actually a bit of an extreme case.
3:50 - 3:56

He had, he was diagnosed with ALS when he
was 21. So his his physical
3:56 - 4:01

ability, abilities degraded over the years
because he lived for many decades after
4:01 - 4:05

that and he went through many
communication mechanisms. Initially his
4:05 - 4:08

speech changed so that it was only
intelligible to his family and close
4:08 - 4:14

friends, but he was still able to speak.
And then after that he would work with the
4:14 - 4:19

human interpreter and raise his eyebrows
to pick various letters. And then and keep
4:19 - 4:25

in mind, this is like the 60s or 70s,
right? So computers were not really where
4:25 - 4:30

they are today. Later he would operate a
switch with one hand, just like on off on
4:30 - 4:35

off, kind of morse code and select from a
bank of words. And that was around 15
4:35 - 4:41

words per minute. Eventually, he was
unable to move his hand, so a team of
4:41 - 4:44

engineers from Intel worked with him and
they figured out, they were trying to do
4:44 - 4:48

like brain scans and all kinds of stuff.
But again, this was like in the eighties,
4:48 - 4:55

so there was not not too much they could
do. So they basically just created some
4:55 - 4:59

custom software to detect muscle movements
in his cheek. And he used that with
4:59 - 5:04

predictive, predictive words, the same way
that a phone, smartphone keyboard will
5:04 - 5:07

predict which word you want to say next.
Stephen Hawking, used something similar to
5:07 - 5:13

that, except instead of swiping on a
phone, he was moving his cheek muscles, so
5:13 - 5:18

that's obviously a sequence of like highly
customized input mechanisms for, for
5:18 - 5:24

someone very, very specialized for that
person. I also want to talk about someone
5:24 - 5:30

else named Professor Sang-Mook Lee, whom
I've met. That was me when I had more of a
5:30 - 5:36

beard than I do now. He he's a professor
at Seoul National University in South
5:36 - 5:43

Korea. And he sometimes called like the
Korean Stephen Hawking, because he's a big
5:43 - 5:48

advocate for people with disabilities.
Anyway, what he uses is you can
5:48 - 5:52

see a little orange device near his mouth
there. It's called a sip and puff mouse
5:52 - 5:57

so he can blow into it and suck air
through it and also move it around. And
5:57 - 6:02

that acts as a mouse cursor on the Android
device in front of him. It will move the
6:02 - 6:08

cursor around and click when he when he
blows air and so on. So that combined
6:08 - 6:14

with speech recognition, lets him use
mainstream Android hardware. He still has
6:14 - 6:21

access to, you know, email apps and like
Web Browsers and like Maps and everything
6:21 - 6:26

that comes on a normal Android device. So
he's way more capable than Stephen
6:26 - 6:30

Hawking, as who could, Stephen Hawking
could communicate, but just to a person at
6:30 - 6:36

a very slow rate. Right. Part of it's due
to the nature of his injury. But it's also
6:36 - 6:44

a testament to how far the technology has
improved. So let's talk a little bit about
6:44 - 6:49

what makes good accessibility. I think
performance is very important, right? You
6:49 - 6:54

want high accuracy. You don't want typos,
low latency. I don't want to speak and
6:54 - 6:58

then five seconds later have words appear.
It's too long, especially if I have to
6:58 - 7:03

make corrections. Right. And you want high
throughput, which we already talked about.
7:03 - 7:06

Oh, I forgot to mention Stephen Hawking
had like 15 words per minute. A normal
7:06 - 7:12

person speaking is 150. So that's
a big difference. (laughs) The higher
7:12 - 7:16

throughput you can get, the better. And
for input accessibility, I think and this
7:16 - 7:21

is not scientific. This is just what I've
learned from using myself and observing
7:21 - 7:25

many of these systems. I think it's
important to get completeness, consistency
7:25 - 7:31

and customization. For completeness I
mean, can I do any action? So Stephen or
7:31 - 7:41

Professor Sang-Mook Lee, his, his orange
mouth input device, the sip and puff is
7:41 - 7:44

quite powerful, but it doesn't let him do
every action. For example, for some reason
7:44 - 7:48

when he gets an incoming call, the the
input doesn't work. So he has to call over
7:48 - 7:52

a person physically to tap the accept call
button or the reject call button, which is
7:52 - 7:56

really annoying. Right. If you don't have
completeness, you can't be fully
7:56 - 8:02

independent. Consistency, very important
as well. The same way we develop motor
8:02 - 8:08

memory for muscle memory, for a keyboard.
You develop memory for any types of
8:08 - 8:12

patterns that you do. But if the thing you
say or the thing you do keeps changing in
8:12 - 8:18

order to do the same action. That's not
good. And finally, customization. So the
8:18 - 8:23

learning curve for beginners is important
for any accessibility device, but
8:23 - 8:27

designing for expert use is almost more
important because anyone who uses an
8:27 - 8:31

accessibility interface becomes an expert
at it. The example I like to give is
8:31 - 8:35

screen readers like a blind person using a
screen reader on a phone. They will crank
8:35 - 8:42

up the speed at which the speech is being
produced. And I actually met someone who
8:42 - 8:46

made his speech 16 times faster than
normal human speech. I could not
8:46 - 8:51

understand it at all, I sound like brbrbrbr, but
he could understand it perfectly. And that's just
8:51 - 8:56

because he used it so much that he's
become an expert at its use. Let's analyze
8:56 - 9:01

ergonomic keyboards just for a moment,
because it's fun. You know, they are kind
9:01 - 9:04

of like a normal keyboard. They'll have a,
you'll have a slow pace when you're
9:04 - 9:08

starting to learn them. But once you're
good at it, you have very good accuracy,
9:08 - 9:12

like instantaneous low latency. Right. You
press the key, the computer receives it
9:12 - 9:18

immediately and very high throughput. It
has high as you are on a regular keyboard.
9:18 - 9:20

So they're actually fantastic
accessibility devices, right. They're
9:20 - 9:24

completely compatible with original
keyboards. And if all you need is an
9:24 - 9:29

ergonomic keyboard, then you're in luck
because it's a very good accessibility
9:29 - 9:34

device. I'm going to talk about two
things, computers, but also Android
9:34 - 9:40

devices, so let's start with Android
devices. Yes, the built in voice
9:40 - 9:43

recognition and Android is really
incredible. So even though the microphones
9:43 - 9:47

on the devices aren't great, Google has
just collected so much data from so many
9:47 - 9:52

different sources that they've built like
better than human accuracy for for their
9:52 - 9:57

voice recognition. The voice accessibility
interface is kind of so so we'll talk
9:57 - 10:00

about that in a bit. That's the interface
where you can control the Android device
10:00 - 10:04

entirely by voice. For other input
mechanisms. You could use like a sip and
10:04 - 10:09

puff device or you could use physical
styluses. That's something that I do a
10:09 - 10:13

lot, actually, because for me, my fingers
get sore. And if I can hold a stylus in my
10:13 - 10:19

hand and kind of not use my fingers, then
that's very effective. So and the Elecom
10:19 - 10:24

styluses from a Japanese company are the
lightest I've found and they don't require
10:24 - 10:30

a lot of force. So the ones at the top
there are they're like 12 grams and the
10:30 - 10:34

one on the bottom is 4.7 grams. And you've
got almost no force to use them. So very
10:34 - 10:38

nice on the left there you can see the
Android speech recognition is built into
10:38 - 10:42

the keyboard now. Right. You can just
press that and start speaking. It
10:42 - 10:46

supports different languages, and it's
very accurate, it's very nice. And
10:46 - 10:51

actually, when I was working at Google for
a bit, I talked to the speech recognition
10:51 - 10:54

team as like: Why are you doing on
server speech recognition? You should do
10:54 - 10:58

it on the devices. But of course, Android
devices are, they're all very different
10:58 - 11:03

and many of them are not very powerful. So
they were having trouble getting
11:03 - 11:06

satisfactory speech recognition on the
device. So for a long time, there's some
11:06 - 11:11

server latency, server lag that you do
speech recognition and you wait a bit. And
11:11 - 11:14

then sometime this year, I just was using
speech recognition and it became so much
11:14 - 11:18

faster. I was extremely excited and I
looked into it and yeah, they just
11:18 - 11:22

switched on my device. At least they
switched on the On device speech recognition
11:22 - 11:26

model. And so now it's incredibly fast and
also incredibly accurate. I'm a huge fan
11:26 - 11:31

of it. On the right hand side. We can
actually see the voice access interface.
11:31 - 11:35

So this is meant to allow you to use a
phone entirely by voice. Again, while I
11:35 - 11:38

was at Google, I tried the the beta
version before it was publicly released
11:38 - 11:44

and I was like, this is pretty bad, mostly
because it did, it lacked completeness.
11:44 - 11:47

There would be things on the screen that
would not be selected. So here we see show
11:47 - 11:53

labels. And then I can I can say like four,
five, six, whatever, to tap on that
11:53 - 11:57

thing. But as you can see at the bottom,
there was like a Twitter Web app link and
11:57 - 12:00

there's no number on it. So if I want to
click on that, I'm out of luck. And this
12:00 - 12:06

is actually a problem in the design of the
accessibility interface that it only, it
12:06 - 12:12

doesn't expose the full DOM. It exposes
only a subset of it. And so an
12:12 - 12:19

accessibility mechanism can't ever see
those other things. And furthermore, the
12:19 - 12:22

way the Google speech recognition works,
they have to reestablish a new connection
12:22 - 12:26

every 30 seconds. And if you're in the
middle of speaking, it will just throw
12:26 - 12:30

away whatever you were saying because it
just decided it had to reconnect, which is
12:30 - 12:35

really unfortunate. They later released
that publicly and then sometime this year
12:35 - 12:40

they did the update, which is pretty nice.
It now has like a mouse grid, which lets,
12:40 - 12:44

which solves a lot of the completeness
problems. Like you can, you can use a grid
12:44 - 12:50

to narrow down somewhere on the screen and
then tap there. But the server issues and
12:50 - 12:55

the expert use is still not good, like, if
I want to turn it, if I want to do
12:55 - 13:00

something with the mouse grid, I have to
say "mouse grid on. 6. 5. mouse grid off".
13:00 - 13:03

And I can't combine those together. So
there's a lot of latency and it's not
13:03 - 13:10

really that fun to use, but better than
nothing? Absolutely! I just want to really
13:10 - 13:13

briefly show you as well that this same
feature of like being able to select links
13:13 - 13:17

on a screen is available on desktops. This
is a plug in for Chrome called Vimium. And
13:17 - 13:23

it's very powerful because you can then
combine this with keyboards or other input
13:23 - 13:27

mechanisms. And this one is complete. It
uses the entire DOM and anything you can
13:27 - 13:31

click on will be highlighted. So very
nice. I just want to give a quick example
13:31 - 13:35

of me using some of these systems. So I've
been trying to learn Japanese and there's
13:35 - 13:39

a couple of highly regarded websites for
this, but they're not consistent. When I
13:39 - 13:44

use the browser show labels like, you
know, the thing to press next page or
13:44 - 13:48

something like that or like, you know, I
give up or whatever it is, it keeps
13:48 - 13:52

changing. So the letters that are being
used keep changing. And that's because of
13:52 - 13:56

the dynamic way that they're generating
the HTML. So not really very useful. What
13:56 - 14:01

I do instead is I use a program called
Anki and that has very simple shortcuts in
14:01 - 14:06

its desktop app. One, two, three, four. So
it's nice to use and consistent and it's
14:06 - 14:12

syncs with an Android app and then I can
use my stylus on the Android device. So it
14:12 - 14:16

works pretty well. But even so, as you can
see from the chart in the bottom there,
14:16 - 14:20

there are many days when I can't use this,
even though I would like to, because I've
14:20 - 14:26

overused my hands or overused my voice.
When I'm using voice recognition all day,
14:26 - 14:29

every day, I do tend to lose my voice. And
as you can see from the graph, sometimes I
14:29 - 14:34

lose it for a week or two at a time. So
same thing with any accessibility
14:34 - 14:38

interface, you know, you've got to use
many different techniques and it's always,
14:38 - 14:44

it's never perfect is just the best you
can do at that moment. Something else I
14:44 - 14:50

like to do is read books. I read a lot of
books and I love e-book readers, the
14:50 - 14:54

dedicated e-ink displays. You can read them
in sunlight, they last forever, battery
14:54 - 14:59

wise. Unfortunately, it's hard to add other
input mechanisms to them. They don't have
14:59 - 15:04

microphones or other sensors and you can't
really install custom software on them.
15:04 - 15:07

But for Android based devices and there
are also like e-book reading apps for
15:07 - 15:10

Android devices, they have everything you
can install custom software and they have
15:10 - 15:16

microphones and many other sensors. So I
made two apps that allow you to read
15:16 - 15:21

e-books with an e-book reader. The first
one is Voice Next Page. It's based on one
15:21 - 15:26

of my speech recognition engines called
Silvius, and it does do server based
15:26 - 15:29

recognition. So you have to capture all
the audio, use 300 kilobits a second to
15:29 - 15:36

send it to the server and recognize things
like next page, previous page. However, it
15:36 - 15:40

doesn't cut out every 30 seconds. It keeps
going. So that's that's one win for it I
15:40 - 15:46

guess. And it is published in the Play
store. Huge thanks to Sarah Leventhal, who
15:46 - 15:50

did a lot of the implementation. Very
complicated to make an accessibility app
15:50 - 15:56

on Android. But we persevered and it works
quite nicely. So I'm going to actually
15:56 - 16:03

show you an example of voice next page.
This over here is my phone on the left
16:03 - 16:09

hand side just captured so that you guys
can see it. So here's the Voice Next Page.
16:09 - 16:14

And basically the connection is green. I
can do, the server is up and running and
16:14 - 16:20

so on. I just press start and then I'll
switch to an Android reading app and say,
16:20 - 16:23

next page, previous page. I won't speak
otherwise because it will chapel
16:23 - 16:26

everything I'm saying.
16:33 - 16:35

Next Page
16:36 - 16:38

Next Page
16:38 - 16:40

Previous Page
16:42 - 16:43

Center
16:44 - 16:45

Center
16:47 - 16:48

Foreground
16:49 - 16:51

Stop listening
16:55 - 16:59

So that's a demo of
The Voice Next Page, and it's
16:59 - 17:03

extremely helpful. I built it a couple of
years ago along with Sarah, and I use it a
17:03 - 17:08

lot. So, yeah, you can go ahead and
download it if you guys wanna try it out.
17:08 - 17:13

And the other one is called Blink Next
Page. So the idea for this, I got this
17:13 - 17:18

idea from a research paper this year that
was studying eyelid gestures. I didn't use
17:18 - 17:24

any of their code, but it's a great idea.
So the way this works is you detect blinks
17:24 - 17:29

by using the Android camera and then you
can trigger an action like turning pages
17:29 - 17:34

in an e-book reader. This actually doesn't
need any networking. It's able to use the
17:34 - 17:39

on device face recognition models from
Google, and it is still under development.
17:39 - 17:45

So it's not on the play store yet, but it
is working. And, you know, please contact
17:45 - 17:54

me if you want to try it. So just give me
one moment to set that demo up here. So
17:54 - 18:01

I'm going to use... The main problem with
this current implementation is that it
18:01 - 18:07

uses two devices. So that was easier to
implement. And I use two devices anyway.
18:07 - 18:14

But obviously I want a one device version
if I'm actually going to use it for
18:14 - 18:18

anything. So here's how this works. This
device I point at me, at my eyes, the
18:18 - 18:24

other device I put wherever it's
convenient to read, ups sorry, and if I blink
18:24 - 18:29

my eyes, the phone will buzz once it
detects that I blink my eyes and it will
18:29 - 18:35

turn the page automatically on the other
Android device. Now I have to blink both
18:35 - 18:42

my eyes for half a second. If I want to go
backwards, I can blink just my left eye.
18:42 - 18:50

And if I want to go forwards like quickly,
I can blink my right eye and hold it. (background buzzing)
18:50 - 18:55

Anyway, it does have some false positives.
That's why like you can go backwards in
18:55 - 19:00

case it detects that you've accidentally
flipped the page. And lighting is also
19:00 - 19:04

very important. Like if I have a light
behind me, then this is not going to be
19:04 - 19:08

able to identify whether my eyes are open
or closed properly. So it has some
19:08 - 19:19

limitations, but very simple to use. So
I'm a big fan. OK, so that's enough about
19:19 - 19:24

Android devices, let's talk very briefly
about desktop computers. So if you're
19:24 - 19:27

going to use a desktop computer, of
course, try using that show labels plugin
19:27 - 19:33

in a browser. For native apps you can try
Dragon NaturallySpeaking, which is fine if
19:33 - 19:37

you're just like using basic things. But
if you're trying to do complicated things,
19:37 - 19:41

you should definitely use a voice coding
system. You could also consider using eye
19:41 - 19:46

tracking to replace a mouse. I personally,
I don't use that. I find it hurts my eyes,
19:46 - 19:50

but I do use a trackball with very little
force and a wacom tablet. Some people will
19:50 - 19:56

even scroll up and down by humming, for
example, but I don't have that setup.
19:56 - 20:01

There's a bunch of nice talks out there on
voice coding. The top left is Tavis Rudds
20:01 - 20:06

talk from many years ago that got many of
us interested. Emily Shea gave a talk
20:06 - 20:11

there about best practices for voice
coding. And then I gave a talk a couple of
20:11 - 20:16

years ago at the Hope 11 conference, which
you can also check out. It's mostly out of
20:16 - 20:22

date by now, but it's still interesting.
So there are a lot of voice coding
20:22 - 20:28

systems, the sort of grandfather of them
all is Dragonfly. It's become a grammar
20:28 - 20:35

standard. Caster is if you're willing to
memorize lots of unusual words, you can
20:35 - 20:41

become much better, much faster than I
currently am at voice coding. aenea is how
20:41 - 20:46

you originally used Dragon to work on a
Linux machine, for example, because Dragon
20:46 - 20:53

only runs on Windows. Talon is a closed
source program, which is, but it's very
20:53 - 20:57

powerful. Has a big user base, especially
for Mac OS. There are ports now. And Talon
20:57 - 21:05

used to use Dragon, but it's now using a
speech system from Facebook. Silvius is
21:05 - 21:10

the system that I created, the models are
not very accurate, but it's a nice
21:10 - 21:13

architecture where there's client- server,
so it makes it easy to build things like
21:13 - 21:18

the voice next page. So Voice next page
was using Silvius. And then the the most
21:18 - 21:22

recent one I think on this list is kaldi-
active-grammar, which is extremely
21:22 - 21:26

powerful and extremely customizable. And
it's also open source. It works on all
21:26 - 21:30

platforms. So I really highly recommend
that. So let's talk a bit more about
21:30 - 21:35

kaldi-active-grammar. But first, for voice
coding, I've already mentioned, you have
21:35 - 21:39

to be careful how you use your voice
right. Breathe from your belly. Don't
21:39 - 21:42

tighten your muscles and breathe from your
chest. Try to speak normally. And I'm not
21:42 - 21:45

particularly good at this. Like you'll
hear me when I'm speaking commands that my
21:45 - 21:51

inflection changes. So I do tend to
overuse my voice, but you just have to be
21:51 - 21:54

conscious of that. The microphone hardware
does matter. I do recommend like a blue
21:54 - 22:00

yeti on a microphone arm that you can pull
and put close to your face like this. I
22:00 - 22:04

will use this one for my speaking demo
and. Yeah. And the other thing is your
22:04 - 22:08

grammar is fully customizable. So if you
keep saying a word and the system doesn't
22:08 - 22:14

recognize it, just change it to another
word. And it's complete in the sense you
22:14 - 22:18

can type any key on the keyboard. And the
most important thing for expert use or
22:18 - 22:22

customizability is that you can do
chaining. So with the voice coding system,
22:22 - 22:27

you can say multiple commands at once. If
there's, and it's a huge time saving,
22:27 - 22:32

you'll see what I mean when I give a quick
demo. When I do voice coding, I'm a very
22:32 - 22:39

heavy vim and tmux user. You know, there
have been I've worked with many people
22:39 - 22:42

before, so I have some cheat sheet
information there. So if you're
22:42 - 22:45

interested, you can go check that out. But
yeah, let's just do a quick demo of voice
22:45 - 22:54

coding here. "Turn this mic on". "Desk left
two". "Control delta", "open new terminal".
22:54 - 23:00

"Charlie delta space slash tango mike papa
enter". "Command vim". "Hotel hotel point
23:00 - 23:09

charlie papa papa, enter". "India , hash
word include space langel", "india oscar word
23:09 - 23:16

stream rangel, enter, enter", "india noi
tango space word mean", "no mike arch india
23:16 - 23:24

noi space len ren space lace enter enter
race up tab word print fox scratch nope code
23:24 - 23:31

standard charlie oscar uniform tango space
langel langel space quote. Sentence hello,
23:31 - 23:40

voice coding bang, scratch six delta india
noi golf, bang, backslash, noi quote
23:40 - 23:46

semicolon act sky fox mike romeo noi oscar
word return space number zero semicolon
23:46 - 23:53

act vim save and quit. Golf plus plus
space hotel hotel tab minus oscar space
23:53 - 24:04

hotel hotel enter. Point slash hotel hotel
enter. Desk right. So that's just a quick
24:04 - 24:09

example of voice coding, you can use it to
write any programing language, you can use
24:09 - 24:14

it to control anything on your desktop.
It's very powerful. It has a bit of a
24:14 - 24:19

learning curve, but it's very powerful. So
the creator of kaldi-active-grammar is
24:19 - 24:26

also named David. I'm named David, but
just a coincidence. And he says of kaldi-
24:26 - 24:31

active-grammar, that I haven't typed with
the keyboard in many years and kaldi-
24:31 - 24:36

active-grammar is bootstrapped in that I
have been developing it entirely using the
24:36 - 24:42

previous versions of it. So, David has a
medical condition that means he has very
24:42 - 24:48

low dexterity, so it's hard for him to use
a keyboard. And yet he basically got
24:48 - 24:53

kaldi-active-grammar working through the
skin of his teeth or something and then
24:53 - 24:59

continues to develop it using it. And
yeah, I'm a huge fan of the project. I
24:59 - 25:03

haven't contributed much, but I did give
some of the hardware resources like GPU
25:03 - 25:08

and CPU compute resources to allow
training to happen. But I would also like
25:08 - 25:13

to show you a video of David using kaldi-
active-grammar, just, so you can see it as
25:13 - 25:21

well. So, the other thing about David is,
that he has a speech impediment or a
25:21 - 25:25

speech, I don't know, an accent or
whatever. So it's difficult to, for a
25:25 - 25:28

normal speech recognition system, to
understand him. And you might have trouble
25:28 - 25:31

understanding him here. But you can see in
the lower right, what the speech system
25:31 - 25:37

understands what he's saying. Oh, I
realized, that I do need to switch
25:37 - 25:42

something in OBS, so that you guys can
hear it. Sorry. There you go.
25:42 - 26:03

(Other) David using kaldi-active-grammar system (not understandable)
26:03 - 26:06

Here, you get the idea and hopefully, you
26:06 - 26:11

guys were able to hear that. If not, you
can also find this on the website that I'm
26:11 - 26:18

going to show you at the end. One other
thing, I want to show you about this is,
26:18 - 26:23

David has actually set up this humming to
scroll, which I think is pretty cool. Of
26:23 - 26:28

course, I've gone and turned off the OBS
there. But he's just doing hmmm and it's
26:28 - 26:33

understanding that and scrolling down. So,
something that I'm able to do with my
26:33 - 26:42

trackball, but he's using his voice for,
so pretty cool. So I'm almost done here.
26:42 - 26:47

In summary, good input accessibility means
you need completeness, consistency and
26:47 - 26:50

customization. You need to be able to do
any action that you could do with the
26:50 - 26:55

other input mechanisms. And doing the same
input should have the same action. And
26:55 - 27:00

remember, your users will become experts,
so the system needs to be designed for
27:00 - 27:06

that. For e-book reading: Yes, I'm trying
to allow anyone to read, even if they're
27:06 - 27:11

experiencing some severe physical or motor
impairment, because I think that gives you
27:11 - 27:15

a lot of power to be able to turn the
pages and read your favorite books. And
27:15 - 27:19

for speech recognition, yeah, Android
speech recognition is very good. Silvius
27:19 - 27:23

accuracy is not so good, but it's easy to
use quickly for experimentation and to
27:23 - 27:28

make other types of things like Voice Next
Page. And please do check out kaldi-
27:28 - 27:34

active-grammar if you have some serious
need for voice recognition. Lastly, I put
27:34 - 27:39

all of this onto a website, voxhub.io, so
you can see Voice Next Page, Blink Next
27:39 - 27:42

Page, kaldi-active-grammar and so on, just
instructions for how to use it and how to
27:42 - 27:47

set it up. So please do check that out.
And tons of acknowledgments, lots of
27:47 - 27:50

people that have helped me along the way,
but I want to especially call out
27:50 - 27:54

Professor Sang-Mook Lee, who actually
invited me to Korea a couple of times to
27:54 - 27:58

give talks - a big inspiration. And of
course, David Zurow, who has actually been
27:58 - 28:03

able to bootstrap into a fully voice
coding environment. So that's all I have
28:03 - 28:07

for today. Thank you very much.
28:07 - 28:16

Herald: Alright, I suppose I'm back on the
air, so let me see. I want to remind
28:16 - 28:22

everyone before we go into the Q&A that
you can ask your questions for this talk
28:22 - 28:26

on IRC, the link is under the video, or
you can use Twitter or the Fediverse with
28:26 - 28:34

the hashtag #rc3two. Again, I'll hold it
up here, "rc3two".
28:34 - 28:39

Thanks for your talk, David. That was
really interesting. Thanks for talk,
28:39 - 28:47

David. I, yeah, I think we have a couple
of questions from the Signal Angels.
28:47 - 28:51

Before that, I just wanted to say I've
recently spent some time playing with a
28:51 - 28:57

like the VoiceOver system in iOS and that
can now actually tell you what is on a
28:57 - 29:03

photo, which is kind of amazing. Oh, by
the way, I can't hear you here on on the
29:03 - 29:05

Mumble.
David: Yeah. Sorry, I wasn't saying
29:05 - 29:10

anything. Yeah, no, it's so I focused
mostly on input accessability, right?
29:10 - 29:14

Which is like how do you get data to the
computer. But there's been huge
29:14 - 29:17

improvements in the other way around as
well, right? The computer doing VoiceOver
29:17 - 29:19

things.
Herald: So we have about let's see,
29:19 - 29:25

five-six minutes left at least for Q&A. We
have a question by Toby++, he asks: "Your
29:25 - 29:29

next page application looks cool. Do you
have statistics of how many people use it
29:29 - 29:36

or found it on the App Store?"
David: Not very many. The Voice Next Page
29:36 - 29:41

was advertised only so far as a little
academic poster. So I've gotten a few
29:41 - 29:46

people to use it. But I run eight
concurrent workers and we've never hit
29:46 - 29:52

more than that. (laughs) So not super popular,
but I do hope that some people will see it
29:52 - 29:55

because of this talk and go and check out.
Herald: That's cool. Next question. How
29:55 - 30:00

error prone are the speech recognition
systems at all? E.g., can you do coding
30:00 - 30:06

while doing workouts?
David: So one thing about speech
30:06 - 30:10

recognition is very sensitive to the
microphone, so when you're doing it
30:10 - 30:38

Technical malfunction. We'll be back soon.
30:38 - 30:41

David (cont.): Any mistakes, right?
30:41 - 30:44

That's the thing about having low latency,
you just say something and you watch it
30:44 - 30:48

and you make sure that it was what you
wanted to say. I don't know exactly how
30:48 - 30:52

many words per minute I can say with voice
coding, but I can say it much faster than
30:52 - 30:56

regular speech. So I'd say at least like
200, maybe 300 words per minute.
30:56 - 30:57

So it's actually a very high bandwidth
mechanism.
30:57 - 31:03

Herald: That's really awesome. A question from
peppyjndivos: "Any advice for software
31:03 - 31:08

authors to make their stuff more
accessible?"
31:08 - 31:15

David: There are good web accessibility
guidelines. So if you're just making a
31:15 - 31:19

website or something, I would definitely
follow those. They tend to be focused more
31:19 - 31:24

on people that are blind because that is,
you know, it's more of an obvious fail.
31:24 - 31:30

like they just can't interact at all with
your website. But things like, you know,
31:30 - 31:37

if Duolingo, for example, had used the
same, like, the same accessibility access
31:37 - 31:40

tag on their, like, next button, then they
would always be the same letter for me and
31:40 - 31:46

I wouldn't have to be like Fox-Charlie ,
Fox-Delta, Fox-something - changes all the
31:46 - 31:52

time. So I think consistency is very
important. And integrating with any
31:52 - 31:58

existing accessibility APIs is also a very
important - Web APIs, Android APIs and so
31:58 - 32:02

on, because, you know, we can't make every
program out there like voice compatible.
32:02 - 32:05

We just have to meet in the middle where
they interact at the keyboard layer or the
32:05 - 32:08

accessibility layer.
Herald: Awesome. AmericN has a question,
32:08 - 32:14

wonders if these systems use similar
approaches like stenography with mnemonics
32:14 - 32:19

or if there's any projects working having
that in mind.
32:19 - 32:27

David: A very good question. So, the first
thing everyone uses is the NATO phonetic
32:27 - 32:33

alphabet to spell letters, for example,
Alpha. Bravo, Charlie. Some people then
32:33 - 32:39

will substitute letters for things that
are too long, like November. I use noi.
32:39 - 32:42

Sometimes the speech system doesn't
understand you. Whenever I said Alpha,
32:42 - 32:46

Dragon was like, oh, you're saying
"offer". So I changed it. It's Arch for
32:46 - 32:53

me, Arch, Brav, Char. So, and also most of
these grammars are in a common grammar
32:53 - 32:57

format. They are written in Python and
they're compatible with Dragonfly. So you
32:57 - 33:01

can grab a grammar for, I don't know, for
Aenea and get it to work with kaldi-
33:01 - 33:05

active-grammar with very little effort. I
actually have a grammar that works on both
33:05 - 33:11

Aenea and kaldi-active-grammar, and that's
what I use. So there's a bit of lingua
33:11 - 33:14

franca, I guess, you can kind of guess
what other people are using. But at the
33:14 - 33:19

same time there's a lot of customization,
you know, because people change words,
33:19 - 33:23

they add their own commands, they change
words based on what the speech system
33:23 - 33:27

understands.
Herald: Alright, LEB asks, is there an online
33:27 - 33:32

community you can propose for
accessibility technologies?
33:32 - 33:40

David: There's an amazing forum for anything
related to voice coding. All the
33:40 - 33:52

developers of new voice coding software
are there. Sorry, I just need to drink. So
33:52 - 33:57

it's a really fantastic resource. I do
link to it from voxhub.io. I believe it's
33:57 - 34:02

at the bottom of the kaldi-active-grammar
page. So you can definitely check that
34:02 - 34:07

out. For general accessibility, I don't
know, I could recommend the accessibility
34:07 - 34:12

mailing list at Google, but that's only if
you work at Google. Other than that, yeah,
34:12 - 34:16

I think it depends on your community,
right? I think if you're looking for web
34:16 - 34:20

accessibility, you could go for some
Mozilla mailing list and so on. If you're
34:20 - 34:25

looking for desktop accessibility, then
maybe you could go find some stuff about
34:25 - 34:30

the Windows Speech API. unintelligible
Herald: One last question from Joe Neilson.
34:30 - 34:35

Could there be legal issues if you make an
e-book into audio? I'm not sure what that
34:35 - 34:43

refers to.
David: Yeah. So if you are like doing, if
34:43 - 34:46

you're using a screen reader and you're
like, you try to get it to read out the
34:46 - 34:55

contents of an e-book, right? So most,
most of the time there are fair use
34:55 - 35:03

exceptions for copyright law, even in the
US, and making a copy yourself for
35:03 - 35:09

personal purposes so that you can access
it is usually considered fair use. If you
35:09 - 35:14

were trying to commercialize it or make
money off of that or like, I don't know,
35:14 - 35:18

you're a famous streamer and all you do is
highlight text and have it read it out,
35:18 - 35:21

then maybe, but I would say that
definitely falls under fair use.
35:21 - 35:27

Herald: Alright. So I guess that's it for
the talk. I think we're hitting the timing
35:27 - 35:30

mark really well. Thank you so much,
David, for that. That was really, really
35:30 - 35:36

interesting. I learned a lot and thanks
everyone for watching and stay on. I think
35:36 - 35:40

there might be some news coming up. Thanks
and everyone.
35:40 - 35:56

rc3 postroll music
35:56 - 36:19

Subtitles created by c3subtitles.de
in the year 2020. Join, and help us!

Title:: #rC3 - Accessible input for readers, coders, and hackers
Description:: more » « less
Video Language:: English
Duration:: 36:20

	Evil_skunk edited English subtitles for #rC3 - Accessible input for readers, coders, and hackers
	Evil_skunk edited English subtitles for #rC3 - Accessible input for readers, coders, and hackers
	Evil_skunk edited English subtitles for #rC3 - Accessible input for readers, coders, and hackers
	Evil_skunk edited English subtitles for #rC3 - Accessible input for readers, coders, and hackers
	C3Subtitles edited English subtitles for #rC3 - Accessible input for readers, coders, and hackers
	C3Subtitles edited English subtitles for #rC3 - Accessible input for readers, coders, and hackers
	C3Subtitles edited English subtitles for #rC3 - Accessible input for readers, coders, and hackers

English subtitles

Revisions

Revision 7 Edited

Evil_skunk

#rC3 - Accessible input for readers, coders, and hackers

Revisions

Our website uses cookies

Operating cookies (Required)