1
00:00:00,000 --> 00:00:12,260
rc3 preroll music
2
00:00:12,260 --> 00:00:17,930
Herald: All right, so again, let's
introduce the next talk, accessible inputs
3
00:00:17,930 --> 00:00:25,320
for readers, coders and hackers, the talk
by David Williams-King about custom off-,
4
00:00:25,320 --> 00:00:30,230
well, not off the shelf, but custom
accessibility solutions. He will give you
5
00:00:30,230 --> 00:00:35,420
some demonstrations and that includes his
own custom made voice input, an added link
6
00:00:35,420 --> 00:00:38,110
system. Here is David Williams-King
7
00:00:40,020 --> 00:00:46,440
David: Thank you for the introduction.
Let's go ahead and get started. So, yeah,
8
00:00:46,440 --> 00:00:50,650
I'm talking about accessibility,
particularly accessible input for readers,
9
00:00:50,650 --> 00:00:57,840
coders and hackers. So what do I mean by
accessibility? I mean people that have
10
00:00:57,840 --> 00:01:02,780
physical or motor impairments. This could
be due to repetitive strain injury, carpal
11
00:01:02,780 --> 00:01:08,030
tunnel, all kinds of medical conditions.
If you have this type of thing, you
12
00:01:08,030 --> 00:01:11,860
probably can't use a normal computer
keyboard, computer mouse or even a phone
13
00:01:11,860 --> 00:01:18,720
touch screen. However, technology does
allow users to interact with these devices
14
00:01:18,720 --> 00:01:23,780
just using different forms of input. And
it's really valuable to these people
15
00:01:23,780 --> 00:01:28,909
because, you know, being able to interact
with the device provides some agency they
16
00:01:28,909 --> 00:01:32,781
can they can do things on their own and it
provides a means of communication with the
17
00:01:32,781 --> 00:01:38,439
outside world. So it's an important
problem to look at. And it's what I care
18
00:01:38,439 --> 00:01:45,200
about a lot. Let's talk a bit about me for
a moment. I'm a systems security person. I
19
00:01:45,200 --> 00:01:49,920
did a phd in cybersecurity at Columbia. If
you're interested in low level software
20
00:01:49,920 --> 00:01:54,509
defenses, you can look that up. And I'm
currently the CTO at a startup called
21
00:01:54,509 --> 00:02:03,360
Elpha Secure. I started developing medical
issues in around 2014. And as a result of
22
00:02:03,360 --> 00:02:07,770
that, in an ongoing fashion, I can only
type a few thousand keystrokes per day.
23
00:02:07,770 --> 00:02:12,090
Roughly fifteen thousand is my maximum.
That sounds like a lot, but imagine you're
24
00:02:12,090 --> 00:02:17,069
typing at a hundred words per minute.
That's five hundred characters per minute,
25
00:02:17,069 --> 00:02:23,349
which means it takes you 30 minutes to hit
fifteen thousand characters. So
26
00:02:23,349 --> 00:02:29,519
essentially I, I can work like the
equivalent of a fast programmer for, for
27
00:02:29,519 --> 00:02:33,700
half an hour. And then after that I would
be unable to use my hands for anything,
28
00:02:33,700 --> 00:02:38,420
including like preparing food for myself
or opening, closing doors and so on. So I
29
00:02:38,420 --> 00:02:42,189
have to be very careful about my hand use
and actually have a little program that
30
00:02:42,189 --> 00:02:46,690
you can see on the slide there that
measures the keystrokes for me so I can
31
00:02:46,690 --> 00:02:51,809
tell it when I'm going over. So what do I
do? Well, I do a lot of pair programming,
32
00:02:51,809 --> 00:02:56,650
for sure. I log into the same machine as
other people and we work together. I'm
33
00:02:56,650 --> 00:03:00,430
also a very heavy user of speech
recognition and I gave a talk at that
34
00:03:00,430 --> 00:03:06,900
about voice coding with speech recognition
at the Hope 11 conference. So you can go
35
00:03:06,900 --> 00:03:15,419
check that out if you're interested. So
when I talk about accessible input, I mean
36
00:03:15,419 --> 00:03:18,779
different ways that a human can provide
input to a computer. So ergonomic
37
00:03:18,779 --> 00:03:23,019
keyboards are a simple one. Speech
recognition, eye tracking or gaze tracking
38
00:03:23,019 --> 00:03:26,699
so you can see where you're looking
or where you're pointing your head and
39
00:03:26,699 --> 00:03:32,229
maybe use that to replace a mouse, that's
head gestures, I suppose. And there's
40
00:03:32,229 --> 00:03:38,650
always this distinction between bespoke,
like custom input mechanisms and somewhat
41
00:03:38,650 --> 00:03:44,499
mainstream ones. So I'll give you some
examples. You've probably heard of Stephen
42
00:03:44,499 --> 00:03:50,230
Hawking. He's a very famous professor, and
he was actually a bit of an extreme case.
43
00:03:50,230 --> 00:03:56,142
He had, he was diagnosed with ALS when he
was 21. So his his physical
44
00:03:56,142 --> 00:04:00,669
ability, abilities degraded over the years
because he lived for many decades after
45
00:04:00,669 --> 00:04:05,059
that and he went through many
communication mechanisms. Initially his
46
00:04:05,059 --> 00:04:08,309
speech changed so that it was only
intelligible to his family and close
47
00:04:08,309 --> 00:04:14,239
friends, but he was still able to speak.
And then after that he would work with the
48
00:04:14,239 --> 00:04:19,440
human interpreter and raise his eyebrows
to pick various letters. And then and keep
49
00:04:19,440 --> 00:04:24,690
in mind, this is like the 60s or 70s,
right? So computers were not really where
50
00:04:24,690 --> 00:04:29,840
they are today. Later he would operate a
switch with one hand, just like on off on
51
00:04:29,840 --> 00:04:35,009
off, kind of morse code and select from a
bank of words. And that was around 15
52
00:04:35,009 --> 00:04:41,080
words per minute. Eventually, he was
unable to move his hand, so a team of
53
00:04:41,080 --> 00:04:44,490
engineers from Intel worked with him and
they figured out, they were trying to do
54
00:04:44,490 --> 00:04:48,229
like brain scans and all kinds of stuff.
But again, this was like in the eighties,
55
00:04:48,229 --> 00:04:54,599
so there was not not too much they could
do. So they basically just created some
56
00:04:54,599 --> 00:04:59,120
custom software to detect muscle movements
in his cheek. And he used that with
57
00:04:59,120 --> 00:05:03,550
predictive, predictive words, the same way
that a phone, smartphone keyboard will
58
00:05:03,550 --> 00:05:07,180
predict which word you want to say next.
Stephen Hawking, used something similar to
59
00:05:07,180 --> 00:05:12,689
that, except instead of swiping on a
phone, he was moving his cheek muscles, so
60
00:05:12,689 --> 00:05:17,810
that's obviously a sequence of like highly
customized input mechanisms for, for
61
00:05:17,810 --> 00:05:23,979
someone very, very specialized for that
person. I also want to talk about someone
62
00:05:23,979 --> 00:05:29,592
else named Professor Sang-Mook Lee, whom
I've met. That was me when I had more of a
63
00:05:29,592 --> 00:05:36,180
beard than I do now. He he's a professor
at Seoul National University in South
64
00:05:36,180 --> 00:05:42,969
Korea. And he sometimes called like the
Korean Stephen Hawking, because he's a big
65
00:05:42,969 --> 00:05:47,990
advocate for people with disabilities.
Anyway, what he uses is you can
66
00:05:47,990 --> 00:05:52,360
see a little orange device near his mouth
there. It's called a sip and puff mouse
67
00:05:52,360 --> 00:05:56,930
so he can blow into it and suck air
through it and also move it around. And
68
00:05:56,930 --> 00:06:02,280
that acts as a mouse cursor on the Android
device in front of him. It will move the
69
00:06:02,280 --> 00:06:08,229
cursor around and click when he when he
blows air and so on. So that combined
70
00:06:08,229 --> 00:06:13,909
with speech recognition, lets him use
mainstream Android hardware. He still has
71
00:06:13,909 --> 00:06:21,249
access to, you know, email apps and like
Web Browsers and like Maps and everything
72
00:06:21,249 --> 00:06:26,159
that comes on a normal Android device. So
he's way more capable than Stephen
73
00:06:26,159 --> 00:06:29,949
Hawking, as who could, Stephen Hawking
could communicate, but just to a person at
74
00:06:29,949 --> 00:06:35,830
a very slow rate. Right. Part of it's due
to the nature of his injury. But it's also
75
00:06:35,830 --> 00:06:43,939
a testament to how far the technology has
improved. So let's talk a little bit about
76
00:06:43,939 --> 00:06:49,480
what makes good accessibility. I think
performance is very important, right? You
77
00:06:49,480 --> 00:06:53,889
want high accuracy. You don't want typos,
low latency. I don't want to speak and
78
00:06:53,889 --> 00:06:58,389
then five seconds later have words appear.
It's too long, especially if I have to
79
00:06:58,389 --> 00:07:02,509
make corrections. Right. And you want high
throughput, which we already talked about.
80
00:07:02,509 --> 00:07:06,240
Oh, I forgot to mention Stephen Hawking
had like 15 words per minute. A normal
81
00:07:06,240 --> 00:07:12,349
person speaking is 150. So that's
a big difference. (laughs) The higher
82
00:07:12,349 --> 00:07:16,479
throughput you can get, the better. And
for input accessibility, I think and this
83
00:07:16,479 --> 00:07:20,879
is not scientific. This is just what I've
learned from using myself and observing
84
00:07:20,879 --> 00:07:25,330
many of these systems. I think it's
important to get completeness, consistency
85
00:07:25,330 --> 00:07:31,479
and customization. For completeness I
mean, can I do any action? So Stephen or
86
00:07:31,479 --> 00:07:40,590
Professor Sang-Mook Lee, his, his orange
mouth input device, the sip and puff is
87
00:07:40,590 --> 00:07:44,199
quite powerful, but it doesn't let him do
every action. For example, for some reason
88
00:07:44,199 --> 00:07:48,379
when he gets an incoming call, the the
input doesn't work. So he has to call over
89
00:07:48,379 --> 00:07:52,430
a person physically to tap the accept call
button or the reject call button, which is
90
00:07:52,430 --> 00:07:55,729
really annoying. Right. If you don't have
completeness, you can't be fully
91
00:07:55,729 --> 00:08:01,729
independent. Consistency, very important
as well. The same way we develop motor
92
00:08:01,729 --> 00:08:07,580
memory for muscle memory, for a keyboard.
You develop memory for any types of
93
00:08:07,580 --> 00:08:11,690
patterns that you do. But if the thing you
say or the thing you do keeps changing in
94
00:08:11,690 --> 00:08:18,220
order to do the same action. That's not
good. And finally, customization. So the
95
00:08:18,220 --> 00:08:22,809
learning curve for beginners is important
for any accessibility device, but
96
00:08:22,809 --> 00:08:27,150
designing for expert use is almost more
important because anyone who uses an
97
00:08:27,150 --> 00:08:31,229
accessibility interface becomes an expert
at it. The example I like to give is
98
00:08:31,229 --> 00:08:35,440
screen readers like a blind person using a
screen reader on a phone. They will crank
99
00:08:35,440 --> 00:08:41,880
up the speed at which the speech is being
produced. And I actually met someone who
100
00:08:41,880 --> 00:08:46,321
made his speech 16 times faster than
normal human speech. I could not
101
00:08:46,341 --> 00:08:51,020
understand it at all, I sound like brbrbrbr, but
he could understand it perfectly. And that's just
102
00:08:51,020 --> 00:08:56,190
because he used it so much that he's
become an expert at its use. Let's analyze
103
00:08:56,190 --> 00:09:01,040
ergonomic keyboards just for a moment,
because it's fun. You know, they are kind
104
00:09:01,040 --> 00:09:04,260
of like a normal keyboard. They'll have a,
you'll have a slow pace when you're
105
00:09:04,260 --> 00:09:07,630
starting to learn them. But once you're
good at it, you have very good accuracy,
106
00:09:07,630 --> 00:09:11,709
like instantaneous low latency. Right. You
press the key, the computer receives it
107
00:09:11,709 --> 00:09:17,510
immediately and very high throughput. It
has high as you are on a regular keyboard.
108
00:09:17,510 --> 00:09:20,329
So they're actually fantastic
accessibility devices, right. They're
109
00:09:20,329 --> 00:09:23,950
completely compatible with original
keyboards. And if all you need is an
110
00:09:23,950 --> 00:09:28,600
ergonomic keyboard, then you're in luck
because it's a very good accessibility
111
00:09:28,600 --> 00:09:34,480
device. I'm going to talk about two
things, computers, but also Android
112
00:09:34,480 --> 00:09:39,750
devices, so let's start with Android
devices. Yes, the built in voice
113
00:09:39,750 --> 00:09:43,340
recognition and Android is really
incredible. So even though the microphones
114
00:09:43,340 --> 00:09:47,000
on the devices aren't great, Google has
just collected so much data from so many
115
00:09:47,000 --> 00:09:51,590
different sources that they've built like
better than human accuracy for for their
116
00:09:51,590 --> 00:09:56,570
voice recognition. The voice accessibility
interface is kind of so so we'll talk
117
00:09:56,570 --> 00:09:59,649
about that in a bit. That's the interface
where you can control the Android device
118
00:09:59,649 --> 00:10:04,230
entirely by voice. For other input
mechanisms. You could use like a sip and
119
00:10:04,230 --> 00:10:09,010
puff device or you could use physical
styluses. That's something that I do a
120
00:10:09,010 --> 00:10:13,320
lot, actually, because for me, my fingers
get sore. And if I can hold a stylus in my
121
00:10:13,320 --> 00:10:19,220
hand and kind of not use my fingers, then
that's very effective. So and the Elecom
122
00:10:19,220 --> 00:10:23,750
styluses from a Japanese company are the
lightest I've found and they don't require
123
00:10:23,750 --> 00:10:30,131
a lot of force. So the ones at the top
there are they're like 12 grams and the
124
00:10:30,131 --> 00:10:34,160
one on the bottom is 4.7 grams. And you've
got almost no force to use them. So very
125
00:10:34,160 --> 00:10:38,040
nice on the left there you can see the
Android speech recognition is built into
126
00:10:38,040 --> 00:10:41,860
the keyboard now. Right. You can just
press that and start speaking. It
127
00:10:41,860 --> 00:10:46,160
supports different languages, and it's
very accurate, it's very nice. And
128
00:10:46,160 --> 00:10:51,470
actually, when I was working at Google for
a bit, I talked to the speech recognition
129
00:10:51,470 --> 00:10:54,470
team as like: Why are you doing on
server speech recognition? You should do
130
00:10:54,470 --> 00:10:58,029
it on the devices. But of course, Android
devices are, they're all very different
131
00:10:58,029 --> 00:11:02,529
and many of them are not very powerful. So
they were having trouble getting
132
00:11:02,529 --> 00:11:06,450
satisfactory speech recognition on the
device. So for a long time, there's some
133
00:11:06,450 --> 00:11:10,630
server latency, server lag that you do
speech recognition and you wait a bit. And
134
00:11:10,630 --> 00:11:14,190
then sometime this year, I just was using
speech recognition and it became so much
135
00:11:14,190 --> 00:11:18,360
faster. I was extremely excited and I
looked into it and yeah, they just
136
00:11:18,360 --> 00:11:22,000
switched on my device. At least they
switched on the On device speech recognition
137
00:11:22,000 --> 00:11:25,710
model. And so now it's incredibly fast and
also incredibly accurate. I'm a huge fan
138
00:11:25,710 --> 00:11:30,949
of it. On the right hand side. We can
actually see the voice access interface.
139
00:11:30,949 --> 00:11:34,899
So this is meant to allow you to use a
phone entirely by voice. Again, while I
140
00:11:34,899 --> 00:11:37,940
was at Google, I tried the the beta
version before it was publicly released
141
00:11:37,940 --> 00:11:43,529
and I was like, this is pretty bad, mostly
because it did, it lacked completeness.
142
00:11:43,529 --> 00:11:47,209
There would be things on the screen that
would not be selected. So here we see show
143
00:11:47,209 --> 00:11:52,510
labels. And then I can I can say like four,
five, six, whatever, to tap on that
144
00:11:52,510 --> 00:11:57,070
thing. But as you can see at the bottom,
there was like a Twitter Web app link and
145
00:11:57,070 --> 00:12:00,140
there's no number on it. So if I want to
click on that, I'm out of luck. And this
146
00:12:00,140 --> 00:12:06,500
is actually a problem in the design of the
accessibility interface that it only, it
147
00:12:06,500 --> 00:12:11,519
doesn't expose the full DOM. It exposes
only a subset of it. And so an
148
00:12:11,519 --> 00:12:18,959
accessibility mechanism can't ever see
those other things. And furthermore, the
149
00:12:18,959 --> 00:12:22,279
way the Google speech recognition works,
they have to reestablish a new connection
150
00:12:22,279 --> 00:12:26,480
every 30 seconds. And if you're in the
middle of speaking, it will just throw
151
00:12:26,480 --> 00:12:29,959
away whatever you were saying because it
just decided it had to reconnect, which is
152
00:12:29,959 --> 00:12:34,610
really unfortunate. They later released
that publicly and then sometime this year
153
00:12:34,610 --> 00:12:39,860
they did the update, which is pretty nice.
It now has like a mouse grid, which lets,
154
00:12:39,860 --> 00:12:44,050
which solves a lot of the completeness
problems. Like you can, you can use a grid
155
00:12:44,050 --> 00:12:50,040
to narrow down somewhere on the screen and
then tap there. But the server issues and
156
00:12:50,040 --> 00:12:54,870
the expert use is still not good, like, if
I want to turn it, if I want to do
157
00:12:54,870 --> 00:12:59,540
something with the mouse grid, I have to
say "mouse grid on. 6. 5. mouse grid off".
158
00:12:59,540 --> 00:13:02,899
And I can't combine those together. So
there's a lot of latency and it's not
159
00:13:02,899 --> 00:13:09,611
really that fun to use, but better than
nothing? Absolutely! I just want to really
160
00:13:09,611 --> 00:13:13,149
briefly show you as well that this same
feature of like being able to select links
161
00:13:13,149 --> 00:13:17,209
on a screen is available on desktops. This
is a plug in for Chrome called Vimium. And
162
00:13:17,209 --> 00:13:22,670
it's very powerful because you can then
combine this with keyboards or other input
163
00:13:22,670 --> 00:13:26,650
mechanisms. And this one is complete. It
uses the entire DOM and anything you can
164
00:13:26,650 --> 00:13:31,130
click on will be highlighted. So very
nice. I just want to give a quick example
165
00:13:31,130 --> 00:13:35,380
of me using some of these systems. So I've
been trying to learn Japanese and there's
166
00:13:35,380 --> 00:13:39,130
a couple of highly regarded websites for
this, but they're not consistent. When I
167
00:13:39,130 --> 00:13:43,829
use the browser show labels like, you
know, the thing to press next page or
168
00:13:43,829 --> 00:13:47,970
something like that or like, you know, I
give up or whatever it is, it keeps
169
00:13:47,970 --> 00:13:51,980
changing. So the letters that are being
used keep changing. And that's because of
170
00:13:51,980 --> 00:13:56,500
the dynamic way that they're generating
the HTML. So not really very useful. What
171
00:13:56,500 --> 00:14:01,160
I do instead is I use a program called
Anki and that has very simple shortcuts in
172
00:14:01,160 --> 00:14:06,410
its desktop app. One, two, three, four. So
it's nice to use and consistent and it's
173
00:14:06,410 --> 00:14:11,530
syncs with an Android app and then I can
use my stylus on the Android device. So it
174
00:14:11,530 --> 00:14:16,450
works pretty well. But even so, as you can
see from the chart in the bottom there,
175
00:14:16,450 --> 00:14:20,220
there are many days when I can't use this,
even though I would like to, because I've
176
00:14:20,220 --> 00:14:25,770
overused my hands or overused my voice.
When I'm using voice recognition all day,
177
00:14:25,770 --> 00:14:28,649
every day, I do tend to lose my voice. And
as you can see from the graph, sometimes I
178
00:14:28,649 --> 00:14:33,700
lose it for a week or two at a time. So
same thing with any accessibility
179
00:14:33,700 --> 00:14:38,410
interface, you know, you've got to use
many different techniques and it's always,
180
00:14:38,410 --> 00:14:44,259
it's never perfect is just the best you
can do at that moment. Something else I
181
00:14:44,259 --> 00:14:49,770
like to do is read books. I read a lot of
books and I love e-book readers, the
182
00:14:49,770 --> 00:14:54,139
dedicated e-ink displays. You can read them
in sunlight, they last forever, battery
183
00:14:54,139 --> 00:14:59,060
wise. Unfortunately, it's hard to add other
input mechanisms to them. They don't have
184
00:14:59,060 --> 00:15:03,569
microphones or other sensors and you can't
really install custom software on them.
185
00:15:03,569 --> 00:15:07,250
But for Android based devices and there
are also like e-book reading apps for
186
00:15:07,250 --> 00:15:10,399
Android devices, they have everything you
can install custom software and they have
187
00:15:10,399 --> 00:15:15,569
microphones and many other sensors. So I
made two apps that allow you to read
188
00:15:15,569 --> 00:15:21,319
e-books with an e-book reader. The first
one is Voice Next Page. It's based on one
189
00:15:21,319 --> 00:15:25,759
of my speech recognition engines called
Silvius, and it does do server based
190
00:15:25,759 --> 00:15:29,290
recognition. So you have to capture all
the audio, use 300 kilobits a second to
191
00:15:29,290 --> 00:15:35,560
send it to the server and recognize things
like next page, previous page. However, it
192
00:15:35,560 --> 00:15:40,329
doesn't cut out every 30 seconds. It keeps
going. So that's that's one win for it I
193
00:15:40,329 --> 00:15:46,470
guess. And it is published in the Play
store. Huge thanks to Sarah Leventhal, who
194
00:15:46,470 --> 00:15:49,670
did a lot of the implementation. Very
complicated to make an accessibility app
195
00:15:49,670 --> 00:15:55,819
on Android. But we persevered and it works
quite nicely. So I'm going to actually
196
00:15:55,819 --> 00:16:03,149
show you an example of voice next page.
This over here is my phone on the left
197
00:16:03,149 --> 00:16:08,649
hand side just captured so that you guys
can see it. So here's the Voice Next Page.
198
00:16:08,649 --> 00:16:13,820
And basically the connection is green. I
can do, the server is up and running and
199
00:16:13,820 --> 00:16:19,700
so on. I just press start and then I'll
switch to an Android reading app and say,
200
00:16:19,700 --> 00:16:23,120
next page, previous page. I won't speak
otherwise because it will chapel
201
00:16:23,120 --> 00:16:26,400
everything I'm saying.
202
00:16:32,910 --> 00:16:34,880
Next Page
203
00:16:36,090 --> 00:16:37,640
Next Page
204
00:16:38,310 --> 00:16:40,100
Previous Page
205
00:16:41,520 --> 00:16:42,860
Center
206
00:16:43,680 --> 00:16:45,030
Center
207
00:16:46,620 --> 00:16:48,120
Foreground
208
00:16:49,155 --> 00:16:50,845
Stop listening
209
00:16:54,960 --> 00:16:58,680
So that's a demo of
The Voice Next Page, and it's
210
00:16:58,680 --> 00:17:03,259
extremely helpful. I built it a couple of
years ago along with Sarah, and I use it a
211
00:17:03,259 --> 00:17:07,800
lot. So, yeah, you can go ahead and
download it if you guys wanna try it out.
212
00:17:07,800 --> 00:17:12,530
And the other one is called Blink Next
Page. So the idea for this, I got this
213
00:17:12,530 --> 00:17:18,260
idea from a research paper this year that
was studying eyelid gestures. I didn't use
214
00:17:18,260 --> 00:17:24,210
any of their code, but it's a great idea.
So the way this works is you detect blinks
215
00:17:24,210 --> 00:17:28,590
by using the Android camera and then you
can trigger an action like turning pages
216
00:17:28,590 --> 00:17:34,330
in an e-book reader. This actually doesn't
need any networking. It's able to use the
217
00:17:34,330 --> 00:17:38,960
on device face recognition models from
Google, and it is still under development.
218
00:17:38,960 --> 00:17:44,630
So it's not on the play store yet, but it
is working. And, you know, please contact
219
00:17:44,630 --> 00:17:54,430
me if you want to try it. So just give me
one moment to set that demo up here. So
220
00:17:54,430 --> 00:18:00,590
I'm going to use... The main problem with
this current implementation is that it
221
00:18:00,590 --> 00:18:07,030
uses two devices. So that was easier to
implement. And I use two devices anyway.
222
00:18:07,030 --> 00:18:14,040
But obviously I want a one device version
if I'm actually going to use it for
223
00:18:14,040 --> 00:18:18,281
anything. So here's how this works. This
device I point at me, at my eyes, the
224
00:18:18,281 --> 00:18:24,010
other device I put wherever it's
convenient to read, ups sorry, and if I blink
225
00:18:24,010 --> 00:18:28,780
my eyes, the phone will buzz once it
detects that I blink my eyes and it will
226
00:18:28,780 --> 00:18:35,410
turn the page automatically on the other
Android device. Now I have to blink both
227
00:18:35,410 --> 00:18:41,500
my eyes for half a second. If I want to go
backwards, I can blink just my left eye.
228
00:18:41,500 --> 00:18:49,510
And if I want to go forwards like quickly,
I can blink my right eye and hold it. (background buzzing)
229
00:18:49,510 --> 00:18:54,640
Anyway, it does have some false positives.
That's why like you can go backwards in
230
00:18:54,640 --> 00:18:59,790
case it detects that you've accidentally
flipped the page. And lighting is also
231
00:18:59,790 --> 00:19:03,560
very important. Like if I have a light
behind me, then this is not going to be
232
00:19:03,560 --> 00:19:07,760
able to identify whether my eyes are open
or closed properly. So it has some
233
00:19:07,760 --> 00:19:19,150
limitations, but very simple to use. So
I'm a big fan. OK, so that's enough about
234
00:19:19,150 --> 00:19:23,760
Android devices, let's talk very briefly
about desktop computers. So if you're
235
00:19:23,760 --> 00:19:27,450
going to use a desktop computer, of
course, try using that show labels plugin
236
00:19:27,450 --> 00:19:33,210
in a browser. For native apps you can try
Dragon NaturallySpeaking, which is fine if
237
00:19:33,210 --> 00:19:37,190
you're just like using basic things. But
if you're trying to do complicated things,
238
00:19:37,190 --> 00:19:40,830
you should definitely use a voice coding
system. You could also consider using eye
239
00:19:40,830 --> 00:19:45,810
tracking to replace a mouse. I personally,
I don't use that. I find it hurts my eyes,
240
00:19:45,810 --> 00:19:50,400
but I do use a trackball with very little
force and a wacom tablet. Some people will
241
00:19:50,400 --> 00:19:55,640
even scroll up and down by humming, for
example, but I don't have that setup.
242
00:19:55,640 --> 00:20:00,600
There's a bunch of nice talks out there on
voice coding. The top left is Tavis Rudds
243
00:20:00,600 --> 00:20:06,110
talk from many years ago that got many of
us interested. Emily Shea gave a talk
244
00:20:06,110 --> 00:20:10,971
there about best practices for voice
coding. And then I gave a talk a couple of
245
00:20:10,971 --> 00:20:16,470
years ago at the Hope 11 conference, which
you can also check out. It's mostly out of
246
00:20:16,470 --> 00:20:21,560
date by now, but it's still interesting.
So there are a lot of voice coding
247
00:20:21,560 --> 00:20:27,660
systems, the sort of grandfather of them
all is Dragonfly. It's become a grammar
248
00:20:27,660 --> 00:20:35,370
standard. Caster is if you're willing to
memorize lots of unusual words, you can
249
00:20:35,370 --> 00:20:40,950
become much better, much faster than I
currently am at voice coding. aenea is how
250
00:20:40,950 --> 00:20:45,710
you originally used Dragon to work on a
Linux machine, for example, because Dragon
251
00:20:45,710 --> 00:20:52,620
only runs on Windows. Talon is a closed
source program, which is, but it's very
252
00:20:52,620 --> 00:20:56,790
powerful. Has a big user base, especially
for Mac OS. There are ports now. And Talon
253
00:20:56,790 --> 00:21:04,640
used to use Dragon, but it's now using a
speech system from Facebook. Silvius is
254
00:21:04,640 --> 00:21:09,640
the system that I created, the models are
not very accurate, but it's a nice
255
00:21:09,640 --> 00:21:12,910
architecture where there's client- server,
so it makes it easy to build things like
256
00:21:12,910 --> 00:21:18,130
the voice next page. So Voice next page
was using Silvius. And then the the most
257
00:21:18,130 --> 00:21:22,390
recent one I think on this list is kaldi-
active-grammar, which is extremely
258
00:21:22,390 --> 00:21:26,420
powerful and extremely customizable. And
it's also open source. It works on all
259
00:21:26,420 --> 00:21:29,590
platforms. So I really highly recommend
that. So let's talk a bit more about
260
00:21:29,590 --> 00:21:35,300
kaldi-active-grammar. But first, for voice
coding, I've already mentioned, you have
261
00:21:35,300 --> 00:21:38,890
to be careful how you use your voice
right. Breathe from your belly. Don't
262
00:21:38,890 --> 00:21:42,180
tighten your muscles and breathe from your
chest. Try to speak normally. And I'm not
263
00:21:42,180 --> 00:21:45,230
particularly good at this. Like you'll
hear me when I'm speaking commands that my
264
00:21:45,230 --> 00:21:50,550
inflection changes. So I do tend to
overuse my voice, but you just have to be
265
00:21:50,550 --> 00:21:53,780
conscious of that. The microphone hardware
does matter. I do recommend like a blue
266
00:21:53,780 --> 00:21:59,801
yeti on a microphone arm that you can pull
and put close to your face like this. I
267
00:21:59,801 --> 00:22:04,340
will use this one for my speaking demo
and. Yeah. And the other thing is your
268
00:22:04,340 --> 00:22:08,190
grammar is fully customizable. So if you
keep saying a word and the system doesn't
269
00:22:08,190 --> 00:22:14,190
recognize it, just change it to another
word. And it's complete in the sense you
270
00:22:14,190 --> 00:22:17,680
can type any key on the keyboard. And the
most important thing for expert use or
271
00:22:17,680 --> 00:22:22,120
customizability is that you can do
chaining. So with the voice coding system,
272
00:22:22,120 --> 00:22:27,040
you can say multiple commands at once. If
there's, and it's a huge time saving,
273
00:22:27,040 --> 00:22:32,140
you'll see what I mean when I give a quick
demo. When I do voice coding, I'm a very
274
00:22:32,140 --> 00:22:39,150
heavy vim and tmux user. You know, there
have been I've worked with many people
275
00:22:39,150 --> 00:22:41,870
before, so I have some cheat sheet
information there. So if you're
276
00:22:41,870 --> 00:22:45,130
interested, you can go check that out. But
yeah, let's just do a quick demo of voice
277
00:22:45,130 --> 00:22:54,350
coding here. "Turn this mic on". "Desk left
two". "Control delta", "open new terminal".
278
00:22:54,350 --> 00:22:59,930
"Charlie delta space slash tango mike papa
enter". "Command vim". "Hotel hotel point
279
00:22:59,930 --> 00:23:08,720
charlie papa papa, enter". "India , hash
word include space langel", "india oscar word
280
00:23:08,720 --> 00:23:16,030
stream rangel, enter, enter", "india noi
tango space word mean", "no mike arch india
281
00:23:16,030 --> 00:23:23,750
noi space len ren space lace enter enter
race up tab word print fox scratch nope code
282
00:23:23,750 --> 00:23:31,080
standard charlie oscar uniform tango space
langel langel space quote. Sentence hello,
283
00:23:31,080 --> 00:23:40,250
voice coding bang, scratch six delta india
noi golf, bang, backslash, noi quote
284
00:23:40,250 --> 00:23:46,340
semicolon act sky fox mike romeo noi oscar
word return space number zero semicolon
285
00:23:46,340 --> 00:23:53,450
act vim save and quit. Golf plus plus
space hotel hotel tab minus oscar space
286
00:23:53,450 --> 00:24:03,840
hotel hotel enter. Point slash hotel hotel
enter. Desk right. So that's just a quick
287
00:24:03,840 --> 00:24:09,010
example of voice coding, you can use it to
write any programing language, you can use
288
00:24:09,010 --> 00:24:13,881
it to control anything on your desktop.
It's very powerful. It has a bit of a
289
00:24:13,881 --> 00:24:18,990
learning curve, but it's very powerful. So
the creator of kaldi-active-grammar is
290
00:24:18,990 --> 00:24:26,050
also named David. I'm named David, but
just a coincidence. And he says of kaldi-
291
00:24:26,050 --> 00:24:31,260
active-grammar, that I haven't typed with
the keyboard in many years and kaldi-
292
00:24:31,260 --> 00:24:35,640
active-grammar is bootstrapped in that I
have been developing it entirely using the
293
00:24:35,640 --> 00:24:42,490
previous versions of it. So, David has a
medical condition that means he has very
294
00:24:42,490 --> 00:24:48,270
low dexterity, so it's hard for him to use
a keyboard. And yet he basically got
295
00:24:48,270 --> 00:24:53,000
kaldi-active-grammar working through the
skin of his teeth or something and then
296
00:24:53,000 --> 00:24:58,710
continues to develop it using it. And
yeah, I'm a huge fan of the project. I
297
00:24:58,710 --> 00:25:02,640
haven't contributed much, but I did give
some of the hardware resources like GPU
298
00:25:02,640 --> 00:25:08,100
and CPU compute resources to allow
training to happen. But I would also like
299
00:25:08,100 --> 00:25:12,970
to show you a video of David using kaldi-
active-grammar, just, so you can see it as
300
00:25:12,970 --> 00:25:20,780
well. So, the other thing about David is,
that he has a speech impediment or a
301
00:25:20,780 --> 00:25:25,000
speech, I don't know, an accent or
whatever. So it's difficult to, for a
302
00:25:25,000 --> 00:25:28,060
normal speech recognition system, to
understand him. And you might have trouble
303
00:25:28,060 --> 00:25:31,050
understanding him here. But you can see in
the lower right, what the speech system
304
00:25:31,050 --> 00:25:37,390
understands what he's saying. Oh, I
realized, that I do need to switch
305
00:25:37,390 --> 00:25:41,502
something in OBS, so that you guys can
hear it. Sorry. There you go.
306
00:25:41,502 --> 00:26:03,430
(Other) David using kaldi-active-grammar system (not understandable)
307
00:26:03,430 --> 00:26:05,900
Here, you get the idea and hopefully, you
308
00:26:05,900 --> 00:26:10,530
guys were able to hear that. If not, you
can also find this on the website that I'm
309
00:26:10,530 --> 00:26:18,350
going to show you at the end. One other
thing, I want to show you about this is,
310
00:26:18,350 --> 00:26:23,010
David has actually set up this humming to
scroll, which I think is pretty cool. Of
311
00:26:23,010 --> 00:26:28,260
course, I've gone and turned off the OBS
there. But he's just doing hmmm and it's
312
00:26:28,260 --> 00:26:33,240
understanding that and scrolling down. So,
something that I'm able to do with my
313
00:26:33,240 --> 00:26:41,730
trackball, but he's using his voice for,
so pretty cool. So I'm almost done here.
314
00:26:41,730 --> 00:26:46,550
In summary, good input accessibility means
you need completeness, consistency and
315
00:26:46,550 --> 00:26:49,591
customization. You need to be able to do
any action that you could do with the
316
00:26:49,591 --> 00:26:55,110
other input mechanisms. And doing the same
input should have the same action. And
317
00:26:55,110 --> 00:27:00,210
remember, your users will become experts,
so the system needs to be designed for
318
00:27:00,210 --> 00:27:05,640
that. For e-book reading: Yes, I'm trying
to allow anyone to read, even if they're
319
00:27:05,640 --> 00:27:10,860
experiencing some severe physical or motor
impairment, because I think that gives you
320
00:27:10,860 --> 00:27:15,031
a lot of power to be able to turn the
pages and read your favorite books. And
321
00:27:15,031 --> 00:27:19,270
for speech recognition, yeah, Android
speech recognition is very good. Silvius
322
00:27:19,270 --> 00:27:23,490
accuracy is not so good, but it's easy to
use quickly for experimentation and to
323
00:27:23,490 --> 00:27:28,150
make other types of things like Voice Next
Page. And please do check out kaldi-
324
00:27:28,150 --> 00:27:33,850
active-grammar if you have some serious
need for voice recognition. Lastly, I put
325
00:27:33,850 --> 00:27:39,050
all of this onto a website, voxhub.io, so
you can see Voice Next Page, Blink Next
326
00:27:39,050 --> 00:27:42,100
Page, kaldi-active-grammar and so on, just
instructions for how to use it and how to
327
00:27:42,100 --> 00:27:47,130
set it up. So please do check that out.
And tons of acknowledgments, lots of
328
00:27:47,130 --> 00:27:50,030
people that have helped me along the way,
but I want to especially call out
329
00:27:50,030 --> 00:27:53,700
Professor Sang-Mook Lee, who actually
invited me to Korea a couple of times to
330
00:27:53,700 --> 00:27:58,140
give talks - a big inspiration. And of
course, David Zurow, who has actually been
331
00:27:58,140 --> 00:28:02,900
able to bootstrap into a fully voice
coding environment. So that's all I have
332
00:28:02,900 --> 00:28:07,300
for today. Thank you very much.
333
00:28:07,300 --> 00:28:15,600
Herald: Alright, I suppose I'm back on the
air, so let me see. I want to remind
334
00:28:15,600 --> 00:28:21,780
everyone before we go into the Q&A that
you can ask your questions for this talk
335
00:28:21,780 --> 00:28:25,880
on IRC, the link is under the video, or
you can use Twitter or the Fediverse with
336
00:28:25,880 --> 00:28:34,380
the hashtag #rc3two. Again, I'll hold it
up here, "rc3two".
337
00:28:34,380 --> 00:28:38,680
Thanks for your talk, David. That was
really interesting. Thanks for talk,
338
00:28:38,680 --> 00:28:47,160
David. I, yeah, I think we have a couple
of questions from the Signal Angels.
339
00:28:47,160 --> 00:28:50,600
Before that, I just wanted to say I've
recently spent some time playing with a
340
00:28:50,600 --> 00:28:56,900
like the VoiceOver system in iOS and that
can now actually tell you what is on a
341
00:28:56,900 --> 00:29:03,210
photo, which is kind of amazing. Oh, by
the way, I can't hear you here on on the
342
00:29:03,210 --> 00:29:05,470
Mumble.
David: Yeah. Sorry, I wasn't saying
343
00:29:05,470 --> 00:29:10,440
anything. Yeah, no, it's so I focused
mostly on input accessability, right?
344
00:29:10,440 --> 00:29:13,890
Which is like how do you get data to the
computer. But there's been huge
345
00:29:13,890 --> 00:29:16,610
improvements in the other way around as
well, right? The computer doing VoiceOver
346
00:29:16,610 --> 00:29:19,150
things.
Herald: So we have about let's see,
347
00:29:19,150 --> 00:29:25,010
five-six minutes left at least for Q&A. We
have a question by Toby++, he asks: "Your
348
00:29:25,010 --> 00:29:29,080
next page application looks cool. Do you
have statistics of how many people use it
349
00:29:29,080 --> 00:29:35,650
or found it on the App Store?"
David: Not very many. The Voice Next Page
350
00:29:35,650 --> 00:29:40,950
was advertised only so far as a little
academic poster. So I've gotten a few
351
00:29:40,950 --> 00:29:46,310
people to use it. But I run eight
concurrent workers and we've never hit
352
00:29:46,310 --> 00:29:51,560
more than that. (laughs) So not super popular,
but I do hope that some people will see it
353
00:29:51,560 --> 00:29:54,891
because of this talk and go and check out.
Herald: That's cool. Next question. How
354
00:29:54,891 --> 00:30:00,000
error prone are the speech recognition
systems at all? E.g., can you do coding
355
00:30:00,000 --> 00:30:06,490
while doing workouts?
David: So one thing about speech
356
00:30:06,490 --> 00:30:09,640
recognition is very sensitive to the
microphone, so when you're doing it
357
00:30:09,640 --> 00:30:38,270
Technical malfunction. We'll be back soon.
358
00:30:38,270 --> 00:30:40,650
David (cont.): Any mistakes, right?
359
00:30:40,650 --> 00:30:43,830
That's the thing about having low latency,
you just say something and you watch it
360
00:30:43,830 --> 00:30:47,870
and you make sure that it was what you
wanted to say. I don't know exactly how
361
00:30:47,870 --> 00:30:52,010
many words per minute I can say with voice
coding, but I can say it much faster than
362
00:30:52,010 --> 00:30:55,500
regular speech. So I'd say at least like
200, maybe 300 words per minute.
363
00:30:55,500 --> 00:30:57,050
So it's actually a very high bandwidth
mechanism.
364
00:30:57,050 --> 00:31:02,590
Herald: That's really awesome. A question from
peppyjndivos: "Any advice for software
365
00:31:02,590 --> 00:31:07,760
authors to make their stuff more
accessible?"
366
00:31:07,760 --> 00:31:15,420
David: There are good web accessibility
guidelines. So if you're just making a
367
00:31:15,420 --> 00:31:19,240
website or something, I would definitely
follow those. They tend to be focused more
368
00:31:19,240 --> 00:31:24,350
on people that are blind because that is,
you know, it's more of an obvious fail.
369
00:31:24,350 --> 00:31:29,880
like they just can't interact at all with
your website. But things like, you know,
370
00:31:29,880 --> 00:31:36,580
if Duolingo, for example, had used the
same, like, the same accessibility access
371
00:31:36,580 --> 00:31:40,360
tag on their, like, next button, then they
would always be the same letter for me and
372
00:31:40,360 --> 00:31:46,400
I wouldn't have to be like Fox-Charlie ,
Fox-Delta, Fox-something - changes all the
373
00:31:46,400 --> 00:31:51,850
time. So I think consistency is very
important. And integrating with any
374
00:31:51,850 --> 00:31:57,690
existing accessibility APIs is also a very
important - Web APIs, Android APIs and so
375
00:31:57,690 --> 00:32:01,730
on, because, you know, we can't make every
program out there like voice compatible.
376
00:32:01,730 --> 00:32:05,360
We just have to meet in the middle where
they interact at the keyboard layer or the
377
00:32:05,360 --> 00:32:08,490
accessibility layer.
Herald: Awesome. AmericN has a question,
378
00:32:08,490 --> 00:32:13,730
wonders if these systems use similar
approaches like stenography with mnemonics
379
00:32:13,730 --> 00:32:18,530
or if there's any projects working having
that in mind.
380
00:32:18,530 --> 00:32:26,830
David: A very good question. So, the first
thing everyone uses is the NATO phonetic
381
00:32:26,830 --> 00:32:32,900
alphabet to spell letters, for example,
Alpha. Bravo, Charlie. Some people then
382
00:32:32,900 --> 00:32:38,910
will substitute letters for things that
are too long, like November. I use noi.
383
00:32:38,910 --> 00:32:41,690
Sometimes the speech system doesn't
understand you. Whenever I said Alpha,
384
00:32:41,690 --> 00:32:45,620
Dragon was like, oh, you're saying
"offer". So I changed it. It's Arch for
385
00:32:45,620 --> 00:32:53,300
me, Arch, Brav, Char. So, and also most of
these grammars are in a common grammar
386
00:32:53,300 --> 00:32:56,640
format. They are written in Python and
they're compatible with Dragonfly. So you
387
00:32:56,640 --> 00:33:00,920
can grab a grammar for, I don't know, for
Aenea and get it to work with kaldi-
388
00:33:00,920 --> 00:33:04,550
active-grammar with very little effort. I
actually have a grammar that works on both
389
00:33:04,550 --> 00:33:10,970
Aenea and kaldi-active-grammar, and that's
what I use. So there's a bit of lingua
390
00:33:10,970 --> 00:33:14,060
franca, I guess, you can kind of guess
what other people are using. But at the
391
00:33:14,060 --> 00:33:19,190
same time there's a lot of customization,
you know, because people change words,
392
00:33:19,190 --> 00:33:23,160
they add their own commands, they change
words based on what the speech system
393
00:33:23,160 --> 00:33:27,150
understands.
Herald: Alright, LEB asks, is there an online
394
00:33:27,150 --> 00:33:32,130
community you can propose for
accessibility technologies?
395
00:33:32,130 --> 00:33:40,460
David: There's an amazing forum for anything
related to voice coding. All the
396
00:33:40,460 --> 00:33:51,560
developers of new voice coding software
are there. Sorry, I just need to drink. So
397
00:33:51,560 --> 00:33:56,760
it's a really fantastic resource. I do
link to it from voxhub.io. I believe it's
398
00:33:56,760 --> 00:34:01,690
at the bottom of the kaldi-active-grammar
page. So you can definitely check that
399
00:34:01,690 --> 00:34:07,450
out. For general accessibility, I don't
know, I could recommend the accessibility
400
00:34:07,450 --> 00:34:11,530
mailing list at Google, but that's only if
you work at Google. Other than that, yeah,
401
00:34:11,530 --> 00:34:16,240
I think it depends on your community,
right? I think if you're looking for web
402
00:34:16,240 --> 00:34:20,220
accessibility, you could go for some
Mozilla mailing list and so on. If you're
403
00:34:20,220 --> 00:34:24,509
looking for desktop accessibility, then
maybe you could go find some stuff about
404
00:34:24,509 --> 00:34:29,579
the Windows Speech API. unintelligible
Herald: One last question from Joe Neilson.
405
00:34:29,579 --> 00:34:34,730
Could there be legal issues if you make an
e-book into audio? I'm not sure what that
406
00:34:34,730 --> 00:34:42,849
refers to.
David: Yeah. So if you are like doing, if
407
00:34:42,849 --> 00:34:45,780
you're using a screen reader and you're
like, you try to get it to read out the
408
00:34:45,780 --> 00:34:55,059
contents of an e-book, right? So most,
most of the time there are fair use
409
00:34:55,059 --> 00:35:02,609
exceptions for copyright law, even in the
US, and making a copy yourself for
410
00:35:02,609 --> 00:35:08,661
personal purposes so that you can access
it is usually considered fair use. If you
411
00:35:08,661 --> 00:35:14,079
were trying to commercialize it or make
money off of that or like, I don't know,
412
00:35:14,079 --> 00:35:18,270
you're a famous streamer and all you do is
highlight text and have it read it out,
413
00:35:18,270 --> 00:35:21,280
then maybe, but I would say that
definitely falls under fair use.
414
00:35:21,280 --> 00:35:26,740
Herald: Alright. So I guess that's it for
the talk. I think we're hitting the timing
415
00:35:26,740 --> 00:35:30,380
mark really well. Thank you so much,
David, for that. That was really, really
416
00:35:30,380 --> 00:35:36,160
interesting. I learned a lot and thanks
everyone for watching and stay on. I think
417
00:35:36,160 --> 00:35:40,369
there might be some news coming up. Thanks
and everyone.
418
00:35:40,369 --> 00:35:55,640
rc3 postroll music
419
00:35:55,640 --> 00:36:18,549
Subtitles created by c3subtitles.de
in the year 2020. Join, and help us!