1
00:00:00,000 --> 00:00:17,860
35C3 Intro music
2
00:00:17,860 --> 00:00:23,065
Herald Angel: OK. So this talk is called
"A deep dive into the world of DOS
3
00:00:23,065 --> 00:00:33,500
viruses" and if you happened to be at the
8C3, that is 27 years ago, you would have
4
00:00:33,500 --> 00:00:38,599
seen a very young and awkward, even more
awkward than I am of the moment, version
5
00:00:38,599 --> 00:00:46,120
of myself, speaking on basically the same
subject. The stage of course was a lot
6
00:00:46,120 --> 00:00:50,491
smaller than this, this would have really
intimidated me back then, but I was
7
00:00:50,491 --> 00:00:55,160
talking about a university project that we
had run for about 3 years at that point,
8
00:00:55,160 --> 00:01:05,500
and our possibilities were very limited.
Meanwhile, 27 years later, our speaker, in
9
00:01:05,500 --> 00:01:13,040
between fighting battleships over the
public BGP network and trying to encode
10
00:01:13,040 --> 00:01:18,690
data in dubstep music, was able to
actually do all of the stuff that we were
11
00:01:18,690 --> 00:01:25,650
trying to do, with a lot of effort,
basically, and I guess 4 hours of CPU time
12
00:01:25,650 --> 00:01:32,610
or something like that. Please help me in
welcoming Ben to our stage, to talk about
13
00:01:32,610 --> 00:01:35,820
a bygone era.
Applause
14
00:01:35,820 --> 00:01:40,920
Applause
15
00:01:40,920 --> 00:01:48,340
Ben: Thank you. Hi, I'm Ben Cartwright-
Cox, as the slide suggests. So I have an
16
00:01:48,340 --> 00:01:53,100
admission to make: So this is a thing to
be aware of.
17
00:01:53,100 --> 00:01:56,970
Laughter
Ben: And you know, things also to be aware
18
00:01:56,970 --> 00:02:07,110
of. Anyway. So what is DOS? To get
straight into it. You can do it in a
19
00:02:07,110 --> 00:02:10,947
bullet points way. You know, DOS is an
upgrade from CP/M, another very old legacy
20
00:02:10,947 --> 00:02:14,819
system, but another thing to be aware of
is that DOS covers a wide range of
21
00:02:14,819 --> 00:02:19,950
vendors. Might not just be like those old
IBM PCs. Some of the DOSes had
22
00:02:19,950 --> 00:02:23,950
compatibility with each other, meaning
that some of the DOSes had shared malware
23
00:02:23,950 --> 00:02:31,390
with each other. But to be honest, most
people know DOS as these lovely old beige
24
00:02:31,390 --> 00:02:37,709
boxes; the same era gave us our loved
Model M keyboard. Hated by some, loved by
25
00:02:37,709 --> 00:02:42,840
others, for the sound. But, you know, most
people's knowledge of DOS came from
26
00:02:42,840 --> 00:02:59,599
computers, a user interface that looked
like this. Pretty basic. Okay so this is
27
00:02:59,599 --> 00:03:04,340
Wordstar, some of you may not know that
Game of Thrones was written on Wordstar.
28
00:03:04,340 --> 00:03:09,281
George R. R. Martin is apparently not a
big fan of modern word processing. he
29
00:03:09,281 --> 00:03:16,340
admitted he had some issue with disliking
how spell checking worked. So just uses,
30
00:03:16,340 --> 00:03:18,700
and I also guess it's a good security
quality, you know, you can't get hacked,
31
00:03:18,700 --> 00:03:24,680
if it literally has no Internet access.
So, also though, for a lot of people this
32
00:03:24,680 --> 00:03:28,310
is also their first experience into
programming. For the some of the older
33
00:03:28,310 --> 00:03:36,500
crowd. This is also the invention of
QBasic, which, you know, gave a very basic
34
00:03:36,500 --> 00:03:40,940
language to program creatively in DOS. For
some people this was the gateway drug into
35
00:03:40,940 --> 00:03:47,160
programming and perhaps the gateway drug
into what they started as a career. For
36
00:03:47,160 --> 00:03:52,800
other people the experience of DOS was not
so great. For example, you know, let's
37
00:03:52,800 --> 00:03:57,640
just say you were doing some work in an
infinite loop and at some point stuff like
38
00:03:57,640 --> 00:04:04,001
this happens. Unfortunately I don't have
sound for this one, but you can just, in
39
00:04:04,001 --> 00:04:09,200
your head, imagine like our PC speakers
playing some small techno music, on like,
40
00:04:09,200 --> 00:04:14,310
you know, but only one frequency at a
time. This might get especially incredibly
41
00:04:14,310 --> 00:04:18,589
embarrassing, if you are in an office
environment, just slowly beeping away. You
42
00:04:18,589 --> 00:04:22,770
can't exit this. It has to finish fully and
if you touch the keyboard it reminds you
43
00:04:22,770 --> 00:04:30,069
not to touch the keyboard, and continues
playing this music. So, you know, this would be
44
00:04:30,069 --> 00:04:34,319
fun, but this wouldn't be fun, especially
in an office environment. But, you know,
45
00:04:34,319 --> 00:04:40,339
ultimately it's not malicious. And that
trend continues. This is another good
46
00:04:40,339 --> 00:04:45,240
example of a DOS virus. This is ambulance,
for when you run it, an ambulance just
47
00:04:45,240 --> 00:04:50,589
drives past and then your normal program
just continues running. I think this is
48
00:04:50,589 --> 00:04:56,729
amazing, it's an interesting era of
viruses. It was all, the history of it was
49
00:04:56,729 --> 00:05:01,270
collected very well by a website called VX
heavens, which sort of still lives, but
50
00:05:01,270 --> 00:05:06,629
unfortunately, at one point was raided by
the Ukrainian police, for what is the
51
00:05:06,629 --> 00:05:11,469
fantastic wording they used. Basically,
someone told them they were distributing
52
00:05:11,469 --> 00:05:16,770
Malware. Unfortunately not malware that
operates in this century. But I guess
53
00:05:16,770 --> 00:05:21,710
that's good enough for a raid. But luckily
for the archivists there are archivists of
54
00:05:21,710 --> 00:05:28,809
archivists, and so we have a saved capture
of VX heavens. This is actually an old
55
00:05:28,809 --> 00:05:32,770
snapshot, there are way more modern
snapshots, but thankfully the MS DOS virus
56
00:05:32,770 --> 00:05:38,189
era doesn't move very quickly. So, but the
interesting thing here is, like, there's
57
00:05:38,189 --> 00:05:44,349
66000 items in this tarball and it's 6.6
gigabytes of code. And these viruses are
58
00:05:44,349 --> 00:05:48,580
like super dense. There's not much to
them, like they are just blobs of machine
59
00:05:48,580 --> 00:05:51,520
code. They are not like your electron app
these days that ships an entire Chrome
60
00:05:51,520 --> 00:05:57,219
browser, and normally an out of date
Chrome browser, you know, this is just
61
00:05:57,219 --> 00:06:00,429
basic, like, you know, how to draw an
ambulance and, you know, some infection
62
00:06:00,429 --> 00:06:06,629
routines. The normal distribution also
changes with it as well. For example, the
63
00:06:06,629 --> 00:06:11,059
normal lifecycle of an MS DOS virus is,
you know, you download, or for some other
64
00:06:11,059 --> 00:06:17,560
reason run an infected program that
presumably does nothing; to you it looks
65
00:06:17,560 --> 00:06:22,129
like it does nothing, so, you know,
remains roughly undetected. Then you go
66
00:06:22,129 --> 00:06:27,830
and run more files, the DOS virus infects
more files and at some point you're
67
00:06:27,830 --> 00:06:31,069
probably going to give one of those
excutables to some other computer, or some
68
00:06:31,069 --> 00:06:35,409
other person, whether it was by giving
someone or copying a floppy disk of some
69
00:06:35,409 --> 00:06:38,880
software, maybe some expensive software,
so they didn't have to pay for it, or
70
00:06:38,880 --> 00:06:44,900
uploading it to a BBS, where it could be
downloaded by many people. So the
71
00:06:44,900 --> 00:06:49,689
distribution mechanism is a far cry from
the eternal blues of this era, where, you
72
00:06:49,689 --> 00:06:54,449
know, we can have a strain of malware
spread across the world very brutally,
73
00:06:54,449 --> 00:07:01,709
very quickly. So most DOS viruses are
pretty simple: They start, they say "have
74
00:07:01,709 --> 00:07:06,839
my payload conditions been met?" If not,
then they'll go on display, if they are
75
00:07:06,839 --> 00:07:11,799
met they'll go and display the payload.
And the payloads are definitely more,
76
00:07:11,799 --> 00:07:16,949
I don't know, nice. You know, you have stuff
like this, which is pretty and it uses VGA
77
00:07:16,949 --> 00:07:20,580
colors and all sorts of pretty nice stuff.
You get also some very demoscene vibes
78
00:07:20,580 --> 00:07:26,270
from this. Another good example is this
like VGA, like super trippy thing, which
79
00:07:26,270 --> 00:07:29,909
is really impressive, 'cause this is
really small. This is less than 1 kilobyte
80
00:07:29,909 --> 00:07:34,870
of code. It's in fact way less than 1
kilobyte, it's like 64k. Or you just get
81
00:07:34,870 --> 00:07:38,591
like interesting screen effects as well.
For example, it's quick, but like, you can
82
00:07:38,591 --> 00:07:43,580
just watch the entire computer just
dissolve away, which also might be quite
83
00:07:43,580 --> 00:07:47,929
worrying, if you weren't expecting that.
Alternatively, if the payload conditions
84
00:07:47,929 --> 00:07:52,860
are not met, then, you know, you hook
syscalls and you, or alternatively, if you
85
00:07:52,860 --> 00:07:56,870
want to be way more aggressive, as a
malware offer, you scan for files on the
86
00:07:56,870 --> 00:08:02,649
system to infect proactively. And the way
you infect DOS programs is pretty simple:
87
00:08:02,649 --> 00:08:07,219
Imagining you have like one giant tape of
all the code you have for the target
88
00:08:07,219 --> 00:08:11,499
program. Most of them work like this: They
replace the first 3 bytes of the program
89
00:08:11,499 --> 00:08:16,909
with a x86 jump. They append their malware
onto the end of the executable, and so the
90
00:08:16,909 --> 00:08:19,779
first thing that you do, when you run the
executable, is it jumps to the end of the
91
00:08:19,779 --> 00:08:25,489
file, effectively, runs the malware chunk,
and then it optionally will return control
92
00:08:25,489 --> 00:08:33,800
back to the original program. But there's
also the thing about hooking syscalls, right?
93
00:08:33,800 --> 00:08:39,219
So, you know, MS-DOS is an
operating system, it does have syscalls,
94
00:08:39,219 --> 00:08:43,779
programs can reach out to MS-DOS, to do
things like file access and stuff, so as
95
00:08:43,779 --> 00:08:48,990
you expect, you run a software interrupt
to get there. Thankfully though, MS-DOS
96
00:08:48,990 --> 00:08:55,829
does also allow you to extend MS-DOS by
adding handlers itself, or even
97
00:08:55,829 --> 00:08:59,029
overwriting existing handlers, which is
very convenient, if you are trying to
98
00:08:59,029 --> 00:09:02,160
write drivers, but it's also incredibly
convenient, if you're trying to write
99
00:09:02,160 --> 00:09:09,410
malware. For some of the examples of the
syscalls, most of them relevant towards
100
00:09:09,410 --> 00:09:15,530
DOS virus making. Here's a decent example
of the things that DOS will provide you. A lot
101
00:09:15,530 --> 00:09:21,180
of them are just very useful in general
for producing functional executables the
102
00:09:21,180 --> 00:09:25,660
end users want to use. This is what an
average program looks like. This is almost
103
00:09:25,660 --> 00:09:29,269
the shortest hello world you can make,
minus the actual hello world string. In
104
00:09:29,269 --> 00:09:34,870
fact, the hello world string might be the
largest part of this binary. It's a pretty
105
00:09:34,870 --> 00:09:40,480
simple binary. Here we we're moving a
pointer to the message we just set. We
106
00:09:40,480 --> 00:09:50,410
then set the AH register to 9, or hex 9.
That's the syscall for printing a string,
107
00:09:50,410 --> 00:09:58,300
and then we run a software interrupt, 21h,
which is short for 21 hex, and we continue on.
108
00:09:58,300 --> 00:10:06,589
We then set AH again, to 4C, which is
exit with a return code, and the program
109
00:10:06,589 --> 00:10:12,439
will return. So, in the meantime, this is
roughly the loop that just happened.
110
00:10:12,439 --> 00:10:18,470
You have your program code, that calls an
interrupt and that gets passed over to the
111
00:10:18,470 --> 00:10:22,189
interrupt handler. In the process of doing
this, the CPU has quickly looked at the
112
00:10:22,189 --> 00:10:28,430
first 100 bytes of memory in the interrupt
vector table, IVT, as it's abbreviated,
113
00:10:28,430 --> 00:10:32,300
and then it's effectively a router. If
anyone has written like a small piece of
114
00:10:32,300 --> 00:10:36,149
code to route HTTP requests, or anything,
it's basically like that, but in the 80s,
115
00:10:36,149 --> 00:10:41,029
with syscalls. So it's just basically
saying "Compare this, compare that, jump
116
00:10:41,029 --> 00:10:46,240
there, jump there." Then the thing gets
passed to the call handler, it goes and
117
00:10:46,240 --> 00:10:49,740
does the syscall, the thing that was
required. Normally it will leave some
118
00:10:49,740 --> 00:10:55,130
registers behind, a state, or results of
actions it has performed, and it returns
119
00:10:55,130 --> 00:10:59,519
control back to the program. So,
theoretically speaking, if we wanted to go
120
00:10:59,519 --> 00:11:04,199
and look at what a program actually does
we need to set a break point here, because
121
00:11:04,199 --> 00:11:11,030
this is the only place that we can be sure
the location exists, because this is way
122
00:11:11,030 --> 00:11:15,760
before the era of ASLR, address space
randomisation, and this is way, way before
123
00:11:15,760 --> 00:11:19,819
the era of kernel space randomisation, in
fact, MS DOS has almost no memory
124
00:11:19,819 --> 00:11:24,610
protection whatsoever. Once you run a
program you are basically putting the full
125
00:11:24,610 --> 00:11:29,430
control of the system to that program,
which means you can happily also boot
126
00:11:29,430 --> 00:11:33,870
things like Linux directly from a COM
file, which is handy if you want to
127
00:11:33,870 --> 00:11:43,860
upgrade. So, if we look at certain files
we can go and see what they do. So in this
128
00:11:43,860 --> 00:11:50,110
case, here is one example. This is a goat
file. A goat file is like a sacrificial
129
00:11:50,110 --> 00:11:54,699
goat. It is a file that is purely designed
to be infected. So what you do is you
130
00:11:54,699 --> 00:11:59,790
bring a virus into into memory in the
system and then you run a goat file, in
131
00:11:59,790 --> 00:12:03,879
the vague hope that the virus will infect
it, and then you have a nice clean sample
132
00:12:03,879 --> 00:12:08,450
of just that virus and not another program
inside the virus, which makes it way
133
00:12:08,450 --> 00:12:12,079
easier to test and reverse engineer. So,
we can see things are happening here. For
134
00:12:12,079 --> 00:12:16,600
example, we can see it opening a file,
moving like where it's looking into the
135
00:12:16,600 --> 00:12:19,770
file, reading some data from the file,
just 2 bytes, though, and it closes a
136
00:12:19,770 --> 00:12:23,839
file. We see the same sort of thing repeat
itself, except at one point it reads a
137
00:12:23,839 --> 00:12:27,529
large amount of data, moves the file
pointer, writes another large amount of
138
00:12:27,529 --> 00:12:32,769
data, does some more stuff, and yeah, we
pass some filenames, we display a string,
139
00:12:32,769 --> 00:12:39,230
which is almost definitely the goat file
message and yeah, we pretty much exit
140
00:12:39,230 --> 00:12:42,860
after that. So, there were a few syscalls
here that we would really like to know
141
00:12:42,860 --> 00:12:48,790
more about. So, for that, it's the open
files, we'd really like to know what files
142
00:12:48,790 --> 00:12:52,870
were being opened. We would also want to
know what, we'd like to know, what data
143
00:12:52,870 --> 00:12:55,950
was being written to the file, rather than
having to fish it out of the virtual
144
00:12:55,950 --> 00:13:00,550
machine later, and we'd also, just out of
curiosity, really want to know what
145
00:13:00,550 --> 00:13:05,420
filenames it was asking MS-DOS to parse.
Display string is also a nice test to
146
00:13:05,420 --> 00:13:08,519
know, whether your code is working. So to
do this you're gonna have to look a little
147
00:13:08,519 --> 00:13:14,529
bit deeper into how the MS-DOS runtime
and, by proxy, how x86 in 16-bit mode
148
00:13:14,529 --> 00:13:20,250
works, or legacy mode, I guess. This is
basically all the registers you have in
149
00:13:20,250 --> 00:13:26,120
16-bit mode, and some nice computations at
the bottom, to make it easier to read.
150
00:13:26,120 --> 00:13:33,550
So, as we mentioned, AH is the one that you
use to specify, which syscall you want,
151
00:13:33,550 --> 00:13:40,339
and you'll notice it's not there. AH is
actually the upper half of AX. AH is a
152
00:13:40,339 --> 00:13:46,320
8-bit register, because sometimes people
really just wanted only 8 bits. It's very
153
00:13:46,320 --> 00:13:53,579
obscure that we were saving that much
space. And so, this is what a, this is the
154
00:13:53,579 --> 00:13:57,660
definition of the syscall of a print
string. So you have AH needs to be set to
155
00:13:57,660 --> 00:14:02,839
9, this is once you, in order to call the
syscall for printing string, you set AH to
156
00:14:02,839 --> 00:14:09,070
9, and then you need to set DS and DX to a
pointer to a string that ends in a dollar.
157
00:14:09,070 --> 00:14:11,890
And that doesn't make a lot of sense, or
it didn't make a lot of sense to me, when
158
00:14:11,890 --> 00:14:15,579
I first read that and so, to do this,
we need to learn a little bit more about
159
00:14:15,579 --> 00:14:19,730
how memory works, on these old CPUs, or
the CPUs that are probably in your
160
00:14:19,730 --> 00:14:25,720
laptops, but running in an older mode. So
this is effectively what it looks like.
161
00:14:25,720 --> 00:14:31,839
They have a 16-bit CPU, 2 to the 16 is 64
kilobytes, and we have a 20-bit memory
162
00:14:31,839 --> 00:14:36,350
addressing space. 2 to 20 is 1 megabyte,
so if you ever see an MS-DOS machine like
163
00:14:36,350 --> 00:14:39,519
limiting at 1 megabyte, or some old
operating system, saying like the maximum
164
00:14:39,519 --> 00:14:43,980
memory you can have is 1 megabyte, it's
because it's running in 16 bit mode. And
165
00:14:43,980 --> 00:14:50,249
the maximum it can physically see is 20
bits. So the question is: How do we
166
00:14:50,249 --> 00:14:58,580
address anything above 64K? If the CPU can
only fundamentally see 16 bits. So, this
167
00:14:58,580 --> 00:15:02,399
is where segment registers come in. We
have 4 segment registers, actually we
168
00:15:02,399 --> 00:15:05,899
might have more, but they're the ones who
need to care about. There's the code
169
00:15:05,899 --> 00:15:10,819
segment, the data segment, the stack
segment and the extra segment, in case you
170
00:15:10,819 --> 00:15:15,420
need just another one. So anyway, with
that in mind, let's have a quick crash
171
00:15:15,420 --> 00:15:21,419
course on segment registers. So, imagine
if you have a very long piece of memory,
172
00:15:21,419 --> 00:15:30,430
and we can only see 16 bits at a time. So,
however, we can move the sliding window
173
00:15:30,430 --> 00:15:36,180
around in the memory, to go and see, like,
to move our view of where it is. So, we
174
00:15:36,180 --> 00:15:42,410
can do this and put data around the
system, and we can use the final pointer
175
00:15:42,410 --> 00:15:48,589
to specify, how far in to the memory
segment we should go. So the DS and DX
176
00:15:48,589 --> 00:15:55,360
really just means a multiplier. So, where
the data segment is 100, you need to just
177
00:15:55,360 --> 00:16:01,350
move 100 times 16 to get to the correct
place in memory, and then DX is the
178
00:16:01,350 --> 00:16:09,170
offset. This continues on, so, where we
have a 16 bit cpu, we have a bunch of
179
00:16:09,170 --> 00:16:13,220
general use registers or general purpose
registers. They're quite useful for
180
00:16:13,220 --> 00:16:17,379
ensuring, you don't need to touch RAM too
often. x86 actually has a fairly small
181
00:16:17,379 --> 00:16:25,240
amount of general purpose registers. Some
architectures have way more. I think more
182
00:16:25,240 --> 00:16:32,139
modern chips like GPUs have hundreds, well
hundreds, maybe thousands. However, this
183
00:16:32,139 --> 00:16:34,699
doesn't really change over time in x86
because we have to force backwards
184
00:16:34,699 --> 00:16:38,139
compatibility. So, really what actually
ends up happening, when we move up the
185
00:16:38,139 --> 00:16:42,709
bittage, is that the same registers just
get wider, and we add some more ones for
186
00:16:42,709 --> 00:16:45,499
the programmers, that want them, and the
exact same thing happened to 64 bit: The
187
00:16:45,499 --> 00:16:52,970
registers just got wider. So thinking
about it, we have a lot of malware now,
188
00:16:52,970 --> 00:16:58,319
what if we want to know everything that's
happened in this entire archive. So we
189
00:16:58,319 --> 00:17:01,420
kind of want to trace all of these
automatically, but we might not know what
190
00:17:01,420 --> 00:17:04,480
we're looking for, so let's go through the
checklist of what we need to do, to trace
191
00:17:04,480 --> 00:17:09,335
all of this malware. We need to break
point on the syscall handler. When we get
192
00:17:09,335 --> 00:17:13,260
that breakpoint, we need to save all the
registers, so we know which syscall was
193
00:17:13,260 --> 00:17:19,880
run and potentially what data is being
given to the syscall. Ideally, we're going
194
00:17:19,880 --> 00:17:25,130
to save one hundred bytes from that data
pointer, not especially because we need
195
00:17:25,130 --> 00:17:28,149
it, but it's quite handy in a lot of
registers in a lot of syscalls. It's for
196
00:17:28,149 --> 00:17:34,429
example what you use to get the open file
path, when you're opening files. We should
197
00:17:34,429 --> 00:17:37,649
also, probably, record the screen for
quick analysis, rather than just staring
198
00:17:37,649 --> 00:17:43,870
at HTML tables, and so we can do that, we
burn a lot of CPU time and probably cause
199
00:17:43,870 --> 00:17:51,120
some minor amounts of environmental
damage. And we get nothing. We just run a
200
00:17:51,120 --> 00:17:55,080
bunch of stuff and most of them don't
return anything. At best they return a
201
00:17:55,080 --> 00:18:02,770
goat file string. They just do nothing.
So, if we look deeper into the reason why,
202
00:18:02,770 --> 00:18:05,490
it's sort of a smoking gun here, so we can
see the syscalls that run on this file
203
00:18:05,490 --> 00:18:09,840
that does nothing, and the smoking gun
here is the date. So it's asking for the
204
00:18:09,840 --> 00:18:15,190
date from the system, and this sort of
flags out the first issue, is that a lot
205
00:18:15,190 --> 00:18:18,750
of MS-DOS viruses don't really have a lot
to go on, because they have no internet
206
00:18:18,750 --> 00:18:24,180
connection, and there's not really any
other state they can decide to activate on.
207
00:18:24,180 --> 00:18:28,600
So the date syscall is pretty simple.
The get date and get time just return all
208
00:18:28,600 --> 00:18:34,360
of their values as registers. And, you
know, some using the 8-bit halves, to save
209
00:18:34,360 --> 00:18:44,970
space. So, a naive way of doing this, is
what we do, is we would run the sample,
210
00:18:44,970 --> 00:18:50,030
we'd wait for the syscall for date or
time, we would just fiddle the values,
211
00:18:50,030 --> 00:18:53,240
'cause in this case we're using a debugger,
so we can automatically change, what the
212
00:18:53,240 --> 00:18:56,760
state registers are, and we can then
observe to see, if any of the syscalls
213
00:18:56,760 --> 00:18:59,580
that the program ran changed, which is a
pretty good indication that you've hit
214
00:18:59,580 --> 00:19:04,330
some behavior that is different. And then,
you know, we can say "Hooray, we found a
215
00:19:04,330 --> 00:19:08,330
new test case!" The downside is: running
every one of these samples takes 15
216
00:19:08,330 --> 00:19:13,940
seconds of CPU-time because MS-DOS, well,
15 seconds of wall-time, which,
217
00:19:13,940 --> 00:19:18,080
when you are emulating MS-DOS is 15
seconds of CPU-time because of the fact
218
00:19:18,080 --> 00:19:20,610
that MS-DOS doesn't have power saving
mode, so when it's not doing anything, it
219
00:19:20,610 --> 00:19:27,120
just goes into a busy loop which makes it
very hard to optimize. Or we could take a
220
00:19:27,120 --> 00:19:33,350
cleverer look. So when we think about it,
we are in the interrupt handler where all
221
00:19:33,350 --> 00:19:36,830
we ever see is the insides of the
interrupt handler because we don't know
222
00:19:36,830 --> 00:19:40,990
where the program code is. The interrupt
handler is the only place that we know is
223
00:19:40,990 --> 00:19:45,450
consistent because MS-DOS could
potentially load the code for the malware
224
00:19:45,450 --> 00:19:50,610
or the program anywhere. But we want to
know where the code is. It would be really
225
00:19:50,610 --> 00:19:54,250
handy to know what the code is that we'd
be about to run. So for this we need to
226
00:19:54,250 --> 00:19:59,190
look towards the stack. Just like the DSN
DX registers the stacks are located on a
227
00:19:59,190 --> 00:20:02,970
stack segment, on a stack pointer.
Luckily, the first two values is the
228
00:20:02,970 --> 00:20:07,130
interrupt, the interrupt pointer in the
stack segment so we can use that to grab
229
00:20:07,130 --> 00:20:10,779
exactly where, what the code will be run
afterwards. So we just need to add a few
230
00:20:10,779 --> 00:20:14,440
things to our checklist. We need to grab 4
bytes from the stack pointer and then
231
00:20:14,440 --> 00:20:18,370
using that, we can calculate the
destination that the syscall will return
232
00:20:18,370 --> 00:20:22,549
to. And if we look at some of them - we
can look at an example here - well, this
233
00:20:22,549 --> 00:20:27,243
is what a piece of what one of the calls
returns to us. So we see we running a compare
234
00:20:27,243 --> 00:20:36,640
on DL against the HEX of 0x1E. And then
if that comparison is equal it will
235
00:20:36,640 --> 00:20:43,171
jump to 1 memory address. And if not it
will jump to another. So if we look back
236
00:20:43,171 --> 00:20:52,560
at the definition of those syscalls we can
see that DL is the day. So with this we
237
00:20:52,560 --> 00:21:01,150
can conclude that D if 0x1e is 30 and DL
is the day this malware effectively is
238
00:21:01,150 --> 00:21:07,120
saying if the day of month is 30 we need
to go down a different path. If we run
239
00:21:07,120 --> 00:21:11,950
these all over time across the whole
dataset what we see is roughly this as a
240
00:21:11,950 --> 00:21:21,740
polydome bar chart. We see out of the 17.500
samples we have around 4.700 of them
241
00:21:21,740 --> 00:21:24,330
checked for the date and time and these
are the ones that are really tricky
242
00:21:24,330 --> 00:21:27,590
because they're really hard to activate.
They're also the most interesting though, because
243
00:21:27,590 --> 00:21:33,900
those are the ones trying to hide. So, with
that in mind, we need to, we have the code
244
00:21:33,900 --> 00:21:38,100
segment that we're about to run, when we
return and we can't really brute force
245
00:21:38,100 --> 00:21:43,730
because it takes a little CPU-time and we
can't brute force it inside a 'real' or
246
00:21:43,730 --> 00:21:47,419
emulated machine but we can brute force it
in a significantly more interesting way.
247
00:21:47,419 --> 00:21:53,960
We need to build something: we need to
build the world's worst x86 emulator so
248
00:21:53,960 --> 00:22:02,019
dubbed BenX86, it's 16-bit only. Any
attempt to access memory effectively ends
249
00:22:02,019 --> 00:22:06,029
the simulation. It's got a fake stack if
you try and push something onto the stack
250
00:22:06,029 --> 00:22:09,640
it says sure, fine if you try and pop it
it's like oh actually I never held any of
251
00:22:09,640 --> 00:22:13,690
that data anyway so we are ending the
simulation. 80 opcodes, most of them are
252
00:22:13,690 --> 00:22:18,900
jumps. Because that's the primary
purposes, comparing and jumps. The
253
00:22:18,900 --> 00:22:23,630
difference is it logs every opcode every
address that it went trough and it can be
254
00:22:23,630 --> 00:22:29,210
run with just a small x86 code segment and
a register snapshot. This means that we
255
00:22:29,210 --> 00:22:34,909
can test old age from 1980 to 2005 and are
roughly about 100 milliseconds and most
256
00:22:34,909 --> 00:22:40,860
programs ended up having just 3 different
code paths on average so that yields us
257
00:22:40,860 --> 00:22:48,019
with 17.000 virus samples and about 10.000
of samples that had date variations as in:
258
00:22:48,019 --> 00:22:53,539
Once you exploit the complexity. So I'm
going to now use my final remaining time
259
00:22:53,539 --> 00:22:59,769
to go through some of my favorites. So
this is an example of a virus that just
260
00:22:59,769 --> 00:23:04,440
doesn't do anything on the 1st of 1980.
However if you'd happen to be running this
261
00:23:04,440 --> 00:23:08,477
on New Year's Day you would get this.
Laughter
262
00:23:08,477 --> 00:23:10,610
No matter what you do, every program you can't
263
00:23:10,610 --> 00:23:14,940
exit out of this, your machine is hung. This
might be great, right? You might be like:
264
00:23:14,940 --> 00:23:19,040
'Oh cool, I don't need to do work anymore
because my computer will literally not let me'
265
00:23:19,040 --> 00:23:21,049
This also might be terrible, because
you might need to do some work on New
266
00:23:21,049 --> 00:23:28,100
Year's day. Here's another example. This
does nothing as well just another innocent
267
00:23:28,100 --> 00:23:33,600
.com file. Of course reminding these
pieces of malware will be wrapped around
268
00:23:33,600 --> 00:23:37,620
something else. Almost anything could be
infected in here. In this case though
269
00:23:37,620 --> 00:23:46,880
these binary is a nice and shaped down.
However instead we get this, which I think
270
00:23:46,880 --> 00:23:53,564
is super interesting and is basically the
author is aware - they're telling you they
271
00:23:53,564 --> 00:23:57,110
are actually like self disclosing in
saying the previous year I've infected
272
00:23:57,110 --> 00:24:04,800
your computer. And for some reason it's
being nice. They're just saying. Actually
273
00:24:04,800 --> 00:24:11,580
you have been infected. And as a - I guess a
pity - I'm just going to remove myself now.
274
00:24:11,580 --> 00:24:17,120
I don't really. For some reason it's also
encouraging you to buy McAfee. This is
275
00:24:17,120 --> 00:24:26,179
back in the day when John McAfee himself
actually wrote McAfee. Interesting times.
276
00:24:26,179 --> 00:24:33,059
Definitely interesting times. Here is
another example. This one I found
277
00:24:33,059 --> 00:24:41,450
particularly obscure. On the 8th of
November 1980 or any year I think actually
278
00:24:41,450 --> 00:24:51,110
it turns all zeroes on the system into
tiny little glyphs that say "hate" if
279
00:24:51,110 --> 00:24:54,760
anyone understands this I'd really like to
know like I've been thinking about this a
280
00:24:54,760 --> 00:25:01,950
lot. What does it mean? Is it an artistic
statement? Is it. I wish I knew.
281
00:25:01,950 --> 00:25:05,669
Someone in the audience: it says MATE
Ben: There could be a CCC variant says
282
00:25:05,669 --> 00:25:12,630
MATE. Another good one in that it's the
last thing I ever want to see any program
283
00:25:12,630 --> 00:25:19,669
tell me is this one here where you run it
and it says "error eating drive C:". I
284
00:25:19,669 --> 00:25:25,070
never ever want an error in any program
unexpectedly just says 'Sorry almost I
285
00:25:25,070 --> 00:25:30,159
failed to remove you root file system,
don't know why, could you like change your
286
00:25:30,159 --> 00:25:35,940
settings so I can remove it?' Cheers. And
finally this is one of my absolute
287
00:25:35,940 --> 00:25:41,420
favorites in that it's just brilliant in
that it also stops you from running the
288
00:25:41,420 --> 00:25:46,490
program you want to run it exits
prematurely. This is the virus version of
289
00:25:46,490 --> 00:25:50,607
the Navy SEAL copy pasta. Says "I am an
assassin. I want to and I shall kill you."
290
00:25:50,607 --> 00:25:59,809
"I also hate Aladdin and I also will kill
it. I will eliminate you with ...". You know where
291
00:25:59,809 --> 00:26:04,880
this is going. It says fear
the virus that is more powerful than God.
292
00:26:04,880 --> 00:26:10,830
It only activates on one day though, so
it's fine. Thank you for your time. I know
293
00:26:10,830 --> 00:26:15,480
it's late and I will happily take any
questions or corrections if you know this
294
00:26:15,480 --> 00:26:27,029
topic better than me.
applause
295
00:26:27,029 --> 00:26:33,410
Herald: This totally brings tears to my
eyes with nostalgia. So if there is any
296
00:26:33,410 --> 00:26:37,970
questions, we have microphones distributed around
the room, there is like 1,2, 3, 4 and
297
00:26:37,970 --> 00:26:42,630
one in the back. We also have questions
perhaps from the internet if you want to
298
00:26:42,630 --> 00:26:47,980
ask a question come up to the microphone
ask the question just as a reminder a
299
00:26:47,980 --> 00:26:53,789
question is one or two sentences with a
question mark behind it and not a life
300
00:26:53,789 --> 00:27:00,840
story attached. So let's see what we have.
I'm going to start with microphone number
301
00:27:00,840 --> 00:27:04,470
1 just because I can see it easiest, let's
go for it.
302
00:27:04,470 --> 00:27:09,559
Microphone 1: Hi Ben, thanks for the talk.
Really interesting. My question would be
303
00:27:09,559 --> 00:27:16,297
did you do any analysis on what ratio of
the viruses was more artistic
304
00:27:16,297 --> 00:27:20,690
and which one actually did damage.
Ben: So most of them surprisingly don't do
305
00:27:20,690 --> 00:27:26,450
damage. I actually really struggled to
find a date varying sample that
306
00:27:26,450 --> 00:27:30,140
specifically activated on a certain day
and decided to delete every file. There
307
00:27:30,140 --> 00:27:35,259
are some very good ones in some of them
are like virus scanning utilities that just
308
00:27:35,259 --> 00:27:37,990
don't do anything on certain dates and in
one day like while they're telling you all
309
00:27:37,990 --> 00:27:41,120
the files they are scanning is actually
telling you all the files they're
310
00:27:41,120 --> 00:27:46,120
deleting. So that's particularly cruel but
it's actually surprisingly hard to find a
311
00:27:46,120 --> 00:27:50,480
virus sample that actually was brutally
malicious. There was some, that would just,
312
00:27:50,480 --> 00:27:53,910
you know, infect binaries is but it's very hard
to find one that I think was brutally
313
00:27:53,910 --> 00:27:58,100
malicious, which is a far cry from the days
well from the days that we live in right
314
00:27:58,100 --> 00:28:03,549
now, where we're taking down hospitals with
windows bugs.
315
00:28:03,549 --> 00:28:09,210
Herald: as everybody is leaving the room.
Please do it quietly. I see a question at
316
00:28:09,210 --> 00:28:12,200
(microphone) 3, on that side.
Microphone 3: Yes. Since a lot of
317
00:28:12,200 --> 00:28:19,970
industrial control systems still run DOS.
What's the threat from DOS malware that
318
00:28:19,970 --> 00:28:27,150
might be written today.
Ben: It's probably unlikely than an
319
00:28:27,150 --> 00:28:31,009
Industrial Control System that's running
DOS, would come into contact with DOS-malware.
320
00:28:31,009 --> 00:28:36,010
The only way I can think is if one vendor
was like or a factory or supply or
321
00:28:36,010 --> 00:28:41,049
whatever it was basically downloading all
basically wares onto industrial control
322
00:28:41,049 --> 00:28:47,419
boxes. I wouldn't be surprised but it
would be pretty irresponsible. But it
323
00:28:47,419 --> 00:28:52,510
would be quite surprising to find MS-DOS
malware today on industrial controllers
324
00:28:52,510 --> 00:28:57,110
that was installed recently and not just a
lingering infection from the last 20
325
00:28:57,110 --> 00:29:00,029
years.
Herald: Microphone 2
326
00:29:00,029 --> 00:29:05,000
Microphone 2: Did you find any conditions
that weren't date based. Some of them do
327
00:29:05,000 --> 00:29:09,610
attempt to some of them try and circumvent
the date recognition. Unfortunately it's
328
00:29:09,610 --> 00:29:12,809
very hard to brute force those. Some of
them install themselves as what's called
329
00:29:12,809 --> 00:29:19,710
TSR or Terminate and Stay Resident which
basically means that they will exit out,
330
00:29:19,710 --> 00:29:23,750
run in the background and continuously ask
the actual system time what time it is.
331
00:29:23,750 --> 00:29:27,639
It's a bit of a more risky strategy
because the system timer might not exist
332
00:29:27,639 --> 00:29:31,650
which would be unfortunate for the virus.
So definitely there are viruses that have
333
00:29:31,650 --> 00:29:38,340
way more complicated execution conditions.
I observed one sample that only activated
334
00:29:38,340 --> 00:29:43,850
after I believe it was something silly
like 100 keypresses which is very hard to
335
00:29:43,850 --> 00:29:49,770
automatically test. Those sort of viruses
require static analysis and statically
336
00:29:49,770 --> 00:29:54,480
analyzing 17.000 samples is a time
consuming task.
337
00:29:54,480 --> 00:30:02,009
Herald: So we have a question from the Internet.
Signal Angel: Do you have the source? What
338
00:30:02,009 --> 00:30:07,990
is the source of the malware that you
analyzed here, is it published somewhere?
339
00:30:07,990 --> 00:30:13,400
Ben:You can still find dump's of VX
heavens, and more modern dumps of VX
340
00:30:13,400 --> 00:30:17,990
heavens on popular torrent websites.
But I'm sure there are also copies
341
00:30:17,990 --> 00:30:21,399
floating about on non-popular torrent
websites.
342
00:30:21,399 --> 00:30:24,810
Laughter
Herald: Over to microphone 1.
343
00:30:24,810 --> 00:30:32,240
Microphone 1: Hi Ben. I'm Jope. Thank you
for your talk. I was wondering: did you
344
00:30:32,240 --> 00:30:36,639
learn anything from your studies of these
viruses that should be taught in modern
345
00:30:36,639 --> 00:30:42,820
day computer science classes like more
efficient sorting algorithm or some hidden
346
00:30:42,820 --> 00:30:47,080
gem that actually should be part of
computing these days.
347
00:30:47,080 --> 00:30:53,570
Ben: My primary takeaway was x86 was a
mistake.
348
00:30:53,570 --> 00:31:01,320
Laughter & applause
Herald: So I'm not seeing any more
349
00:31:01,320 --> 00:31:04,480
questions. Oh no there is. OK one more
question from the internet.
350
00:31:04,480 --> 00:31:11,389
Signal angel: Have you found malware
samples that did like try to detect dummy
351
00:31:11,389 --> 00:31:14,617
binaries or whatever, to avoid easy
analysis?
352
00:31:14,617 --> 00:31:20,007
Ben: Oh actually, that's a really good question.
So it is it's complicated:
353
00:31:20,007 --> 00:31:24,580
So some viruses would so, maybe let's be
354
00:31:25,027 --> 00:31:29,770
dangerous let's try and go backwards on my
home written presentation software. So
355
00:31:29,770 --> 00:31:41,160
humming Too many slides. I have
regrets. Yes. OK. Here we are. This slide.
356
00:31:41,160 --> 00:31:45,450
OK. So you know here I'm saying that the
malware infection goes to the end. Well
357
00:31:45,450 --> 00:31:49,850
some samples are really cool. They don't
change the size of the file. They just
358
00:31:49,850 --> 00:31:54,590
find areas in the files that are full of
null bites and just say this is probably
359
00:31:54,590 --> 00:32:00,230
fine. I'm just going to put myself here
which may have unintended consequences. It
360
00:32:00,230 --> 00:32:04,960
may mean if a program is like a statically
typed, statically defined byte array of
361
00:32:04,960 --> 00:32:10,039
like a certain size and the program is
relying on it being zeros when it accesses
362
00:32:10,039 --> 00:32:14,440
it for the first time it may get very
surprised to find some malware code in
363
00:32:14,440 --> 00:32:20,159
there. But generally speaking as far as
I'm aware, this deployment
364
00:32:20,159 --> 00:32:26,220
procedure works pretty well and actually
is very good at avoiding antivirus of the
365
00:32:26,220 --> 00:32:30,390
era which would just be checking like
common system files and its size. And you
366
00:32:30,390 --> 00:32:35,059
know the size increases of COMMAND.COM
then that's clearly bad news.
367
00:32:35,059 --> 00:32:38,450
Herald: We have a question on microphone
1.
368
00:32:38,450 --> 00:32:45,620
Microphone 1: Are there any viruses that
try to eliminate or manipulate virus
369
00:32:45,620 --> 00:32:48,970
scanners of the day.
Oh yeah. So a lot of the samples will
370
00:32:48,970 --> 00:32:52,960
actively go and look for files of other
anti-viruses.
371
00:32:52,960 --> 00:32:57,159
But I am generally under the impression
that it's kind of hard to find them. They
372
00:32:57,159 --> 00:33:01,750
weren't actually that many antivirus
products back in the day.
373
00:33:01,750 --> 00:33:06,410
I feel like, it was a bit of a niche thing to
be running. Microsoft did for a while ship
374
00:33:06,410 --> 00:33:14,330
their own antivirus with MS-DOS. So I
guess you know what's new is old. So there
375
00:33:14,330 --> 00:33:17,860
were antiviruses out there. I don't think
many of them were very effective.
376
00:33:17,860 --> 00:33:27,260
Herald: Any more questions? There, where?
Oh right. Another one from the Internet.
377
00:33:27,260 --> 00:33:32,049
It's interesting that the internet is
querying MS-DOS all the time. Go ahead.
378
00:33:32,049 --> 00:33:38,000
Signal angel: Did you do the diagrams by
hand or do you have a tool?
379
00:33:38,000 --> 00:33:42,559
Ben: So many hours. No. So there's a
couple of good tools to do it.
380
00:33:42,559 --> 00:33:46,429
asciiflow.org. I think is a fantastic
tool. I would highly recommend it. I think
381
00:33:46,429 --> 00:33:52,779
it's not maintained very well, though.
Herald: microphone 1.
382
00:33:52,779 --> 00:33:55,519
Microphone 1: Are you publishing the tools
you wrote?
383
00:33:55,519 --> 00:34:02,429
Ben: I will be publishing the tools at
some point when they are less... when they
384
00:34:02,429 --> 00:34:08,320
are less ugly. I will be publishing all of
the automatic malware runs and the gifs
385
00:34:08,320 --> 00:34:12,929
generated by them so that people can
easily search google for the virus names
386
00:34:12,929 --> 00:34:16,890
and get like actual real time versions.
The hardest thing that I've found is when
387
00:34:16,890 --> 00:34:21,710
looking at virus names was literally just
finding any information about them and one
388
00:34:21,710 --> 00:34:25,220
of the things I really wish existed at the
time of writing this talk, was being able
389
00:34:25,220 --> 00:34:29,580
to just query a name and be like oh yeah
this virus it looks like it does this.
390
00:34:29,580 --> 00:34:33,420
Herald: since I saw microphone 1 first
let's go with that.
391
00:34:33,420 --> 00:34:40,260
Microphone 1: Did you find any viruses
that had signage in them not signage of
392
00:34:40,260 --> 00:34:43,520
today but the name of the author. Like he
was very proud of what he wrote.
393
00:34:43,520 --> 00:34:47,450
Ben: Yeah, there are some notable
examples. Quite a few of them will try and
394
00:34:47,450 --> 00:34:52,870
name - so DOS-viruses do like have
[incomprehensible] sample names in the same way
395
00:34:52,870 --> 00:34:57,470
that we'd still today give viruses names.
A lot of the time you will just encode a
396
00:34:57,470 --> 00:35:01,131
string that you want the virus to be
named, you know, somewhere in the file
397
00:35:01,131 --> 00:35:04,472
just a random string doing nothing. It's
like oh, ok, they clearly wanted the virus
398
00:35:04,472 --> 00:35:11,430
to be called Tempest. So that does happen.
One of the favorite examples is the brain
399
00:35:11,430 --> 00:35:16,750
malware which literally encodes an address
and phone number of the author. I believe
400
00:35:16,750 --> 00:35:22,720
in Pakistan and there's a fantastic mini
documentary by F-Secure where they go and
401
00:35:22,720 --> 00:35:25,850
visit the people who wrote it. It's a
super interesting watch and I would really
402
00:35:25,850 --> 00:35:29,990
recommend it.
Herald: Indeed it is. Microphone 2?
403
00:35:29,990 --> 00:35:36,260
Microphone 2: Did you have any chance to
look at any kind of viruses that did not
404
00:35:36,260 --> 00:35:42,330
modify the files themselves. For example
one of the largest virus infections at the time was a
405
00:35:42,330 --> 00:35:46,080
virus called [incomprehensible] which modified
the master boot record
406
00:35:46,080 --> 00:35:51,060
Ben: Yes, Master boot record, I did
consider. It was more of a time problem
407
00:35:51,060 --> 00:35:55,320
that I had in getting to the point where
you could brute force time and date
408
00:35:55,320 --> 00:36:01,020
combinations and looking for master boot
record changes. It was really hard. I am
409
00:36:01,020 --> 00:36:06,610
super interested in reviewing a fact to be
the root kits of the era. But yes that's
410
00:36:06,610 --> 00:36:10,220
definitely something I will look into in
the future.
411
00:36:10,220 --> 00:36:14,410
Herald: And we have yet another question
from the Internet.
412
00:36:14,410 --> 00:36:17,400
Signal angel: And it's even from the same
guy.
413
00:36:17,400 --> 00:36:22,830
Ben: Oh damn.
Signal angel: is the BenX86 software open-
414
00:36:22,830 --> 00:36:25,530
source or can be found on the web
somewhere.
415
00:36:25,530 --> 00:36:29,870
Ben: It probably will be. I wouldn't
expect it to work in, well, in any use-case
416
00:36:29,870 --> 00:36:36,360
though. It's effectively designed to like
not work correctly, right? Like what
417
00:36:36,360 --> 00:36:40,880
was the spec? It basically like fails at
every single thing awkward. I just went
418
00:36:40,880 --> 00:36:46,660
like oh that's fine. We're probably far
enough down there anyway. Are we? Be aware
419
00:36:46,660 --> 00:36:50,740
this is the feature list.
Herald: So is that a follow up question
420
00:36:50,740 --> 00:36:57,010
from the internet?
Signal angel: No it's a new one. I don't
421
00:36:57,010 --> 00:37:02,660
know how serious it is but would it be
possible or a good idea to use machine
422
00:37:02,660 --> 00:37:09,500
learning to create new DOS malware from
the existing samples.
423
00:37:09,500 --> 00:37:17,021
Laughter & applause
Ben: It would not be a good idea. But I
424
00:37:17,021 --> 00:37:24,230
like how you think.
Herald: Actually I saw somebody trying to
425
00:37:24,230 --> 00:37:27,640
use NLP to generate viruses but ok that's
enough for now.
426
00:37:27,640 --> 00:37:32,400
Ben: you could probably do Markov Chains
with x86 to be honest. Please don't do
427
00:37:32,400 --> 00:37:34,530
that, please!
Herald: Don't try this at home.
428
00:37:34,530 --> 00:37:37,480
Ben: I have seen things I've seen. Just
please don't do that.
429
00:37:37,480 --> 00:37:43,461
Herald: So I think we've run out of
questions. Going once, going twice. Let's
430
00:37:43,461 --> 00:37:49,520
thank Ben for this marvelous retrospective
talk.
Big applause
431
00:37:49,520 --> 00:37:58,785
36C3 postroll music
432
00:37:58,785 --> 00:38:12,000
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!