36c3 preroll music
Herald: Our next talk will be "The
ultimate Acorn Archimedes Talk", in which
there will be spoken about everything
about the Archimedes computer. There's a
promise in advance that there will be no
heureka jokes in there. Give a warm
welcome to Matt Evans.
applause
Matt Evans: Thank you. Okay. Little bit of
retro computing first thing in the
morning, sort of. Welcome. My name is Matt
Evans. The Acorn Archimedes was my
favorite computer when I was a small
hacker and I'm privileged to be able to
talk a bit little bit about it with you
today. Let's start with: What is an Acorn
Archimedes? So I'd like an interactive
session, I'm afraid. Please indulge me,
like a show of hands. Who's heard of the
Acorn Archimedes before? Ah, OK, maybe 50,
60%. Who has used one? Maybe 10%,
maybe. Okay. Who has programs -
who has coded on an Archimedes? Maybe
half? Two, three people. Great. Okay.
Three. laughs Okay, so a small
percentage. I don't see these machines as
being as famous as say the Apple Macintosh
or IBM PC. And certainly outside of Europe
they were not that common. So this is kind
of interesting just how many people here
have seen this. So it was the first ARM-
based computer. This is an astonishingly
1980s - I think one of them is drawing,
actually. But they're not just the first
ARM-based machine, but the machine that
the ARM was originally designed to drive.
It's a... Is that a comment for me?
Mic?
I'm being heckled already. It's only slide
two. Let's see how this goes. So it's a
two box computer. It looks a bit like a
Mega S.T. ... to me. Its main unit with
the processor and disks and expansion
cards and so on. Now this is an A3000.
This is mine, in fact, and I didn't bother
to clean it before taking the photo. And
now it's on this huge screen. That was a
really bad idea. You can see all the
disgusting muck in the keyboard. It has a
bit of ink on it, I don't know why. But
this machine is 30 years old. And
this was luckily my machine, as I said, as
a small hacker. And this is why I'm doing
the talk today. This had a big influence
on me. I'd like to say as a person, but
more as an engineer. In terms of what my
programing experience when I was learning
to program and so on. So I live and work
in Cambridge in the U.K., where this
machine was designed. And through the
funny sort of turn of events, I ended up
there and actually work in the building
next to the building where this was
designed. And a bunch of the people that
were on that original team that designed
this system are still around and
relatively contactable. And I thought this
is a good opportunity to get on the phone
and call them up or go for a beer with a
couple of them and ask them: Why are
things the way they are? There's all sorts
of weird quirks to this machine. I was
always wondering this, for 20 years. Can
you please tell me - why did you do it
this way? And they were a really good bunch
of people. So I talked to Steve Ferber,
who led the hardware design, Sophie
Wilson, who was the same with software.
Tudor Brown, who did the video system.
Mike Miller, the IO system. John Biggs and
Jamie Urquhart , who did the silicon
design, I spoiled one of the
surprises here. There's been some silicon
design that's gone on in building this
Acorn. And they were all wonderful people
that gave me their time and told me a
bunch of anecdotes that I will pass on to
you. So I'm going to talk about the
classic Arc. There's a bunch of different
machines that Acorn built into the 1990s.
But the ones I'm talking about started in
1987. There were 2 models, effectively a
low end and a high end. One had an option
for a hard disk, 20 megabytes, 2300
pounds, up to 4MB of RAM. They all share
the same basic architecture, they're all
basically the same. So the A3000 that I
just showed you came out in 1989. That was
the machine I had. Those again, the same.
It had the memory controller slightly
updated, was slightly faster. They all had
an ARM 2. This was the released version of
the ARM processor designed for this
machine, at 8 MHz. And then finally in
1990, what I call the last of the classic
Arc, Archimedes, is the A540. This was the
top end machine - could have up to
16 MB of memory, which is a fair bit
even in 1990. It had a 30 MHz ARM 3. The
ARM 3 was the evolution of the ARM 2, but
with a cache and a lot faster. So this
talk will be centered around how these
machines work, not the more modern
machines. So around 1987, what else
was available? This is a random selection
of machines. Apologies if your favorite
machine is not on this list. It wouldn't
fit on the slide otherwise. So at the
start of the 80s, we had the exotic things
like the Apple Lisa and the Apple Mac.
Very expensive machines. The Amiga - I had
to put in here. Started off relatively
expensive because the Amiga 500 was, you
know, very good value for money, very
capable machine. But I'm comparing this
more to PCs and Macs, because that was the
sort of, you know, market it was going
for. And although it was an expensive
machine compared to Macintosh, it was
pretty cheap. Even put NeXT Cube on there,
I figured that... I'd heard that they were
incredibly expensive. And actually
compared to the Macintosh, they're not
that expensive at all. Well I don't know
which one I would have preferred. So the
first question I asked them - the first
thing they told me: Why was it built? I've
used them in school and as I said, had one
at home. But I was never really quite sure
what it was for. And I think a lot of the
Acorn marketing wasn't quite sure what it
was for either. They told me it was the
successor to the BBC Micro, this 8 bit
machine. Lovely 6502 machine, incredibly
popular, especially in the UK. And the
goal was to make a machine that was 10
times the performance of this. The
successor would be 10 times faster at the
same price. And the thing I didn't know is
they had been inspired. The team Acorn had
seen the Apple Lisa and the Xerox Star,
which comes from the famous Xerox Alto,
Xerox PARC, first GUI workstation in the
70s, monumental machine. They'd been
inspired by these machines and they wanted
to make something very similar. So this is
the same story as the Macintosh. They
wanted to make something that was desktop
machine for business, for office
automation, desktop publishing and that
kind of thing. But I never really
understood this before. So this was this
inspiration came from the Xerox machines.
It was supposed to be obviously a lot more
affordable and a lot faster. So this is
what happens when Acorn marketing gets
hold of this vision. So Xerox Star on the
left is this nice, sensible business
machine. Someone's wearing nice, crisp
suit bumps microphon banging their
microphone - and it gets turned into the
very Cambridge Tweed version on the right.
It's apparently illegal to program one of
these if you're not wearing a top hat. But
no one told me that when I was a kid. And
my court case comes up next week. So
Cambridge is a bit of a funny place. And
for those that been there, this picture on
the right sums it all up. So they began
Project A, which was build this new
machine. And they looked at the
alternatives. They looked at the
processors that were available at that
time, the 286, the 68 K, then that semi
32016, which was an early 32 bit
machine, a bit of a weird processor. And
they all had something in common that
they're ridiculously expensive and in
Tudors words a bit crap. They weren't a
lot faster than the BBC Micro. They're a
lot more expensive. They're much more
complicated in terms of the processor
itself. But also the system around them
was very complicated. They need lots of
weird support chips. This just drove the
price up of the system and it wasn't going
to hit that 10 times performance, let
alone at the same price point. They'd
visited a couple of other companies
designing their own custom silicon. They
got this idea in about 1983. They were
looking at some of the RISC papers coming
out of Berkeley and they were quite
impressed by what a bunch of grad students
were doing. They managed to get a working
RISC processor and they went to Western
Design Center and looked at 6502
successors being design there. They had a
positive experience. They saw a bunch of
high school kids with Apple 2s doing
silicon layout. And they though "OK,
well". They'd never designed a CPU before
at ACORN. ACORN hadn't done any custom
silicon to this degree, but they were
buoyed by this and they thought, okay,
well, maybe RISC is the secret and we can
do this. And this was not really the done
thing in this timeframe and not for a
company the size of ACORN, but they
designed their computer from scratch. They
designed all of the major pieces of
silicon in this machine. And it wasn't
about designing the ARM chip. Hey, we've
got a processor core. What should we do
with it? But it was about designing the
machine that ARM and the history of that
company has kind of benefited from. But
this is all about designing the machine as
a whole. They're a tiny team. They're a
handful of people - about a dozen...ish
that did the hardware design, a similar
sort of order for software and operating
systems on top, which is orders of
magnitude different from IBM and Motorola
and so forth that were designing computers
at this time. RISC was the key. They
needed to be incredibly simple. One of the
other experiences they had was they went
to a CISC processor design center. They
had a team in a couple of hundred people
and they were on revision H and it still
had bugs and it was just this unwieldy,
complex machine. So RISC was the secret.
Steve Ferber has an interview somewhere.
He jokes about ACORN management giving him
two things. Special sauce was two things
that no one else had: He'd no people and
no money. So it had to be incredibly
simple. It had to be built on a
shoestring, as Jamie said to me. So there
are lots of corners cut, but in the right
way. I would say "corners cut", that
sounds ungenerous. There's some very
shrewd design decisions, always weighing
up cost versus benefit. And I think they
erred on the correct side for all of them.
So Steve sent me this picture. That's he's
got a cameo here. That's the outline of
him in the reflection on the glass there.
He's got this up in his office. So he
led the hardware design of all of these
chips at ACORN. Across the top, we've got
the original ARM, the ARM 1, ARM 2 and the
ARM 3 - guess the naming scheme - and the
video controller, memory controller and IO
controller. Think, sort of see their
relative sizes and it's kind of pretty.
This was also on a processor where you
could really point at that and say, "oh,
that's the register file and you can see
the cache over there". You can't really do
that nowadays with modern processors. So
the bit about the specification, what it
could do, the end product. So I mentioned
they all had this ARM 2 8MHz, up to four
MB of RAM, 26-bit addresses, remember
that. That's weird. So a lot of 32-bit
machines, had 32-bit addresses or the ones
that we know today do. That wasn't the
case here. And I'll explain why in a
minute. The A540 had a updated CPU. The
memory controller had an MMU, which was
unusual for machines of the mid 80s. So it
could support, the hardware would support
virtual memory, page faults and so on. It
had decent sound, it had 8-channel sound,
hardware mixed and stereo. It was 8 bit,
but it was logarithmic - so it was a bit
like u-law, if anyone knows that - instead
of PCM, so you got more precision at the
low end and it sounded to me a little bit
like 12 bit PCM sound. So this is quite
good. Storage wise, it's the same floppy
controller as the Atari S.T.. It's fairly
boring. Hard disk controller was a
horrible standard called ST506, MFM
drives, which were very, very crude
compared to disks we have today. Keyboard
and mouse, nothing to write home about. I
mean, it was a normal keyboard. It was
nothing special going on there. And
printer port, serial port and some
expansion slots which, I'll
outline later on. The thing I really liked
about the ARC was the graphics
capabilities. It's fairly capable,
especially for a machine of that era and
of the price. It just had a flat frame
buffer so it didn't have sprites, which is
unfortunate. It didn't have a blitter and
a bitplanes and so forth. But the upshot
of that is dead simple to program. It had
a 256 color mode, 8 bits per pixel, so
it's a byte, and it's all just laid out as
a linear string of bytes. So it was dead
easy to just write some really nice
optimized code to just blit stuff to the
screen. Part of the reason why there isn't
a blitter is actually the CPU was so good
at doing this. Colorwise, it's got
paletted modes out of a 4096 color
palette, same as the Amiga. It has this
256 color mode, which is different. The
big high end machines, the top end
machines, the A540 and the A400 series
could also do this very high res 1152 by
900, which was more of a workstation
resolution. If you bought a Sun
workstation a Sun 3 in those days, could
do this and some higher resolutions. But
this is really not seen on computers that
might have in the office or school or
education at the end of the market. And
it's quite clever the way they did that.
I'll come back to that in a sec. But for
me, the thing about the ARC: For the
money, it was the fastest machine around.
It was definitely faster than 386s and all
the stuff that Motorola was doing at the
time by quite a long way. It is almost
eight times faster than a 68k at about the
same clock speed. And it's to do with it's
pipelineing and to do with it having a 32
bit word and a couple of other tricks
again. I'll show you later on what the
secret to that performance was. About
minicomputer speed and compared to some of
the other RISC machines at the time, it
wasn't the first RISC in the world, it was
the first cheap RISC and the first RISC
machine that people could feasibly buy and
have on their desks at work or in
education. And if you compare it to
something like the MIPS or the SPARC, it
was not as fast as a MIPS or SPARC chip.
It was also a lot smaller, a lot cheaper.
Both of those other processers had very
big Die. They needed other support chips.
They had huge packages, lots of pins, lots
of cooling requirements. So all this
really added up. So I priced up
a Sun 4 workstation at the time and
it was well over four times the price of
one of these machines. And that was before
you add on extras such as disks and
network interfaces and things like that.
So it's very good, very competitive for
the money. And if you think about building
a cluster, then you could get a lot more
throughput, you could network them
together. So this is about as far as I got
when I was a youngster, I was wasn't brave
enough to really take the machine apart
and poke around. Fortunately, now it's 30
years old and I'm fine. I'm qualified and
doing this. I'm going to take it apart.
Here's the motherboard. Quite a nice clean
design. This was built in Wales for anyone
that's been to the UK. Very unusual these
days. Anything to be built in the UK. It's
got several main sections around these
four chips. Remember the Steve photo
earlier on? This is the chip set: the ARM
BMC, PDC, IOC. So the IOC side of things
happens over on the left video and sound
in the top right. And the memory and the
processor in the middle. It's got a
megabyte onboard and you can plug in an
expansion for 4 MB. So memory map
from the software view. I mentioned this
26-bit addressing and I think this is one
of the key characteristics of one of these
machines. So you have a 64MB address
space, it's quite packed. That's quite a
lot of stuff shoehorned into here. So
there's the memory. The bottom half of the
address space, 32MB of that is the
processor. It's got user space and
privilege mode. It's got a concept of
privilege within the processor execution.
So when you're in user mode, you only get
to see the bottom half and that's the
virtual maps. There's the MMU, that will
map pages into that space and then when
you're in supervisor mode, you get to see
the whole of the rest of the memory,
including the physical memory and various
registers up the top. The thing to notice
here is: there's stuff hidden behind the
ROM, this address space is very packed
together. So there's a requirement
for control registers, for the memory
controller, for the video controller and
so on, and they write only registers in
ROM basically. So you write to the ROM and
you get to hit these registers. Kind of
weird when you first see it, but it was
quite a clever way to fit this stuff into
the address space. So it will start with
the ARM1. So Sophie Wilson designed the
instruction set late 1983, Steve took the
instruction set and designed the top
level, the block, the micro architecture
of this processor. So this is the data
path and how the control logic works. And
then the VLSI team, then implemented this,
did their own custom cells. There's a
custom data path and custom logic
throughout this. It took them about a
year, all in. Well, 1984, that sort of...
This project A really kicked off early
1984. And this staked out first thing
early 1985. The design process the guys
gave me a little bit of... So Jamie
Urquhart and John Biggs gave me a bit of
an insight into how they worked on the
VLSI side of things. So they had an Apollo
workstation, just one Apollo workstation,
the DN600. This is a 68K based washing
machine, as Jamie described it. It's this
huge thing. It cost about 50˙000 £.
It's incredibly expensive. And they
designed all of this with just one of
these workstations. Jamie got in at 5:00
a.m., worked until the afternoon and then
let someone else on the machine. So they
shared the workstation, they worked
shifts so that they could design this
whole thing on one workstation. So this
comes back to that. It was designed on a
bit of a shoestring budget. When they got
a couple of other workstations later on in
the projects, there was an allegation that
the software might not have been licensed
initially on the other workstations and
the CAD software might have been. I can
neither confirm nor deny whether that's
true. So Steve wrote a BBC Basic
simulator for this. When he's designing
this block level micro architecture run on
his BBC Micro. So this could then run real
software. There could be a certain amount
of software development, but then they
could also validate that the design was
correct. There's no cache on this. This is
a quite a large chip. 50 square
millimeters was the economic limit of
those days for this part of the market.
There's no cache. That also would have
been far too complicated. So this was
also, I think, quite a big risk, no pun
intended. The aim of doing this
with such a small team that they're all
very clever people. But they hadn't all
got experience in building chips before.
And I think they knew what they were up
against. And so not having a cache of
complicated things like that was the right
choice to make. I'll show you later that
that didn't actually affect things. So
this was a RISC machine. If anyone has not
programmed ARM in this room then get out
at once. But if you have programed ARM
this is quite familiar with some
differences. It's a classical three
operand RISC, its got three shift on one of
the operands for most of the instructions.
So you can do things like static
multiplies quite easily. It's not purist
RISC though. It does have load or store
multiple instructions. So these will, as
the name implies, load or store multiple
number of registers in one go. So one
register per cycle, but it's all done
through one instruction. This is not RISC.
Again, there's a good reason for doing
that. So when one comes back and it gets
plugged into a board that looks a bit like
this. This is called the A2P, the ARM second
processor. It plugs into a BBC Micro. It's
basically there's a thing called the Tube,
which is sort of a FIFO like arrangement.
The BBC Micro can send messages one way
and this can send messages back. And the
BBC Micro has the discs, it has the I/O,
keyboard and so on. And that's used as the
hosts to then download code into one
megabytes of RAM up here and then you
combine the code on the ARM. So this was
the initial system, 6 MHz. The
thing I found quite interesting about
this, I mentioned that Steve had built
this BBC Basic simulation, one of the
early bits of software that could run on
this. So he'd ported BBC Basic to ARM and
written an ARM version of it. The Basic
interpreter was very fast, very lean, and
it was running on this board early on.
They then built a simulator called ASIM,
which was an event based simulator for
doing logic design and all of the other
chips in the chips on the chipset that
were simulated using ASIM on ARM1 which is
quite nice. So this was the fastest
machine that they had around. They didn't
have, you know, the thousands of machines
in the cluster like you'd have in a
modern company doing EDA. They had
a very small number of machines and these
were the fastest ones they had about. So
ARM2 was simulated on ARM1 and all the
other chipset. So then ARM2 comes along.
So it's a year later, this is a shrink of
the design. It's based on the same basic
micro architecture but has a multiplier
now. It's a booth multiplier , so it is at
worst case, 16 cycle, multiply just two
bits per clock. Again, no cache. But one
thing they did add in on to is banked
registers. Some of the processor modes I
mentioned there's an interrupt mode. Next
slide, some of the processor modes will
basically give you different view on
registers, which is very useful. These
were all validated at 8 MHz. So
the product was designed for 8 MHz.
The company that built them
said, okay, put the stamp on the outside
saying 8 MHz. There's two
versions of this chip and I think they're
actually the same silicon. I've got a
suspicion that they're the same. They just
tested this batch saying that works at 10
or 12. So on my project list is
overclocking my A3000 to see how fast
it'll go and see if I can get it to 12 MHz.
Okay. So the banking of the registers.
ARM has got this even modern 32 bit
type of interrupts and an IRQ
pronounced "erk" in English and FIQ
pronounced "fic" in English. I appreciate it
doesn't mean quite the same thing in
German. So I call if FIQ from here on in
and FIQ mode has this property where
the top half of the registers are effectively
different registers when you get into
this mode. So this lets you first of all
you don't have to back up those registers.
I mean your FIQ handler. And
secondly if you can write an FIQ handler
using just those registers and there's
enough for doing most basic tasks, you
don't have to save and restore anything
when you get an interrupt. So this is
designed specifically to be very, very low
overhead interrupt mode. So I'm coming to
why there's a 26 bit address space. And so
I found this link very unintuitive. So
unlike 32 bit ARM, the more modern
1990s onwards ARMs, the program counter
register 15 doesn't just contain the
program counter, but also contains the
status flags and processor mode and
effectively all of the machine state is
packed in there as well. So I asked the
question, well why, why 64 megabytes of
address space? What's special about 64.
And Mike told me, well, you're asking the
wrong question. It's the other way round.
What we wanted was this property that all
of the machine state is in one register.
So this means you just have to save one
register. Well, you know, what's the harm
in saving two registers? And he reminded
me of this FIQ mode. Well, if you're
already in a state where you've really
optimized your interrupt handler so that
you don't need any other registers to deal
with, you're not saving restoring anything
apart from your PC, then saving another
register is 50 percent overhead on that
operation. So that was the prime motivator
was to keep all of the state in one word.
And then once you take all of the flags
away, you're left with 24 bits for a word
aligned program counter, which leads to
26 bit addressing. And that was then seen
as well, 64 MB is enough. There were
machines in 1985 that, you know, could
conceivably have more memory than that.
But for a desktop that was still seen as a
very large, very expensive amount of
memory. The other thing, you don't need to
reinvent another instruction to do
return from exception so you can return
using one of your existing instructions.
In this case, it's the subtract into PC
which looks a bit strange, but trust me,
that does the right thing. So the memory
controller. This is - I mentioned the
address translation, so this has an MMU in
it. In fact, the thing directly on the
left hand side. I was
worried that these slides actually might
not be the right resolution and they might
be sort of too small for people to see
this. And in fact, it's the size of a
house is really useful here. So the left
hand side of this chip is the MMU. This
chip is the same size as ARM2. Yeah,
pretty much. So that's part of the reason
why the MMU is on another chip ARM2 was
as big as they could make it to fit the
price as you don't have anyone here done
silicon design. But as the area goes
up effectively your yield goes down and
the price it's a non-linear effect on
price. So the MMU had to be on a separate
chip and it's half the size of that as
well. MEMC does most mundane things
like it drives DRAM, it does refresh for
DRAM and it converts from linear addresses
into row and column addresses which DRAM
takes. So the key thing about this
ARM and MEMC binding is the key factor of
performance is making use of memory
bandwidth. When the team had looked at all
the other processors in Project A before
designing their own, one of the things
they looked at was how well they utilized
DRAM and 68K and the semi chips made very,
very poor use of DRAM bandwidth.
Steve said, well, okay. The DRAM is the
most expensive component of any of these
machines and they're making poor use of
it. And I think a key insight here is if
you maximize that use of the DRAM, then
you're going to be able to get much higher
performance in those machines. And so it's
32 bits wide. The ARM is pipelined, so it can
do a 32 bit word every cycle. And it also
indicates whether it's sequential or non
sequential addressing. This
then lets your MEMC
decide whether to do an N cycle or an S
cycle. So there's a fast one and a slow
one basically. So when you access a new
random address and DRAM, you have to open
that row and that takes twice the time.
It's a 4 MHz cycle. But then once
you've access that address and then once
you're accessing linearly ahead of that
address, you can do fast page mode
accesses, which are 8 MHz cycles.
So ultimately, that's the reason
why these load store multiples exist. The
non-RISC instructions, they're there so
that you can stream out registers and back
in and make use of this DRAM bandwidth. So
store multiple. This is just a simple
calculation for 14 registers, you're
hitting about 25 megabytes a second out of
30. So this is it's not 100%, but it's way
more than a 10th or an 8th.
Which a lot of the other processors
were using. So this was really good. This
is the prime factor of why this machine
was so fast. It's effectively the load store
multiple instructions and being able to
access the stuff linearly. So the MMU is
weird. It's not TLB in the traditional
sense, so TLB's today, if you take your
MIPS chip or something where the TLB is
visible to software, it will map a virtual
address into a chosen physical address and
you'll have some number of entries and you
more or less arbitrarily, you know, poke
an entry and with the set mapping in it.
The MEMC does it upside down. So it says it's
got a fixed number of entries for every
page in DRAM. And then for each of those
entries, it checks an incoming address to
see whether it matches. So it has all of
those entries that we've showed on the
chip diagram a couple of slides ago. That
big left hand side had that big array. All
of those effectively just storing a
virtual address and then matching it and
have a comparator. And then one of them
lights up and says yes, it's mine. So
effectively, the aphysical page says that
virtual address is mine instead of the
other way round. So this also limits your
memory. If you're saying I have to have
one of these entries on chip per page of
physical memory and you don't want pages
to be enormous. The 32 K if you do the
maths is 4 MB over 128 pages, it's a
32K page. If you don't want the page to
get much bigger than that and trust me you
don't, then you need to add more of these
entries and it's already half the size of
the chip. So effectively, this is one of
the limits of why you can only have 4 MB
on one of these memory
controller chips. OK. So VIDC is the core
of the video and sound system. It's a set
of FIFOs and a set of shift digital analog
converters for doing video and sound. You
stream stuff into the FIFOs and it does
the display timing and pallet lookup and
so forth. It has an 8 bit mode I
mentioned. It's slightly strange. It also
has an output for transparency bit. So in
your palette you can set 12 bits of
color, but you can set a bit of
transparency as well so you can do video
gen- looking quite easily with this. So
there was a revision later on Tudor
explains that the very first one had a bit
of crosstalk between the video and the
sound, so you'd get sound with noise on
it. That was basically video noise and
it's quite hard to get rid of. And so they
did this revision and the way he fixed it
was quite cool. They shuffled the power
supply around and did all the sensible
engineering things. But he also filtered
out a bit of the noise that is being
output on the sound. He
inverted it and then fed that back in as
the reference current for the DACs. So that
sort of self compensating and took the
noise a bit like the noise canceling
headphones. It was kind of a nice hack.
And that was that was VIDC1. OK, the final
one, I'm going to stop showing you chip
plots after this, unfortunately, but just
get your fill while we're here. And again,
I'm really glad this is enormous for the
people in the room and maybe those zooming
in online. There's a cool little
Illuminati eye logo in the bottom left
corner. So I feared that you weren't gonna
be able to see and I didn't have time to
do zoomed in version, but. Okay. So IOC
is the center of the IO system as much of
the IO system as possible, all the random
bits of glue logic to do things like
timing. Some peripherals are slower than
others lives in IOC. It contains a UART
for the keyboard, so the keyboard is
looked after by an 8051 microcontroller. Just
nice and easy, you don't have to do scanning
in software. This microcontroller just sends
stuff up of serial port to this chip. So
UART keyboard, asynchronous receiver and
transmitter. It was at one point called
the fast asynchronous receiver and
transmitter. Mike got forced to change the
name. Not everyone has a 12 year old sense
of humor, but I admire his spirit. So the
other thing it does is interrupts all the
interrupts go into IOC and it's got masks
and consolidates them effectively for
sending an interrupt up to the on the ARM.
The ARM can then check the status and do
fast response to it. So the eye of providence
there, the little logo I pointed out, Mike
said he put that in for future
archaeologists to wonder about. Okay.
That was it. I was hoping there'd be
this big back story about, you know, he
was in the Illuminati or something. Maybe
he is, but not allowed to say anyway. So just
like the other dev board I showed you so
this one's A 500 2P, it's still a second
processor that plugs into a BBC Micro.
It's still got this host having disk
drives and so forth attached to it and
pushing stuff down the tube into the
memory here. But now, finally
all of this, the chip set now
assembled in one place. So this is
starting to look like an Archimedes. It
got video out. It's got keyboard
interface. It's got some expansion stuff.
So this is bring up an early software
headstart. But very shortly afterwards, we
got the a five A500 internal to Acorn. And
this is really the first Archimedes. This
is the prototype Archimedes. Actually got
a gorgeous gray brick sort of look to it,
kind of concrete. It weighs like concrete,
too, but it has all the hallmarks. It's
got the IO interfaces, it's got the
expansion slots. You can see at the back.
It's got all, it runs the same operating
system. Now, this was used for the OS
development. There's only a couple of
hundred of these made. Well, this is a
serial 222. So this is one of the last,
I think. But yeah. Only an internal to
ACORN. There are lots of nice tweaks to this
machine. So the hardware team had designed
this, Tudor designed this as well as the
video system. And he said, well, his A500
was the special one that he had a video
controller. He'd hand-picked one
of the VCs so that instead of running
at 24 MHz to run at 56, so some silicon
variations in manufacturer. So he found a
56 MHz part so he could do. I
think it was 1024 x 768, which is way out
of respect for the rest of the Archimedes.
So he had the really, really cool machine.
They also ran some of them at 12 MHz
as well instead of 8. This is a massive
performance improvement. I think it used
expensive memory, which is kind of out of
reach for the product. Right. So
believe me, this is the simplified
circuit diagram. The technical reference
manuals are available online if anyone wants
the complicated one. The main parts of the
display are ARM, MEMC, VIDC and some RAM
and we have a little walk through them. So
the clocks are generated actually by the
memory controller. Memory controller gives
the clocks to the ARM. The main reason for
this is that the memory controller has to
do some slow things now and then. It has
to open pages of DRAMs, refresh cycles and
things. So it stops the CPU and generates
the clock and it pauses the CPU by
stopping that clock from time to time.
When you do a DRAM access, your adress on
bus along the top, the ARM outputs an
address that goes into the MEMC. The
MEMC then converts that, it does an address
translation and then it converts that into
a row and column addresses suitable for
DRAM. And then if you're doing a read
DRAM outputs the address, outputs the data
onto the data bus, which ARM then sees.
MEMC is the the critical path on
this, but the address flows through MEMC
effectively. Notice that MEMC is not on
the data bus. It just gets addresses
flowing through it, this is important later
on. ROM is another slow thing.
Another reason why MEMC might slow down
the access from the CPU, it works in a
similar sort of way. There is also a
permission check done when you're doing
the address translation per... user
permission versus OS, a supervisor.
And so this information is output as part
of the cycle when the ARM does that access.
If you miss in that translation, you get
a page fault or permission fault, then an
abort signal comes back and you
take an exception.
And the ARM deals with that in software.
The data bus is a critical path, and so
the IO stuff is buffered, it is kept away
from that. So the IO bus is 16 bits and
not a lot 32 bit peripherals were around
in those days. All the peripherals 8 or
16 bits. So that's the right thing to do.
The IOC decodes that and there's a
handshake with MEMC. If it needs more
time, if it's accessing one of the
expansion cards and the expansion card
has something slow on it then that's dealt
with in the IOC. So I mentioned the
interrupt status that gets funneled into
IOC and then back out again. There's a
VSync interrupt, but not an HSync
interrupt. You have to use timers for that,
really annoyingly. There's one timer and
there's a 2 MHz timer available. I
think I had that in a previous slide,
forgot to mention it. So if you want to
do funny palette switching stuff or copper
bars or something - that's possible with the
timers, it's also simple hardware mod to
make a real HSync interrupt as well.
There's some spare interrupt inputs on the
IOC as an exercise for you . So the bit I
really like about this system, I mentioned
that MEMC is not on the data bus. The VIDC
is only on the data bus and it doesn't
have an address bus either. The VIDC is the
thing responsible for turning the frame
buffer into video, reading that frame
buffer out of RAM, so on. So how does it
actually do that RAM read without the
address? Well, the MEMC contains all of
the registers for doing this DMA: the
start of the frame buffer, the current
position and size, and so on. They all
live in the MEMC. So there's a handshake
where VIDC sends a request up to the MEMC.
When it's FIFO gets low, the MEMC then
actually generates the address into the
DRAM, DRAM outputs that data and
then the MEMC, gives an acknowledge
to the ARM Excuse me - too many
chips. The MEMC gives an acknowledged to
VIDC, which then latches that data
into the FIFO. So this partitioning is
quite neat. A lot of the video, DMA.
The video DMA stuff all lives in MEMC and
there's this kind of split across the two
chips. The sound one I've just
highlighted one interrupt that comes from
MEMC. Sound works exactly the same way,
except there's a double buffering scheme
that goes on. And when one half of it
becomes empty, you get an interrupt so you
can refill that so you don't glitch your
sound. So this all works really very
smoothly. So finally the high res- mono
thing that I mentioned before is quite
novel way they did that. Tudor had realized
that with one external component to the
shift register and running very fast, he
could implement this very high resolution
mode without really affecting the rest of
the chip. So VIDC still runs at
24 MHz to sort of VGA resolution. It
outputs on a digital bus that was a test
board, originally. It outputs 4 bits. So 4
pixels in one chunk at 24 MHz and
this external component then shifts
through that 4 times the speed. There's
one component. I mean, this is a
very cheap way of doing this. And as I
said, this high res- mode is very
unusual for machines of this era.
I've got a feeling an A500 the top end
machine, if anyone's got one of these and
wants to try this trick and please get in
touch, I've got a feeling an
A500 will do 1280 x 1024 by
overclocking this. I think all of the
parts survive it. But for some reason,
ACORN didn't support that on the board.
And finally, clock selection VIDC on
some of the machines, quite flexible set
of clocks for different resolutions,
basically. So MEMC is not on the data bus.
How do we program it? It's got registers
for DMA and it's got all this address
translation. So the memory map I showed
before has an 8 MB space reserved for
the address translation registers. It
doesn't have 8 MB of it. I mean,
doesn't have two million... 32 bit registers
behind there, which is a hint of what's
going on here. So what you do is you write
any value to this space and you encode the
information that you want to put into one
of these registers in the address. So this
address, the top three bits are 1 - it's
in the top 8 MB of the 64 MB
address space and you format your
logical physical page information in this
address and then you write any byte
effectively. This sort of feels
really dirty, but also really a very nice
way of doing it because there's no other
space in the address map. And this reads
to the the price balance. So it's not
worth having an address bus going into
MEMC costing 32 more pins just to write
these registers as opposed to playing this
sort of trick. If you have that address
bus just for that data bus, just for
that, then you have to get to a more
expensive package. And this was
really in their minds: a 68 pin chip
versus an 84 pin chip. It was a big deal.
So everything they really strived
to make sure it was in the very smallest
package possible. And this system
partitioning effort led to these sorts of
tricks to then program it. So on the
A540, we get multiple MEMCs. Each one is
assigned a colored stripe here of the
physical address space. So you have a
16 MB space, each one looks after
4 MB of it. But then when you do a
virtual access in the bottom half of the
user space, regular program access, all of
them light up and all of them will
translate that address in parallel. And
one of them hopefully will translate and
then energize the RAM to do the read, for
example. When you put an ARM 3 in this
system, the ARM 3 has its cache and then
the address leads into the MEMC. So then
that means that the address is being
translated outside of the cache or after
the cache. So your caching virtual
addresses and as we all know, this is kind
of bad for performance because whenever
you change that virtual address space, you
have to invalidate your cache. Or tag it,
but they didn't do that. There's other ways
of solving this problem. Basically on this
machine, what you need to do is invalidate
the whole cache. It's quite a quick
operation, but it's still not good for
performance to have an empty cache. The
only DMA present in the system is for the
video, for the video and sound. I/O
doesn't have any DMA at all. And this is
another area where as younger engineer
"crap, why didn't they have DMA? That
would be way better." DMA is the solution
to everyone's problems, as we all know.
And I think the quote on the right
ties in with the ACORN team's discovery
that all of these other processes needed
quite complex chipsets, quite expensive
support chips. So the quote on the right
says that if you've got some chips, that
vendors will be charging more for their
DMA devices even than the CPU. So not
having dedicated DMA engine on board is a
massive cost saving. The comment I made on
the previous 2 slides about the system
partitioning, putting a lot of attention
into how many pins were on one chip versus
another, how many buses were going around
the place. Not having IOC having to access
memory was a massive saving in cost for
the number of pins and the system as a
whole. The other thing is the FIQ mode
was effectively the means for doing IO.
Therefore, FIQ Mode was designed to be an
incredibly low overhead way of doing
programed IO, having the CPU do the
IO. So this was saying that the CPU is
going to be doing all of the IO stuff, but
lets just optimize it, let's make it make
it as good as it could be and that's
what led to the programmed IO. I also
remember ARM 2 didn't have a cache. If you
don't have a cache on your CPU then
DMA is going to hold up the CPU anyway,
so no cycles. DMA is not any
performance gain. You may as well get
the CPU to do it and then get the CPU to
do it in the lowest overhead way as possible.
I think this can be summarized as bringing
the "RISC principles" to the system. So
the RISC principle, say for your CPU, you
don't put anything in the CPU that you can
do in software and this is saying, okay,
we'll actually software can do the IO just
as well without a cache as the DMA
system. So let's get software to do that.
And I think this is a kind of a nice way
of seeing it. This is part of the cost
optimization for really very little
degradation in performance compared to
doing in hardware. So this is an IO card.
The euro cards then nice and easy. The
only thing I wanted to say here was this
is my SCSI card and it has a ROM on the
left hand side. And so. This is the
expansion ROM basically many, many years
before PCI made this popular. Your drivers
are on this ROM. This is a SCSI disc
plugging into this and you can plug this
card in and then boot off the disk. You
don't need any other software to make it
work. So this is just a very nice user
experience. There is no messing around
with configuring IO windows or interrupts
or any of the iSCSI sort of stuff that was
going on at the time. So to summarize some
of the the hardware stuff that we've seen,
the ARM is pipelined and it has the load-
store-multiple -instructions which make
for a very high bandwidth utilization.
That's what gives it its high performance.
The machine was really simple. So
attention to detail about separating,
partitioning the work between the chips
and reducing the chip cost as much as
possible. Keeping that balanced was really
a good idea. The machine was designed when
memory and CPUs were about the same speed.
So this is before that kind of flipped
over. An 8 MHz ARM 2 was
designed to use 8 MHz memory.
There's no need to have a cache at all on
there these days it sounds really crazy
not to have a cache on the CPU, but if your
memory is not that much slower than this
is a huge cost saving, but it is also risk
saving. This was the first real proper CPU.
If we don't count ARM 1 to say ARM 1 was a
test, but ARM 2 is that, you know, the
first product CPU. And having a cache on
that would have been a huge risk for a
design team that hadn't dealt with the
structures that complicated at that
point. So that was the right
thing to do, I think
and I talked about DMA. I'm actually
converse on this. I thought this was crap.
And actually, I think this was a really
good example of balanced design. What's
the right tool for the job? Software is
going to do the IO, so let's make sure
that FIQ mode, it makes sure that
there's low overhead as possible. We
talked about system partitioning. The MMU.
I still think it's weird and
backward. I think there is a
strong argument though that a more
familiar TLB is a massively complicated
compared to what they did here. And I
think the main drive here was not just
area on the chip, but also to make it much
simpler to implement. So it worked. And I
think this was they really didn't have
that many shots of doing this. This wasn't
a company or a team that could afford to
have many goes at this product. And I
think that says it all. I think they did a
great job. Okay. So the OS story is a
little bit more complicated. Remember,
it's gonna be this office automation
machine a bit like a Xerox star. Was going
to have this wonderful high res mono mode
and people gonna be laser printing from
it. So just like Xerox PARC, Acorn started
Palo Alto based research center.
Californians and beanbags writing an
operating system using a micro kernel in
Modula-2 all of the trendy boxes ticked
here for the mid 80s. It was by the sounds
a very advanced operating system and it
did virtual memory and so on, is very
resource hungry, though. And it was never
really very performant. Ultimately, the
hardware got done quicker than the
software. And after a year or two.
Management got the jitters. Hardware was
looming and said, well, next year we're
going to have the computer ready. Where's
the operating system? And the project got
canned. And this is a real shame. I'd love
to know more about this operating system.
Virtually nothing is documented outside of
Acorn. Even the people, I spoke to, didn't
work on this. A bunch of people in
California that kind of disappeared with
it. So if anyone has this software
archived anywhere, then get in touch.
Computer Museum around the corner from me
is raring to go on that. That'll be really
cool thing to archive. So anyway, they
had now a desperate situation. They had to
go to Plan B, which was in under a year write
an operating system for the machine
that was on its way to being delivered.
And it kind of shows Arthur was I mean, I
think the team did a really good job in
getting something out of the door in half
a year, but it was a little bit flaky.
RISC OS then a year later, developed
from Arthur. I don't know if anyone's
heard of RISC OS, but Arthur is
very, very niche and basically got
completely replaced by RISC OS because
it was a bit less usable than RISC OS.
Another really strong point that this
had it's quite a big ROM. So 2 MB going
up...sorry, 0,5 MB in the 80s going
up to 2 MB in the early 90s.
There's a lot of stuff in ROM. One of
those things is BBC Basic 5. I know
it's 2019, and I know Basic is basic, but
BBC Basic is actually quite good. It has
procedures and it's got support for all
the graphics and sound. You could write GUI
applications in Basic and a lot of people
did. It's also very fast. So Sophie Wilson
wrote this very, very optimized Basic
interpreter. I talked about the modules
and podules. This is the expansion
ROM things. And a really great user
experience there. But speaking of user
experience, this was ARTHUR . I never used
ARTHUR. I just dug out a ROM and had a
play with it. It's bloody horrible. So that
went away quickly. At the time also. So
part of this emergency plan B was to take
the Acorn soft team who were supposed to
be writing applications for this and get
them to quickly knock out an operating
system. So at launch, basically, this is
one of the only things that you could do
with the machine. Had a great demo called
Lander, of a great game called Zarch,
which is 3D space. You could fly around,
it didn't have serious business
applications. And, you know, it was very
there was not much you could do with this
really expensive machine at launch and
that really hurt it, I think. So let me
get RISC OS 2 in 1988 and this is now
looking less like a vomit sort of thing,
much nicer machine. And then eventually
RISC OS 3. It was drag and drop between
applications. It's all multitasking,
does outline font anti aliasing
and so on. So just lastly, I want to
quickly touch on the really interesting
operating systems that ACORN had a Unix
operating system. So as well as being a
geek, I'm also UNIX geek and I've always
been fascinated by RISCiX. These machines
are astonishingly expensive. They were
the existing Archimedes machines with a
different sticker on. So that's A540 with
a sticker on the front. And this OS
was developed after the Archimedes was
already designed at that point when this
OS was being developed. So
there's a lot of stuff about the hardware
that wasn't quite right for a Unix
operating system. 32K page size on a 4
megabyte machine really, really killed you
in terms of your page cache and and that
kind of thing. They turned this into a bit
of an opportunity. At least they made good
on some of this. There was a quite a novel
online decompression scheme for you to
demand a page- text from a binary
and it would decompress into your 32K
page, but it was stored in a
sparse way on disk. So actually on disk
use was a lot less than you'd expect. The
only way it fit on some of the
smaller machines.
Also Acorn TechL the department that
designed the cyber truck it turns out.
This was their view of the A680,
which is an unreleased workstation.
I love this picture.
I like that piece of cheese or
cake as the mouse. That's my favorite
part. But this is the real machine. So
this is an unreleased prototype I found at
the computer museum. It's notable. And
it's got 2 MEMCs. It's got a 8MB of
RAM. It's only designed to run RISC iX,
the Unix operating system and has highres
monitor only doesn't have color, who's
designed to run frame maker and driver
laser printers and be a kind of desktop
publishing workstation. I've always been
fascinated by RISC iX, as I said a while
ago I hacked around on ArcEm for a while.
I got it booting in ArcEm. I'd never seen
this before. I never used a RISC iX
machine. So there we go, it boots, it is
multi-user. But wait, there's more. It has
a really cool X-Server, a very fast one. I
think Sophie Wilson again worked on
the X server here. So it's very well
optimized and very fast for a machine of
its era. And it makes quite a nice little
Unix workstation. It's quite a cool little
system, by the way Tudor, the guy that
designed the VIDC and the IO system called
me a sado forgetting this working in
there. That's my claim to fame. Finally,
and I want to leave some time for
questions. There's a lot of useful stuff
in ROM. One of them is BBC Basic. Basic
has an assembler so you can walk up to
this machine with a floppy disk and write
assembler has a special bit of syntax
there and then you can just call it. And
so this is really powerful. So at school
or something with the floppy disk, you can
do something that's a bit more than basic
programing. Bizarrely, I mostly write that
with only two or three tiny syntax errors
after about 20 years away from this. It's
in there somewhere. Legacy wise, the
machine didn't sell very many under a
hundred thousand easily. I don't think it
really made a massive impact. PCs had
already taken off by then. The ARM
processor, not going to go on about the
company. That's clear that that
obviously has changed the world in many
ways. The thing I really took away from
this exercise was that a handful of smart
people. Not that many. No, order of a dozen
designed multiple chips, designed a custom
computer from scratch, got it working. And
it was quite good. And I think that this
really turned people's heads. It made
people think differently that the people
that were not Motorola and IBM really,
really big companies with enormous
resources could do this and could make it
work. I think actually that led to the
thinking that people could design their
systems on the chip in the 90s and that
market taking off. So I think this is
really key in getting people thinking that
way. It was possible to design your own
silicon. And finally, I just want to thank
the people I spoke to and Adrian and
Jason. Their center of computing history in
Cambridge. If you're in Cambridge, then
please visit there. It's a really cool
museum. And with that, I'll wrap up. If
there's any time for questions, then I'm
getting a blank look. No time for
questions?
Herald: There's about 5 minutes left for
questions.
Matt: Fantastic! Or come up to me afterwards.
I'm happy to chat more about this.
applause
Herald:The first question is for the
Internet. Signal angel, will you?
Well, grab your microphones and get the
first of the audio in the room here. There
that microphone, please ask a question.
Mic1: You mentioned that the system is
making good use of the memory, but how is
that actually not completely being
stalled on memory? Having no cache and
same cycle time for the cache- for the
memory as for the CPU.
M: Good question. So how is it not always
stalled on memory ? I mean. Well, it's
sometimes stalled on memory when you do
something that's non sequential. You have
to take one of the slow cycles. This was
the N cycle. The key is you try and
maximize the amount of time that you're
doing sequential stuff.
So on the ARM 2 you wanted to unroll loops
as much as possible. So you're fetching
your instructions sequentially, right? You
wanted to make as much use of load-store
multiples. You could load single registers
with an individual register load, but it
was much more efficient to pay that cost.
Just once the start of the instruction and
then stream stuff sequentially. So you're
right that it is still stalled sometimes,
but that was still a good
tradeoff, I think, for a system that
didn't have a cache for other reasons.
M1: Thanks.
Herald: Next question is for the Internet.
Signal Angel: Are there any Acorns on
sale right now or if you want to get into
this kind of hardware where do you get it?
Herald: Can you repeat the first sentence,
please? Sorry, the first part.
S: If you want to get into this kind of
hardware right now, if you want to buy it
right now.
M: Yeah, good question. How do you
get hold of one drive prices up on eBay? I
guess I hate to say it. Might be fun to
play around in emulators. Always
perfer that to hack around on the
real thing. Emulators always feel a bit
strange. There are a bunch of really good
emulators out there. Quite complete. Yeah,
I think it just I would just go on
auction sites and try and find one.
Unfortunately, they're not completely
rare. I mean that's the thing, they
did sell. Not quite sure. Exact figure,
but you know, there were tens and tens of
thousands of these things made. So I would
look also in Britain more than elsewhere.
Although I do understand that Germany had
quite a few. If you can get a hold of one,
though, I do suggest doing so. I think
they're really fun to play with.
Herald: OK, next question.
M2: So I found myself looking at the
documentation for the LVM/STM instructions
while devaluing something on ARM just last
week. And just maybe wonder what's your
thought? Are there any quirks of the
Archimedes that have crept into the modern
ARM design and instruction set that you
are aware of?
M: Most of them got purged. So there are
the 26 bits adressing. There was a
couple of strange uses of, there is an XOR
instruction into PC for changing flags. So
there was a great purge when the ARM 6 was
designed and the ARM 6. I should know
this ARM v3. That's got 32 bit addressing
and lost this. These weirdnesses
got moved out.
I can't think of aside from just the
resulting ARM 32 instructions that being
quite quirky and having a lot of good
quirks. This shifted register as sort of a
free thing you can do. For example, you
can add one register to a shifted register
in one cycle. I think that's a good quirk.
So in terms of the inheriting that
instruction set and not changing those
things. Maybe that counts?
Herald: Any further questions? Internet,
any new questions? No? Okay, so in that
case one round of applause for Matt Evans.
M: Thank you.
applause
postroll music
Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!