index.html

<!doctype html>
<html lang="en-US">
<title>A Simple CPU</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<style>
    html {
      background-color: #002b36;
    }
    .content {
      color: #839496;
      font-family: Georgia, serif;
      line-height: 1.5;
      max-width: 60ch;
      margin-left: auto;
      margin-right: auto;
    }
    h1,h2,h3,h4,h5,h6 {
      color: #93a1a1;
      font-family: sans-serif;
      line-height: normal;
    }
    h1 {
      text-align: center;
    }
    h2 {
      font-size: 1.2em;
    }
    p,ul,ol {
      break-inside: avoid;
    }
    a:link {
      color: #268bd2;
    }
    a:visited {
      color: #d33682;
    }
    .video {
      max-width: 100%;
      /* This is relative to the width, preserving the aspect ratio. */
      padding-bottom: 56.25%;
      position: relative;
      width: 560px;
    }
    iframe {
      height: 100%;
      left: 0;
      position: absolute;
      top: 0;
      width: 100%;
    }
    a:not(:hover) {
      text-decoration: none;
    }
</style>
<div class="content">
<h1>A Simple CPU</h1>
<h2>Magic</h2>
<p>When I was a child, I marveled at our family computer. Occasionally, when I
was bored of shooting things in Duke Nukem 3D or Doom, I'd lean my face in
really close to the monitor and start nudging the arrow keys. As the game
camera moved about the virtual space, the pixelated rectangles on screen would
shift and distort in complex patterns. I was convinced that if I just stared
at it long enough, I'd figure out how they made the colors on screen change
to look like a 3D room.

<p>In high school, I finally learned enough math to figure it out. I even built
a little ray-casting 3D engine. (I later learned neither game actually
used ray-casting. Que sera.) It was one thing to work out on paper how to use
wall distances to shrink or expand a pixel-wide column of texture. It was
another thing to walk around a room made of your own math.

<p>In my childhood, I'd been shown a magic trick. Hocus pocus, now you see a
room! Later, I'd learned the trick myself.

<p>In college, I learned computer architecture, compilers, network stacks,
algorithms, and all the trappings of software. I took a digital circuit
design class and made sequential logic circuits on breadboards. A deeper
magicks had been revealed: computing itself.

<p>I learned how these tricks were done, but I couldn't really do them
myself. I wanted to make a CPU to embody what I'd learned, but every path
towards this goal felt either trivial or tedious. I could program it all into
a FPGA, but this felt just like weirder programming. I could hand wire relays
into flip-flops and half adders, but that would take forever. I could wire
together a bunch of store-bought ALUs and registers, but that felt like
assembling really complicated LEGOs. So I put the project on the shelf for a
long time.

<h2>Simplicity</h2>

<p>The project popped back into my mind a while ago with a twist. Even toy
CPUs usually have instruction decoding, data path switching, an ALU, etc. But
these toys are simplified models of production CPUs, and production CPUs need
to be efficient, performant, and practical.

<p>My CPU wouldn't need to be efficient, performant, or particularly
practical. I'd just want it to compute the Fibonacci sequence in under a day
or so. This opened up the enormous space of less-than-practical designs.

<p>Any <a href="https://en.wikipedia.org/wiki/Turing_completeness">Turing-complete</a>
computer architecture can be reduced to any other via software rewriting. So
long as the transformation wouldn't make a Fibonacci program completely
infeasible, any would work. I didn't want to do a lot of wiring, so I wanted
one I could build from very few electronic parts.

<h2>One instruction to rule them all</h2>

<p>A dazzling variety of Turing-complete abstract machines have only one
instruction type. These are appropriately named
<a href="https://en.wikipedia.org/wiki/One_instruction_set_computer">one
instruction set computers</a> (OISCs). Much of production CPUs hardware
exists to switch the behavior of the processor when different types of
instructions are read. With only one instruction type, I could avoid needing
any hardware for this.

<p><a href="https://en.wikipedia.org/wiki/One_instruction_set_computer#Subtract_and_branch_if_less_than_or_equal_to_zero">SUBLEQ</a>
(subtract and branch if result &lt;= 0) is probably the most famous OISC.
I explored it thoroughly, but implementing it seemed hardly any easier than a
normal CPU. SUBLEQ requires an ALU for subtraction with a status bit for the
comparison. The CPU would also need to do have different behavior depending
on the value of the status bit.

<p>I consulted
<a href="https://en.wikipedia.org/wiki/One_instruction_set_computer">various</a>
<a href="https://esolangs.org/wiki/OISC">lists</a> of OISCs until I found the perfect
one: <a href="https://esolangs.org/wiki/ByteByteJump">byte-byte-jump</a>. Any
CPU must, at the very least, accept input and deliver output. This
instruction requires little else.

<p>Byte-byte-jump takes three addresses as arguments: source, destination,
and jump. It reads a byte from the source address, writes it to the
destination address, and changes the instruction pointer to the jump address.

<h2>Can't add, doesn't even try</h2>

<p>It seemed a bit magical to me that such a simple operation could implement
arithmetic, comparison, and conditional branching. As with all magic tricks,
it's a bit less impressive once you know how.

<p><a href="https://en.wikipedia.org/wiki/Lookup_table">Tables</a>. A program
can perform any arithmetic operation by including a table from all possible
inputs to the corresponding output. Production programs use lookup tables for
DES S-boxes, hash functions, etc. This is no different, except that usually
lookup tables make a program more efficient. Here they make a program
<i>possible</i>.

<p>Self-modifying code allows byte-byte-jump to perform table lookups. Say a
program includes a byte-byte-jump with the table as the source address. At
runtime, before the instruction executes, the program can use another
byte-byte-jump to overwrite the low byte of the source address (in the
program itself) with the table index. When the modified instruction later
runs, it then reads from the correct location inside the table. Overwriting
the jump address in this fashion yields conditional branching.

<p>One snag. The naive tables for even 8-bit binary operations (like
additions) are 64 KiB large. This plus a program would require addresses to
be larger than 16 bits. Once the adresses get past 16 bits, my nostalgia
goggles fall right off.

<p>No matter. 4-bit operations chained together can do any 8-bit operation,
and 4-bit binary operations tables are only 256 bytes. This allows my CPU to
have 8-bit bytes and 16-bit addresses, as God intended.

<p>The cheapest configuration of the commercially-sold <a
href="https://en.wikipedia.org/wiki/IBM_1620">IBM 1620</a> replaced expensive
logic with arithmetic tables in cheaper program storage. As a result, users
took it's internal code name, CADET, to stand for "Can't Add, Doesn't Even
Try."

<h2>Engine</h2>

<p>Let's talk about hardware. We can start by analyzing byte-byte-jump.
Byte-byte-jump involves two reads and one write to main memory. Simple main
memories can only read or write one address at a time, so these reads and
writes must happen at different times. In between, the CPU will need to
remember some data.

<p>Specifically, the CPU must remember the source and destination addresses,
the instruction pointer, and the byte to be written. The CPU also needs a way
to access or overwrite each of these values without disturbing the others.

<p>There are two easy ways to differentiate things in hardware: space and
time. Parallel systems include multiple copies of a thing with different
wires and pins for each. Serial systems use the same pins and wires to handle
different items at different times, with the timing coordinated by some
protocol. Parallel takes more work to physically wire up, so serial is my
jam.

<p>So, I need a way to remember some bits with a serial interface. That means
either a serial register bank or a serial RAM. I couldn't find any cheap
serial registers, so I started shopping for serial SRAMs. Eventually, I
selected the 64 KiB
<a href="https://www.microchip.com/wwwproducts/en/23LC512">Microchip 23LC512</a>.
This chip uses
<a href="https://en.wikipedia.org/wiki/Serial_Peripheral_Interface">SPI</a>
to communicate data, addresses, and read/write operations with the chip. SPI
is literally the simplest serial communications protocol I can imagine, and
the chip only costs around $1.

<p>I'll revel for a moment in the absurdity of using 64 KiB of RAM to remember
48 bits. This one chip has more switching elements than the first ten of
mankind's computers combined. Yet, the smallest SPI SRAM I could find was still
8 KiB, and the 64 KiB one was the smallest with a sequential mode I needed.

<p>The SRAM also has an instruction set, address pointer, operating modes,
registers, and a communications protocol. Is this SRAM really just a CPU
masquerading as a SRAM? Sort of, but not really. It doesn't have an
instruction pointer or control flow. This means it cannot request
instructions for itself, and the outside system must spoon-feed them in. It's
a powerful engine, but without a driver.

<h2>Driver</h2>

<p>Something needs to drive the SRAM's pins. The SRAM has eight, but only
three are interesting: chip select (CS), serial input (SI), and serial output
(SO).

<p>The chip can't actually input and output at the same time. When SI is in
use, SO is always electrically disconnected, and vice versa. Thus, instead of
treating them separately, I wired them together into one "SIO" pin. This
reduces the number of electrical connections needed with the rest of the
system.

<p>The CS pin controls the timing of the operations sent over the SIO line.
Once CS is activated, the operation type (read/write) and address are sent
over SIO. Data then flows over SIO from/to the given address (and successive
addresses) until CS is deactivated.

<p>The main memory also needs to communicate with the CPU. For this, I used a
slightly tweaked version of the SRAM's protocol over SPI. The main memory
gets its own CS line, but it shares SIO with the SRAM, allowing bidirectional
communications between the two.

<p>With main memory and the SRAM connected together in this fashion, the
stage is set. Enter the players. Four more identical SRAMs can be hooked
together with the SRAM and main memory. Together, the five SRAMs can operate
as a byte-byte-jump CPU.

<p>How does adding more SRAMs help? Well, performing byte-byte-jump requires
issuing various instructions and addresses to the SRAM and main memory. These
take the form of specific binary signals for SRAM CS, main memory CS, and
SIO. An SRAM can generate any binary signal.

<p>To generate a signal, fill the SRAM with repeated copies of the signal and
issue a read to address zero. The signal will then come out of SO
indefinitely, since the SRAM reads each address successively, and the
internal address pointer wraps back around to zero once it reaches the end.

<p>For the two CS lines, these signals are just a sequence of high and low
voltages. They thus require one SRAM each. SIO is a little different, since
it should only be driven when neither the SRAM or main memory is sending data
across it. This means it needs three states: high, low, and disconnected.

<p>Luckily, the SRAMs have HOLD pins that disconnect their outputs. The third
SRAM drives SIO, and the fourth SRAM drives the third SRAM's HOLD pin. This
allows the third SRAM to only drive SIO sometimes, completing the setup.

<p>While I would have <b>loved</b> to be able to say I built a CPU out of
five identical SRAM chips, reality was not so kind. I'd have to program four
of them with four different signals every single time the system was powered
on. So instead, I used four Microchip
<a href="http://www.microchip.com/wwwproducts/25LC010A">EEPROMs</a>.
These have nearly identical interfaces to the SRAMs, but once programmed,
they retain their behavior without power.

<h2>Build</h2>

<p>I needed:

<ul>
  <li>1 Microchip SRAM
  <li>4 Microchip EEPROMs
  <li>2 pullup resistors
  <li>4 LEDs w/ resistors, to provide blinkenlights for each signal
  <li>~40 wires
</ul>

<p>I bought all this from Mouser and waited for it to arrive. It cost me about
~$15, and $7 of that was shipping.

<p>The part of main memory was to be played by a Raspberry PI and its GPIO
pins. I wrote a
<a href="https://github.com/mysterymath/simple_cpu/blob/master/bus.py">program</a>
to accept read and write operations over the SIO and main memory CS lines,
using a 64 KiB array as backing storage. Whenever the CPU writes a byte to
the address 0xFFFF, the program prints it to the terminal. This gives
programs for the CPU a simple output device.

<p>I wrote a
<a href="https://github.com/mysterymath/simple_cpu/blob/master/eeprom.py">script</a>
to program each of the EEPROMs with its control signal, then verified that
each EEPROM would output that signal indefinitely. Finally, I wired the four
EEPROMs, the SRAM, and the Raspberry Pi together.

<p>I turned it on for the first time. It immediately started executing
byte-byte-jumps, instead the expected behavior, catching fire. All things
considered, it took around 200 clock cycles to execute a single byte-byte-jump.

<p>I incrementally built up a
<a href="https://github.com/mysterymath/simple_cpu/blob/master/bus.py">library</a>
of routines to write a Fibonacci-calculating program into the memory array.
All things considered, it took around 200 byte-byte-jump instructions to
execute an 8-bit addition.

<p>You can see it in action at various clock rates below. The second-rightmost
LED is SIO, and all of the others are control lines. The Turing magic happens
on the SIO LED; the others are perfectly cyclical.

<div class="video">
  <iframe
    src="https://www.youtube-nocookie.com/embed/videoseries?list=PL4Aym2GNCg4hb7xvSkjSgP3m2xWIY_TZK"
    frameborder="0"
    allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture"
    allowfullscreen></iframe>
</div>

<h2>Discussion</h2>

<p>Compared to the time it took to conceive, design, and program this CPU,
building it was incredibly straightforward. It operated reliably at all the
clock rates I could get Python to generate, up to 80 kHz. Admittedly, that's
not very fast, but it <i>only</i> takes around 40,000 cycles to do an
addition. That is, it can do around 2 additions per second, which is more
than I can do, and I bet it can do them all night!

<p>The final design resembles early computers far more than modern ones. All
early computers were bit-serial, and very limited instruction sets
supplemented with lookup tables not unheard-of. In both cases, simple
implementation is king. I wanted it to be simple because I'm lazy. They
<i>needed</i> it to be simple, since technology placed tight limitations on
what even major world governments and universities could build.

<p>I can safely say that running this CPU for the first time was exactly as
satisfying as I'd hoped. I still can't believe the damn thing actually
<b>computes</b>, and faster than I can! Given how many things could have broken
along the way, the fact that it does what I wanted is, well, pretty magical.

<p>I've collected a fair amount of additional
<a href="details.html">details</a>
 from my notes. Besides the schematics in that document, the canonical
 control codes and software are all defined in the two Python files I linked
 to above. If there's anything I left out, shoot me an email.

<address>Daniel Thornburgh (<a href="mailto:mysterymath@gmail.com">mysterymath@gmail.com</a>)</address>
</div>