Introducing JRISC
Introducing JRISC, the new early startup processor
by Jens Schönfeld
Note: Jeri Ellsworth has abandoned JRISC, this information is only staying online
to have a complete overview over the development history.
Here's the long-awaited news update after many months of intense work on the
C-One. We were so close to releasing the boards with a working C64 core, but
there was something that kept us from completing the early startup procedure:
The size of the design. It simply did not fit in the "small" FPGA, the 1K30
that launches the whole system.
Many weeks of optimizing freed a few logic cells, but did not make the design
fit. Taking a bigger FPGA was not an option, because the boards had already
been produced a long time ago. Exchanging the FPGAs on 300 boards (where some
of them are already spread all around the globe at the developer's sites)
would have been too costly, and too risky for those who want to do it on their
own. Exchanging a 208-pin QFP package is something you can do at home, but
you should have lots of experience and the right tools.
We dropped the idea of a bigger FPGA. This would have been the Wintel way of
doing things: If it does not work, use a bigger/faster computer. We're
Commodore people, and that means we're taking hardware restrictions as a
challenge, not as a limit of our brains.
The lion share of logic cells was eaten up by the 6502 processor, AKA the
drive CPU. Although it's cut-down version with no BCD support has gotten 50%
smaller than any other implementation you can find on the net, it was still
too big to be fitted with the D-Ram controller, the video controller, keyboard
controller, floppy controller, audio engine and the early startup DMA engine
that gets the 64K of early startup code into the memory.
The other parts of the design are as small as can be. Neither a memory
controller, nor video or the DMA engine can be considered "big". It's the CPU
that takes up too much space, so Jeri tried to optimize that, and having
squeezed everything out of it, it was clear that the 6502 had to go.
The new idea was to create a microcode engine that loads a 6502 emulator with
the DMA engine. A new processor from scratch. During the design phase, Jeri
discovered that the microcode engine she designed was nearly a full processor.
Just add a few things here and there, cut out the microcode overhead, and
all of a sudden we have a RISC processor, the JRISC. Even the DMA engine can
be wiped, because the CPU has a small bootstrap that can load it's code from
the flash memory.
Before anyone complains about not having the 6502 any more, and all the work
on that being done in vain: We still need the 6502 for the C64 "compatibility"
core. The work was necessary, we're glad it's done, and we're confident that
this implementation of a 6502 on an FPGA beats any other in size, speed and
cycle-exact execution. It's just that the 6502 is not used in early startup
any more, but the C64 will of course stay with this classic processor, and
the "native C-One" will stay with the 65816, as promised in all the technical
data that has been spread over the course of the project.
Now let's come to the new processor that has so many benefits over the 6502:
JRISC is a 32-bit processor. It has the full address range of the SIMM module,
and a little more (currently 27 bits, but that might change in order to
optimize the design). No matter what size SIMM you put in the multimedia
memory socket, it can all be used as program memory for the drive CPU. Up to
128 megs can be used as shared memory for program, data, video and audio.
JRISC has a von Neumann architecture. As opposed to the Harvard-architecture,
this has common memory areas for program and data. In other words: The same
as you're used to from C64 or Amiga experience. I want to make this clear, as
Harvard architecture, which separates code and data, is often used on
microcontroller systems such as PIC micro or the alltime classic much-hated
Intel 8031, and since the goal of the design was small size, one could
consider JRISC a microcontroller, which it really isn't.
The basic architecture is about everything that JRISC and the 6502 have in
common. You will have to re-think some aspects of processors if you haven't
dealt with RISC processors before. RISC means "reduced instruction set
computer", and it's the opposite of the CISC (complex instruction set
computer), the concept of most popular processor architectures of today: x86,
M68K, PPC, just to mention a few.
Those of you who know the CISC architecture might have real problems getting
the idea of Jeri's new approach to having a small processor. Everyone who
wants to take part in future discussions should read the preliminary
documentation, and everyone is welcome to discuss the new aspects of the
processor on this list. Ladies and Gentlemen, may I present:
JRISC!
The document is preliminary, and it's already outdated in some aspects. Please
do not base any work on this document!
Reducing the instruction set does not mean crippling the capabilities of the
processor. Reading the document, you might have found the "shift right"
opcode, but no shift left opcode. Such an operation must be done with other
operations, but we'll come to that later. Let's first discuss the form of an
opcode, and the power you can squeeze into one 32-bit word:
Every opcode is 32 bits wide. There's no smaller unit in the program memory.
Every opcode is conditional, that means that the execution of each command
depends on the state of the flags. 6502 programmers only know the branch
commands like
BNE (branch not equal)
BEQ (branch equal)
BCS (branch carry set)
BCC (branch carry clear)
and so on. These are conditional commands, they jump if a certain condition
is met. The drawback is that only one bit of the flags field can be checked,
and the action taken can only be a jump to a different location in program
memory. JRISC allows checking of more than just one flag, giving fairly
complex comparisons such as "greater than" or "less or equal" in just one
opcode. Further, not only jump commands can be executed conditional, but also
any other command. I'll spare you any examples and leave them for the
discussion later.
Any opcode can update the resulting flags of the just-performed operation or
not. Further, any ALU (arithmetic-logical-unit) operation result can be stored
or not. This makes comparisons possible without having to destroy the contents
of a register. As opposed to the preliminary PDF document, this is not
achieved by the "store result" bit, but by choosing register 0 as the
destination register. It is now write-protected, which frees one bit in the
opcode word for other purposes. To compare two registers, just subtract them,
and choose register 0 as the destination. Set the "store flags" bit. Next you
can use the flags as a condition for some command. For example the Zero flag
will be set if the two registers were equal.
Speaking of registers, there are ten of them, and they're 32 bits wide. There
are six more registers for special purposes, which emphasizes the load-store
architecture of the processor: Simply everything is done in the registers,
and after preparing data in them, you can store the data in memory, which
leads to the next aspect of the CPU, the addressing modes.
The only immediate mode is loading data into a register. Immediate means that
the data is located in the program itself, as a parameter to an opcode.
Loading some value to a register not only takes the opcode, but also the data
as a parameter, so the full command is two 32-bit words long (one for the
opcode, and one for the data). This is the only command that takes two
instruction words, all other commands only have one word. Other addressing
modes are generated by the ALU: You can use ADD, SBC, AND, OR, NOT, shift and
combinations of those to generate the target address where you load or store
data.
The experienced readers among you might already see the power in this: Weird
addressing modes like "indirect indexed masked" where the index can point
over the whole memory space is possible. The only thing that the processor
lacks is a direct addressing mode; you always have to go the re-route of
loading the address to a register with immediate, and then using that register
as a pointer to your memory location - this is as "direct" as it gets!
There are three register fields in the bit pattern of an opcode. Two source
fields, and one destination field. Each of them is 4 bits wide, so each of
them can point to one of the possible 16 registers of the processor
(restrictions apply, because only ten of them are general-purpose registers!).
The two source registers are fed into the ALU, and the destination register
field points to the register file where the result of the ALU operation shall
be stored. You can choose the registers freely, so the two source registers
can be the same, and even the destination register can be the same, because
the ALU latches the values and therefore does not cause any glitches in the
data consistency. The result is stored in a register file after the source
data has been read - if you choose to store the result at all (see above).
The final part of a 32-bit word is the remaining 8 bits of "branch". This
means that every opcode gives you the opportunity to jump after it's
execution. Since every opcode is conditional, this jump is also conditional,
and the same condition code applies to the jump itself. The jump is relative,
you can jump 128 locations forward, or 127 locations back. To emphasize the
meaning of these last bits: Every opcode can perform two actions, where the
second is always jumping. If you don't want to jump anywhere, just set these
bits to the default pattern that makes the CPU fetch the next instruction, and
that's all 0's.
Let's shock you a little more: There's no JSR (jump to subroutine) and no RTS
(return from subroutine) in the instruction set. How we still manage to jump
to subroutines and return from them in a clean way requires knowledge about
the six "special purpose" registers of JRISC:
Register number 0 always contains 0. It cannot be used for anything else, only
as a constant 0.
As mentioned earlier, the register is write-protected, so it can be used as a
dummy destination for results that shall not be stored.
Registers 1-10 are general purpose.
Register 11 IRQ vector (upper 5 bits reserved)
Register 12 software stack (upper 5 bits reserved)
Register 13 program counter (upper 5 bits reserved)
Register 14 Last address, Upper 5 bits are flags: NZCVI (MSB first)
Register 15 current opcode fetched (do not modify!)
Every instruction can find out what the previous instruction was just by
looking at the last address register (14). You can now make a subroutine-entry
routine that stores the flags and the last address on the stack, and restores
this before leaving the subroutine. The difference between a "real" JSR and
this way is just that you're doing the necessary steps to find the way back
into the main program instead of the CPU doing the work for you. Overall
execution time will be the same compared to CISC units, because there's
hardly a difference between the CPU doing the things on it's own or the
programmer telling it what to do.
The idea to do subroutines with JRISC is to jump to the routine, the first
instruction of that routine catching the last address and the flags from
register 14 and push it on the stack, then do the subroutine stuff. To exit
and return to the location where it came from, just take that word from the
stack, write the flags back to register 14, set the program counter to the
old address and the CPU will automatically continue at the next address after
the jump to the subroutine. Saving the flags is of course optional, the 6502
doesn't save flags on subroutine calls either.
A similar workaround will have to be done for shifting left. Since that is not
directly supported by the ALU, you'll have to do it with some knowledge about
bit operations. Here's the theory:
Shifting left is the same as multiplying by two. Multiplying by two means
adding the same number to itself, and that's what the ALU can do natively:
Choose the same register for both sources for an Add, and the result will be
that register shifted one bit left.
More theory: The result of a multiplication by two is always an even number.
Even numbers represented in binary always have a cleared least significant
bit, so "add with carry" will never make the operation wrap around if the
carry bit is set. Instead, it will "shift" in the carry bit, another benefit.
If the MSB of the to-be-shifted value was set, the carry bit will be set
after this operation, giving a full rotate command with the few features the
CPU has.
The one big exception to the flexibility of conditionals and branching into
subroutines from any command is the immediate load: The condition code must
be set to "always", and a jump into a subroutine is not allowed either.
This is because of the data being located right behind the opcode: If you use
a condition code which is not met, the data would be executed as code, which
would crash the CPU. This is also the reason for the no-JSR, because the
return code of the subroutine would not know that it was called from an
immediate load, and return into the data instead of the word after that. The
final restriction is that the branch may not be set to all 0's, because this
would also lead to the data being interpreted as code. Immediate load requires
the user to set the branch destination to the word after the data word, or
some other valid instruction word.
Interrupts are a little tricky: They can only be served after instructions
that have the branch destination set to 0. The IRQ service routine will have
to do similar things as the subroutine entry code, because it has to take care
of flags and the return address. The full detail of interrupts would go beyond
the scope of a preliminary introduction of this processor, so please accept
that it can do IRQs with some restrictions.
Speed: The PDF document from last week still states that every instruction
takes 7 cycles to execute. This will be reduced in order to boost performance.
Code: All the startup code developed until now has been written in 6502
assembler. We don't want to re-write all that code, so the JRISC has some
features that assist in emulating 8-bit processors in general, not only 6502.
We have already started to develop the necessary code and will document the
new opcodes and ALU features in a few days. The next few days are reserved
for implementing and testing some of the 6502 emulation code. I will be
reading the mailinglist and try to answer as many questions as you have.
Happy valentine's day everyone,
Jens Schönfeld
|
|