Sunday, November 27, 2016

Working on the MEGA65 IDE

It's been a bit of a while since the last post, and there are some good reasons for that.  Chief among those is that I have been busy temporarily moving to Germany with my family for work.  We are staying just long enough that we have to go through all the formalities as though we were staying forever, but not long enough to be able to settle properly.  The net result is even less time available than normal.  However, on the upside, I am located nearby to most of the other MEGA folks for a few months, which is very nice.

What I have been working on in the meantime in what spare moments I can find, is the beginning of the MEGA65 IDE.  This IDE will integrate with the m65dbg debugger, and will run mostly from within LGB's MEGA65 emulator.  That is, the user interface will in fact be a C65 program.  While this might sound strange, it has the added benefit that we will end up with a (hopefully rather nice) native text editor/viewer for the MEGA65.

I have a large part of the editor working now, including the ability to load multiple files (even if they don't all fit in RAM at the same time), switch between them, and even display multiple files on screen at the same time, as the following screen-shot shows, showing some of the C source code of the editor itself loaded (it is being written using the cc65 6502 C compiler):

The idea of the split-screen mode is to make it easier for debugging: The current stack-trace could be displayed visually showing the recent call graph.  Also it is handy to be able to look at two files at once when programming.

The editor also natively uses ASCII instead of PETSCII, so that it is easier to work on source files intended for cross-compilers and cross-assemblers (or in the future at some point, cc65 or ca65 running natively on the MEGA65)

Here is a more ordinary looking screen-shot with a single file in view:

You can see here that we have curly-braces as part of our ASCII-goodness.

The next step is to finish cursor movement functions, and then work on editing functions.  The editor controls are based rather loosely on Emacs and the UNIX shell conventions, e.g., control-A and control-E to jump to start and end of line, respectively.

Thursday, September 1, 2016

MEGA65 Emulator can run kickstart and disk menu

Just a quick post while I am on the road to share Gábor's latest progress on the MEGA65 emulator: It can now run kickstart and supports hypervisor traps and SD card access, sufficient to boot, and run the disk menu program (which makes heavy use of hypervisor DOS calls):

Monday, August 22, 2016

Double-tap RESTORE to enter Hypervisor "freezer"

After a lot of little bug fixing, I finally got the keyboard based trap-to-Hypervisor function working.  While this might sound dull at first, it is actually super important, and enables a bunch of really interesting and fun features.

The best way to think of this, is as an integrated "freeze" button, that is triggered by tapping the RESTORE key twice in quick succession (between 50ms and 800ms apart).  When this occurs, it triggers a transparent trap to the Hypervisor. That is, the running program doesn't have the slightest idea that it has happened.  This allows the Hypervisor to run some arbitrary routine, before exiting back to the running program.  In other words, it really is just an integrated freeze function. Right now, that routine just toggles between slow and fast CPU speed, which is kind of fun, but not really that exciting.  Here is the current Hypervisor RESTORE trap routine:

; For now we just want to toggle the CPU speed between 48MHz and
; 1MHz

; enable 48MHz for fast mode instead of 3.5MHz
lda $D054
eor #$40
sta $D054

; enable FAST mode,
lda $D031
ora #$40
sta $D031

; bump border colour so that we know something has happened
lda $D020
and #$0f
sta $D020

; return from hypervisor
sta hypervisor_enterexit_trigger

There isn't anything really to exciting to see there: It just fiddles a couple of registers to toggle the CPU speed between normal and 48MHz, increments the border colour as a bit of a debug aid, and then exits from the hypervisor.

For the technically inclined, what you might be noticing what isn't there.  As I have talked about previously, the Hypervisor trap process is super efficient: The entire CPU state is saved to shadow registers, resulting in a 1 cycle entry and exit time from the Hypervisor, because you don't need to save or restore any CPU or memory mapping registers.  The CPU is automatically set to a known configuration on entry to the Hypervisor, and restores the running program's configuration on exit.  Thus, this entire Hypervisor trap takes only about 40 cycles, or about 850 nanoseconds.  Most modern desktop processors probably would have trouble beating that.  Indeed, as previously mentioned, a minimalistic Hypervisor trap can complete in under 200 nanoseconds.

Anyway, back to the story at hand...

Like on a freeze cartridge, we will implement a freeze menu, that will allow a number of useful operations.  The usual staples will be there, including memory monitor, the option to reset the machine, probably some poke finder type functions, the option to freeze the currently running program to disk, and so on.

However, the MEGA65 has been designed from the outset to do much more in the freeze menu.

First up, you will be able to switch tasks, by browsing through the list of tasks, complete with 80x50 pixel thumbnails that are drawn using the previously described hardware thumbnail generator, that continuously generates little screen captures of the running program.

Similarly, you will be able to delete tasks and start new ones.

So, for example, if you have a sudden need to show off your BASIC programming prowess half-way through a game of Ghosts and Goblins, you can just double-tap RESTORE, choose the menu option to create a new C64-mode task, demonstrate your elite status by typing something like:


and then when you have demonstrated your mastery over coding to whoever was doubting it, you can double-tap RESTORE again, and switch back to Ghosts and Goblins, which you can easily find from the thumbnails.

Similarly, when approaching a hard part of a game, you could freeze it, make a back up of the game where you are up to, and then go on to play that hard level, and reload the saved state until you can conquer it.  In this way, mere mortals should be able to get a score of at least 7 in Flappy Birds without too much trouble.

While these use-cases might be a bit simplistic and contrived, it is hopefully not too hard to see how the Hypervisor freeze menu will likely play a central role in the use and experience of the MEGA65 for many.  Thus it is really nice to have the hardware side of it implemented.  The next step is to start working on the menu program itself, the freeze/unfreeze routines, and getting saving to the SD card actually working, so that things can get saved.

Sunday, August 14, 2016

Tutorial Video for m65dbg

Gürçe who has been working on the very nice m65dbg symbolic debugger for the MEGA65 has released a nice video providing an introduction to the current feature-set of m65dbg:

Saturday, August 13, 2016

We can include GEOS with the MEGA65

Just a very quick but super exciting note to say that we have received permission to include GEOS with the MEGA65, provided that we do not charge any extra for doing so.  Since we are creating our software suite as open-source, this is no problem at all for us, and means that we can potentially use GEOS to make the MEGA65 configuration menus etc using GEOS, which would make them look nice, and probably be much quicker to write as well.

Sunday, August 7, 2016

Booting GEOS on the MEGA65

Ralph Egas, CEO of Abstraction Games has been helping to port GEOS to the MEGA65, using the disassembly of the GEOS 2.0 Kernal by Maciej Witkowiak.

This has been super exciting, because we have been wanting for some time to get GEOS running on the MEGA65, partly because we know that it should be VERY fast on the MEGA65, even without a RAM expander, because the SD card interface can transfer data faster than the REU on a C64 or C128 can.

However, we weren't sure that it would be easy to do, because GEOS is infamous for its horrible copy protection, which I hadn't realised JUST how horrible/clever it was until I read several pages at that link.

However, Maciej's disassembly of the GEOS kernal removes all such problems for us, and presents the disk drivers as nice discrete modules.  Thus, in theory, all that was needed, was to write a C65/MEGA65 disk routine.

For simplicity and speed of development, Ralph decided to make a version of GEOS that would access the floppy drive(s) via the normal CBM DOS routines,  without any fast loader. This allowed him to test that version under VICE, for very a rapid development cycle, especially since VICE could be run in warp mode, without having to exactly emulate the floppy drive, since it was only being accessed using the official C64 KERNAL routines.

Once Ralph had that working, the plan was to start implementing the MEGA65 SD card routines.  However, he decided to try this de-fast-loaded version on the MEGA65, and was pleasantly surprised to find that it worked:

This is because, like in VICE, by using only the official KERNAL disk routines, the C65's 1581 emulation DOS was able to service the sector reads. He only hit trouble at this point, when trying to write sectors, because the MEGA65's emulation of the C65's floppy controller currently has some problems with writing sectors to the SD card.  We'll fix that as soon as we get the chance to do so.

My first comment to Ralph, after congratulating him, of course, was how slow it was to load.  This was slightly tongue in cheek, because it clearly loads VERY fast.  However, it is still using the C65's 1581 DOS emulation routines, which context switch (very slowly!) on every byte read or written to the internal drive. This costs hundreds of cycles per byte, yielding a maximim disk speed of somewhere around 15 - 30KB/sec.  In contrast, the SD interface is capable (currently) of a theoretical maximum of 3MB/sec, and speeds in the 100s of KB/second are quite easy to achieve. Also, GEOS doesn't know about the MEGA65's DMA controller, and so memory fills are much slower than they could be*.  Thus, I think it should be possible to speed up the loading time by an order of magnitude or so, so as to seem instantaneous after hitting "return" after loading the program.

You can see the current state of the source code on github.  Ralph hopes to implement the native SD card routines soon, which would get us a fully working, and much faster booting GEOS.  He might then look into using MEGA65/C65 features, such as the extra RAM, DMA controller, and improved screen resolutions and colour depths.

* Probably "only" 1MB - 2MB/second using a typical 6502 memory copy routine.

Wednesday, August 3, 2016

We finally have the new disk menu mostly working

Finally we have the new disk menu (mostly) working on the MEGA65:

This is no small milestone, because it doesn't touch the SD card directly at all.

It uses Hypervisor calls to do everything: to check that it is running on a MEGA65, to list the files on the SD card, and then to ask the Hypervisor to mount the disk image, which in turn does some checks to make sure the disk image is fit for mounting. We can now proceed to implementing more Hypervisor calls with confidence.

Also, by using the Hypervisor, the program is able to be quite compact: less than 2KB, despite having a full screen browser to select the disk images, and a (currently disabled) sort routine to show the images in the right order. Indeed, almost 1/4 of the size is text messages.

This now sets the scene for us to progress all of the other previously blocked progress on the Hypervisor, to make the whole machine pleasant to use.

Monday, August 1, 2016

Revenge of the Decimal Flag

[Deutsch Übersetzen unten]

I first learned to hate the 6502's decimal flag when I was about 17.  I was still at school, and was offered a job of writing the software for a dual-6502 based 30' (8+m) industrial roll former (the machine that puts the corrugations into corrugated iron sheeting).  If was a bit older and wiser, I might have thought twice about it, but at the time it seemed like a great idea. Anyway, the process went surprisingly well except for one intermittent bug: Sometimes it wouldn't count the distances out correctly.

This caused a hair raising last 3 days before the blasted thing was due to be shipped off to Argentina while we tried to track down the source.  The cause turned out to be that the 6502, while specified in the data-sheet to start up with the decimal flag clear, starts up, in fact, with the decimal flag only usually clear.  Needless to say, one of the absolute first things that I did with the MEGA65 was make sure that its CPU always starts up with the decimal flag clear.  End of problem. Well, so I thought...

Some of you will be aware that I have been trying to track down the source of some nasty bugs in the Hypervisor, where making trap calls into the Hypervisor would sometimes fail for seemingly inexplicable reasons.  This came to a head when I tried to build the new Disk Menu program into the Hypervisor ROM, instead of the amazingly horrible diskchooser thing that I cobbled together back in the beginning.

With the Disk Menu built in (running in user mode, not in the Hypervisor, but installed automatically at $C000 by the Hypervisor), the trap bug was happening consistently.  Big problem.  Especially since months of looking at it intermittently failed to provide a simple explanation.

Today in the lab we had a breakthrough.  Ben pointed out that the Hypervisor checkpoint debug system was displaying a corrupted message when we tried to debug the Disk Menu program.  This gave us a clue that I started to follow, and with a bit of poking around and following the single-step trace output (Gurce, please feel free to make that program that can show the instruction disassembly for the serial monitor as soon as you like :), I realised that the Hypervisor was incorrectly calculating the skip address when stepping over the message for a checkpoint.  It was adding #$01 to #$1D and getting #$24 as the result.

I was about to start pulling my hair out and wonder exactly what had gone so badly to pot that the ALU was now not even able to compute a simple addition.  Then some little neuron in the back of my mind told me that it looked like the result of a Binary Coded Decimal (BCD) addition.  I then cast my eyes to the right on the trace output to look at the CPU status register, and sure enough, there was the evil "D" staring at me: The CPU was in decimal mode! In the Hypervisor!

It then occurred to me that when the CPU traps to the hypervisor, it preserves the D flag's value, as well as saving it in the hyper_p register for restoration on exit.  A single line change to the code there has fixed that (and I also added a CLD instruction to the main Hypervisor trap entry point because I am paranoid).  Then it was time to go home, so we will have to try it out in the morning.  Hopefully it will let us finally get the Disk Menu program working, which will be a very big step forward.

This still leaves the mystery of how the D flag got set in the processor flags to begin with.  It really shouldn't have, as the Disk Menu program doesn't use decimal mode, either.  That will have to wait until tomorrow as well to be investigated.

For now, I am just happy that we have finally found the main bug, and can hopefully start moving forward again.


Ich war nur 17, wenn ich erst gelernt das "Dezimal Modus" Fahne des 6502s zu hassen. Ich war noch in der Schule und bekam ein Job, das Software ein 8,7m lang Roll Former Maschine zu schreiben. Es benutzt ein doppel-6502 Platte und hatte nur 8KB RAM pro CPU. Wenn ich weiser war, dann würde ich es nicht akzeptiert haben.  Alles ging ziemlich gut. Dass ist, bis wir hatten nur drei Tage, es zu fertigstellen, bevor es an Bord ein Schiff nach Argentinien muss. Alles funktioniert gut, außer dass manchmal zählt es die Lange falsch. Endlicht entdeckt uns, dass nur meistens gestartet ein 6502 mit dem D-Fahne leer, obwohl das Data-Sheet sagt es immer ohne D-Fahne starten würde.

Natürlich mit dem MEGA65, einige die erste Sache, dass ich gemacht hatte, war die D-Fahne aus am Reset machen. Ich hatte gedacht, dass alles war am Ende mit der blöden D-Fahne. Aber das war nicht so.

Wir haben für ein paar Monaten ein böse Bug im Hypervisor gekriegt. Manchmal, wenn eine Programme ein Hypervisor-Trap gerufen, dann wird es falsch gehen.  Ab und zu hatten wir es untersucht ohne Erfolg. Dann Heute im unseren Labor hat Ben bemerkt, dass wann wir die neue Disk Menu Programme debuggt, dass eine von Checkpoint Nachrichten war immer beschädigt. Endlich hatten wir ein reproduzierbar Bug, dass wir könnten untersuchen.

Nur eine Stunde später hatte ich entdeckt, dass die blöde D-Fahne war auf. Im Hypervisor! Ich war gar nicht glücklich.  Aber wie bekam die D-Fahne auf?  Dann erkannte ich, dass ich der D-Fahne Wert konserviert, wenn ein Hypervisor Trap auftritt.  Es braucht nur Eine Linie zu reparieren, nach wir hatten das Problem entdeckt. Aber es war schon Feierabend. Deshalb müssen wir Morgen prüfen, wenn es das Problem wirklich repariert.

Thursday, April 21, 2016

On cycle count predictability and related things

Some folks have expressed their concern that this CPU redesign takes away from the genuine 8-bit computer feel of the MEGA65.  My feeling is that it doesn't which I will explain below, but at the same time I don't want to be dismissive of anyone's concerns. Our goal remains to make something that is authentic and enjoyable for a wide range of people to use and program.  So please poke me, either in the comments or elsewhere if you wish.

But for now, I will take a little time to explain how the CPU looks from the user-perspective, to hopefully provide some assurance that it is not really a great departure from what we already had.  Indeed, from what I understand, what we doing here is not greatly different from how the Chameleon's CPU operates, i.e., some more modern CPU construction techniques are used behind the scenes, to provide what is very much (in their case) a 6502.

The main difference is that we are being transparent how we are making the CPU behind the scenes, so that it gives the end result of being a 6502 and 4502 compatible CPU.  We're sorry if that spoils the "magic trick" for some, but we strongly believe that transparency is always best in the long run.

The out-of-order instruction retirement is just a fancy way of saying that the CPU takes and executes the instructions in order, but some can take longer to complete, for example if they need to read or write from memory.

What doesn't change, is if an instruction requires the value read from memory, that it can't be completed until the thing it depends on is complete.  That is, it still behaves exactly as one expects a 6502 to behave, for any given program.  This is quite similar in many ways to the way that the SuperCPU has a 1-byte write-through "cache."  We are just using a different mechanism (register renaming, or reservation slots, depending on how you want to look at it), but to achieve much the same goal.

So if we look at a simple loop:

l1: lda $1000,x
sta $2000,x 
bne l1         

The simulation of this loop for the new CPU (in its current unfinished form, so there might be some changes) below shows how a couple of loop iterations go through. Note that register contents are BEFORE the instruction is executed, just because of how the simulation outputs stuff.  i.e., it shows the CPU state just before it executes the instruction, instead of just after.

-- LDA / STA / INX / BNE instructions all execute on consecutive cycles, taking
-- a total of only 20ns
@450ns: PC $8104 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10
@455ns: PC $8107 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  9D 00 20
@460ns: PC $810A A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  E8 D0 F7
@465ns: PC $810B A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I--  :  D0 F7 4C
-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction
@525ns: PC $8104 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10
@530ns: PC $8107 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  9D 00 20
@535ns: PC $810A A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  E8 D0 F7
@540ns: PC $810B A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I--  :  D0 F7 4C
-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction
@600ns: PC $8104 A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10

What can basically be seen above is that the non-branching instructions all take one cycle to run, whether or not they need a memory access, because all the out-of-order retirement and register renaming hides that. The result is that the timing is actually somewhat simpler and easier to predict for the most part than on a real 6502.  Note that we will still have a ~1MHz, ~2MHz and ~3.5MHz speed settings, where we will emulate the normal 6502 and 4502 timing of all instructions, and when we get time to do it, to make the memory access cycles also match that of a 6502 exactly, and naturally also the same for 3.5MHz 4502 mode. (One of the key reasons for reimplementing the CPU this time, is actually to make sure it has two "personalities", where in 6502 mode, all illegal opcodes work properly, and when in 4502 mode, all 4502 opcodes work properly, and can match the timing exactly -- so that we can have a real C64 mode and a real C65 mode, both of which are as compatible as possible.

The other obvious thing is that the branch instruction suffers a pretty big penalty, which is because the pipeline takes a bit of time to start feeding the new instructions.  However, because the clock speed is 4x, and the main pipeline is 4-stage, the end result is that the branch actually takes exactly the same amount of time as on our previous 48MHz CPU design.

It's also worth mentioning that most of the sources of timing uncertainty in modern PC processors etc don't actually come from the pipeline and other features that we are talking about here.  (In fact, the 6502 already pipelines between instructions a little). They come from the cache, from virtual memory, from the operating system that is hiding behind and pre-empting your process all the time and filling the cache with rubbish as a result. We are not having any of that stuff in the MEGA65: What you get is a 6502 or 4502 processor, that behaves how you expect.  We have just implemented it using some lessons learned over the past few decades of CPU implementation.

Otherwise, I think that this work has got some folks thinking about what makes a machine have character, instead of just being an 8-bit version of another wise soulless kind of PC, or FPGA-centric thing that people build.  For us, there are some key things, of which the following are a few.  Of course, we are thinking about many other things, such as C64 compatibility, but we take these simply for granted. 

First, the video generation MUST be rasterised, without a frame-buffer, just like on a real C64 or C65.  That is, the video chip needs to be deciding, cycle by cycle, what colour the next pixel will be, and allow the programmer to do horrible things to it that were never intended.  It is already possible, for example, on the VIC-IV in the MEGA65 to cause a monitor to totally lose sync, because you can trick it into moving the HSYNC pulses on a raster line.  I get back to this again below, but it is really a very important point.  In fact, I would say that what really makes the C64 interesting is the VIC-II and the SID. The CPU, while still important, is really secondary in many ways. It is the custom chips and overall combination that really define the "character" and "personality" of the C64.  The MEGA65 will of course have a its own personality, but we still feel that it will indeed have a personality, and that it will be a very strong one.

Second, it has to still be a simple bare-metal machine, where you have effectively full access to all the hardware when you are running on it.  The only piece we have outside that is the Hypervisor, which is best understood as an integrated freeze cartridge, so that you can easily load, save and switch what you are doing.

Third, the machine must still have fundamental limitations, that provide opportunity for programmers to try to stretch what the machine can do.  This is why we have the combination of CPU and resolution improvements together, for example, so that the relationship of CPU performance and the number of bits on screen at a time remain in reasonable relation.  The C64 has 64000 pixels from not more than 64KB = 512kbit on screen at a time, and 1x10^6 cycles or 3x10^5 instructions per second, so that there is approximately one instruction per bit of displayed graphics per second.  The MEGA65 has about 2x10^6 pixels, and is expected to have some multiple of 10^7 instructions per second.  Thus the instructions per pixel-bit is increased by an order of magnitude over the C64, so that it offers a nice bit of extra freedom, but without removing the limitation completely.  (Compare that with a modern PC, which instead has about 10^10 instructions per second, not counting the 10^12 or more GPU instructions per second). Moreover, the number of bits per pixel available from RAM is still in proportion: A C64 has about 8 bits per pixel available (64KB / 64000 pixels). The C65 actually has less, because while it has 128KB of RAM, it can do, for example 640x400 or 1280x400 resolutions. The MEGA65 goes further, having the same RAM as the C65, but with many more pixels, much more creativity will be required to find solutions to having full screen full-colour displays -- just as this presents special challenges and opportunities for ingenuity on the C64 (and C65). My point here is really that while the boundaries of what is "possible" on the MEGA65 are naturally different to those of the C64, we have retained this sense of a limited computer, so that it still has character, and will still require years of careful thought and experimentation to find its limits.

Finally, the specification has to be fixed for the long-term, like the C64's, so that people can program it with confidence, knowing that their code will "just work" on MEGA65s for years and decades to come, because otherwise the limits of the machine are not real.  This is actually why we want to get this CPU matter sorted out sooner rather than later, so that we can say with authority, "This is the CPU of the MEGA65. It shall be no faster."  Similarly, we want to pin down the last few points on the VIC-IV

Anyway, as I have said, we want this machine to be fun for the community, and something with a stable and fixed specification once we release it, so that it can have a long life, including so that stuff that you write with cycle-by-cycle timing will keep on working.  So please don't hesitate to let us know if you have concerns about our approach, or suggestions how we can do it better. This is one of the great things about an open-source project, that people can look and provide feedback and help to make sure that the end result is as good as possible. We can't guarantee that we can take everyone's requests and include them (partly because some of you ask for opposite things ;), but we do listen and think carefully about them all.

Tuesday, April 19, 2016

Overview of the new CPU design (as of today)

This is totally subject to change without notice, but the following gives an over-view of the design of the new CPU:

The gs4502b will, at this stage, be a pipelined, triple-core, out-of-order instruction retirement, register-renaming processor with parallel instruction pre-fetch buffer and self-modifying code hazard avoidance.  

We also plan for it to run at 192MHz.

Okay, that's all fairly technical gobbledey-gook, so lets break it down:

First, for the most part, the features of the processor will increase the rate at which instructions can be processed, i.e., decrease the number of cycles that many instructions will take. The main exception is the fact that it is pipelined, means that in some cases instructions may take more cycles, in particular branches or self-modifying code, because the pipeline will have to flush.  However, the increase in clock speed means that it will be extremely hard to craft a set of instructions that run slower on this processor than on our existing 48MHz one.

Now moving to the specific features:

A pipelined processor is one that breaks instructions down into separate little bits, like reading the instruction, decoding the instruction, actually doing the instruction, and writing back any results to memory, and so on.  One instruction can be doing each of those things at any point in time, so while an individual instruction takes longer, the number of instructions per cycle doesn't have to drop.  The big advantage is that a pipeline usually allows the clock speed to be increased -- this is exactly why we are employing one on the new 4502b.

The down side to a pipeline is that if the pipeline has to be flushed, it takes a while for it to start executing instructions again.  This was the problem with the Pentium 4 processors that used crazy pipelines to push the clock-speed way high, but didn't have enough cache memory to sustain the pipeline, meaning that actual performance was often quite poor. However, on the MEGA65, the CPU is effectively operating from cache the whole time, as the BRAM we use for the main memory is internal to the FPGA, and can be accessed as fast as the cache on a typical processor.

The 4502b will also be triple core. The first core will be the "CPU", and the 2nd and 3rd cores will be primarily for floppy drive emulation.  However, when you don't want or need to emulate a floppy drive, they will be available for use by the programmer.  Also, at this stage, the cores will be able to be set in two different performance modes: In the one mode, the primary core gets priority, so that it can run as fast as possible.  In the other mode, all cores will share the memory bus more fairly, and so while the first core will likely still run fast, it won't be as fast as in the first mode, but this will be offset in most cases by the increased performance of the 2nd and 3rd cores. Of course, this will require software that is designed to take advantage of the extra cores, of which none currently exists -- although I did write some dual-processor 6502 code back in the 1990s, but that's a story for another day.

The CPU will also support out-of-order instruction retirement.  This means that while instructions will start executing in the correct order, quick instructions will be allowed to finish while slower ones will continue in the background.  This will allow more instructions to be processed per unit time, by reducing the amount of the time the CPU sits blocked waiting for memory accesses to complete. In particular, memory reads and writes will continue in the back-ground, without blocking the CPU from executing new instructions, unless the new instructions depend on the results of the old instructions, for example if we have LDA $1234  followed by ADC $3456, the ADC would normally need to wait for the LDA to finish so that we can have the result ready to use as input to the ADC instruction.  However, even then, it will sometimes be possible to continue processing, where we can easily predict where the result will come from, as in this example, by using register renaming.

Register renaming is a fancy trick, where we can have multiple versions of a register at the same time. Using the example from above, we can say that one version of the accumulator register will get its value from location $1234.  Then when we want to use ADC to calculate based on that renamed register, we can tell the appropriate part of the CPU that the input to ADC is in fact the output from the previous instruction, by giving it the name that the result will have.  If that all sounds crazily complicated, don't worry too much. Just understand that it helps the CPU to go a lot faster, especially when there are a lot of memory accesses.  For those interested, wikipedea has a good page on this.

The CPU will also have a parallel instruction pre-fetch buffer. This is really a simple little thing that holds the up-coming 16 instruction bytes, and allows an entire instruction to be dispatched every single cycle, unless the buffer is empty, or the CPU pipeline stalls.  This means that instructions that used to take upto 7 cycles on a regular 6502 can sometimes be executed in just one cycle*!

Of course the asterisk is there, because there can be a lot of reasons why this might not happen in practice.  But in theory, the new CPU will be able to execute 192 million instructions per second.  This compares with the approximately 10 - 20 million instructions per second that the existing 48MHz CPU can achieve, and of course looks quite absurd next to the ~250,000 - 300,000 instructions that a real C64 could execute per second.  And that is using just one core on the 4502b.  The theoretical peak performance will be 576 million instructions per second, although as anyone who knows CPU benchmarks will know, that the reality might be only 10% - 50% of that figure.  Nonetheless, that is still very, very fast for an 8-bit CPU.

Finally, to make sure that all existing software can run on it, the CPU will include self-modifying code hazard avoidance. This is just a fancy way of saying that the CPU will realise when a program modifies itself, and flush the pipeline whenever it needs to.  The only trade-off is that code that modifies itself might suffer a penalty of about 10 - 20 cycles each time it modifies itself.  Of course, at 192MHz, that is still less than 105 nano-seconds.  That is, a worse-case pipeline stall on this processor will stall the CPU for only about 1/10th of a cycle when compared to a 1MHz 6502.

So anyway, that's the current thinking on this processor, and when I get the chance, I will provide an update on how far along the implementation currently is, and give some tentative simulation results to give an idea of how fast the processor might end up in practice.

Planning for stability, and the case for re-working the CPU

One of the things that we are determined to do with the MEGA65, is to have the core functions of the machine ready and stable from the outset.

In particular, we want people to be able to have dependable cycle timing for the CPU and VIC-IV, so far as is possible, so that people can safely write games and demos for it, without worrying about future updates breaking things.

The problem at the moment is that our existing 48MHz CPU doesn't meet timing closure, i.e., it is too fast for what the FPGA can guarantee will work, and it's also much too big: It takes up somewhere between 15% - 30% of the very large FPGA we are using.  This is a Bad Thing. Especially since we don't yet have a CPU core for an emulated floppy drive, and we would really like to be able to emulate two floppy drives at the same time, so we would need 3 CPUs.  Of course the floppy drive CPUs can be 6502s instead of 4502s, which simplifies things a bit, but it was still running the risk of being much too big.

Also, the current CPU has a couple of weird bugs that are proving hard to track down, because the existing CPU has been built by accretion, as I have realised things that need to be in it.

So, while on the one hand, it feels like we are going backwards in the short-term, I have started implementing an all new CPU, that will be much smaller, will meet timing closure, and will generally be simpler and easier to understand, and therefore to debug.

This will in fact be the 3rd or 4th CPU design for the MEGA65, depending on how you count things, and will also incorporate what I have learnt through that process, and also some other modern CPU features that I have been reading up on.  The net result is that the new CPU should be quite a lot faster than the current design, but you will have to wait for future blog posts to find out how fast, because even I don't yet know how fast it will end up being.

So expect a few blog posts over the coming days and weeks as I go through the design of the CPU, and document the process of getting it to work.

Sunday, April 17, 2016

We've entered the MEGA65 into this year's hackaday contest

Hop over to the link below and take a look, and if you like, help us spread the word:

Meanwhile, we have not been idle in the background, but have some fun progress to report as soon as I get the chance to write about it.

Tuesday, March 22, 2016

First community developed title with MEGA65 support?

We recently found out about the following release:

This is basically a Wolfenstein-like ray-tracing engine for the C64, which seems to be amazingly memory efficient and rich in functionality. In fact, it seems to be capable enough to make an actual Wolfenstein-style game on the C64.

It is also EXTREMELY slow on a real C64. In fact, even on a SuperCPU, or in VICE running in warp-mode on my i7 Mac, it is still quite slow.

What is very nice is that the author appears to have added support for the MEGA65.

Here is a video of it running on the MEGA65 at 48MHz.  Compared to my 2.7GHz i7 Mac running VICE, the MEGA65 is somewhere between 2x and 5x faster, as far as I can judge.  It would be great if the demo included a frames per second measure, so that we can benchmark more precisely.  It would also be great if someone with a Chameleon could test it on there, so that we can get a sense of the relative speed between the Chameleon and current MEGA65 state.

Prototyping the keyboard

We have just received a prototype keyboard that we have had WASD Keyboards build for us. This keyboard is based on a normal 87-key keyboard layout, and is intended to be a design that can be used for people building their own MEGA65 using the Nexys4 FPGA boards.

As the keyboard layout doesn't exactly match the C65's keyboard, we have had to make some adjustments, which I will explain in a moment. But first, the pictures:

First up, we think that WASD have done a great job, and made a very beautiful keyboard. We thought quite hard about the keyboard layout, to make it both useful and as true as possible to the original keyboard layout, and preserve the "8-bit" feel.

First, the top line from RUN/STOP to HELP match the C65 keyboard exactly. We then also put CLR/HOME and RUN/STOP on two of the three keys to the right, so that it would be easy to find those keys using muscle memory, to overcome the interference of the six key block below these three keys.  The pound key is also put on that row, because there wasn't room for it in the main block. This is also why the up-arrow and equals keys are moved to the key blocks on the right.

We then used the remaining four spare keys there for some common ASCII characters that were missing from the original keyboard, to make the MEGA65 keyboard more useful.  We also added a FAST/SLOW key for changing the CPU speed more conveniently.

The other most obvious changes are to have duplicated the C= / MEGA key, so that there is one on the right as well as left, and the inclusion of the FIRE and JOY LOCK keys.  Those two keys allow use of the cursor keys as a make-shift joystick.  The FIRE key simply mimics the fire button on a joystick, and by pressing the JOY LOCK key, the cursor keys toggle between acting as normal keys or as a joystick. SHIFT or one of the other modifier keys (yet to be decided) will be able to toggle which joystick port is to be controlled in this way.

We are still working on having a completely custom keyboard made, which would also have symbols printed on the front of the keys like on a real C65, instead of just on the top of the keys, and would exactly match the C65 keyboard in key layout.  More on that as we make progress.

Sunday, March 13, 2016

We're still alive / Wir leben noch


Just a short post to break the radio silence, and let you all know that after a quiet period over Christmas and start of University semester, we are picking up steam again.

First, we now have a couple of students working on VHDL and other aspects of the project, together with another volunteer.  So we have gone from having just me working on all the VHDL alone to having a lovely team now! I am extremely glad of this, partly because I know there are plenty of better VHDL coders than me, but also because tracking down bugs in VHDL code is best done with plenty of time and focus, which I haven't had at my disposal lately.  Thus we will hopefully get the known CPU bugs fixed over the next few weeks, and progress forward with implementing 6502 illegal opcodes, improving the CIA implementation, adding bitplanes and doing all the other stuff that needs to happen on the VHDL front.

Second, we have been making some nice progress on the keyboard front, including with some really great volunteer assistance, for which I am also very grateful.  We are simultaneously exploring options ranging from custom key-caps on a standard 87-key USB keyboard, through to full custom manufacture of a keyboard. We have no firm idea of the costs at either end of the spectrum, but we do have some clear paths now to get to that point.  Hopefully I will be able to share some images of the keyboard designs, so that we can have some nice pictures instead of just text in the next blog post.

So, that's about it for now, just, as I said, to let you know that we are still alive and working on stuff.


Vielleicht  wissen schon einige von euch, dass ich Deutsch lerne. Deshalb werde ich einige Posts auch auf deutsch schreiben.  Natürlich ist mein Deutsch noch nicht sehr gut. Aber ich werde es versuchen.

Es hat eine ganze Weile gedauert, seit dem letzten Post.  Weihnachten und Sylvester ist vorbei und auch der Anfang des Universitätsjahres.  In Australien beginnt das Universitätsjahr im Februar und im November ist es zu Ende.  Inzwischen habe ich nicht viel Zeit gehabt, aber jetzt habe ich ein bisschen. Viel besser, wir haben neuen Studenten und Freiwillige, die dem MEGA65 Projekt helfen. Wir haben zwei Studenten und einen Freiwilligen, um mit dem VHDL zu helfen und einen Freiwilligen mit der Tastatur zu helfen. Deshalb sollten wir bald neue Bitstreams und auch interessante Fortschritte mit der Tastatur haben.  Ich freue mich sehr darüber, dass wir die CPU Fehler endlich reparieren und die andere VHDL Sache abhaken können. Das braucht natürlich seine Zeit.

In Bezug auf der Tastatur, schauen wir um verschiedene Optionen an. Auf der einen Seite, denken wir über eine 87-Taste USB Tastatur Lösung nach. Das wir natürlich billiger sein, aber nich so schön wie eine ganz maßgeschneiderte Tastatur. Aus diesem Grund kommen auch maßgeschneiderte Tastaturen in Frage. Er wird mindestens ein paar Wochen brauchen, bevor wir eine genauere Vorstellung von den Kosten haben. Unser super Freiwilligen helfen bei dieser Sache sehr.

Saturday, January 2, 2016

Making Hypervisor debugging easier

The hypervisor has bugs. This isn't really a surprise.

The problem is how to make it easy and efficient to debug the hypervisor.  I can already put break-points on the CPU via the serial monitor, however, that is a rather frustrating process, partly because the assembler I am using doesn't emit a nice symbol table to let me easily work out where to break.  So I wanted an easier approach.

I know there is a bug in the DOS function for opening and reading from a directory from user-land, which makes the whole machine reset back to kickstart trying to boot the machine.  This was the catalyst for me to find a nicer way.

So what I have done is to implement a checkpoint function in the hypervisor, which writes characters directly to the serial monitor interface.  With that, I could easily emit messages indicating where in the hypervisor the code has reached, and thus help to track down where things are going west.

I wanted this checkpoint system to be as simple to use as possible, so that I could easily add and remove checkpoints throughout the hypervisor code, but at the same time I want the checkpoint routine to show useful information, such as the address it was called from, and the current contents of the major registers.  So with something like:

   jsr checkpoint

I want it to produce output on the serial interface like:

Checkpoint @ $886C A:$0C, X:$04, Y:$10, Z:$03, P:$34 :

That would be a big step forward, but what I would really like, is to be able to attach text messages to each checkpoint, so that I can very easily relate them back to the source code, e.g.:

   jsr checkpoint
   .byte 0,"dos_readdir <1>",0

Should generate something like:

Checkpoint @ $886C A:$0C, X:$04, Y:$10, Z:$03, P:$34 :dos_readdir <1>  

That way, with checkpoints throughout the code, I should easily be able to generate progress logs of the hypervisor like the following, which traces a single call to dos_readdir:

Checkpoint @ $886C A:$0C, X:$04, Y:$10, Z:$03, P:$34 :dos_readdir <1>           
Checkpoint @ $8885 A:$00, X:$04, Y:$10, Z:$03, P:$37 :dos_readdir <2>
Checkpoint @ $889C A:$02, X:$00, Y:$1A, Z:$03, P:$37 :dos_readdir <3>
Checkpoint @ $88DC A:$80, X:$10, Y:$1A, Z:$03, P:$37 :drd_isdir
Checkpoint @ $88F4 A:$00, X:$FF, Y:$1A, Z:$03, P:$B5 :drce1
Checkpoint @ $8904 A:$02, X:$00, Y:$1A, Z:$03, P:$37 :drce1 <2>           
Checkpoint @ $8915 A:$81, X:$00, Y:$1A, Z:$03, P:$B5 :drce1 <3>           
Checkpoint @ $8923 A:$81, X:$00, Y:$1A, Z:$03, P:$B5 :drce_next_piece <1>
Checkpoint @ $8960 A:$00, X:$1B, Y:$0C, Z:$03, P:$37 :drce_next_piece <2>
Checkpoint @ $8986 A:$00, X:$00, Y:$0C, Z:$03, P:$36 :drce_next_piece <3>
Checkpoint @ $8A1A A:$00, X:$01, Y:$03, Z:$04, P:$36 :drce_ignore_lfn_piece
Checkpoint @ $8A39 A:$C0, X:$01, Y:$10, Z:$04, P:$B5 :drce_ignore_lfn_piece <2>
Checkpoint @ $8923 A:$C0, X:$01, Y:$10, Z:$04, P:$B5 :drce_next_piece <1>
Checkpoint @ $8960 A:$00, X:$1B, Y:$0C, Z:$04, P:$37 :drce_next_piece <2>
Checkpoint @ $8986 A:$00, X:$00, Y:$0C, Z:$04, P:$36 :drce_next_piece <3>
Checkpoint @ $89B8 A:$67, X:$05, Y:$0B, Z:$00, P:$36 :drce2
Checkpoint @ $89DA A:$2E, X:$0B, Y:$1A, Z:$00, P:$36 :drce3
Checkpoint @ $89FC A:$38, X:$0D, Y:$20, Z:$00, P:$36 :drce4
Checkpoint @ $8A1A A:$00, X:$0D, Y:$20, Z:$00, P:$36 :drce_ignore_lfn_piece
Checkpoint @ $8A39 A:$E0, X:$0D, Y:$10, Z:$00, P:$B5 :drce_ignore_lfn_piece <2>
Checkpoint @ $8923 A:$E0, X:$0D, Y:$10, Z:$00, P:$B5 :drce_next_piece <1>
Checkpoint @ $8A77 A:$22, X:$1B, Y:$0B, Z:$00, P:$35 :drce_shortname
Checkpoint @ $8A96 A:$5F, X:$1B, Y:$00, Z:$00, P:$34 :drce_shortname <2>
Checkpoint @ $8ABA A:$31, X:$0B, Y:$0B, Z:$00, P:$37 :drce5
Checkpoint @ $8B28 A:$0D, X:$0B, Y:$0B, Z:$00, P:$35 :drce_already_have_long_name
Checkpoint @ $8B78 A:$22, X:$04, Y:$0B, Z:$00, P:$35 :drce_fl <1>
Checkpoint @ $8BAB A:$01, X:$04, Y:$10, Z:$00, P:$37 :drce_noteof

The source code for part of the hypervisor that produces the above log has simple and easy to read (and later remove) checkpoint annotations:

; Get the current file entry, and advance pointer
; This requires parsing the current directory entry onwards, accumulating
; long filename parts as required.  We only support filenames to 64 chars,
; so long names longer than that will get ignored.
; LFN entries have an attribute byte of $0F (normally indicates volume label)
; LFN entries use 16-bit unicode values. For now we will just keep the lower
; byte of these

jsr checkpoint
.byte 0,"dos_readdir <1>",0

; clear long file name data from last call
lda #0
sta dos_dirent_longfilename_length

jsr checkpoint
.byte 0,"dos_readdir <2>",0
jsr dos_file_read_current_sector

jsr checkpoint
.byte 0,"dos_readdir <3>",0

ldx dos_current_file_descriptor_offset
lda [dos_file_descriptors+dos_filedescriptor_offset_mode],x
cmp #dos_filemode_directoryaccess
beq drd_isdir
cmp #dos_filemode_end_of_directory
bne drd_notadir

jsr checkpoint
.byte 0,"dos_readdir <4>",0

The checkpoint function itself is not too complex. The main things it needs to do is to save the major registers into some scratch space, pop the return address of the stack, so that we can show the call address in the output, and also update the return address to skip any message to be displayed.  

Other than that, it is mostly focused on preparing the message to be written to the serial monitor interface.  To keep the hardware simple, the serial monitor doesn't look after the timing between characters, so if the checkpoint routine sends characters too quickly, they will be lost. Thus the checkpoint routine has a loop to that waits between sending successive characters.  Since each checkpoint message is about 100 bytes long, and the serial monitor runs at 230400bps, this means we should be able to output 200 checkpoint messages per second, if we get the delay loop exactly right.

For those with a morbid interest in these things (or who would like to spot bugs for me), here is the checkpoint function in its entirety:

; Routine to record the progress of code through the hypervisor for
; debugging problems in the hypervisor.
; If the JSR checkpoint is followed by $00, then a text string describing the
; checkpoint is inserted into the checkpoint log.
; Checkpoint data is recorded in the 2nd 16KB of colour RAM.

; Save all registers and CPU flags
sta checkpoint_a
stx checkpoint_x
sty checkpoint_y
stz checkpoint_z
sta checkpoint_p

; pull PC return address from stack
; (JSR pushes return_address-1, so add one)
adc #$01
sta checkpoint_pcl
adc #$00
sta checkpoint_pch

; Only do checkpoints visibly if switch 12 is set
lda $d6f1
and #$10
beq cp9

inc $d020

; Write checkpoint byte values out as hex into message template
ldx checkpoint_a
jsr checkpoint_bytetohex
sty msg_checkpoint_a+0
stx msg_checkpoint_a+1
ldx checkpoint_x
jsr checkpoint_bytetohex
sty msg_checkpoint_x+0
stx msg_checkpoint_x+1
ldx checkpoint_y
jsr checkpoint_bytetohex
sty msg_checkpoint_y+0
stx msg_checkpoint_y+1
ldx checkpoint_z
jsr checkpoint_bytetohex
sty msg_checkpoint_z+0
stx msg_checkpoint_z+1
ldx checkpoint_p
jsr checkpoint_bytetohex
sty msg_checkpoint_p+0
stx msg_checkpoint_p+1
ldx checkpoint_pch
jsr checkpoint_bytetohex
sty msg_checkpoint_pc+0
stx msg_checkpoint_pc+1
ldx checkpoint_pcl
jsr checkpoint_bytetohex
sty msg_checkpoint_pc+2
stx msg_checkpoint_pc+3

; Clear out checkpoint message
ldx #39
lda #$20
cp4: sta msg_checkpointmsg,x
bpl cp4
; Read next byte following the return address to see if it is $00,
; if so, then also store the $00-terminated text message that follows.
; e.g.:
; jsr checkpoint
; .text 0,"OPEN DIRECTORY",0
; to record a checkpoint with the string "OPEN DIRECTORY"
ldy #$00
lda (<checkpoint_pcl),y

bne nocheckpointmessage

; Copy null-terminated checkpoint string
ldx #$00
cp3: lda (<checkpoint_pcl),y
beq endofcheckpointmessage
sta msg_checkpointmsg,x
cpy #40
bne cp3

; Skip $00 at end of message

; Advance return address following any checkpoint message
adc checkpoint_pcl
sta checkpoint_pcl
lda checkpoint_pch
adc #$00
sta checkpoint_pch

; Only do checkpoints visibly if switch 12 is set
lda $d6f1
and #$10
beq checkpoint_return

; output checkpoint message to serial monitor
ldx #0
cp5: lda msg_checkpoint,x
sta hypervisor_write_char_to_serial_monitor

; delay at least 2,000 cycles to allow character to be written
; each inner loop is 2 + 256 * (2+3) = ~1,250 cycles
; so 2 such loops should take long enough
ldy #2
ldz #0
cp6: inz
bne cp6
bpl cp6

cmp #10
bne cp5
; restore registers
lda checkpoint_p
lda checkpoint_a
ldx checkpoint_x
ldy checkpoint_y
ldz checkpoint_z

; return by jumping to the 
jmp (checkpoint_pcl)

and #$f0
jsr checkpoint_nybltohex
and #$0f
jsr checkpoint_nybltohex
and #$0f
ora #$30
cmp #$3a
bcs cpnth1
cpnth1: adc #$06

; checkpoint message
msg_checkpoint:      .byte "Checkpoint @ $"
msg_checkpoint_pc:    .byte "%%%% A:$"
msg_checkpoint_a:     .byte "%%, X:$"
msg_checkpoint_x:     .byte "%%, Y:$"
msg_checkpoint_y:     .byte "%%, Z:$"
msg_checkpoint_z:     .byte "%%, P:$"
msg_checkpoint_p:     .byte "%% :"
msg_checkpointmsg:    .byte "                                        "
     .byte 13,10  ; CR/LF