Wednesday 30 May 2018

Resurrecting the MEGA65 VNC server interface

Ages back we had a VNC server for the MEGA65, partly just for fun, and partly as a way to get nice digital screen shots of the MEGA65 for use here on the blog and elsewhere.  This all fell into disrepair after a while as we focussed on other things, including the change of video mode and the related activity around that, to make sure that the M65 platform is stable for some time to come.

Anyway, now I want to be able to make nice digital screen captures again, instead of taking photos all the time, so I have spent a couple of days getting it all working again, and trying to make it quite a bit better than it was before.

Back almost four years ago (gadzooks, we have been working on the MEGA65 for a long time now!) the VNC display was extremely lethargic, in part because at 1920x1200 we could transmit only 1 in every 13 raster lines over the 100mbit ethernet, as any more would have simply eaten all the bandwidth, because 1920x1200 = ~2Mpixels x 60 frames per second = 120MiB/sec = ~10x too much for 100Mbit ethernet.

But now that we are at 800x600, the bandwidth equation is quite a bit different.  800x600 = ~0.5Mpixels, and so a full 50 frames per second in PAL needs only 25MiB -- still too much, but tantalisingly close that I thought about how I could compress the data stream enough for it to work in the typical case.

100Mbit ethernet can realistically do about 10MiB of useful data per second, so we need to reduce the average data per pixel to < 10/25 bytes = ~3.2 bits per pixel.  At the same time, it would be nice to have at least 12-bit colour depth, so that the images look nicer than the old 3-3-2 8-bit colour cube I had to use with the old one.

I figured a nice bit-packed compressed format should do it: a 0 means repeat the same colour as last time, and 10 means use the most recently used colour.  With this, a 2-colour screen, like the C64 start-up screen should average (1+2)/2 = 1.5 bits per pixel -- easily within our envelope. But of course most of the time the border is a solid colour, as is much of the screen, so I also added a 16-bit sequence that indicates that upto 255 pixels of the same colour occur in a line.  Then I added some four-bit sequences so that we can cheaply switch among the five most recently used colours, as I figure this should probably be sufficient for most purposes.

Of course, this is all quite a bit more complex than the old one, and so most of the time has been spent debugging all the special cases of bit shuffling in the encoder, which is of course fully in hardware in VHDL, which makes it all a bit interesting.

After some effort, it mostly works. I can see the C64 and C65 start-up screens, and the C64 screen takes only about 0.5MiB per second to stream at 50Hz, so about 10KiB per frame, which is more than acceptable.

There are always lots of little fiddly bits with these sorts of things, as the state machine for the encoder (in VHDL) and the state machine for the decoder (in C) have to follow each other exactly.  I think it now matches nicely, but I am still seein the occasional glitched raster, which I think is when it switches from one packet to the next, or else it could be the little 32 bit buffer for the compresser overrunning, although since it happens even on the 2-colour C64 start-up screen, which definitely lacks the complexity to cause overruns, I suspect this is not the case. It could feasibly be lost packets as well, as the bit stuffing spans packets, without resetting.  Anyway, it results in only a few glitched lines per second, which while noticeable when it is running freely, are usually not there if you do a single-frame screen grab, which is the primary purpose of implementing it.

The bigger problem is that the pixel valid signal that tells the frame packer when to capture a pixel is not in sync with the output of the VIC-IV.  This is because the pixel valid signal does not pass through some of the compositing and filtering output stages, and thus arrives a few cycles early.  Because the output pixel clock is a non-integer fraction of the internal clock of the VIC-IV, there is jitter between the two, which means if the two aren't in sync you don't just get the wrong pixel in a consistent manner, but rather there is some variation as you scan across the line, that effectively distorts the display, like this:


The effect is particularly noticeable here with in 80 column mode, because each pixel corresponds exactly to one display pixel -- so we see some pixels doubled in width, while others disappear.  There are also some other little glitches here, like the display is not properly centred (which should be easy enough to fix), and the funny notching into the right border (which I am not sure if it is caused by the encoder or the decoder; I'll have to do some more testing to work out what the cause of that is.)

So, I added a single cycle delay to the pixel strobe signal, and now it is displaying much more nicely:

While it can't be seen here, the notching of the border is still happening, as is the occasional glitching line.  Nonetheless, it is now at the point of basic usability.

What is annoying, is that in the process a memory corruption bug has crept in. I am suspecting that this is due to lack of timing closure, but I can't be immediately sure.

The next step was to think about how I could setup an easy work flow for capturing high-quality video streams direct from the MEGA65 via this VNC feed.

A bit of digging around revealed that ffmpeg can capture directly from an X11desktop.  If I started the VNC server and viewer automatically, and worked out where the window was, I could indeed make it automatically record.  This is still a bit of a work in progress, but it already works (Linux only for now):

if [ "x$1" == "x" ]; then
  echo "usage: record-m65 <network interface>"
  echo ""
  echo "NOTE: You must first enable the ethernet video stream on the MEGA65"
  echo "      sffd36e1 29 from the serial monitor interface will do this."
  exit
fi
make bin/videoproxy bin/vncserver
pkill vncserver
sudo echo
sudo bin/videoproxy $1 &
sleep 1
bin/vncserver &
sleep 1
vncviewer localhost &
sleep 2
xwininfo  -name "VNC: MEGA65 Remote Display"
x1=`xwininfo  -name "VNC: MEGA65 Remote Display" \

    | grep "Absolute upper-left X:" | cut -f2 -d: | sed s'/ //g'`
y1=`xwininfo  -name "VNC: MEGA65 Remote Display" \

    | grep "Absolute upper-left Y:" | cut -f2 -d: | sed s'/ //g'`
wmctrl -a "VNC: MEGA65 Remote Display"
rm output.mp4

ffmpeg -video_size 800x600 -framerate 50 -f x11grab \
    -show_region 1 -i :0.0+${x1},${y1} output.mp4
pkill vncserver

Basically it makes sure you have told it which network interface to listen on, makes sure that the necessary tools from in the MEGA65 source tree have been built, and then runs the video proxy (this requires root, because at the moment it has to operate as a packet sniffer, because the video containing ethernet frames the MEGA65 produces are effectively raw frames), starts the VNC server to use that, and then uses xwininfo to figure out where the window is on the screen, uses wmctrl to bring that window to the foreground, and then runs ffmpeg to do the capture.

The main wrinkles in this at the moment are that the video stream does not contain any audio, so the recorded video is only video, without any sound, and that you hvae to manually stop the script to stop it recording, which means you typically end up with a second or two of terminal window output at the end.

Apart from that, however, it works very nicely, as the following video shows.  The capture is at a full 50Hz, and the quality is great.  Indeed, the quality is so great it records those glitches I mentioned earlier. Because it is a direct digital capture, the files are also quite small.  So in this case, where most of the screen is a single colour, 23 seconds of video requires only 330KiB, and I suspect a lot of that will actually be the second or so of Linux terminal window you see at the end.


You can also see the timing closure problem I mentioned, in the form of the corrupting of the last few bytes of the screen.  Finding and fixing the root cause of that is the priority for me now, followed by fixing the visual glitches.

So, fixing the timing closure problem turned out to be quite simple. So with that fixed, I could again run some software.  This time we have a 5.5MiB file for about 1.5 minutes, which is still very nice, around 60KiB per second, or a little over 1KiB per frame.   Of course if we add audio in, then this will go up a bit.  But for now, here is me trying to remember how to get the joystick working on PS/2 keyboard input and play a little Impossible Mission:


Some of the glitching in this video confirms that there is an encoder or decoder problem, as the border notching is in fact one too many pixels being decoded somewhere along the line.  It is possible that whatever that problem is, that it might in fact be the cause of the glitch lines, if it is actually the encoder and decoder failing to track state correctly. Hopefully those problems won't be too hard to track down.

[Edit: It looks like Blogger has munged the video from the nice crisp videos I uploaded.  However, overall the effect isn't all bad, as the videos look almost like a real CRT display.  So I'll just leave them as they are for now.  The full crispness can be seen in the screen grab above, in any case.]

Wednesday 16 May 2018

Migrating from Xilinx ISE to Xilinx Vivado FPGA software

Until now, we have been using the old (and deprecated) Xilinx ISE software to compile the VHDL for the MEGA65 project. This is all a bit of an accident of history, because when the project first began in 2014, ISE was only just at end of life, and deficiencies in my VHDL programming style meant that I couldn't get it to work in Vivado.  Also, Vivado was less mature at the time.

Now, all that has changed: ISE is well and truly end of life, and approaching the zombie stage.  Vivado is now much more mature. But perhaps most importantly, Kenneth, one of our volunteers has put in a LOT of work silently in the background helping to move the project over to Vivado. 

The work involved, and the value to the project of doing this cannot be understated.

First, synthesis time under Vivado is fully 10x faster than under ISE.  This means we can do a synthesis run in ~10-15 minutes, instead of ~2-10 hours.  The benefit of this cannot be overstated.

Second, by fixing the semantics of memory access of the internal memories in the FPGA, a whole raft of instabilities have been fixed. These instabilities were causing differences in behaviour on different FPGA chips of the same model, and generally causing many lost hours due to chasing my tail on the symptoms of the problem, rather than the cause.

Third, Vivado achieves better timing closure.  This means that it is easier to get the design to run at the correct clock speed.  It also opens the door to increasing the clock speed in the future.

Finally, we are now somewhat future-proofed for ongoing development for the foreseeable future.

While most of the changes have been in the background, there are a few practical differences.  One of those is that various fixes along the way have improved our Bouldermark score somewhat to 38,980 (up from about 31,000).  This means we have a Bouldermark score 124x that of a stock C64. However, as previously explained, Bouldermark is a bit non-linear, in that the first few hundred points are quite a lot harder to get than the majority. This is consistent with the results of the Chameleon 64, which gets a Bouldermark score of 44.62x, but only 10.79x on Synthmark (our current Synthmark score is, for reference, 51x).



Kenneth gets an extra gold star for having found and fixed a problem with self-modifying code that was previously causing Bouldermark (and presumably other things) to not run stably.  This was all part of the same memory access semantics problems: In this case, it was possible for the CPU to begin fetching the next instruction before the RAM had time to update internally to present the updated value.  While this sounds absurd, when the clock cycles are only 20ns, propagation time inside components becomes a real consideration.