Still going

Been a little bit lax with the updates, but I am going to put that down to there not being much to write.  Although progress and changes have been made to my code.

I have moved the mixing of 4 channels of 8bit audio over to the DSP, as well as the output buffer into it’s cache RAM.  I believe I know what the distortion is now, looks like the output is missing samples for short periods, I suspect this is either the CPUs fighting over RAM or catching each other up.  The actual render code is working as intended, I took some dumps from RAM of it’s output, no distortion before playback, distortion added after playback.

Moving to the RISC did highlight a bug that had me scratching my head.  I had erroneously operated the interrupt latches within the DSP, and hence not cleared the correct one, unbeknown to me this was preventing the DSPs main loop from running, a main loop that accessed main RAM to count a fair amount.. so when I fixed the latches, the DSP went on a mad bender of RAM access, almost halting the 68K from running at all.  Once fixed all ran as I had hoped.  Alas the distortion persisted, but the new changes have reduced the amount of time the 68K spends on the job, probably now down to around 5-10% CPU utilization on the 68K, amazing what shaving a few instructions from within a loop will do 🙂

Once I have conquered the distortion issue I will focus more on getting four channels working and playing through a whole module, rather than looping a single channel.  Tonight is the Computer club at The Lass, so no code tonight, but hopefully the day off tomorrow will have plenty of coding time in it 🙂

Getting there

Whilst the new approach was mostly working it had had a negative affect also, the introduction of some hideous distortion.  I have tried a few different things to isolate the causes of this and a solution but was rapidly losing hope at finding the problem.

Thankfully I decided to take a break from it and my latest attempts at solving it, and not more than about 20 minutes into the film (V for Vendetta if you are interested), it struck me!

The production of the sample data during the VBI was fine, however I realised that whilst it was generating this sample data it was also blocking all other CPU based interrupts, which would of course block the 8kHz feeder interupt that is feeding the sample data to the DSP!  I managed to resist the urge to stop the film and try it there and then.. which meant I got to enjoy the rest of the film, but try my idea out at well past coding o’clock.

The result? near perfect success! Instead of running the whole sample rendering routine within the VBI, I simply triggered it from the VBI and bingo!  This is lest than ideal of course, but as this is prototyping at this point it’s not really a concern.

I did also try clearing the interrupt latches earlier in the VBI handler to allow other interrupts whilst the sample render ran, but this didn’t seem to work too well, I shall investigate further, I suspect it may be due to the processor having not yet returned from the exception, possibly making it less tolerant of other interrupts.  Quite pleased with myself 🙂

A different approach

Despite heavy optimisation it seems that my current approach simply will not work the way I have implemented it.  There are too many cycles wasted to updating variables to make it feasible, it may be possible with the DSP due to it’s high speed local cache, but 68K and DRAM nope.  So time for a change of plan.

An alternative approach mentioned a while ago by Zerosquare was to simply write out the sample data to a circular buffer, and play this, only updating during the VBI.  The size of the buffer obviously has to contain sufficient sample data to maintain constant playback for the period between VBI but this is typically a very small time period (20ms for 50Hz or 16.6ms for 60Hz), which translates to only needing quite a small buffer for sample data (8kHz with 60Hz VBI is around 134 samples, double that for 16kHz etc).  Triggering the code less frequently also means less time is spent saving and restoring CPU state between interrupts, the more complex the code the more you have to save/restore, so doing so less frequently is a saving in time also.  More time doing actual work and not the associated ‘paperwork’ around it.

One additional big advantage to this approach is no need to save incremental updates to counters for every sample, instead these counters can be held in the CPU’s data registers for the duration of the processing and only written at the end, reducing the number of memory calls made by the CPU significantly.

So last night, after much mulling over on the sofa, I started this rewrite already past my stop coding deadline (if I code much past 21:00 at night it becomes hard to shut down the brain and get some sleep 🙂 ), I was quite pleased that I have a very rough working version of this code already and in under an hour of fairly half arsed hacking.  I have mostly repurposed my existing code, so it is still horrifically inefficient, however it has already demonstrated significant time saving even in it’s current form.  With this working base reworking it to take full advantage of the new benefits of this technique shouldn’t be too difficult.

Hopefully if I get home tonight with sufficient time I will have an opportunity to try.

Timings

I am working away on something I have always wanted to write, a tracker mod player.  Many years ago I mentioned to Tyr that I wanted to write one that resided entirely within the DSP of the Jaguar, and this is my current plan.  Of course rather than try and learn both the RISC CPU and how to write a tracker player I have decided to prototype the design on the 68K.  This should be challenging enough, however I am very pleased with my initial progress, which I have left to fester for a while due to frustrations and having other things I need to do.

The main issue seems to be that I simply cannot squeeze sufficient time out of the 68K, which is crazy given it’s faster than the one in an Atari ST, and there are rather quick players on the ST.  I for some reason have been struggling to push more than one channel past 8kHz which is pretty naff (although I am most chuffed in the general code, just it needs more work and optimisation)

This evening I decided to try out one of my ideas, I pondered that perhaps by ignoring the object processor if it had perhaps gone into a sulk mode and decided that if it wasn’t to be played with it would steal some of the bus bandwidth and sulk.  So I gave it something to do.. no difference.. bollocks…

I decided to try some other things before I reach “stop coding o’clock” (if I code past that time, I simply will not sleep! 🙂 ).  So I decided to look at how much time my code was taking on the CPU, as well as see if I could fire the interupt faster than I presently was doing.  To achieve this I removed my playback routines from the interupt handler and simply replaced them with 2 instructions.  One to change the background colour blue at the start and one to change it black at the end, with nothing between them.  This produced some nice small blue lines on the screen, perfect, I can now SEE how long it takes the CPU to perform these 2 simple tasks (long than I thought too!).

For my next trick I re-introduced my playback code, but crippled it to simply just the portion of the routine which checks for new samples to play, and re-ran it.  There was now a LOT of blue on the screen! although the amount of code was quite small. hmmm clearly the code is less than optimal time wise, naturally as I am coding a prototype I have not been as stringent as I would for a final.

My thoughts at this stage are that I am asking the 68K to access main RAM too much, as it has no instruction cache of it’s own, its instructions, and most of the data have to be dragged from main RAM, in the case of 32bit data that alone will require 2 fetches.  To test the effects of reads on main RAM I removed my playback routine from the interupt handler again and replaced it with 10 nop instructions.  Running the test again, I get slightly longer line than with nothing between the colour changes..  OK, I added ten 16 bit memory reads instead of the NOPS, the amount of blue increased GREATLY, it is hard to tell but instead of being only about 6cm of blue it now seems to wrap across whole scanlines!  Changing this to ten 16 bit moves from one internal data register to another reduces the line back down to a much nicer 8cm sized line…

So.. it looks like I am going to have to figure a way of doing this without the simplicity and comfort main RAM was giving me.  Good job I do this for the fun of the challenge really 🙂