OK - here is the idea.
Each channel (wave or noise) takes 27 clock cycles.
The most expensive part of each channels 27 clocks is all the LDS and STS instructions.
example channel 1 here
Code: Select all
lds r16, tr1_step_lo
lds r17, tr1_pos_frac
add r17, r16 ;add step to fractional part of sample pos
lds r16, tr1_step_hi
lds ZL, tr1_pos_lo
lds ZH, tr1_pos_hi
adc ZL, r16 ;add step to low byte of sample pos
lpm r16, Z ;load sample
sts tr1_pos_lo, ZL
sts tr1_pos_frac, r17
lds r17, tr1_vol
mulsu r16, r17 ;(sample*mixing vol)
sbc r0,r0 ;sign extend
mov r28,r1 ;set (sample*vol>>8) to mix buffer lsb
mov r29,r0 ;set mix buffer msb
nop
16 of those 27 clock cycles are for loading and storing parameters
7 of them are for frequency divider, reading the sample and applying the volume <<< THE ACTUAL WORK
4 of them are for mixing
If we change the whole thing so it is part like the VSync mixer and part like the current inline mixer - we can almost double the channels. What I mean by this - is instead of processing each line separate you can bundle them up into runs of 4 lines and save samples to RAM. Like the VSync mixer saves 252 samples to RAM. It will be slightly different as we will be saving pre-mixed 8 bit values for 8 channels for 4 samples (32 bytes) instead. (maybe you can pre-mix channel pairs and make that 16 bytes I am not sure how that maths works yet)
Split the inline-mixer into 4 phases
Phase 1
Read 4 Samples from Channel 1 (Wave)
Write 3 Samples to RAM
Read 4 Samples from Channel 2 (Wave)
Write 3 Samples to RAM
Read Sample 1 of Channel 3,4,5,6,7 and 8 from RAM
Mix all 8 channels for sample 1
Phase 2
Read 4 Samples from Channel 3 (Wave)
Write 3 Samples to RAM
Read 4 Samples from Channel 4 (Wave)
Write 3 Samples to RAM
Read Sample 2 of Channel 1,2,5,6,7 and 8 from RAM
Mix all 8 channels for sample 2
Phase 3
Read 4 Samples from Channel 5 (Wave)
Write 3 Samples to RAM
Read 4 Samples from Channel 6 (Wave)
Write 3 Samples to RAM
Read Sample 3 of Channel 1,2,3,4,7 and 8 from RAM
Mix all 8 channels for sample 3
Phase 2
Read 4 Samples from Channel 7 (Noise)
Write 3 Samples to RAM
Read 4 Samples from Channel 8 (PCM)
Write 3 Samples to RAM
Read Sample 4 of Channel 1,2,3,4,5 and 6 from RAM
Mix all 8 channels for sample 4
There is a little more overhead and RAM usage by reading and writing the other 3 phases samples each time, but you save on those 16 clocks of reading/saving the channel parameters a lot.
Other things to consider to fully flesh this out
PCM Currently takes 45 clocks - but if I make that a fixed freq-non-wrapping channel it can be ~18 clocks
Noise also takes 27 clocks but has less LDS/STS fat in it (so may only be able to share with PCM)
There will be about another 20 clocks of overhead pushing and popping more stack
There will be more overhead (and a counter) deciding which phase to run each time "update sound" is called (guess at 20 clocks also)
There will be 32+ bytes extra RAM used and I will have to find that somewhere. At present I have 36 bytes free in T2K below object strore
There will be 600 words extra flash taken up which is about 2x what I currently have free in T2K
I don't fully understand the mixing maths. It seems to be 16 bit math, but the samples are only 8 bits. So this might be a show stopper if I can't just save an 8 bit value for the samples.