Inline Mixer - saving clocks

Topics related to the API, programming discussions & questions, coding tips, bugs, etc. should go here.
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Inline Mixer - saving clocks

Post by CunningFellow »

Have found savings of a few clock cycles in the inLineMixer.s file.

They are not enough to add another sound channel there, but may become useful for something

Here is 3 clocks to be saved. ( I already showed this in the T2K thread). Of course you could move the usage of T to change r18 to just after the first OUT if you need to clear up T earlier.

Code: Select all

in    r18, SYNC_PORT            ; read current state of port
sbrc  ZL, 0
ori   r18, SYNC_PIN
bst   ZL, 1                     ; copy the hsync pusle bit to T register for later use

; blah
; blah
; blah
; blah

;*** Video sync update ***
out   SYNC_PORT, r18
;*************************


; blah
; blah
; blah
; blah

;*** Video sync update ***
brtc  SKIP                      ; it the T bit was previously SET
ori   r18, (1<<SYNC_PIN)        ; The set the SYNC_PIN
SKIP:
out   SYNC_PORT, r18
;*************************
Then in the LFSR noise generator there is 3 more clocks to be saved

Code: Select all

mov   r0, r16       ;copy barrel shifter
lsr   r0
eor   r0, r16       ;xor bit0 and bit1
bst   r0, 0
lsr   r17
ror   r16
bld   r17, 6        ;15 bits mode
sbrs  ZH, 0
bld   r16, 6        ;7 bits mode
Instead of copying into R0, implicitly shifting and then XORing, just copy into R0, shift the original (which was going to happen anyways) and then XOR

so change to

Code: Select all

mov   r0, r16       ;copy barrel shifter
lsr   r17
ror   r16
eor   r0, r16       ;xor bit0 and bit1
bst   r0, 0
bld   r17, 6        ;15 bits mode
sbrs  ZH, 0
bld   r16, 6        ;7 bits mode
And that saves 1 clock

Finally

Code: Select all

    sts   tr4_barrel_lo, r16
    sts   tr4_barrel_hi, r17

    rjmp ch4_end
ch4_no_shift:
    ;wait loop 21 cycles
    ldi r17,6
    dec r17
    brne .-4

ch4_end:
    sts tr4_divider, ZL
    ldi r17, 0x80         ;-128
Make the "ch4_no_shift" out of band. It is just a wait loop so does not matter if it is not efficient. That will save on wasting 2 clock cycles having to jump over it.

So there we have 6 clocks extra. There was already a NOP in there. So just 20 more clocks to save to squeeze in an extra wave channel (no I don't think it is really possible)
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: Inline Mixer - saving clocks

Post by CunningFellow »

Oh - I think that 6 clock saving might be enough to fit in an extra WAVE channel in if the PCM is the simple PCM that can only play a fixed speed/length sample.

I know the use case of a PCM sample in normal music could not use that, but in cases like "The Minds Eye" where the PCM is just a sample of the word "play" it works fine.
User avatar
Jubatian
Posts: 1564
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Inline Mixer - saving clocks

Post by Jubatian »

It could be useful for digital effects, if one decides to stream such in from SD card (or SPI RAM), for example (saving on flash). I even had ideas for simple digital music compression, of course irrelevant at the point of implementation there (requiring a 262 byte RAM buffer filled, no matter what), just that such things could make it more useful to have a fixed frequency PCM channel on top of the rest.

(At any rate, it is definitely more useful than unconditionally dumping six or so NOPs in there! :P )
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: Inline Mixer - saving clocks

Post by CunningFellow »

Jubatian wrote:It could be useful for digital effects, if one decides to stream such in from SD card (or SPI RAM), for example (saving on flash). I even had ideas for simple digital music compression, of course irrelevant at the point of implementation there (requiring a 262 byte RAM buffer filled, no matter what), just that such things could make it more useful to have a fixed frequency PCM channel on top of the rest.

(At any rate, it is definitely more useful than unconditionally dumping six or so NOPs in there! :P )
It means loosing the current full featured PCM channel and replacing it with a Wave channel + dumb PCM channel. NOT adding an extra PCM channel. It takes 18 clocks to add an extra dumb PCM channel. so still 12 away there.

BUT now that you mention SD card. That 6 clocks could be used to stream music from SD card if something in VSync could point to the start of a sector each feild (and waste 250 bytes from each sector of SD card)
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: Inline Mixer - saving clocks

Post by CunningFellow »

OK - my version of Channel 4 (LFSR Noise channel)

I think it is correct. Comment if you think it is wrong.

It is 5 clocks faster than the present implementation.

Code: Select all

    ; Note - for the Noise channel the SAMPLE is only ever +127 or -127
    ; Therefore (Volume * sample)  will always be either (volume/2) or -(volume/2)


    ;channel 4 - 7/15 bit LFSR 
    lds   r16, tr4_barrel_lo    ; Get the LFSR (16 barrel shifter)
    lds   r17, tr4_barrel_hi

    lds   r1, tr4_vol           ; get the Volume
    lsr   r1                    ; Divide it by 2 to get the "sample" ((+127 * vol) >> 8) = (vol >> 2)
    sbrc  r16,0                 ; if the LSB of the LFSR is zero
    neg   r1                    ; then make "sample" negative

    lds   ZL, tr4_divider       ; load the divider
    dec   ZL	                ;
    brpl  ch4_no_shift	        ; if not enough ticks have elapsed then don't shift the LFSR

                                ; Otherwise relode the divider by
    lds   ZH,tr4_params         ; Getting the parameters (7bits of divider + 1 bit of mode)
    mov   ZL,ZH                 ; copy the parameters to ZL (the divider register)
    lsr   ZL                    ; shift left to remove the mode bit and keep bits7:1

                                ; then perform the actual LFSR shifting by
    mov   r0, r16               ; copying low byte of LFSR to a temp for XOR opperation
    lsr   r17                   ; shift the 16 bits of the barrel shifter
    ror   r16                   ; leaving the old bit 0 into Carry (Same bit used to decide +ve or -ve "sample" above)
    eor   r0, r16               ; perform the XOR of bit 0 and bit 1
    bst   r0, 0                 ; Save that XOR'd bit to T
    bld   r17, 6                ; Write T to the 15th bit of the LFSR (regardless of mode as 7 bit will overwrite it)
    sbrs  Zh, 0                 ; If the 7/14 mode bit indicates 7 bit mode then
    bld   r16, 6                ; Store T to the 7th bit of the LFSR

    sts   tr4_barrel_lo, r16    ; save the LFSR
    sts   tr4_barrel_hi, r17

ch4_end:
    sts   tr4_divider, ZL       ; Save the divider

    sbc   r0, r0                ; Sign extend the "sample" (carry has not been trashed from the "LSR r16" yet)
    add   r28, r1               ; add (sample*vol>>8) to mix buffer
    adc   r29, r0
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: Inline Mixer - saving clocks

Post by CunningFellow »

Instead of having the divider in ZL be shifter from the parameter byte - You can leave bit 0 in place.

Then instead of DEC to see if the you are ready for a shift you can SUBI Rx, 2

That will cause a carry when it wraps and leave bit 0 unaffected.

So that saves 2 more clocks

Code: Select all

    ; Note - for the Noise channel the SAMPLE is only ever +127 or -127
    ; Therefore (Volume * sample)  will always be either (volume/2) or -(volume/2)


    ;channel 4 - 7/15 bit LFSR 
    lds   r16, tr4_barrel_lo    ; Get the LFSR (16 barrel shifter)
    lds   r17, tr4_barrel_hi

    lds   r1, tr4_vol           ; get the Volume
    lsr   r1                    ; Divide it by 2 to get the "sample" ((+127 * vol) >> 8) = (vol >> 2)
    sbrc  r16,0                 ; if the LSB of the LFSR is zero
    neg   r1                    ; then make "sample" negative

    lds   ZL, tr4_divider       ; load the divider
    subi  ZL,2                  ; Decrement bits 1..7 leaving bit 0 untouched by subtracting 2
    brcc  ch4_no_shift          ; if not enough ticks have elapsed then don't shift the LFSR

                                ; Otherwise relode the divider / Parameters which 
    lds   ZL,tr4_params         ; consists of 7 bits of divider + 1 bit of mode

                                ; then perform the actual LFSR shifting by
    mov   r0, r16               ; copying low byte of LFSR to a temp for XOR opperation
    lsr   r17                   ; shift the 16 bits of the barrel shifter
    ror   r16                   ; leaving the old bit 0 into Carry (Same bit used to decide +ve or -ve "sample" above)
    eor   r0, r16               ; perform the XOR of bit 0 and bit 1
    bst   r0, 0                 ; Save that XOR'd bit to T
    bld   r17, 6                ; Write T to the 15th bit of the LFSR (regardless of mode as 7 bit will overwrite it)
    sbrs  ZL, 0                 ; If the 7/14 mode bit indicates 7 bit mode then
    bld   r16, 6                ; Store T to the 7th bit of the LFSR

    sts   tr4_barrel_lo, r16    ; save the LFSR
    sts   tr4_barrel_hi, r17

ch4_end:
    sts   tr4_divider, ZL       ; Save the divider (plus 7/15 mode bit in LSB)

    sbc   r0, r0                ; Sign extend the "sample" (carry has not been trashed from the "LSR r16" yet)
    add   r28, r1               ; add (sample*vol>>8) to mix buffer
    adc   r29, r0
That is 32 clocks so 7 clocks faster than at present.
User avatar
D3thAdd3r
Posts: 3222
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: Inline Mixer - saving clocks

Post by D3thAdd3r »

I wish I could brain download your asm thought process. Looks good to me as far as I understand it, though I never would have thought of any of that. Sure it's all in the manual, but that is a lot of details and pattern recognition I think you need in your tool belt to actually put out code like that continually as you do.

Keep it up, these kinds of things are like Uzebox wow factor presents for everyone!
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: Inline Mixer - saving clocks

Post by CunningFellow »

I will try teach you :)

It's not actually very glamorous and mostly just being methodical.

First step (as in any endeavour) is to define the problem. If you write all you ASM code something like this it is not only easier to for someone else (or future you) to understand, it is easier for you to see the big picture.

Code: Select all

; r0  : trash
; r1  : trash
; r16 : Low  8 bits of LFSR Barrel Shifter  /  Later reused for "Volume"
; r17 : High 8 bits of LFSR Barrel Shifter
; r28 : Low  byte of sample accumulator (to add up Channels 1..5)
; r29 : High byte of sample accumulator (to add up Channels 1..5)
; ZL  : Divider / Countdown till next action
; ZH  : Channel Parameters Bit 7..1 = Divider Reload Value   Bit 1 = 7/15 bit LFSR mode

; Each 15.734Khz
;
; Decrement a counter
; If the counter has roller over then
;    Reload the counter
;    Perform the Linear Feedback Shift Register Function
;        Calculte the XOR of Bit 0 and Bit 1
;        if LFSR mode is 15 bit then store this result in bit 15
;                               else store the result in bit 7
; else do nothing
; If bit 0 of the LFSR is 0 then "Sample" = +127  (The noise channel is a square wave
;                           else "Sample" = -128    that can only be MAX/MIN SIGNED BYTE)
; Multiply the "Sample" by "Volume"
; Add the Sample to the 'Accumulator" 


    ;channel 4 - 7/15 bit LFSR 
    lds   r16, tr4_barrel_lo        ; Load in the 16 bit LFSR
    lds   r17, tr4_barrel_hi
    lds   ZL,  tr4_divider          ; Load the divider
    dec   ZL	                    ; Decrement the Divider
    brpl  ch4_no_shift	            ; If no overflow then do nothing

    lds   ZH, tr4_params            ; Otherwise get the parameters (bit7..1 diver reload val, bit 0 LFSR mode)
    mov   ZL, ZH                    ; copy the parameters to "divider"
    lsr   ZL                        ; Shift Divider to keep bits7:1 (and get rid of the mode bit)

    mov   r0, r16                   ; Make a copy of the low byte of LFSR for XOR opperation
    lsr   r0                        ; Shift the copy so Bit 1 in the copy aligns with bit 0 in the original
    eor   r0, r16                   ; XOR bit0 and bit1
    bst   r0, 0                     ; And copy that result to T

    lsr   r17                       ; Shift the 16 bit LFSR
    ror   r16

                                    ; Copy the XOR'd value into bit 15 (Regarless of mode as writing to bit 7
    bld   r17, 6                    ;                                   after this will overwrite this)
    sbrs  ZH, 0                     ; Check to see if 7 or 15 bit LFSR mode
    bld   r16, 6                    ; and if 7 bits mode then copy the XOR'd value into bit 7

    sts   tr4_barrel_lo, r16        ; Save the 16 bit LFSR
    sts   tr4_barrel_hi, r17

    rjmp ch4_end                    ; Skip over the wait routine
ch4_no_shift:
    ;wait loop 21 cycles            ; Wait routine to make both paths same length
    ldi   r17, 6
    dec   r17
    brne  .-4

ch4_end:

    sts   tr4_divider, ZL           ; Save the divider (This may have been decremented OR reloaded)


    ldi   r17, 0x80                 ; If the lowest bit of the LFSR is 0 the load "Sample" with
    sbrc  r16, 0                    ; +127 otherwise load "Sample" with -128
    ldi   r17, 0x7f                 ; 

    lds   r16, tr4_vol              ; Get the channel volume
    mulsu r17, r16                  ; R1 = (sample*mixing vol) >> 8
    sbc   r0,  r0                   ; sign extend 8 bit R1 to 16 bits (MULSU leaves C = MSB of result)
    add   r28, r1                   ; add low  byte of (sample*vol>>8) to mixing accumulator
    adc   r29, r0                   ; add high byte to mixing accumulator

; 39 Clocks
Now first and most obvious thing when looking at it. There are two paths through this
  1. Process LFSR
  • Do Nothing but wait
(Note here that the second path is not actually "Do nothing" but is "Do nothing but wait". It is the "but wait" part that makes it a separate path. If it was just "Do nothing" it would just be a skip down the same path.)

Optimization Rule : Minimize the critical path.

Now the "critical path" is not always the longest. Sometimes the critical path is the one run most often. Sometimes it is just the most important (Save the girl OR save Gotham City).

In this case the critical path is very obvious as the other path is "Do nothing but wait"

So we make the "Do nothing but wait" branch "Out of band". Every time a path splits and then has to come back together there is going to be a BRANCH to split the path and a JUMP to rejoin them.

If we make the BRANCH to somewhere outside of the code then the JUMP to join the paths back together can be in the 'Do nothing but wait" path that is otherwise just sitting there twiddling its thumbs.

Now the Critical Path is not being lumbered with the RJMP.

There are many other ways you might optimize the critical path. Another example is when the decision is based on a modified variable, the critical path needs the variable modified but the non critical path needs the original value

Code: Select all

Read variable from memory
Make a copy of the variable
Modify the copy of the variable
Make Decision based on the copy of the variable

Critical Path                                                  non-Critical Path
Do something with the Modified copy              Do something with original variable
Do other critical stuff
could be changed to

Code: Select all

Read variable from memory
Modify the the variable
Make Decision based on the modified variable

Critical Path                                                  non-Critical Path
Do something with the Modified Variable         Re-Load the un-modified variable from RAM
Do other critical stuff                                     Do something with un-modified variable
So you can see we have made the non-critical path slower - but have sped up the critical path by not having to make a COPY of the variable.
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: Inline Mixer - saving clocks

Post by CunningFellow »

1 - NOP in current inLineMixer.s
3 - HSync update (code shown in T2K thread)
7 - LFSR noise channel (code shown above)
4 - Wave channels (3 clocks saved off each wave channel and loss of 5 clocks to realise this gain)

That is 15 clocks so far.

Swapping the PCM channel for a dumb PCM + an extra Wave channel yields another 7 clocks

22 clocks free. An extra noise channel is 32 clocks.

If I can find another 10 clocks somewhere I can get

4 Wave
2 Noise
1 Sample Playback

for T2K without loosing the 32 bytes of ram the split phase version needed.
User avatar
uze6666
Site Admin
Posts: 4801
Joined: Tue Aug 12, 2008 9:13 pm
Location: Montreal, Canada
Contact:

Re: Inline Mixer - saving clocks

Post by uze6666 »

Nice! It's pretty awesome all the tricks you found, specially for the noise channel. I looked so many times at this code to free up some cycles, sometimes you just need an extra pair of eyeballs! :P
Post Reply