Assembler Tips

From Uzebox Wiki
Jump to navigation Jump to search

Interfacing with C language components from assembler

You can call C routines or write C callable assembly routines by understanding the calling convention and how the C compiler uses the AVR's registers:

One very important thing to remember is that r1 *must always* be cleared before invoking C functions. Similarly, r1 must be cleared when returning to C function from assembler.


16 bit and 32 bit math

Atmel has a nice cheat-sheet that outlines the algorithms to perform arithmetic on an 8-bit processor.

AVR 201: multiplication and division:

AVR 202: addition, subtraction and comparisons:


Use SUBI+SBCI for Fast additions

When in tigh loops, and need to add constant values to registers, you can save a register and load intruction by using SUBI + SBCI instead of ADD + ADC. The trick is to substact the complemented values.

So instead of:

    ldi   r20, lo8(0x1234)
    ldi   r21, hi8(0x1234)
    add   ZL,  r20
    adc   ZH,  r21

do:

    subi  ZL,  lo8(-(0x1234))
    sbci  ZH,  hi8(-(0x1234))

If the constant you want to add is small, keep in mind that you can also use the ADIW and SBIW instructions (for registers r25:r24, XH:XL, YH:YL and ZH:ZL)!


Branch by a Single Bit

Sometimes, you just want to test if a bit is set - like a flag. Normally you could use the SBRS or SBRC instructions for this purpose like:

    sbrs  r0,  1   ; Skip if bit 1 is set ((r0 & 0x02) != 0)
    rjmp  bit1_was_clear
    (...)
bit1_was_clear:
    (...)

Of course if you have a single instruction to perform when the bit was set or clear, you can use that directly instread of jumping.

In some cases, you might not want to keep the value in a register (such as when you are short in registers, for example when designing video mode or inline mixer code). This case the T flag can come handy:

    bst   r0,  1   ; Store bit 1 of the register for later use
    (...)
    brtc  bit1_was_clear
    (...)
bit1_was_clear:
    (...)

The T flag also has the peculiar property of not being set or cleared by any operation other than bst, set and clt. So it is immune to arithmetic, comparisons and other bit operations.


Jump Table

Jump to a location based on a calculation, much like a switch() in C:

    mov   ZL,  r0  ; Load index (in this example in r0)
    ldi   ZH,  0
    subi  lo8(-(pm(jump_table)))
    sbci  hi8(-(pm(jump_table)))
    ijmp

jump_table:
    rjmp  jump_target_0
    rjmp  jump_target_1
    rjmp  jump_target_2

jump_target_0:
    (...)
jump_target_1:
    (...)
jump_target_2:
    (...)

You can also call functions this way via icall, instead of ijmp.


Multiply without MUL

In AVR assembler, the mul family of instructions treat the r0:r1 like an accumulator; it gets destroyed and replaced with the result without exception. If you can afford to destroy your operands instead, use bit shifts and addition.

; r25:r24 * 2 -> r25:r24
    lsl   r24      ; high bit goes into carry
    rol   r25      ; move the carry into low bit of r25
; r25:r24 / 2 -> r25:r24
    lsr   r25      ; low bit goes into carry
    ror   r24      ; move the carry into high bit of r24
; r24 * 2 -> r25:r24
    ldi   r25, 0
    lsl   r24      ; high bit goes into carry
    rol   r25      ; move the carry into low bit of r25

Each shift left is equivalent to mutiplying by two, and each shift right is equivalent to dividing by two. Keep in mind that since the mul family of instructions are fast, it is most often better to just use them!


Use sbc for fast set/clear of a register based on the carry

    lsl   r24      ; shift the msb of r24 in the carry
    sbc   r0,  r0  ; if carry == 0 -> r0 = 0, if carry == 1 -> r0 = 0xff

You may use this technique to sign-extend after a muls processing a fixed-point number (often useful in inline mixer design).


Video mode & Inline mixer design

Timing

    nop        ; 1 cycle
    rjmp  .    ; 2 cycles
    lpm        ; 3 cycles - but destroys a register (r0 if plain "lpm")

To kill off 3N cycles: (N > 0)

.macro delay3N value
    ldi   r19, \value
    dec   r19
    brne  .-4
.endm

You can use any register r16 or above for the counter. The branch costs one less cycle on the last iteration, but that is "paid for" by the LDI instruction up front.

To kill off variable number of cycles

You may also write a routine which waits an arbitrary amount of cycles as follows:

delay_cycles:
    lsr   r24
    brcs  .    ; +1 if bit0 was set
    lsr   r24
    brcs  .    ; +1 if bit1 was set
    brcs  .    ; +1 if bit1 was set
    dec   r24
    nop
    brne  .-6  ; 4 cycle loop
    ret

This produces a delay of 12 cycles (excluding the CALL or RCALL used to call it), when r24 is 4. By incrementing r24, you can increment the delay cycle by cycle, up to 267 (r24 = 3, after wrapping around).

Useful Macros

Tired of the verbose way to send a pixel out? I always cut/paste another one myself. Here's a better way:

.macro pixel reg
    out _SFR_IO_ADDR(DATA_PORT),\reg ; 1
.endm


Timing measurements

The emulators can report cycles consumed between wdr instructions. You can use this to measure the performace or the timing of a block of code like this:

    wdr
    (...)      ; Block of code to measure
    wdr

Note that you don't need to use this if you want to check whether your HSync timing is right when designing a video mode as the CUzeBox emulator is capable to indicate timing proper.


Indirect Jump Without Z

Since the Z register is the only pointer allowed to index Program Memory, it tends to get used heavily. Unfortunately, Z is also necessary in order to execute an indirect jump, suitable for a jump table. If your Z register is busy, you can use the program stack instead and execute a RET instruction to perform the jump:

    mov   r24, r0  ; Load index (in this example in r0)
    ldi   r25, 0
    subi  lo8(-(pm(jump_table)))
    sbci  hi8(-(pm(jump_table)))
    push  r24
    push  r25
    ret

Keep in mind that this is costly (8 cycles versus the 2 cycles of the IJMP), so it may be rarely used, but sometimes might be useful to know.