Now just a rather annoying optimization quirk. This appeared at work, in a routine for an XMega which is often executed, benefiting from optimization. The routine:
Code: Select all
void dstream_enc_setm8(auint ch, uint16 addr, uint8 const* val, auint mask)
{
uint8* dst;
uint8* mod;
uint8 mdv;
uint8 t8;
auint sh;
addr = addr & 0xFFF8U;
if (ch >= DSTREAM_CH_CNT){ return; }
if (addr > (0x380U - 8U)){ return; }
dst = &(dstream_state[ch][addr]);
mod = &(dstream_modif[ch][addr >> 3]);
mdv = *mod;
sh = 1U;
do{
t8 = (*val);
val ++;
if ((mask & sh) == 0U){ t8 = 0U; }
if ((*dst) != t8){ mdv |= sh; }
*dst = t8;
dst ++;
sh <<= 1;
}while ((sh & 0xFFU) != 0U);
*mod = mdv;
}
What it does is setting a group of values (8 values) with modification markers, the "mask" parameter being able to inhibit values from the source propagating into the array. The critical part of course is the loop, which going through GCC becomes this:
Code: Select all
5386: 18 81 ld r17, Y
5388: ad 01 movw r20, r26
538a: 48 5f subi r20, 0xF8 ; 248
538c: 5f 4f sbci r21, 0xFF ; 255
538e: 31 e0 ldi r19, 0x01 ; 1
5390: cd 91 ld r28, X+
5392: d3 2f mov r29, r19
5394: d2 23 and r29, r18
5396: 09 f4 brne .+2 ; 0x539a <dstream_enc_setm8+0x72>
5398: c0 e0 ldi r28, 0x00 ; 0
539a: d0 81 ld r29, Z
539c: dc 13 cpse r29, r28
539e: 13 2b or r17, r19
53a0: c1 93 st Z+, r28
53a2: 33 0f add r19, r19
53a4: a4 17 cp r26, r20
53a6: b5 07 cpc r27, r21
53a8: 99 f7 brne .-26 ; 0x5390 <dstream_enc_setm8+0x68>
53aa: 20 e7 ldi r18, 0x70 ; 112
53ac: 28 9f mul r18, r24
53ae: f0 01 movw r30, r0
53b0: 29 9f mul r18, r25
53b2: f0 0d add r31, r0
53b4: 11 24 eor r1, r1
53b6: e6 0f add r30, r22
53b8: f7 1f adc r31, r23
53ba: e7 56 subi r30, 0x67 ; 103
53bc: ff 4d sbci r31, 0xDF ; 223
53be: 10 83 st Z, r17
I also added the code for the last "*mod = mdv;" line refreshing the modification markers. That's a bit shocking, how the compiler ended up with the decision of recalculating that address instead of saving it somewhere. True it ran out of registers, but then take a look at the loop...
The optimization GCC is doing there is often beneficial, removing the variable used for looping. However here the variable is used for calculations, the most ridiculous being the line 53a2. That's the "sh <<= 1;" line, after which the loop condition is a test for zero. That add instruction already sets the zero flag right, and the compiler seemingly was even very much aware of it as it could calculate the iterations the loop will make, and calculated an end position by pointer. Thus it discarded the zero flag, and rather compared to a pre-generated end marker by pointer which it made up. Facepalm.
Double facepalm even as by this decision it ran out of registers, possibly ending up with that ridiculuous decision to recalculate the "mod" pointer.
Then when I unrolled the loop, it decided to use the 'X' and 'Z' registers, and displacement addressing. Displacement addressing! On an XMega (3 cycles instead of 2). With the X register, where it can't even use displacement, so sprayed around the whole thing with "adiw" and "sbiw" instructions (2 cycles each), all the time the Y pointer being allocated and its registers used as temporaries for calculating stuff (which pointer at least has a displacement addressing mode).
What to make of this?
I guess if there is nothing against using assembly, and you have such a performance dependent addressing-intensive routine, just write it in assembler.
(Otherwise staying in C I found that using pointer math often results faster code than doing the equivalent with indexing)