The compiler is playing evil

Post by **Jubatian** » Thu Nov 02, 2017 7:41 pm

Aargh!

I guess I am not alone with such baldness-inducing horror stories of compiler misfeatures. So here is a topic to share some, hopefully helping each other to spot them easier later.

So the first one is... What's the bug in this section of assembly code?

Code: Select all

	ldi   r24,     12
	ldi   r25,     0
	ldi   XL,      lo8(v_store)
	ldi   XH,      hi8(v_store)
psrl1:
	ld    r0,      X
	sbrs  r0,      7
	rjmp  psrl1f           ; Free slot: search completed
	sbrs  r0,      6
	mov   r25,     r24     ; Low pri. slot found (r25: How far back it is from end + 1)
	adiw  XL,      9
	dec   r24
	brne  psrl1
	cpi   r25,     0
	breq  psre2            ; No free slot, no low pri. slot: Totally occupied, exit
	sbrs  r22,     6
	rjmp  psre2            ; Input is low priority: No further allocations will happen, exit
	mov   r24,     9
	mul   r25,     r24
	sub   XL,      r0
	sbc   XH,      r1      ; Step back onto low priority slot to occupy it

This is an extract of Flight of a Dragon (here), actually you don't need to understand any of it, carefully examining the code should reveal it. The buggy behavior appears when there are so many actors active that the game has to prioritize among them, so I could get quite far without experimenting it. However when it happens, it makes it count.

CunningFellow · Post by **CunningFellow** » Thu Nov 02, 2017 9:59 pm

Compiler horror story? The file name ends in S ?

But I understand how horrible compiler errors are. I know in T2K I spent a week searching for a hard to replicate bug. You had to die and spawn just as the correct sequence of other objects where being garbage collected to trigger the bug.

It turned out to be GCC4.9.x doing something screwy with a pointer that was fixed in a later release of the compiler (4.9.2 from memory)

Post by **Jubatian** » Fri Nov 03, 2017 6:15 am

CunningFellow wrote: ↑Thu Nov 02, 2017 9:59 pmCompiler horror story? The file name ends in S ?

No, preprocessor is turned on for the assembly files in the Makefile with the appropriate parameter. The bug is not that, if it was such, then most likely it wouldn't even compile. Part of the problem is that this compiles without the compiler giving a fart about it. (hint)

Post by **D3thAdd3r** » Sat Nov 04, 2017 9:45 pm

Weird. So the compiler does something beforehand that allows the assembler to assemble a mov with a non-register/immediate value? I guess I wouldn't expect that to assemble at all but maybe there is more I don't know on this item.

Post by **Jubatian** » Sun Nov 05, 2017 9:49 pm

D3thAdd3r wrote: ↑Sat Nov 04, 2017 9:45 pm Weird. So the compiler does something beforehand that allows the assembler to assemble a mov with a non-register/immediate value? I guess I wouldn't expect that to assemble at all but maybe there is more I don't know on this item.

Yes, it assembles. I deducted that internally the assembler uses numeric literals, so that "9" translates to "r9". Which is quite nasty since then the game starts to spit stuff all over the RAM. I already had a similar bug somewhere else in the game, that made it a little easier to spot. So if you use a numeric literal below 32 for a mov, that will translate to a register without the compiler farting anything about it. Of course likely applies everywhere else where a register is accepted (But maybe here it is the worst since you would expect that happening what you see, forgetting about that it would require an ldi. Maybe not if your first assembly language is the AVR-8).

Post by **Jubatian** » Thu Dec 14, 2017 11:17 am

Now just a rather annoying optimization quirk. This appeared at work, in a routine for an XMega which is often executed, benefiting from optimization. The routine:

Code: Select all

void   dstream_enc_setm8(auint ch, uint16 addr, uint8 const* val, auint mask)
{
 uint8* dst;
 uint8* mod;
 uint8  mdv;
 uint8  t8;
 auint  sh;

 addr = addr & 0xFFF8U;
 if (ch >= DSTREAM_CH_CNT){ return; }
 if (addr > (0x380U - 8U)){ return; }

 dst = &(dstream_state[ch][addr]);
 mod = &(dstream_modif[ch][addr >> 3]);
 mdv = *mod;

 sh  = 1U;

 do{
  t8  = (*val);
  val ++;
  if ((mask & sh) == 0U){ t8 = 0U; }
  if ((*dst) != t8){ mdv |= sh; }
  *dst = t8;
  dst ++;
  sh <<= 1;
 }while ((sh & 0xFFU) != 0U);
 
 *mod = mdv;
}

What it does is setting a group of values (8 values) with modification markers, the "mask" parameter being able to inhibit values from the source propagating into the array. The critical part of course is the loop, which going through GCC becomes this:

Code: Select all

    5386:	18 81       	ld	r17, Y

    5388:	ad 01       	movw	r20, r26
    538a:	48 5f       	subi	r20, 0xF8	; 248
    538c:	5f 4f       	sbci	r21, 0xFF	; 255
    538e:	31 e0       	ldi	r19, 0x01	; 1
    5390:	cd 91       	ld	r28, X+
    5392:	d3 2f       	mov	r29, r19
    5394:	d2 23       	and	r29, r18
    5396:	09 f4       	brne	.+2      	; 0x539a <dstream_enc_setm8+0x72>
    5398:	c0 e0       	ldi	r28, 0x00	; 0
    539a:	d0 81       	ld	r29, Z
    539c:	dc 13       	cpse	r29, r28
    539e:	13 2b       	or	r17, r19
    53a0:	c1 93       	st	Z+, r28
    53a2:	33 0f       	add	r19, r19
    53a4:	a4 17       	cp	r26, r20
    53a6:	b5 07       	cpc	r27, r21
    53a8:	99 f7       	brne	.-26     	; 0x5390 <dstream_enc_setm8+0x68>

    53aa:	20 e7       	ldi	r18, 0x70	; 112
    53ac:	28 9f       	mul	r18, r24
    53ae:	f0 01       	movw	r30, r0
    53b0:	29 9f       	mul	r18, r25
    53b2:	f0 0d       	add	r31, r0
    53b4:	11 24       	eor	r1, r1
    53b6:	e6 0f       	add	r30, r22
    53b8:	f7 1f       	adc	r31, r23
    53ba:	e7 56       	subi	r30, 0x67	; 103
    53bc:	ff 4d       	sbci	r31, 0xDF	; 223
    53be:	10 83       	st	Z, r17

I also added the code for the last "*mod = mdv;" line refreshing the modification markers. That's a bit shocking, how the compiler ended up with the decision of recalculating that address instead of saving it somewhere. True it ran out of registers, but then take a look at the loop...

The optimization GCC is doing there is often beneficial, removing the variable used for looping. However here the variable is used for calculations, the most ridiculous being the line 53a2. That's the "sh <<= 1;" line, after which the loop condition is a test for zero. That add instruction already sets the zero flag right, and the compiler seemingly was even very much aware of it as it could calculate the iterations the loop will make, and calculated an end position by pointer. Thus it discarded the zero flag, and rather compared to a pre-generated end marker by pointer which it made up. Facepalm.

Double facepalm even as by this decision it ran out of registers, possibly ending up with that ridiculuous decision to recalculate the "mod" pointer.

Then when I unrolled the loop, it decided to use the 'X' and 'Z' registers, and displacement addressing. Displacement addressing! On an XMega (3 cycles instead of 2). With the X register, where it can't even use displacement, so sprayed around the whole thing with "adiw" and "sbiw" instructions (2 cycles each), all the time the Y pointer being allocated and its registers used as temporaries for calculating stuff (which pointer at least has a displacement addressing mode).

What to make of this?

I guess if there is nothing against using assembly, and you have such a performance dependent addressing-intensive routine, just write it in assembler.

(Otherwise staying in C I found that using pointer math often results faster code than doing the equivalent with indexing)

Post by **Jubatian** » Thu Dec 14, 2017 11:55 am

The end result was this:

Code: Select all

void   dstream_enc_setm8(auint ch, uint16 addr, uint8 const* val, auint mask)
{
 uint8* dst;
 uint8* end;
 uint8  mod;
 uint8  t8;

 addr = addr & 0xFFF8U;
 if (ch >= DSTREAM_CH_CNT){ return; }
 if (addr > (0x380U - 8U)){ return; }

 dst = &(dstream_state[ch][addr]);
 end = dst + 8U;
 mod = 0U;

 do{
  t8  = (*val);
  val ++;
  if ((mask & 1U) == 0U){ t8 = 0U; }
  mask >>= 1;
  mod  >>= 1;
  if ((*dst) != t8){ mod |= 0x80U; }
  *dst = t8;
  dst ++;
 }while (dst != end);

 dstream_modif[ch][addr >> 3] |= mod;
}

You can see a few fundamental changes, tailoring the code to optimize better understanding the behaviour of GCC. In particular the followings were done:

The loop now terminates by pointer comparison. Since GCC will optimize to this anyway, there is no point trying to be creative with the loop variable (even though it was also intuitive the way it was). Note that "dst + 8U" may evaluate to 0x380, which is one past the last valid element of the array (its size is 0x380). This is not undefined as of C99 (in C89 it was).
Applying the modification markers is moved to the end of the code. This reduces register usage since that pointer doesn't have to be remembered while looping which is the most register-demanding part.
The "sh" variable was eliminated, calculating intermediates differently, reducing variable, thus potentially register usage (it isn't something clear to exploit for optimization: it is actually not the variable count, rather interdependencies which matter, how well the compiler might untangle them to utilize fewer registers).

The resulting assembly of the loop:

Code: Select all

    5360:	ad 01       	movw	r20, r26
    5362:	48 5f       	subi	r20, 0xF8	; 248
    5364:	5f 4f       	sbci	r21, 0xFF	; 255
    5366:	30 e0       	ldi	r19, 0x00	; 0

    5368:	cd 91       	ld	r28, X+
    536a:	20 ff       	sbrs	r18, 0
    536c:	c0 e0       	ldi	r28, 0x00	; 0
    536e:	26 95       	lsr	r18
    5370:	36 95       	lsr	r19
    5372:	d0 81       	ld	r29, Z
    5374:	dc 13       	cpse	r29, r28
    5376:	30 68       	ori	r19, 0x80	; 128
    5378:	c1 93       	st	Z+, r28
    537a:	a4 17       	cp	r26, r20
    537c:	b5 07       	cpc	r27, r21
    537e:	a1 f7       	brne	.-24     	; 0x5368 <dstream_enc_setm8+0x40>

It is quite decent, now the compiler didn't even mess around with jumping far-far away just to execute single instructions (sometimes it does). It could be one clock shorter and could use one less register by using a normal loop variable, but GCC wouldn't do that in such pointer loops (it just prefers doing this).

Uzebox Forums

The compiler is playing evil

The compiler is playing evil

Re: The compiler is playing evil

Re: The compiler is playing evil

Re: The compiler is playing evil

Re: The compiler is playing evil

Re: The compiler is playing evil

Re: The compiler is playing evil