Code optimization

Topics related to the API, programming discussions & questions, coding tips, bugs, etc. should go here.
User avatar
paul
Posts: 457
Joined: Sat May 02, 2009 8:41 am
Location: Brisbane, Australia

Re: Code optimization

Post by paul »

avr-size reports that my build uses 3352 bytes of 'data' and 10746 bytes of 'program' - this is confusing. Where should I look to determine how much space is left on the chip?
To expand on Lee's post, you've got 4k ram ('data') and 64k flash ('program') on the ATmega644.

The ram will be used for global/static variables and for the call stack. You can explicitly request that items be allocated memory from the 'data' section (ram) in avr-asm with ".section .bss". You can place data in 'program' space (flash) with ".section .text".

If you're coding in C, any variables declared outside of a function (or inside of one, but preceded with the static keyword) will be placed in the 4k of ram starting from position 0x000 all the way to 0xfff (4096 bytes). Variables declared inside of a function will be reserved space on the stack when the function is called (unless registers are available), along with the return address and function parameters for which no registers are available. The stack (also residing in ram) starts at 0xfff and works downwards (although we think of it as building upwards). Thus, if you eat up all your ram with static and global variables (or have large function chains or recursive calls), you can get into a problem where your stack collides with your globals/statics. So it's a good idea to leave some room for the stack to grow.

'Program' space in C will be used by your code expressions/logic. You can also place data in program space with the PROGMEM macro; it looks like this:
char blinkyName[] PROGMEM = "\"BLINKY\"";
You can load it into ram with the strcpy_P and memcpy_P functions (you'll need to include <avr/pgmspace.h>). This is useful because these chips have such little ram, so we make the tradeoff in cycles required to load this data when required. We also realize that changes cannot be made to this data at runtime (although when loaded into ram, that copy in ram may be altered).

So you can see that ram ('data') is dynamic and very scarce, while flash ('program') is static but relatively plentiful.

As for the malloc library include, maybe someone else will know for sure, but I'd guess that you're using a library which in turn makes use of malloc. I'm not entirely sure, otherwise.

My apologies if much of this was recapping information that you already know, but it might be a useful intro to others.
orthopteroid
Posts: 12
Joined: Mon Jan 18, 2010 8:08 am
Location: Vancouver, Canada

Re: Code optimization

Post by orthopteroid »

D3thAdd3r wrote: For the '644 it would just be 4k(4096-3352)=744 data free and 64k(65535-10746)=54789 program free, if you don't mind manually figuring it out. Otherwise I'm not sure.
LOL! Yes, I did seem to come across as a simpleton! I was actually wondering why .data is used for both sums...

Output from avr-size:

Code: Select all

Program:   15898 bytes (24.3% Full)
(.text + .data + .bootloader)

Data:       3359 bytes (82.0% Full)
(.data + .bss + .noinit)
Contents of my .lss file:

Code: Select all

Sections:
Idx Name          Size      VMA       LMA       File off  Algn

  0 .data         00000012  00800100  00003e08  00003f08  2**0
  1 .text         00003e08  00000000  00000000  00000100  2**8
  2 .bss          00000d0d  00800120  00800120  00003f20  2**5
But in this case .data is only 18 bytes - no big deal. I'll worry about what it means when it gets larger.... :roll:

As an aside, can this chip execute from any address? Could I conceivably load code from SD for execution? What I can't store as data I could store as code...
paul wrote:

Code: Select all

char blinkyName[] PROGMEM = "\"BLINKY\"";
You can load it into ram with the strcpy_P and memcpy_P functions (you'll need to include <avr/pgmspace.h>).
Thanks for the extra clarity on the memory layout. And yes, I'm using strcpy_P to make working copies of level maps.

As for my malloc/free problem, I seemed to have some dirty .o files laying around. :oops: I don't think I'll be going the way of malloc/free...
User avatar
uze6666
Site Admin
Posts: 4801
Joined: Tue Aug 12, 2008 9:13 pm
Location: Montreal, Canada
Contact:

Re: Code optimization

Post by uze6666 »

I was actually wondering why .data is used for both sums...
Since .data is for initialized variables, the initial values to load in RAM must be stored in flash/program memory.
As an aside, can this chip execute from any address? Could I conceivably load code from SD for execution? What I can't store as data I could store as code...
The AVRs use an Harvard architeture with two different busses for program memory and RAM. They and can only execute code from flash, not RAM. To execute code from a SD card, the AVR has to reprogram itself with it. This is how the Gameloader bootloader works.

Hope that helps

-Uze
orthopteroid
Posts: 12
Joined: Mon Jan 18, 2010 8:08 am
Location: Vancouver, Canada

Re: Code optimization

Post by orthopteroid »

I've noticed that avr-gcc appears to do some static code analysis that inlines and factors generated code.

Code: Select all

// note: b is not joypad mask but joypad bit-number (ie 12 instead of 2048)
// this way we can use b as a guide to access the right bit in the second byte of the joypad memory.
// pjs is a pointer to the joypad memory (it's an arg here so I can use same code for player1 or player2
// as well as catch button transitions on those player joypads).
INLINE UINT8 input_down_core( UINT8 p, UINT8 b, UINT8* pjs )
{
	const UINT8 mask[] = {1,2,4,8,16,32,64,128};
	UINT8 jsa = 0, ba = 1, poff = 4 * p;
	if( b>8 ) { jsa = 1; ba = 9; }
	UINT8 uByte = *(pjs + poff + jsa); // hack: we know how memory is organized...
	return uByte & mask[ b-ba ];
}

#define input_down( p, b )  ( input_down_core( p, b, joypad1_status_lo ) )
Exploiting this, I've found that (after setup) my joypad code now comes down to 2 instructions per button test. I get similar code generation when checking for button down/up transitions as I also save the joypad states every tick.

Code: Select all

	if( input_down( 0, INP_LEFT ) ) { player_turn( 0, +math_angleTurnIncr() ); }
    32c0:	06 ff       	sbrs	r16, 6
    32c2:	04 c0       	rjmp	.+8      	; 0x32cc <TickControls+0x64>
    32c4:	80 e0       	ldi	r24, 0x00	; 0
    32c6:	62 e0       	ldi	r22, 0x02	; 2
    32c8:	0e 94 36 18 	call	0x306c	; 0x306c <player_turn>
	if( input_down( 0, INP_RIGHT ) ) { player_turn( 0, -math_angleTurnIncr() ); }
    32cc:	07 ff       	sbrs	r16, 7
    32ce:	04 c0       	rjmp	.+8      	; 0x32d8 <TickControls+0x70>
    32d0:	80 e0       	ldi	r24, 0x00	; 0
    32d2:	6e ef       	ldi	r22, 0xFE	; 254
    32d4:	0e 94 36 18 	call	0x306c	; 0x306c <player_turn>
Anyone else tweaking to take advantage of these compiler features? I seem to recall that c++ template args could generate similarly optimized code.
User avatar
uze6666
Site Admin
Posts: 4801
Joined: Tue Aug 12, 2008 9:13 pm
Location: Montreal, Canada
Contact:

Re: Code optimization

Post by uze6666 »

Anyone else tweaking to take advantage of these compiler features? I seem to recall that c++ template args could generate similarly optimized code.
Yeah I just noticed that yesterday when trying to shorten the kernel size. First, creating short loops (for example to initialize an array of structs) would yield no benefit at all. After much searching I discovered that AVR GCC always unroll short loop if there not "too much" instructions in them. Naturally, this is too much for me, since I try to reduce size at maximum. Moreover I discovered there's a bug that causes GCC to ignores all "-fno-unroll-loops" kind of switches. Also, the compiler will inline lots of stuff, even if we use -Os. Fortunately, this one can be disabled globally using -fno-inline. I recall there's a attribute you can append to disable per function, just could not find it.

Btw, I did some optimization that shrinked the kernel footprint by ~930 bytes. Results may vary since the sound waves must be aligned to 256 bytes and there's user code before it. So at worst you may be wasting 255 bytes. Porting all C code to assembly could reduce it significantly, but the trade off in code manageability hold me off on this one.

Thanks to Filipe Rinaldi's global build script, it's damn easy to do regression test on all projects at once! Oh and the GDB support in uzem in Eclipse just rocks and made this optimization a breeze! :D

-Uze
Post Reply