I've rethought the 16x4 mode again, and found it allows a dynamical 16 of 256 palette, half the flash needed for tile data, and at the same time 240x224 resolution; at the expense of 256 bytes RAM.
(If 256 bytes RAM penalty are too hard, with the same mode you could always use e.g. only 8 of 256 colors, which frees 136 bytes en block and another 56 scattered in 7 blocks of 8 bytes usused palette entries. Or with 10 colors 100 instead of 256 bytes, or with 12 colors 144 instead of 256 bytes, what you like.)
The problem with current mode 3 having too few cycles for palette lookup is because we have 6 cycles per pixel, so one tile of 8 pixels has 48 cycles. For the usual direct video we need one LPM (3 cycles) and one OUT (1 cycle) per pixel, resulting in 32 cycles per 8-pixel tile. This leaves 16 cycles for the calculation of the next tile address from the index, which video mode 3 uses up completely:
Code: Select all
cycles | code
=======+======================================================================
3 | lpm r16,Z+ ; load pixel 1 direct video
1 | out vport,r16 ; output pixel 1 data
2 | Used for calculation of next tile address from index (step 1)
-------+----------------------------------------------------------------------
3 | lpm r16,Z+ ; load pixel 2 direct video
1 | out vport,r16 ; output pixel 2 data
2 | Used for calculation of next tile address from index (step 2)
==============================================================================
12 cycles for 2 pixels, 8 calculation steps per 8-pixel tile.
If we had only one tile lookup per 16 pixels, we would still need the same 16 cycles for calculation of the next tile address, but now they come from a pool of 96 cycles (16 pixels) instead of only 48. So we have 48 cycles for LPM/SWAP and OUT, and 16 cycles for tile address calculation, and another 32 cycles left for the palette
Code: Select all
cycles | code
=======+======================================================================
3 | lpm r28,Z+ ; load pixel 1+2 palette indices from tile. (r28 is yl)
2 | ld r16,Y ; palette lookup
1 | out vport,r16 ; output pixel 1 data
-------+----------------------------------------------------------------------
1 | swap r28 ; other pixel palette index (r28 is yl)
2 | ld r16,Y ; palette lookup
2 | Used for calculation of next tile address from index (step 1)
1 | out vport,r16 ; output pixel 2 data
==============================================================================
12 cycles for 2 pixels, still 8 calculation steps per 16-pixel tile.
Please note this is still mode 3, only with a little tweaking, but with all features.