Ideas around an STM32G0

Discuss anything not related to the current Uzebox design like successors and other open source gaming hardware
Post Reply
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Ideas around an STM32G0

Post by Jubatian »

Just throwing in some ideas I am playing around (even some actual code already) to get a Uzebox-like small ARM game console going.

While of course I am also messing around with that Square Kernel, now trying to get a much faster sprite blitter to get that 85 RAM tiles more useful. I would really like to have it for a certain game idea...

But back to the ARM. The STM32G0 is a quite new chip, however it already has Nucleos, so maybe prototyping stuff wouldn't be too difficult. At least if I wasn't in the mess I am now, which definitely puts off any approach to hardware for much later. 128 KBytes of ROM and 36 KBytes of RAM, the odd number is because it optionally supports ECC for safety relevant stuff (when it has 32K RAM, the remaining 4K used for the ECC).

My general approach would be having the kernel in the ROM, a fixed thing, while games would only use the RAM. So something favoring small games and working around the tight space, but with more flexibility for larger games than with Uzebox as you can replace anything any time (code or data of your game) by loading parts from an SD card. The benefits are that there is no Flash wear, the hardware could work theoretically forever, and that the kernel provides the API for the hardware, actually anything capable of running ARM Thumb2 code could theoretically run the games (so you could have working implementations on bigger ARMs, potentially even producing display to HDMI with the appropriate part, or using USB game controllers or USB flash drive instead of SD card).

The STM32G0 would run at ~43MHz (12 * NTSC colorburst), with the flexibility of the PLL, any NTSC based crystal may be used including the most common 14.31818MHz ones. The video output would be 7 cycles / pixel, which is exact square, visible dimensions would be at most 288x216. The significance of square pixels is that it is possible to design a variant which drives a 320x240 LCD directly.

Palette would be RRGGBBII, I already designed the appropriate resistor network DAC for this. The 'I' bits are the least significant shared Intensity bits, so you could have 16 gray levels with this setup, and in overall, more smooth intensity ramps for any color. On the ARM it could be argued that more than 8 bits should be possible, yes, it is, but it reduces the flexibility for video modes. And I could come up with something quite wicked for this setup.

So the main video mode would look like as follows:
  • Background VRAM is 16 bits / tile, 6 bits specifying palette, 10 bits the tile.
  • The background tiles are 2bpp, but all the 4 colors are selectable by the 6 palette bits in the VRAM entry.
  • Sprites use a blitter tile approach like Uzebox RAM tiles, using 4bpp blitter tiles. 12 sprite colors can be specified.
  • Designed sprite blitter uses 2bpp sprites, the 3 sprite colors are freely selectable from the 12 available for sprites.
In overall this would have pretty much a NES like look and feel, 2bpp all around with very flexible palette selections. The sprite blitter, clock by clock, as far as I see would surpass anything on Uzebox, so there shouldn't be a bottleneck for getting those sprites on-screen. A 256 tile background would take 4K of RAM, 128 blitter tiles another 4K, the VRAM about 2K, which is quite agreeable for the RAM size of the STM32G0 (the maximums are 1024 bg tiles which can cover up the whole screen and 255 blitter tiles for sprite output, using 1024 bg tiles can be interesting as it could give a surface resembling Commodore 64's Multicolor graphics mode).

There are some uncertainties, but it seems like this video mode would work, and in overall it seems like just right for the RAM budget providing great flexibility. I also have an RLE mode which could be useful for vector stuff.

Otherwise the system is quite simple. SD card through SPI, no need for level shifters as the MCU is 3.3V already, stereo audio out by PWM, SNES controllers which as far as I see were reported to run just fine from 3.3V. It should have a similar overall part count to the original Uzebox (the video output using more parts due to the intensity bits, essentially needing a 4-4-4 RGB DAC, however less voltage shifting). The current design I sketched up can work with the smallest 32 pin LQFP package, such as the STM32G071K8.
bodyjarrocks
Posts: 1
Joined: Wed Jan 23, 2019 11:06 am

Re: Ideas around an STM32G0

Post by bodyjarrocks »

Hi Jubatian,

What are your thoughts on pixel output timing? I have successfully got the STM32F4 to produce PAL output but I had the use DMA. At least on that part you couldn't rely on the number of cycles per instruction. It is a bit non-deterministic. On the flip side DMA allows this to be done without loading up the CPU. A much cleaner example of the DMA approach is on GitHub within the BitBox project. Timers and DMA get it all done.

What are your thoughts?
Mike
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Ideas around an STM32G0

Post by Jubatian »

Hi,

Not a bad idea at all, and on larger STM32s implementing the firmware capable of running the games, I would do it with DMA. I was thinking about something DMA driven here first, but on the Cortex M0+, the GPIO is not connected to the AHB, rather directly to the CPU (see for example here: Microchip Cortex M0+), of course official docs and datasheets also contain similar images, just couldn't find one easily linkable. The GPIO is so accessible with single cycle loads and stores, however the DMA module doesn't "see" it any more as it is not on the AHB.

When your code can barely fit, if it otherwise interleaves well with the pixel outputs, then DMA wouldn't help a whole lot anyway.

I like the Cortex M0+ as it has an easier to understand pipeline, and it can be totally deterministic (which also means it is easier to emulate it). There will be one problem though: the Flash, at 43MHz, it has to run with 1 wait state, the small instruction cache turned on for better performance. However stuff within RAM is not affected by this, and I would place the cycle-perfect part of the video driver there.

On larger STM32 devices (or possibly other ARMs with sufficient RAM starting at 0x20000000 or an MMU to make it look like starting at 0x20000000) of course this all wouldn't apply, the graphics mode would just work into a DMA buffer, making it simpe to do scandoubling for example to get VGA output. I actually have a BitBox, and it is on my mind to eventually get an implementation of this going on it. But I would really like to start off with the STM32G0 to see how it works out.

Otherwise the whole sync timing would go all well with synchronized timers and PWM, I even sketched up how I can get 31KHz audio on the STM32G0 employing some tricks with these timers.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Ideas around an STM32G0

Post by Jubatian »

Wow... I was quite surprised to find another candidate, the Kinetis KL28, a quite spacious Cortex-M0+ with 512K Flash and 128K RAM! (The latter is which is particularly interesting to me). And contrary to the STM32G0+, this one has DMA capability on the IO ports, the reference manual clearly explains it that they chose to implement it that way so you can either access them single-cycle or through DMA (and they even describe that the single cycle access by the CPU would take precedence in a race condition with the DMA, wow!). The RAM is at 0x20000000 like on ST devices.

Since I really wanted an option for games in the 100Kbytes range, I guess I will just go with this one and stick to DMA for graphics, that would be more portable even between various ARMs (for the console's kernel, the games themselves are meant to be runnable on anything having a working kernel and capable to execute the Cortex-M0+'s limited instruction set fast enough with the game sitting at address 0x20000000). The graphics mode described above should do well with DMA output as well (as a line buffered mode), and there are also further possibilities with the DMA.

ESP8266, or well, any other Wifi module. I was thinking about these for a while, and I feel like I would go with one of them. I like Nicksen's ideas, if the Uzebox was more capable, those could lead to a quite easy to use system for the end users. With the Uzebox, the primary problem is Flash lifetime severely limiting the feasibility of such a game selector he is envisioning.

With that solution, I could skip on pondering about USB, and drop the SD card too to replace it with an SPI Flash chip, getting rid of all the related complications (software - SD and filesystem, and user experience too, more people have Wifi these days than SD card readers). You would connect to the device with Wifi and manage games and saves that way, including the possibility of exchanging games or saves easily between two of these consoles.
User avatar
nicksen782
Posts: 714
Joined: Wed Feb 01, 2012 8:23 pm
Location: Detroit, United States
Contact:

Re: Ideas around an STM32G0

Post by nicksen782 »

Jubatian wrote: Sat Feb 02, 2019 6:44 pm ESP8266, or well, any other Wifi module. I was thinking about these for a while, and I feel like I would go with one of them. I like Nicksen's ideas, if the Uzebox was more capable, those could lead to a quite easy to use system for the end users. With the Uzebox, the primary problem is Flash lifetime severely limiting the feasibility of such a game selector he is envisioning.
Perhaps I have misrepresented my ideas. The idea was to Uze and Uzebox program (would need to be flashed from the bootloader) to update your local games library. After you have the game then you would use the bootloader to play. My program would use the 8266 to download files and then would write the files to the SD. I would need PetitFs since I would be creating new files... potentially with fragmentation. I do want to detect fragmented files though at least.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Ideas around an STM32G0

Post by Jubatian »

I mostly thought about the use-case you envisioned for game development, no longer needing swapping around SD cards. However even when not developing games, in normal use case, your game downloader has to be programmed into the ATmega before it could do anything, that's an unit of wear on the Flash for every occasion you want to connect the Net for this purpose.

Sure I pushed the idea maybe a bit farther than you imagined it for the Uzebox, in my mind the game loader of this system would be more elaborate, offering network options directly. I even thought what if I could configure the Wifi module to act as an access point running a simple http server, so you could access it like to a Wifi router's admin interface, and upload games directly (our mindset may be a bit different, I always tend to think about solutions not involving a centralized resource which could fail, become unaccessible, or worse, gets compromised, so I rather consider these types of peer-to-peer solutions or other simple means to pass around games and data locally as primary features).
User avatar
nicksen782
Posts: 714
Joined: Wed Feb 01, 2012 8:23 pm
Location: Detroit, United States
Contact:

Re: Ideas around an STM32G0

Post by nicksen782 »

Yes, it would absolutely require the flash for my game downloader and then of course for every game ever played.

What if the bootloader could trigger an alternative program inside the ESP then it would provide the bootloader the files for writing and also what to display on screen? We might not need an ATMEGA game downloader. Perhaps the game downloader should be in the ESP instead. You would need to modify the bootloader for this and I do not know the feasibility. I'll respond to the ESP bootloader thread since this is technically off-topic.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Ideas around an STM32G0

Post by Jubatian »

Some further sketching and thoughts.

So I felt it quite annoying to stick to square pixels for a possibility of interfacing an LCD, and started to look into solutions for getting around this problem. There are powerful ARMs around, couldn't one actually do the horizontal scaling as well while running some game designed with 43MHz in mind? Poking around with some code and theory this seems possible.

And I still have the BitBox, and a repo which compiles good enough to get me a second-stage bootloader which I can run on the thing. Something is broken somewhere deep in the SD / Filesystem, so it can't actually load anything (I never managed to get it working), but the first stage bootloader can start the second-stage one (as that's burned in and is capable to access the card, fetching the image and copying it into RAM). Sufficient for the immediate purpose, the particular Cortex M4 on the BitBox is exactly that sort of micro which I have in my mind for running with horizontal scaling. If I can get it working, then at least the proof-of-concept would be there.

This would allow much nicer ports including Uzebox stuff to this system, as you could carry over graphics maintaining the correct aspect ratio, and of course it may also be fun to pixel for.

It also crossed my mind to just drop the Cortex M0/M0+, rather aiming for Cortex M3 as a minimal core. The point is that the M3 has the full Thumb instruction set instead of the quite limited subset in the M0/M0+, and from a DMA based video mode's perspective, the pipelining loads & stores are also quite useful. The full instruction set in an emulator wouldn't incur extra processing cost to emulate, so at the same host load games could potentially do more. Still 43MHz of course (12 * NTSC Colorburst), just a lot more instructions.

(Note: I am not aiming for cycle-perfect emulation with this as in general it would be a quite loosely defined system regarding the clocking, only the 43MHz minimum set with a well-defined VBlank time allowance for games).

Sort of random rambling, at least the BitBox starts to look even more appealing to be used as a base for real HW experimenting.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Ideas around an STM32G0

Post by Jubatian »

Just thought to share. This is just pretty slick.

Code: Select all

void __attribute__((noinline)) test_4bpp(uint_fast32_t data, uint8_t const* pal, uint8_t* out)
{
20000264:	b410      	push	{r4}
 out[0] = pal[(data >> 28) & 0xFU];
20000266:	0f03      	lsrs	r3, r0, #28
 out[1] = pal[(data >> 24) & 0xFU];
20000268:	f3c0 6c03 	ubfx	ip, r0, #24, #4
 out[0] = pal[(data >> 28) & 0xFU];
2000026c:	5ccc      	ldrb	r4, [r1, r3]
2000026e:	7014      	strb	r4, [r2, #0]
 out[1] = pal[(data >> 24) & 0xFU];
20000270:	f811 400c 	ldrb.w	r4, [r1, ip]
20000274:	7054      	strb	r4, [r2, #1]
 out[2] = pal[(data >> 20) & 0xFU];
20000276:	f3c0 5303 	ubfx	r3, r0, #20, #4
 out[3] = pal[(data >> 16) & 0xFU];
2000027a:	f3c0 4c03 	ubfx	ip, r0, #16, #4
 out[2] = pal[(data >> 20) & 0xFU];
2000027e:	5ccb      	ldrb	r3, [r1, r3]
20000280:	7093      	strb	r3, [r2, #2]
 out[3] = pal[(data >> 16) & 0xFU];
20000282:	f811 400c 	ldrb.w	r4, [r1, ip]
20000286:	70d4      	strb	r4, [r2, #3]
 out[4] = pal[(data >> 12) & 0xFU];
20000288:	f3c0 3303 	ubfx	r3, r0, #12, #4
 out[5] = pal[(data >>  8) & 0xFU];
2000028c:	f3c0 2c03 	ubfx	ip, r0, #8, #4
 out[4] = pal[(data >> 12) & 0xFU];
20000290:	5ccb      	ldrb	r3, [r1, r3]
20000292:	7113      	strb	r3, [r2, #4]
 out[5] = pal[(data >>  8) & 0xFU];
20000294:	f811 400c 	ldrb.w	r4, [r1, ip]
20000298:	7154      	strb	r4, [r2, #5]
 out[6] = pal[(data >>  4) & 0xFU];
2000029a:	f3c0 1303 	ubfx	r3, r0, #4, #4
 out[7] = pal[(data      ) & 0xFU];
2000029e:	f000 000f 	and.w	r0, r0, #15
 out[6] = pal[(data >>  4) & 0xFU];
200002a2:	5ccb      	ldrb	r3, [r1, r3]
200002a4:	7193      	strb	r3, [r2, #6]
 out[7] = pal[(data      ) & 0xFU];
200002a6:	5c09      	ldrb	r1, [r1, r0]
200002a8:	71d1      	strb	r1, [r2, #7]
}
200002aa:	f85d 4b04 	ldr.w	r4, [sp], #4
200002ae:	4770      	bx	lr
So this is what GCC generates for a 4bpp paletted tile row output for the Cortex-M4 (experimenting with BitBox), the tile row itself coming as a 32 bit input, highest nybble for leftmost pixel. On these ARMs loads and stores can pipeline together, so those groups of 4 load+stores would take 5 cycles. The ubfx instruction is pretty wicked, not unlike some wild instructions I came up with my own fantasy designs. It is described for example here, here it just fits wonderfully for the task at hand.

So this in overall does the job consuming about 3 cycles to generate one pixel and it is still C code.

The horizontal scaler seems to average somewhere around 13 cycles per pixel, so it feels pretty much like a higher performance Cortex-M4 would be capable to do the job, including that on the BitBox console (the BitBox itself might not even really need the scaler, I am just experimenting aiming towards making it actually possible to get a system directly interfacing a 320x240 LCD panel without the rigidity of square pixels).
Post Reply