Experimentation with Adafruit ItsyBitsy M4 board

Discuss anything not related to the current Uzebox design like successors and other open source gaming hardware
User avatar
mast3rbug
Posts: 81
Joined: Sat Jan 08, 2011 8:15 am

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by mast3rbug »

By the way for some tricks on setting up stuff on an ARM, you may check the BitBox repo, it is mostly done at bare metal level, although for an STM32F4. But some ideas could be useful for approaching the problem.
I looked at the bitbox console, it have a really good hardware, I may be wrong but the community and software collection seems to be low, why? I didn't known the existence of this project. Maybe too hard to build?
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by Jubatian »

mast3rbug wrote: Tue Feb 19, 2019 11:33 pmI looked at the bitbox console, it have a really good hardware, I may be wrong but the community and software collection seems to be low, why? I didn't known the existence of this project. Maybe too hard to build?
Currently for me it just doesn't work, there is some problem somewhere within the SD interfacing (See issue here).

I am pondering over some own ARM ideas as well, and still, the BitBox will be an useful asset for it (I have one). My idea in summary would be a binary API which you could implement onto almost any ARM based solution, which could run games written for this API (the games themselves being in RAM, and wouldn't interact with any of the hardware directly). Lots of things to consider here, though, but if I could make it work, it could be fun, you could really build your hardware for it.
User avatar
mast3rbug
Posts: 81
Joined: Sat Jan 08, 2011 8:15 am

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by mast3rbug »

Alittle interesting test I made just out of curiosity, I wanted to know the difference in rendering Between Atxmega and ARM, I made two small videos of VRML object rendering;

In Red: Cortex M4 120 MHZ (SAMD51G19A).

In Green: Atxmega 256 A3. 32 MHZ overclocked to 48 MHZ.

The line routine is in C, I need to write it in ASM for the two plateforms, a lot of wasted cycles here. But the point is for the same C code without any optimisation, the 8 bit Atxmega perform very well VS the 32 bit ARM processor.



User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by Jubatian »

Interesting.

The 8 bit AVRs are great micros on terms of performance and code density, clock to clock in tasks not involving intensive wider than 8 bits math, I would expect them easily surpassing Cortex-M0 both in performance and code size. The XMegas are a bit worse than the Megas for the same clock speed (but they can go up to 32MHz, and might run rather reliable even at 64MHz in hobbyist projects, I actually tested XMega ALUs under such conditions with XMBurner, increasing clock beyond, at 72 - 80MHz the first thing giving up was, as possibly expectable, the internal Flash).

The Cortex-M3 and above should do slightly better clock to clock even at 8 bits as they have a bunch of extra instructions, covering most of what the AVR core can do and even beyond, utilizing the 32 bit instruction size (Thumb-2) well to pack useful things into one single-clock instruction (like doing shifts and move in an add instruction). Some quite surprising results can show up when examining code generated by an optimizing compiler for them, how these instructions can be exploited. Loads and stores are also significantly better clock to clock especially compared to the XMega's 3 clock load with displacement (seriously, that compared to the Mega's 2 clock load with displacement is the worst trade-off ever, this being a critical instruction to access structure members and variables on the stack). However even at 16 bits math the AVR core might keep up clock to clock by having more than twice as many usable registers than the ARM, and a C optimizer might also have easier job with it due to this, not having to resort to swapping stuff in and out of RAM. It is just rare to see the AVR running out of registers when dealing with anything.

The ARM one appears to run significantly faster still, maybe more than 2 times faster, just difficult to tell as the frame rate doesn't quite reach visually acceptable level on either where the brain would easier see it as a fluid motion.

The C code itself might also be optimized for the AVR by choosing appropriate types. If it is so, you might consider replacing for example "uint16_t" and similar to "uint_fast16_t" and similar as appropriate, on the AVR these will have exactly the width specified, on the ARM they would map to 32 bit types. Of course keep in mind that they may be wider than indicated in the type's name where it is necessary. On the ARM this could remove some unnecessary masking and signed / unsigned extending. Data buffers don't need such a change (size reasons, too!) as the ARM instruction set fits very well for handling narrower data buffers.
User avatar
mast3rbug
Posts: 81
Joined: Sat Jan 08, 2011 8:15 am

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by mast3rbug »

interesting experiment, Just be "Enabling" Cache in the main program and "Disabling it" in the Video Interrupt Accelerate the frame rate 2.5 X faster.

Also, when doing smoot code in the Main Loop (I mean, not too much calculation like the VRML object but only LOOPS to move Memory Data on the screen). The jitter in the video interrupt Almost disapeared. I think this is because the Cache is able to do a great job and it's able to do almost 0 wait state... Is the cache 0 wait state?

Jubatian:
I have only little experience about cache, I still don't understand how branch prediction works, but like you said in a previous post, Maybe it's possible to lock the Bitbang video output in the Cache and also having some remaining for the main code at the same time? I will read the chip cache documentation and also will find a tutorial about how cache works, specifically in ARM processors.

Update:
Also juste by switching the compiler to -O1 I have now about 20 frame / second.
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by Jubatian »

mast3rbug wrote: Sat Feb 23, 2019 10:37 pm Update:
Also juste by switching the compiler to -O1 I have now about 20 frame / second.
Yikes! It is easy to bring something down to a crawl for sure by forgetting that! :lol:

By the way as far as I know, the general rule with GCC is mostly using -O2 for stuff more sensitive to performance and -Os for those which you just want to get as small as possible, it is worth to select it on file by file basis. -O3 might be able to bump performance up somewhat, but may as well just end up with slower and bigger code, slower due to being bigger and not fitting well in caches.

Be vary of that the cache or methods to get around slow Flash and bus contention on ARM micros vary vendor by vendor as this isn't something covered by the ARM core itself, so you need to look especially for Microchip (formerly Atmel) devices for your one.

What can happen when there is no room in the cache at 120 MHz is about the following:

The Flash has to run at some 4-6 wait states, I don't remember exactly which applies for your one, but something alike. Typically the Flash offers 64 byte wide fetches (at least I saw this being common on the STM32s we use), which is enough to grab between either 4 16bit wide ARM instructions or 2 32bit wide ones (and anything inbetween) in one Flash cycle. So for example suppose you set your clock to 120MHz, and the Flash to 4 wait states, and you run off code directly from the Flash with only prefetching (for sequential code, speculatively loading in advance to keep the instruction fetch unit filled, a common practice), then at best the code will execute as if the MCU was clocked at 100MHz. This is because 4 wait states means that you get a load every 5th cycle only, and you can only have up to 4 instructions in that, so in a typical ARM instruction stream, every one taking one clock to execute, there will be one cycle when the MCU is just waiting for the next instruction.

As you may imagine, jumps can get quite expensive. The pipeline on the ARM is short, that's not a big problem, but when you jump, and it wasn't predicted right, the MCU would have to first wait for a cycle when a fetch can happen from the Flash. Compilers typically try align possible jump targets, at least function entry points at 64 byte boundaries so at least once the fetch can happen, it loads a full page worth of instructions to execute.

The cache itself is 0 wait states, at least it should be (check it, but I expect it to be so). The RAM too is (as far as I saw on STM32's) 0 wait states. However there the problem which can arise is bus contention. If you do DMA, data loads and stores, all happening on the same bus with the ARM core itself trying to guzzle up instructions all the time, then stuff eventually get stalled a lot. The STM32s offer a solution to this by providing RAM on two separate buses (those which have Core Coupled Memory), so you can have code in the CCM, and data / DMA going on in the regular RAM, coexisting peacefully. On your micro I expect the cache being something similar, living on a different bus to your main RAM.

So something like that.
User avatar
mast3rbug
Posts: 81
Joined: Sat Jan 08, 2011 8:15 am

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by mast3rbug »

Thanks for All that information, It help me to understand better some points in how cache works.

I made a small cool effect yesterday, my goal with this Microcontroller is maybe made a demoscene with it like Amiga style. So I started made my first cool effect. Not too bad for a first one. I use no sprite, Only plain Rectangle Fill and page erase with double buffering. (Just to remember, the Video Memory is Ram Mapped, not like the Uzebox with sprites so the technique is different). My second goal is to try to take a code source of an UZEBOX game and try to recompile it (Only the main game code) to work on the ARM. So, maybe we can re-compile all the game source codes for the ARM and be able to run all them on it. All the kernel and video modes can be emulated in function calls. The final point is to offer to the community a new compatible plateform able to work with current console and also the new kernel able to do more advanced things.

User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by Jubatian »

Nice and quite interesting ideas!

I am working on something similar, however approaching the problem from a different perspective. I am trying to come up with an ARM game kernel which you could adapt onto any vendor's ARM, and the games for it would execute only Thumb instructions (suitable for Cortex-M3 and above), not interfacing with any hardware themselves. Nothing to demonstrate yet, but I feel like I am making good progress (for real HW I am using the BitBox Console, not even digging into it too deep, just enough to get the ARM game kernel interfacing it okay).

Games for this would be possible to be compiled by targeting generic Cortex-M3 with including the ARM Game kernel's header (which provides the definitions and the locations of the interface functions - they are in a fixed area in a vector table), I will design a binary format for it, but would make it so loading HEX files or raw binaries would also work.

What would be really awesome about this is that you could build hardware for this around any ARM having sufficient RAM and at least the Cortex-M3 supported instruction set, realizing peripherals and even the display in any manner suitable (palette would be BBGGGRRR, that is, same as Uzebox as this is simple to build from resistors while offering a good diversity in color, for higher bit depths, the ARM can do palette lookup quite fast). This could so be a quite long-term solution as compatible HW should exist for it for decades!

So of course the ItsyBitsy also should be capable to run this once I get it sufficiently complete, and it was adapted onto it.
User avatar
mast3rbug
Posts: 81
Joined: Sat Jan 08, 2011 8:15 am

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by mast3rbug »

It looks a pretty challenge if you take into account that making a stable picture on only one Micro is already tricky. How will you code all the kernels for all the differents Microcontrollers? You will have to purchase a lot of development kits and debuggers to achieve this. You also have to write the music kernel for all that differents Micros.

Cedric
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Experimentation with Adafruit ItsyBitsy M4 board

Post by Jubatian »

mast3rbug wrote: Wed Mar 06, 2019 4:38 amHow will you code all the kernels for all the differents Microcontrollers?
I am approaching the problem from a different perspective. What kind of simple interface the cross-platform kernel could provide which you could easily attach to the display drive mechanism once you did it?

In this particular regard the interface now looks like as follows:

Code: Select all

/**
 * @brief  Request scanline
 *
 * Requests next scanline from the game kernel. There are 216 scanlines, so it
 * has to be called 216 times after entering the Video Frame to get a complete
 * picture. The target has to have a size corresponding to the pixel aspect
 * ratio returned, up to 512 bytes. The active image is centered on the buffer
 * specified for the pixel aspect ratio (for example the 288 pixels wide
 * square pixels mode would start at index 16, if only 256 pixels were used,
 * then at index 32). The edges are always blanked out. Palette is BBGGGRRR.
 *
 * @param  buf:  Target buffer to fill.
 */
void GKH_Video_NextLine(uint8_t* buf);
So if you drive your display by DMA tied to a port wired to output BBGGGRRR, from your display interrupt, you can pass this the DMA buffer itself, and it will fill it up proper. However if you have something more complex and run the ARM sufficiently fast, you can also do additional post-processing on the data (such as scaling if your hardware uses a 320x240 LCD panel).

Audio is similarly buffer based, you pass the DMA buffer and the number of samples you need for a frame worth of audio, and it will fill it.

The point is that all the game kernel code (including video modes and a Uzebox VSync-mixer like audio engine) is cross-platform C, with a thin interface towards the hardware, which is reasonably simple to implement once you built something which is capable to drive a display. It could be put that way that if this existed and had games, then once you brought up some ARM based hardware to a point it was capable to realize some games, you could easily get this running on it.

Of course just mentioning that I am aiming towards something alike. Nothing coming anytime soon, I am pretty much bogged down and drained nowadays (work and related stuff), but have some experience on realizing easily portable software.
Post Reply