XMega-Uzebox ideas

Discuss anything not related to the current Uzebox design like successors and other open source gaming hardware
User avatar
D3thAdd3r
Posts: 3175
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: XMega-Uzebox ideas

Post by D3thAdd3r »

If I had more confidence in this stuff, I would almost like to see what something as a Cortex M3 @ 32*NTSC could do. At that point there seems no doubt SNES graphics are doable, but then is the question at what point is it too powerful. It is probably nearing the 32 bit Jaguar/Saturn/PS1 era when pushing past 100mhz, which probably is a bad idea for lone developers.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

I think that shouldn't be done here since it is already done, called BitBox. There aren't too many people interested in these, so I guess try to join forces where possible: if you want 32 bits, then have a BitBox.

With UzeBox a nice thing is that emulation is possible, allowing us to do many things even without real hardware (rather accurately), while also allows sharing games as the upcoming browser interface shows.

I am occasionally also working on a 16 bit "AVR" concept, to be used for RRPGE, using a very simple line oriented GPU. But this is rather offtopic here (just mentioned, I wish if there were some nice 16 bit RISC architectures, but there weren't).

The 128K XMega with 8K of RAM for me feels nice for having wider possibilities for graphics mode designs (graphics generated by code blocks may rely on extensive icall + ret use), the CPU in overall is a little faster than the larger ones (due to the faster call & ret). 8K of RAM is still a lot more than 4K (it would feel about 2.5 - 3 times larger due to the bare necessities like VRAM, stack, mixer data present on both systems).
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

Just some extra on the XMega while I am working with these chips.

I realized the USART can also be used as an SPI master, this would apply to the ATMega too, but only the XMega has sufficient USARTs for this. The advantage is that the USART SPI master is buffered, so you can do one transaction in 16 cycles with it and the port accesses don't have to be spaced with rigorous cycle counting (you only need to stick to having them 16 clocks apart on average). Of course with 16K RAM having an SPI RAM isn't such a necessity like it might be for the normal Uzebox (still unexploited of course, but potentially able to enable more complex games).

For the audio part likely on the XMega the most advantageous approach would be completely dropping the inline mixer (as it is impractical for the 3cy LDS instruction). Instead of this, I would design so the SYNC pin's toggles can be sourced from PWM (or maybe the event system; just what I mentioned a couple of posts before), allowing to return from interrupt during hsync (instead of running an inline mixer). For Uzebox compatibility it could simply run a VSync mixer in HSync, but otherwise for new games it could also be possible to open this area for the user. Such an option would enable "racing with the beam" from the user code (a simpler approach to more complex video mode effects than coding those directly within the video mode).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: XMega-Uzebox ideas

Post by Artcfox »

Jubatian wrote:I realized the USART can also be used as an SPI master, this would apply to the ATMega too, but only the XMega has sufficient USARTs for this.
Indeed! That's the only way that I was able to get the max SPI speed out of an ATmega328 chip (without resorting to ASM).

Why do you say only the XMega has a sufficient USARTs for this?
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

Artcfox wrote:Why do you say only the XMega has a sufficient USARTs for this?
I had the ATMega644P in the Uzebox on my mind which has only one. Even with resorting to ASM you can't get max speed by the SPI peripheral: at best you can pump SPI at one transfer per 18 clocks rate. With the USART master, you could do it at 16 clocks (which would be quite useful for the Uzebox if it was possible).
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

I think there was a very significant misunderstanding of the XMega timings by the previously available instruction set documents.

http://www.atmel.com/images/Atmel-0856- ... Manual.pdf

The document was updated some time ago, and it is now visible that for the XMega, just about every load / store operation's timings actually improved (the previous document's timings thought relating the internal RAM rather referred the external memory interface).

So the timings for the Mega and the XMega should be as follows by the new document:

Code: Select all

                    Mega   XMega
LD reg, ptreg       2cy    1cy
LD reg, ptreg+      2cy    1cy
LD reg, -ptreg      2cy    2cy
LDD reg, ptreg+imm  2cy    2cy
LDS reg, imm16      2cy    2cy
ST ptreg, reg       2cy    1cy
ST ptreg+, reg      2cy    1cy
ST -ptreg, reg      2cy    2cy
STD ptreg+imm, reg  2cy    2cy
STS imm16, reg      2cy    2cy
So what should have happened for the XMega is that load/store instructions which could be optimized to perform instruction / data bus accesses in parallel were optimized to do so. What I think happening in the core are roughly as below (remember, 2 stage pipeline):

Code: Select all

i1:     lds   r0,      variable
i2:     add   r1,      r0

       |<-- i1 fetch -->|<-- i1 fetch -->|<-- i2 fetch -->|<-- i3 fetch -->|
                        |<- i1 process ->|                |<- i2 process ->|
                                         |<- i1 databus ->|

i1:     ld    r0,      X+
i2:     add   r1,      r0

Mega   |<-- i1 fetch -->|<-- i2 fetch -->|                |<-- i3 fetch -->|
                        |<- i1 process ->|                |<- i2 process ->|
                                         |<- i1 databus ->|

XMega  |<-- i1 fetch -->|<-- i2 fetch -->|<-- i3 fetch -->|
                        |                |<- i2 process ->|
                        |<- i1 databus ->|
The LDS / STS instruction on the Mega I think works in 2 cycles this way. The second word of the opcode is the immediate itself, so the decoding of the instruction can be actually completed fully while fetching it, and then it is just forwarded as-is to access the data bus.

The XMega seemingly eliminated the need for a decoding step (part of processing the instruction) for load / store instructions, so becoming able to access its internal bus immediately after the fetch. This works only if the address is there to use, that's why only the post-incrementing access works in 1 cycle, and not the pre-decrementing or the instructions with displacement (requiring some calculations to get the address).

The significance, if the XMega truly works this way (the timing information is correct) is that it would be able to do everything what the ATMega does in the Uzebox (porting mostly would require adding nops where necessary). Moreover, it would be notably faster for anything data-intensive!

(Its a pity that if these timings and their impilcations are right, they didn't care for the LPM instruction which could be done in 2 cycles using the same logic like necessary for the 1 cycle loads from data space).

The documentation is still ambiguous though since at the individual instructions, loads are still described taking an extra cycle for accessing the internal SRAM (contradicting the notes on the instruction set summary). It still could be that the XMega rather parallelizes stores with the processing of the subsequent instruction (with inferior timing to the Mega possibly necessary to reach 32MHz) than the suspected behavior above. Some timing tests should be done on an XMega to verify this.

So if this new interpretation was right, that could again make the idea quite viable, since it would be possible to get a mostly source-compatible system with Uzebox on an XMega.
Last edited by Jubatian on Thu Jun 08, 2017 7:20 am, edited 1 time in total.
User avatar
D3thAdd3r
Posts: 3175
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: XMega-Uzebox ideas

Post by D3thAdd3r »

Wow that would be quite a difference with 1 cycle LD+ even with 3 cycle LPM it is a big jump. Just having Uzebox backwards compatibility gives some real credentials towards a branch/successor/whatever, starting such a project with a host of retro games(which even would not be so primitive seeming compared to the specs of the XMega). I'd guess especially the 4bpp modes become much more reasonable, and I have to think you can beat out the NES at everything graphical then.

Out of my league. Figuring things out where the data sheet might even be wrong, and deducing things from experiments is pretty hardcore. I'd be very interested to hear how this comes out after you get a chance to play with the real hardware.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

I did some timing tests at last.

Sadly the XMega's SRAM access was proven to be slow as follows:

Code: Select all

                    Mega   XMega
LD reg, ptreg       2cy    2cy
LD reg, ptreg+      2cy    2cy
LD reg, -ptreg      2cy    3cy
LDD reg, ptreg+imm  2cy    3cy
LDS reg, imm16      2cy    3cy
ST ptreg, reg       2cy    1cy
ST ptreg+, reg      2cy    1cy
ST -ptreg, reg      2cy    2cy
STD ptreg+imm, reg  2cy    2cy
STS imm16, reg      2cy    2cy
An interesting thing however is that on the IO area (tested with GPIO and UART registers), the loads are one cycle faster:

Code: Select all

                    Mega   XMega
LD reg, ptreg       2cy    1cy
LD reg, ptreg+      2cy    1cy
LD reg, -ptreg      2cy    2cy
LDD reg, ptreg+imm  2cy    2cy
LDS reg, imm16      2cy    2cy
ST ptreg, reg       2cy    1cy
ST ptreg+, reg      2cy    1cy
ST -ptreg, reg      2cy    2cy
STD ptreg+imm, reg  2cy    2cy
STS imm16, reg      2cy    2cy
Weird! This however corresponds with the timing information in the XMega AU manual (here). The weirdness is that peripheral access is fast, corresponding with the speculations I made above on how the chip could work internally. With that I don't see why accessing the SRAM has to be slower, what I think most likely is that it is placed on its own bus requiring full clock periods, and stores aren't waited for.

It isn't good for us, but still better in timing than if all peripheral accesses had to adhere those slower timings.
Last edited by Jubatian on Thu Jun 08, 2017 7:21 am, edited 1 time in total.
User avatar
D3thAdd3r
Posts: 3175
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: XMega-Uzebox ideas

Post by D3thAdd3r »

The most common cases we get an extra cycle at least, so probably something can be done with it, especially with lower bpp and mode 2 like stuff. Seems almost a bit close to standard Uzebox for my taste maybe since ultimately I think free cycles are going to be an issue as they always have been.

Maybe 2 ATmega1284s(over volted from the start) where 1 is dedicated sound and graphics (bit bang NTSC without AD725?), and video modes are selected, drawing requested, etc and it just happens on the other side. A bit more like other retro machines which used parallel processing in this way.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

The big problem with the XMega are the 3 cycle LD variants, especially LDS which circumvents using the inline mixer as-is. For graphic modes you mostly do loads, so there will be no real difference. What would make a difference is the more RAM, allowing for entirely RAM based video modes previously impossible.

In an XMega-Uzebox, aiming for source compatibility I would most likely create a VSync mixer running within the video frame, the more RAM allows this, and that's the fastest solution for mixing (it can be done so it uses no VBlank time at all). This could be source-compatible with both inline and vsync mixer games using an unmodified kernel.

The Z-Berry topic made me think a bit how viable an AVR-8 GPU would be (so the 2 microcontroller alternative). I don't think it would be very much useful.

The problem is that the video processor has to run by precise timing, so it can not serve interrupts coming from the other processor. It could only read stuff from it if it was the master determining the timings of the communication. Likely the processor responsible for the main game logic wouldn't be able to use any other interrupt: if fast SPI is required (and it is for viable bandwidth), the main game logic processor would have to respond with very low latency.

Maybe it is reasonably possible if it was made so in VBlank the video processor would "take over", interrupting and driving the main processor to milk the information it needs (or the main CPU wants to send) from it. This also makes it necessary that the video processor generates sound. Probably a viable system, but not an Uzebox any more.

For emulation it would be a good system if the video processor had a fixed firmware: only one AVR would have to be emulated then (as the video / audio processor's logic could be implemented directly). This however probably very much demands the ATxmega384D3 with 32K SRAM to have room for uploaded graphics assets.

Visually it wouldn't be a blast: it wouldn't be capable to do a lot more than a "normal" single processor system can accomplish. You could have more game logic, and more RAM (due to the graphics not taking it) but that's about all it has.
Post Reply