XMega-Uzebox ideas

Discuss anything not related to the current Uzebox design like successors and other open source gaming hardware
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

XMega-Uzebox ideas

Post by Jubatian »

Just some ideas for an I think reasonable upgrade retaining compatibility as much as possible, for people who are OK with the SMD parts and the XMega's 3.3V operation voltage.

(This is a different, more conservative idea than the XMegaBox idea I posted before)

As far as I see the majority of the Uzebox games don't have custom timing-critical components, so by compiling them with a kernel having adjusted video modes they should continue to work just fine with an XMega. By this I think it being quite reasonable to create an XMega version of Uzebox retaining source compatibility.

The major constraints are the following then:
28.6MHz oscillator frequency (the XMega has PLL, so it could possibly use a 3.58MHz NTSC colorburst crystal which are more plentiful)
The palette must stay the same BBGGGRRR.

Possibilities for improvements:
Sync signals could be generated using the PWM modes of timers or the event system automatically toggling an appropriate pin (so the sync pin should be selected accordingly), which could free up lots of CPU power for the user and could make a wider variety of HSync tasks possible (no need to interleave sync toggles).

Otherwise the benefits are that an XMega from the upper end has for example 256K ROM + 16K RAM, and there is no overclocking, so system stability should improve, less risk of having the overclocking-tolerant ATMegas discontinued, it might also serve better for portables (no need to drive the chip with increased voltage).

Of course the 16K RAM would be the most massive deal with this system making lots of things possible which previously weren't (such as you can easily have a full 8x8 tile tileset for a 4bpp mode in RAM with this, loaded from SD card for example), allowing for the design of drastically different games with ease.

What I wouldn't do with the XMegas:
No DMA usage. It is very complex to emulate, and it is a very different system (from graphics generation perspective).

If anyone is interested in this from the hardware end, I really think I would jump in porting the Uzebox kernel for such a design.
User avatar
D3thAdd3r
Posts: 3175
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: XMega-Uzebox ideas

Post by D3thAdd3r »

I pretty much agree with all statements here, especially source compatibility. With the PLL as I understand it, for newer games there is no need to stick to ~28.6, though it should be the default. Going up to 10*NTSC would allow some amazing things;how many times have you wished for just another cycle or 2?!

The ram is the biggest one, like you say, a complete tileset in ram coupled with huge amounts of flash. SD->ram tile sets, enough said.

DMA I might disagree even if I might not fully understand the details. When you say it is difficult to emulate I imagine it is something like SPI complexity(is that totally correct in Uzem or CUzebox?). I just think of the benefits, whatever they might be with my understanding. I would say a huge thing that would make it feel like the next generation, would be more than 1 tile layer;or an approximation of it. Then you are looking at something more in line with an SNES, which makes sense as a target for such a machine.

That, and I see no reason for a '725 if it can be emulated with another chip. It is already surface mount, so AD725 is not doing us favors there anyway. Maybe it is not possible. If it is possible, it allows whoever is crazy enough to go even further and have that do processing on the video data coming from Uzebox(not default behavior of course). Then most critical, is that the main cpu must be able to reprogram the "graphics processor" itself. Otherwise it retains default emulation that allows existing games to work.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

D3thAdd3r wrote:for newer games there is no need to stick to ~28.6, though it should be the default.
True, but this could raise compatibility issues across Uzebox-XM units. Those which produce actual NTSC output would demand some chip which does this conversion, which needs an input clock derived from the colorburst. So if you choose to allow this, that would possibly complicate hardware design (you can't pass an input clock for the NTSC chip from the AVR). 28.6MHz isn't that much smaller than the max (32MHz), so no big loss (you would have 9 cycles where with the Uzebox you have 8), and if you don't (or can't for HW constraints) design to allow driving the CPU with higher voltages, overclocking could fail, so again, compatibility issues across units depending on game. Nasty, unpleasant thing.
D3thAdd3r wrote:DMA I might disagree ... I imagine it is something like SPI complexity(is that totally correct in Uzem or CUzebox?).
Yes, implementing it is SPI complexity (without the bus stalls). But running it is a massive impact on performance, and for the browser version of the emu we are still having problems dealing with the ordinary Uzebox. If you look at the SPI RAM demo with CUzeBox you would notice it running slower (when you turn off frame rate limiting) than most other things, Tempest is also slightly slower than other games. This is so since every 18 cycles there is a hardware event, the necessity to emulate an SPI transmission. If you start producing graphics with DMA at 4-5 cycles per pixel, that would be four times more frequent, causing emulation to perform around 30% worse, and this is just the inaccurate emulation not considering bus stalls. Emulating bus stalls caused by the DMA unit would need an entirely different emulator architecture, which is considerably slower (cycle level emulation versus instruction level emulation, currently we are doing instruction level emulation). Not emulating it would cause things running differently in the emu and the real machine.

DMA won't give you an extra tile layer. The current tight video modes of Uzebox wouldn't benefit from it at all unless somehow we understood the exact silicon level behavior of the DMA unit to be able to time memory accesses so the DMA doesn't stall instruction processing due to a bus conflict. It tight video modes the only thing the DMA would provide is that there are no OUT instructions, but somewhere there will be 1 cycle stalls when the DMA locks the bus for itself. DMA would only help line buffered concepts, there the benefit would be very notable, but it would demand different approaches, no longer doing cycle-precise coding, and it could even hinder generating a cycle-precise sync (due to bus conflicts messing up cycle counts, especially if you tried to use DMA outside the scanline loop for copies where it may interfere for example with the trickier vsync generation), so this could prove to be problematic even on real hardware.

What will give a very notable improvement in the selection of viable video modes is the 16Kbytes of RAM, allowing the design of RAM only modes (no longer needing lpm instructions, neither rom / ram tile selects), which are capable to reach higher resolutions.

A graphics processor could be nice for achieving specific goals (like driving an LCD panel as it appeared with CunningFellow's portable), but it shouldn't be a "standardized" part as it would again constrain design choices. The Uzebox-XM I think should also be a system which at its simplest build would work happily with an RGB output (like E-Uzebox) and an SNES controller.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

I think it could be nice to list some things what the Uzebox-XM could do, and how those could work.

Video modes are the most interesting part.

16 bit VRAM entries could be very useful for many games which have ROM / RAM to hold more than 256 tiles, having 16K of RAM makes this viable (it doesn't take away such a large fraction of the RAM to become impractical for anything also using RAM tiles).

Blitting becomes faster since the XMega does stores in 1 cycle. This affects how many sprites can be rendered in RAM tile modes without running out of VBlank.

Mode 3: the concept of this video mode remains as-is, just with a lot more RAM tiles (and possibly 16 bit VRAM entries), there are no real improvements to be made here.

Mode 13, 74 and other 4bpp ideas: 256 tiles for these modes fit in 8K of RAM, so it is quite reasonable to have such 4bpp modes which use only RAM tiles. Mode 13 is possible even at 5 cycles per pixel if using RAM tiles only, allowing the creation (or porting) of games which have near-square pixels. Of course so much RAM tiles offer plenty of possibilities for SD loaded data, opening up for new genres (large RPG or adventure games).

Framebuffered modes: 16K is enough for an almost full screen 5 cycles / pixel 2bpp mode (or you may have a full screen C64 Multicolor style mode, like demonstrated in Mode 74 on a smaller region).

Mode 2 variants: Since stores are 1 cycle faster, higher resolution and more sprites are possible with these concepts. Using RAM tiles only for the background layer can also improve resolution, in overall offering quite viable modes.

RLE modes: They become useful at full resolution, which may enable the creation of functional flat polygon 3D games or vector graphics concepts.

CunningFellow's 2bpp + SD underlay: The underlay can be replaced to an 1bpp framebuffer, which permits for example a Tempest with actual vector graphics (assuming there are cycles to draw the webs).

SD card usage

This notably improves with the extra RAM: there is space to load into. Large graphics & story oriented games are possible. It is perfectly possible to support fragmentation with so much RAM (you can have the parts you need loaded, or if you need fast access, you have enough RAM to store sector addresses).

Audio

VSync mixing no longer would be such a large problem (large in memory), which is beneficial for video modes which demand all the HSync (Mode 2 variants mostly). Larger memory and faster stores also allow for creating a faster VSync mixer (or one capable to do more channels using the same amount of CPU resources).

Game logic

Just what the more RAM implies. If you don't use it elsewhere, you might have a 100x100 map sitting in memory for example, or you may follow many sprites at once. These are useful for strategy games, anything where the player needs to interact with the map in some manner. It is of course also useful for any other genre if their maps were to be loaded from SD card.
User avatar
D3thAdd3r
Posts: 3175
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: XMega-Uzebox ideas

Post by D3thAdd3r »

Jubatian wrote:28.6MHz isn't that much smaller than the max (32MHz), so no big loss (you would have 9 cycles where with the Uzebox you have 8), and if you don't (or can't for HW constraints) design to allow driving the CPU with higher voltages, overclocking could fail, so again, compatibility issues across units depending on game. Nasty, unpleasant thing.
10*3.579545 is only about 10% over the max 32 rating, and I believe in all things the max rating contains a safety margin of at least 10%. Testing could confirm it, whether the EEPROM or flash can keep up, but at least I think it should be pushed to the edge and no further(probably not even Uzebox's aggressive 40% overclock). Eh I am not stuck on a number I made up, but an extra "ld" per pixel is pretty attractive for next gen stuff.

Some years back the 1284 had more than a couple people excited, but when it finally came out it simply wouldn't overclock reliably. Of course with a ~10% overvolt it works. By that time there was so much invested in the PCB kits, etc, that it didn't seem worth the redesign and market fragmenting for what it was worth on that overvolt...well I am a bit more radical than most so I would have followed 1284 but I digress. If one designed it from the start with a conservative overclocking+overvolting, I would bet $5 everyone of them works, and for all time after there is that extra "ld". It seems worth a lot of consideration on this point anyway.
Jubatian wrote:DMA won't give you an extra tile layer. The current tight video modes of Uzebox wouldn't benefit from it at all unless somehow we understood the exact silicon level behavior of the DMA unit to be able to time memory accesses so the DMA doesn't stall instruction processing due to a bus conflict. It tight video modes the only thing the DMA would provide is that there are no OUT instructions, but somewhere there will be 1 cycle stalls when the DMA locks the bus for itself. DMA would only help line buffered concepts, there the benefit would be very notable, but it would demand different approaches, no longer doing cycle-precise coding, and it could even hinder generating a cycle-precise sync (due to bus conflicts messing up cycle counts, especially if you tried to use DMA outside the scanline loop for copies where it may interfere for example with the trickier vsync generation), so this could prove to be problematic even on real hardware.
DMA is pretty grey area to me, but it makes sense it must stall the bus on memory access like you say. In that case it does sound pretty horrible especially if it is not predictable;it doesn't sound like too much fun to try and predict it either. Not much else to say on that, I am convinced on this aspect. Just the slow emulation would make adoption pretty difficult for most.
User avatar
D3thAdd3r
Posts: 3175
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: XMega-Uzebox ideas

Post by D3thAdd3r »

D3thAdd3r wrote:10*3.579545 is only about 10% over the max 32 rating
I should clarify what I mean on this, as it relates to XM compatibility. I would stress start at 28mhz and have some mechanism to change this in the "Mega Uzebox Kernel". I don't know of a reason it isn't practical but perhaps there exists one?
Jubatian wrote:Audio
Well if we want better sound than Uzebox, I don't think the biggest problem is more channels. If you want SNES sound(I do!) you need SNES methods, which basically means you cannot use the convenient 256 byte samples. They simply have to be larger PCM versions of a realistic instrument sound, and that means it takes more cycles and flash(we have flash). If it is overclocked that is ok. If one will drop a measly 256b out of 16k for a vsync mixer that is ok too. But the vsync mixer was not just about ram, as mode 13 or 74 runs out of cycles before ram tiles, if you wanted a full height screen. I think we were already slightly more cycle limited than ram limited, in the current Uzebox;just my opinion. Like you say of course it is a complex subject, since more ram can mean faster everything moving away from "lpm".

Extra clocks could make a mode 3, gluttonous ram tiles, with full 8bpp color possible...with 16bit vram. That seems like the end all to me. All that typing and all I have really said is:more cycles could be very important here. What you point out in the video modes though, is pretty inspiring even at 28mhz.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

Cycles... Depends on what you are doing.

Audio: A PCM channel takes a little less than 2 waves, but its processing is very much determined by the tons of loads. I referred to faster Vsync mixing by more RAM with this: if you can throw about a kilobyte at it, you can process every channel individually into a 16 bit mix buffer, and only convert that to 8 bit at the end. This wipes out all the loads from the individual channel mixing, plus the XMega has 1cy stores. So doing PCM in a Vsync mixer on the XMega can be a whole lot faster than what you can reasonably accomplish on the Uzebox (no room for a 16 bit mix buffer, the 2cy stores also kill some of its benefit).

Video: I just realized that on the XMega I could do that "almigthy 7cy/pixel Mode 2 with in-scanline calculated sprites" at 5 cycles per pixel. Pretty high resolution and this mode has no video related VSync tasks, so if the XMega Uzebox had it, you would have something quite different in your toolbox than with the normal Uzebox (well, if I completed it for the normal Uzebox, that would also be a thing... RPG game plan ongoing).

In Flight of a Dragon I went for 30FPS, but still, I think it demonstrates it is possible to do stuff with this many cycles. I traded off cycles for RAM everywhere in that game... Even the map coordinates of sprites are compressed, and the thing depacks and repacks them at every usage. The biggest cycle eater though are sprites, of course: blitting that many sprites. There is no way to improve code there (Mode 74's blitter is I think quite well optimized), and it needs Mode 74's ld-swap-ld concept (with Mode 13 it wouldn't be possible to get that stuff like what this game does).

Without overclocking, the way to add cycles is to take them away from the kernel.

One thing which will work for this is that pushes take 1 cycle on the XMega, so every hsync interrupt will take less cycles (they add up).
Another if we pay attention to the HW design to build it so that the sync signal can be generated by PWM, which removes the need for syncing interrupts to timer on non-graphics lines, and of course no sync pin toggles in the mixer. This also frees up a good amount of cycles for user code in the VBlank.
Yet another, again relying on the PWM driven sync is if we use a VSync mixer with a RAM tiled video mode, and instead of running a graphics frame from start to end, we return to the user program on every HSync. The Vsync mixer is a lot faster than the aggregate of the inline mixer calls, and as a bonus, you also get a capability to write "race with the beam" code with such a video mode to change palette or mode features. If you don't want that, you may either have the Vsync mixer running there (which is not as silly as it sounds: if it finishes processing the frame within the display area, you get all the vblank time, which you wouldn't with a Hsync mixer), or process game logic (non-graphics).

Also keep in mind that overclocking also affects the emulator: if you crank the CPU's clock up by 20%, you need 20% more processing power to emulate it as well.
User avatar
D3thAdd3r
Posts: 3175
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: XMega-Uzebox ideas

Post by D3thAdd3r »

Jubatian wrote:One thing which will work for this is that pushes take 1 cycle on the XMega, so every hsync interrupt will take less cycles (they add up).
Another if we pay attention to the HW design to build it so that the sync signal can be generated by PWM, which removes the need for syncing interrupts to timer on non-graphics lines, and of course no sync pin toggles in the mixer. This also frees up a good amount of cycles for user code in the VBlank.
Do you have a rough estimate of how many cycles this might be? Am I wrong to think it is upwards of 40,000 cycles gained? In that case that blits many extra ram tiles past the 64 ram tiles which is really the maximum Uzebox could possible do even with a pretty short screen and optimized game code.

If you come up with a spec and want to build it, I will follow and try it out to verify and try some experiments. I would particularly like to see what else can be done with audio, while retaining what the core ideas for video are and minimal extra parts. I don't know exactly what it would be, but possible a simple capacitor or 2 that ground to toggling pins could do something bigger on filtering. Can we actually do that PWM faster than ~15khz with ~28mhz clock? Also is there a way we can have 2 PWM pins free for audio?
Jubatian wrote:"almigthy 7cy/pixel Mode 2 with in-scanline calculated sprites"
How many sprites was this one again? It seems like you could calculate so many lines in advance that it might be past the tipping point where mode 2 like beats out mode 3 like.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

D3thAdd3r wrote:Do you have a rough estimate of how many cycles this might be?
The current master kernel burns away some 70 cycles for syncing, in my most extreme kernel hack used in FoaD its 11 cycles. In the original kernel's inline mixer, you have 10 cycles for sync toggles, in the FoaD kernel it is 5 cycles. So for normal HSync lines by the original kernel you would get 80 extra cycles to use, by FoaD's 16. In the VBlank there are many more cycles lost due to syncing of which it would also be possible to recover a substantial amount by PWM. Anyway, original kernel, assuming 37 normal HSync lines for VBlank (224 lines tall screen, +1 line for video mode entry) it gets you 3000 extra cycles (5% improvement).

The 1 cycle stores by the original kernel would get you some 20 extra cycles (pushes, various stores in sync management and the mixer), for 37 HSync lines, ~750 cycles.

The most extreme gain would be running a VSync mixer within the Hsync time of the display frame. A Vsync mixer takes around 10000 - 20000 cycles depending on features. 100 cycles per hsync could easily become available assuming the same ~200 cycles reserved for it in ordinary video modes (conservative estimation assuming pushing & popping most registers on IT entry / return and a FoaD style sync to timer). This means it is pretty much doable. A 4 channel inline mixer by the current kernel takes 167 cycles, the Vsync mixer's sound byte output would take some thirty, so lets just assume 130 cycles gained. Thats ~5000 extra VBlank cycles for 37 lines.

So in total you would get about 8500 cycles extra with these for a 37 cycles minimal height VBlank, about 13% improvement (about the same like if you ran the CPU at its max frequency instead of 28.6MHz), while you likely also get much better audio capabilities the same time (using up all the available in-frame HSync for the mixer) or extra non-vblank cycles to do game logic.

PWM doesn't run at 15KHz, it runs faster (Didn't check, but I would say 111 KHz: 28.6MHz divided by 256 as the 8 bit timer generating it counts). You only update it at 15KHz. A characteristic part of the Uzebox audio experience is a 60Hz aliasing noise on the envelopes which you may notice in some games. In FoaD this is not present (as I use a more costly method to interpolate the envelope). 2 PWM pins would demand reserving 2 (8 bit) timers for the audio. For me it looks like the XMega has plenty of timers to get this, and it isn't very costly to produce mono sound on a stereo interface (just write both ports), otherwise if you want stereo, you will of course need to throw more RAM and CPU to it.
D3thAdd3r wrote:How many sprites was this one again?
The sprite code I have now is capable to render 8 pixels wide 3 color + transparency sprites from RAM using 71 cycles, this is for a pair of arbitrary height multiplexed sprites (so when they are on the same scanline, only one shows). You can have as many of these as fits in the available HSync (and unlike Mode 2 they can overlap). Other sizes: 12 pixels: 91 cycles, 16 pixels: 111 cycles. These are with all features utilized, so every color of every individual sprite can be set and they may be X mirrored. If you wanted small sprites (bullets for example), no mirroring / new colors, that could be done at 42 cycles / 4 pixels wide sprite. When this mode gets to have a functional demo, I will post it.

There is a problem though with the XMega I didn't notice before: Until now I recalled it having the decrementing loads / stores being slower by 1 cycle (so 3 cy for load, 2 cy for store). However the STS and LDS instructions are also slower which can affect certain types of code. Not a showstopper, but for example the inline mixer would become inpractical with it. These instructions take 3 cycles (LDS) and 2 cycles (STS) instead of 2 on the AtMega (which I didn't even understood how was possible for LDS as the instruction is 2 words, so fetch word 1, fetch word 2, magically perform a load in 1 cycle??? - when in every other LD type instruction it is 2 cycles). LDD is also 3 cycles.
User avatar
Jubatian
Posts: 1560
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: XMega-Uzebox ideas

Post by Jubatian »

Just some extra for the XMega I found.

The XMega has 16 GPIO registers, all in the bit check range, which of course are accessible in 1 cycle for either loading or storing. These could be useful to get around the problem of slower loads (LDS and LDD instructions). Also they don't have their port I/O registers in this range, rather they support mapping them here to 4 port groups (virtual ports), but this just means it is still possible to produce pixel output all right.

I like the overall architecture of these micros, it seems like they cleaned them up nicely (and wish if LDD was a 2 cycle instruction, that would solve a lot of problems caused by slower LDS...).

Another nasty thing though, applies both the Mega and the XMega is that above 128Kbytes of ROM the Program Counter receives an extra byte, which means an extra cycle in all calls and returns, destroying some tight codes. With 128K ROM you can only get 8K of RAM.
Post Reply