Page 1 of 4

SPI Ram Video Mode

Posted: Wed Aug 09, 2017 3:05 am
by D3thAdd3r
I have not got anywhere with it yet, but I started on a new video mode which I think will hit the right trade offs for a lot of things Mode 3 is currently being used for. The basics of the mode will be based on Mode 3 and the simplest explanation is it uses "in/out"(SPI ram) instead of "ld"('644 ram) for all things related to vram...mainly this video mode will give ~10 extra ram tiles over Mode 3, so at the extremes with a short screen, perhaps blitting sprites onto, approaching, 50 ram tiles, with 200+ flash tiles, and 256 colors, fastest possible blitting. I believe this will be higher sprite output than any palette mode can achieve due to cycles. This is half the purpose of this mode, it sacrifices the space savings of palette modes for this gain.

The other purpose of this mode, is to eliminate the need to scroll data into vram which can be complicated and computationally expensive. This could have purposes for non-scrolling games, because you could have multiple screens prepared(loaded from SD perhaps), and shift the window onto the pre-rendered screens with low overhead. Doing all screen loads in 1 shot can be more efficient. This is a bigger deal for scrolling games, since you can load a very large level into SPI vram at once, and then just move the window as the game camera travels throughout the game world, using the cycles that would have normally been used for decompressing map data to vram, for blitting sprites instead. I see this first and foremost as something that makes development of good platformers easier.

There is somewhat another advantage which is related to a disadvantage. There is no need for interleaved vram like the current Mode 3 with scrolling, so it will be and must necessarily be linear like Mode 3 no scrolling. It must be linear, because it is too slow to seek around SPI ram. There is then no "wrap around" effect like M3 scrolling, but this is unnecessary as M3 only does that so you don't have to redraw vram every frame it scrolls a tile. The idea here, would be to draw the entire level beforehand and scroll nothing in after. The only updates most games should need to make, are for animating/removing items like coins, pickups, etc. Large numbers of animated items could be done by updating ram tiles instead of vram values. Because random access is slower, games that requires large and frequent updates of vram(lots of background stuff that is not static and not practical to do with the extra ram tiles) this probably does not work. For platformers like Super Mario Bros., I do not believe there is anything that could beat this for general cases. Bold statement I know!

The issue with using SPI ram in the interrupt outside of user control, is when user code is fiddling with SPI ram and the kernel needs it. The user code would then possibly get the SPI ram "stolen" before it is able to finish what it was doing. It would not even know it happened, unless there was a signaling system to repeat the last action. To eliminate that, in general, user code should not directly access the SPI ram. Instead, 2 static sized buffers. The first buffer is data to be pushed to vram, and the second is data to be read from vram. Sacrificing a sound channel if necessary, during HSYNC the kernel will operate a state machine that writes specified values to specified locations that the user program requests. The second buffer would be to read from locations in SPI ram. This is likely too slow for collision detection or similar things, which probably need to be stored compressed in '644 ram. The read buffer would probably be most useful to read stored song data(originally from SD), to the '644 side ram buffer. For this purpose, the process might be directly done in the state machine so that user code never needs to think about the song buffer(or wait for all that to happen a scanline at a time). That is part of my thoughts for efficiently negating the space cost of 8bpp as it relates to a NES style platform game which would generally have ~10K of song data...about what palette modes might save on graphics for the same game(minus the color limits, or the development burden that happens when palettes are thrown into the mix). In reality though it ends up being much more than just "in/out" vs "ld" to support all that.

For "one shot" loads the video mode would support a totally black screen where it does not read SPI ram, so the user code can directly access SPI ram for speed. FadeOut(X,true), set mode black, load one shot stuff, unset mode black, FadeIn(X,false), basically.

Any one have something to poke a hole in this theory or some feature ideas?

Re: SPI Ram Video Mode

Posted: Wed Aug 09, 2017 6:11 am
by Jubatian
D3thAdd3r wrote:
Wed Aug 09, 2017 3:05 am
Any one have something to poke a hole in this theory or some feature ideas?
I thought about this, but the problem is that it is not as simple as it seems. You seem to be forgetting about how RAM tiles work here: When you need to blit a sprite fraction, you have to read the VRAM to see what is at the targeted location, then unless it is already a RAM tile (because of a previous sprite fraction blit ending up there), you have to replace it. So for every sprite fraction normally you get 2 x ~100 clocks SPI RAM byte access which is significant. A Mode 3 full 8x8 sprite blit should take somewhere around 1200 - 1400 clocks (by looking at the assembly code) without the RAM tile allocation, which can not fully be interleaved with those SPI accesses (800 clocks for the 4 fractions). Writing the RAM tile's index might interleave only (400 clocks), but that still would be difficult (there isn't any stable unconditional code stream there to interleave with).

You would also have to consider how the RAM tile management works so if the blitting can't finish before the next frame, you wouldn't end up with a broken VRAM due to conflicting SPI accesses.

Not that the mode wouldn't be useful, the large scrollable surface could be tempting, but it doesn't appear to be such a great deal like it might seem first.

(Mode 74, and the not yet merged in Mode 748 has a different approach for using SPI RAM: They use it as sprite source, where I found how I can interleave the SPI accesses rather efficiently so blitting is still reasonably fast. That approach supports rich animation as sprite images can come from the SPI RAM. This feature would also be present in Mode 64 as it is a property of the 4bpp sprite blitter, not the mode itself).

What might be a fix is if the RAM tile allocation method was changed, that is it wouldn't replace the actual tiles any more, rather add lists of replacements for every row. This of course would demand a different scanline logic too (picking the tile index either from the VRAM or the replace list's next element). This way in VBlank you wouldn't have to access VRAM at all to add sprites (and of course such a mode would also give you 256 ROM tiles).

EDIT: This "fix" is doable, at least it is possible to build a functional scanline loop for it. It might seem superseding Mode 3 if done, but such a new mode is not capable to support user RAM tiles (as RAM tiles only occur through a replacement list in it, the normal VRAM only has ROM tiles).

Re: SPI Ram Video Mode

Posted: Wed Aug 09, 2017 7:14 pm
by D3thAdd3r
That is definitely a hard part and one of the "support details" that I don't have thought out too far. I had some vague notion the kernel itself can send commands to happen in hsync, which would hopefully save 100 cycles for the ram tile index write in the blitter. With that, initially...I was thinking just to eat the ~5000 cycles loss(for the read before blit) in the hopes it can interleave with something later. Bad thing is having to eat yet more cycles for ram_tiles_restore, but I thought possibly that also happens in hsync. I think it is not hard to check vsync flag, so at least blits can bail out if a frame is missed.

For the blitter, I almost think to unroll all offsets of X and do a jump inside if possible. At least there might be a prayer to interleave a bit with all the offsets a different code path. Now I think the list of ram tiles sorted as they are reached by the scanline is drastically better than all those cycles wasted. It is also very nice it eliminates ram_tiles_restore. I still think there is a way to have user ram tiles with this, since that logic is done in ProcessSprites, merely not allowing those ram_tiles to be used by the blitter.

Of course none of that is simple at all, and a lot of work. With the alternate ram tile allocation, I feel more confident this will work to acrually blit some ram tiles. If there is no way to blit 50 then I would be less interested to pursue it, though still 1 big vram with an arbitrary window to it would make certain things much easier. Even not using SD, and doing non-real time compression in one shot would then be possible to keep some games all in flash, perhaps.

Re: SPI Ram Video Mode

Posted: Wed Aug 16, 2017 5:14 am
by D3thAdd3r
Jubatian wrote:EDIT: This "fix" is doable, at least it is possible to build a functional scanline loop for it. It might seem superseding Mode 3 if done, but such a new mode is not capable to support user RAM tiles (as RAM tiles only occur through a replacement list in it, the normal VRAM only has ROM tiles).
So, with a list sorted from left to right, top to bottom matching the linearity that vram indices will be encountered in the scanline code, it can be fed as you said from a pre-sorted list of ram tiles. That negates the need for the random lookups into SPI ram which is critical. Now, I am a bit worried even still now that I thought on that a bit. Even though the sprite code would not need to read SPI ram to see if there is already a ram tile there, or to set the index to a ram tile, it still must know if there is, or is not a ram tile at that location. Otherwise, it would be possible for overlapping sprites to generate 2 separate ram tile entries in the list since each was unaware of the other. That is not going to work well for the scanline renderer.

It seems there is no clever post sorting method that will work, since if 2 ram tiles are blit that should have been 1, the loss is already there even if it was corrected so it displayed right(dropping the later ram tile). The obvious solution, is for each time a ram tile will be needed, search through the entire list to see if there is already a ram tile with those coordinates, and use that one(blit over it) if there is, but as I see it the sorts must be done before the entire data set is I think this limits how clever it can be done versus a brute force linear search? I guess this was related to you saying lists for each row? Either way this seems potentially slow, though perhaps still faster than waiting on SPI ram, though even some crude interleaving might bring it back into consideration(assuming sorting a list is unavoidably slow...I would prefer to have 256 flash tiles anyway).

Just random thoughts, but I still am very interested in this idea and believe there has to be a solution to all the issues that would need to be dealt with along the way to a working implementation.

Re: SPI Ram Video Mode

Posted: Wed Aug 16, 2017 5:36 am
by D3thAdd3r
Ok maybe I understand that this really would require fixed size per row lists which I am assuming was your original comment. The entries could be very tiny, assuming there is enough cycles to multiple somewhere in there, just 1 byte for the ram tile index and each tile row has a different fixed sized list. If no cycles then 2 bytes to pre-multiply the address of the ram tile. Still at 2 bytes, not that bad if some sane max value were used. I guess even if there were only a few extra ram tiles, just the sheer scrolling ability along with 256 tiles would be something, but I don't think it is that bleak.

Something like(assuming we can slip a multiply in for 1 byte entries) 10 ram tiles per tile row would be reasonable, and assuming a 26 tiles high screen, is still 832-260=572 extra bytes for ram tiles so about 9 ram tiles over M3. There is also no need for ram tiles restore, which is 150 bytes back assuming 50 ram tiles. So it seems there is ram there to have fast fixed lists, perhaps even 12 ram tiles max per row still allows a lot of sprites and might have a practical limit close to the NES 8 hardware sprites per scanline...but "infinite" storage for the world. 50 scans through ~12 entries does not seem impossible, and at least to me seems about the amount of cycles it takes to decompress level data for the M3 type of scrolling; so somewhere on parity to negate that issue. Past that I can't think of anything that automatically makes this idea impossible, though crap tons of work before it would every fly for sure.

Re: SPI Ram Video Mode

Posted: Wed Aug 16, 2017 8:06 am
by Jubatian
What I did an experimental sketch for had 3 bytes per entry, and a variable amount of RAM tiles per row (with a fixed max. total, deducted from the requested RAM tile count). Row entry indices were used to address the first RAM tile of the row (if any), so in total 32 + RamTileCount * 3 bytes was the RAM budget (the entries held the followings: RAM tile index, next RAM tile location on row, next entry index in list).

Mode74 already has a search in its RAM tile allocator for the prioritizing, it works rather well (as observable on the sprite demo or intensive scenes in FoaD), so I don't think it would have a huge impact. It isn't something too hairy, only the row on which the sprite fragment has to be placed needs to be checked (so usually likely about 4-8 entries from the list, nothing compared to the time required for the blit itself).

So for me this seems to be a totally viable graphics mode, which normally would have very similar capabilities to Mode 3 (or Mode 74 or any other RAM tiled mode if I applied this onto them, a Mode 64 variant doesn't seem to be possible though as it just doesn't fit).

Re: SPI Ram Video Mode

Posted: Fri Aug 18, 2017 9:16 pm
by D3thAdd3r
I am very optimistic then if you think this is possible, myself having far less of the asm skill for sure, I still feel confident I know what this can do for games which might be bigger than is obvious at first(just the perceived game complexity improvement beyond what the specs technically are).

So if this path would cost 32+(50*3)=182 bytes for 50 ram tiles, and then that cost is almost a wash since 150 bytes of ram_tiles_restore[] is no longer necessary. Then the slow access for random reads and writes seems possible to sidestep for the internal kernel stuff, though probably a great deal of work to do it. So M74 already has some things that might be applicable to this that could be reworked? Naturally M74 could just as well use SPI vram to good effect too.

Re: SPI Ram Video Mode

Posted: Sat Aug 19, 2017 8:16 am
by Jubatian
Deriving from M74 due to the 7 cycles per pixel is just simpler to do, but the concept is the same. The scanline loop sketch for M3 was an ugly hairy beast which is impossible to avoid as you have to put one "lpm" between each two pixels for the ROM tile path.

M74 due to the half as large tiles is also more versatile when combining with SPI RAM, like the SPI RAM sprite source feature proves this (a similar feature for M3 would take about twice the time for a sprite). M748 is a little different beast which just simplifies M74 and adds SPI RAM sourced row types for some nice graphics. M74 with SPI RAM sprites could actually work together just fine with an SPI RAM VRAM.

I don't think it is that useful, though, as FoaD also indicates that an intensive M74 game would have a render bottleneck, which would only be made worse by moving off the VRAM to SPI RAM, as then in VBlank you would have to access that. The same time FoaD's logic frames even with the very complex physics and crazy RAM saving tricks take only about half as much time than the render frames - so there is more room to access SPI RAM in a logic frame than in a render frame in a 30FPS game.

So first maybe I would suggest doing some tests, how well it works, how much you can actually blit with the screen height you prefer at what FPS (60 or 30). Of course at 200px moving the VRAM to SPI RAM would be nice as at such a height you would definitely have time to use 50 RAM tiles efficiently possibly even at 60FPS if you have a simple game logic.

Re: SPI Ram Video Mode

Posted: Tue Sep 05, 2017 6:23 pm
by Jubatian
I managed to shoehorn in an 5.5 cycle / pixel variant of this new mode (of course if 5.5 cycles / pixel is possible, then 6 cycles / pixel is just a matter of some NOPs if you preferred that).

I rather threw in some Mode 1 if I were at it though, although likely Mode 3 would have been easier. But there is so much of that SPI RAM... So VRAM indices in this sketch are 2 bytes long, which wouldn't matter much for the blitter (the costly part is addressing in the SPI RAM, not the fetching of that additional byte), and the 2 byte VRAM allows for some tile tricks otherwise impossible. So this is rather a Mode 1 with sprites, something only viable when the VRAM is really on the SPI RAM.

Of course this way this could be visually more than just Mode 3 with more sprites. If the frame driver was programmed so, it would also be possible to have a new VRAM on every scanline, allowing some weird tile usage (while RAM tiles would still go on a 8x8 grid), such as for example serving Würgertime's needs.

Another thing on my mind when having the SPI RAM that likely an option to have bitmaps would be nice. Not the kind Mode 748 supports (it's resolution is totally off), rather just narrow 8bpp bitmaps of the correct resolution (assisted with some preloading during scanlines to get over the limited bandwidth of the SPI RAM), maybe embedded in normal tiled graphics, which could have some good uses (such as character images in an RPG game).

Anyway, for now just a scanline loop, there are still lots of things to finalize in the master branch about the current improvements in Mode 3 and other areas (I also think about porting some mixer tricks I used in FoaD to reduce the inline mixer's size, which also frees up some VBlank time) before jumping into anything new.

Re: SPI Ram Video Mode

Posted: Thu Sep 07, 2017 4:55 pm
by D3thAdd3r
Quite nice, and 16bit vram could allow a lot more flexibility in tilesets which might help save flash. I guess if sprite blitting is not too effected, that is the ultimate.

Do you think it is possible to squeeze code in the HSYNC with the existing channels, something to give user access to vram? Or perhaps it is best to build into the scanline giving up some width but keeping max sound/UART options? Even if it was nothing more than a circular buffer where it sent 1 raw user byte to SPI per line, it seems it can work. All time requirements are easily met with a scanline in between each one at least. Perhaps even the old UserHsyncCallback() idea could allow the choice of different short asm routines to fit the requirements of the game, and allow a lot of customization without being a different video mode or submode.