Smoother rendering of high res modes

The Uzebox now have a fully functional emulator! Download and discuss it here.
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Smoother rendering of high res modes

Post by Jubatian »

I was a bit annoyed by the performance of the emulator when it came to how it renders high resolution modes (such as 3 cycles per pixel, code tiles).

So I did something about it (on top of my speed hacks, but on different branch): https://github.com/Jubatian/uzebox/tree ... linebuffer

Capturing also seem to work fine with it. I used the 4,67 cycles figure for square pixels, as in this topic, and it seems right (a little bit narrower than before, 618 pixels versus 630).

It is based on a 8 bit line buffer into which all pixels go (so everything, all the 1820 pixels). Then when a scan line is to be rendered, the line buffer is processed into the destination surface.

The speed of emulation remained unchanged, I also verified by disabling the processing of the line buffer entirely that it only makes up about 2 percents of the CPU demands of the emulator, and even this seems to be regained by having less branches in update_hardware, and maybe also by having a bit smaller surface (the original rendered onto a 720 pixels wide one which it shrunk down to 630 using SDL).

EDIT: I had to go back for a fix for an unexpected segfault which didn't surface until I pushed commit. I felt something wrong there, but apparently my habit of not using signed variables unless strictly necessary for a given algorithm struck. There are many nasty behaviors of the C and C++ languages (a vast pile of undefined behaviors) which are tied to signed arithmetic, such as overflows or how the sign is actually represented (not necessarily 2's complement), and generally I am just not used to expecting something being of signed type unless documented to be so. Anyway, fixed, it works now. Playing Alter Ego...
User avatar
uze6666
Site Admin
Posts: 4801
Joined: Tue Aug 12, 2008 9:13 pm
Location: Montreal, Canada
Contact:

Re: Smoother rendering of high res modes

Post by uze6666 »

Do you really see a benefit of expanding that buffer to 1820 bytes? 1820 cycles is the number of cycles in a full video line including active video, HSYNC pulse color burst and everything. We can't display anything past 1440 and the kernel will always set the video dac to zero after that (for cycles 1440 to 1820) to avoid colliding with the sync signal.
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

Uze6666 wrote:Do you really see a benefit of expanding that buffer to 1820 bytes?
That part gains on performance since it eliminates conditional checks from the update_hardware function, the main path of the emulator (which executes like 20 million times a second to get 28MHz). Of course only if it is a line buffer, since then that few KBytes (2 Kbytes here) can always sit in the cache. If the overall rendering algorithm (including sending to the video card) would be hindered by having such a line buffer, then it doesn't worth it, but if it gets along with that well, then it worths it. I put it on a separate branch for this, to see what it does. Right now I like how I can have a clean render of the higher res modes instead of missing entire 3 cycle wide columns (check Mode 9 demo with this versus the original).

What possibilities it also opens is for example having a similar option like in the C64 Vice emulator, to see the entire output, while not crippling performance of normal TV size output emulation with it. It is not that critical like with the C64, but may be fun to have, say, if someone adds colored borders to see how far they extend, and likes. And even things like I saw commented out there, to display where and how the HSync pulse is output.

(Note: Why rendering into a line buffer is better from the point of cache, if eventually it needs to be rendered onto a whole surface anyway? The whole surface may this way be as small as required for the display, while the line buffer covers the entire possible length of the line, eliminating boundary tests from the main path, as above. That's the main benefit. If the line buffer could be sent directly into the Video RAM for display on every scanline, preferably as-is at 8 bits depth, then probably this would be the fastest overall solution for the rendering process)
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

Jubatian wrote:If the line buffer could be sent directly into the Video RAM for display on every scanline, preferably as-is at 8 bits depth, then probably this would be the fastest overall solution for the rendering process)
I'll keep looking into whether it's possible to use an 8-bit texture with a palette. I think it should be possible, but I'm not sure that every video card supports it in hardware. I'd hate to make it always render that way, and then find out that some drivers emulate that in software, killing performance. (If that is the case, a command line flag could be added to make 8-bit textures optional.)
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

As I mentioned in the other thread, I found out that 8-bit textures aren't supported by SDL, and in the limited testing I did, the performance of the line buffer branch is slightly worse for mode 3 games than without the line buffer change, but I understand that it improves the rendering for the higher resolution video modes.

A solution that CunningFellow has proposed that should work for all video modes is to just always render into a 1440 wide buffer, and let the GPU scale that down to the logical size of the rendering window. Compared with the previous 720 pixel wide buffer, that would mean transferring twice as many pixels to the GPU each frame, and sending twice as many pixels over the pipe to ffmpeg when recording, so I'm not sure what the final performance impact would be.

One experiment I did try for mode 3 games was to make the pixel buffer exactly 240 pixels wide, and then I let the GPU scale that up to the logical size of the rendering window. That reduced the number of pixels that needed to be transferred to the GPU each frame by a factor of 3 (compared to 720) in addition to reducing the number of pixels sent over the pipe to ffmpeg when recording.

I'm envisioning a "best of both worlds" scenario where the scaling is always performed on the GPU, and where the rendering width has a sane default that looks great for any video mode, but can be overridden through a command line parameter for optimization purposes when you have advanced knowledge of the video mode that will be used. What are your thoughts on that?
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

On the software side the larger your buffer is, the worse will be the performance due to cache misses. If you want to get 60FPS at 1440x224, that means 19 megapixels throughput, at 32 bits you can imagine that it won't likely work too well, especially if you traverse it even twice (once to fill it up, then once more to ram it down the bus towards the video card), all the time without the help of any cache.

For a line buffer, it doesn't really matter how large it is on anything today. The 1820px (2048 so pixel output doesn't need conditionals, just throwing data into it with an AND mask for offset as safety) wide line buffer will take 8K if it is at 32 bits, not a big deal for today's caches at megabyte quantities (and not even for any older machine capable to deal with the emulation of the AVR).

So although now I did it at 8 bits, it could be done just as well at 32 bits if that can be fed into the video card easier. You still populate the line buffer, and then send it, so still have that ~20 megapixels throughput, but it all happens within cache then. (You can experiment with it, just restore that write_io produces a 32 bit pixel, change the type of the line buffer to 32 bits, then try to do something with it in the scanline renderer)

The scaling is a tricky thing. I whacked together a software scaler to get a nice 4:3 output without ending up with such a gigantic 32 bits buffer to feed into SDL so it may scale it. If you want something crisp, then you would have to use linear interpolation for the horizontal shrink, but nearest for the vertical expansion. Vertically somehow at least creating a faint scanline effect would also be nice (the old TVs would do that in progressive scan mode). Anyway, the way to handle horizontal adjustment and vertical is entirely different, even more so as you try to get nearer to old television style picture, simply due to how the CRT operated.

So in short: First thing is maybe trying to see how you can pass over to the video card anything line by line, how well that performs (does it have some ridiculous overhead for example). If it can be done, then the line buffered approach is a good thing.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

Jubatian wrote:On the software side the larger your buffer is, the worse will be the performance due to cache misses. If you want to get 60FPS at 1440x224, that means 19 megapixels throughput, at 32 bits you can imagine that it won't likely work too well, especially if you traverse it even twice (once to fill it up, then once more to ram it down the bus towards the video card), all the time without the help of any cache.

For a line buffer, it doesn't really matter how large it is on anything today. The 1820px (2048 so pixel output doesn't need conditionals, just throwing data into it with an AND mask for offset as safety) wide line buffer will take 8K if it is at 32 bits, not a big deal for today's caches at megabyte quantities (and not even for any older machine capable to deal with the emulation of the AVR).

So although now I did it at 8 bits, it could be done just as well at 32 bits if that can be fed into the video card easier. You still populate the line buffer, and then send it, so still have that ~20 megapixels throughput, but it all happens within cache then. (You can experiment with it, just restore that write_io produces a 32 bit pixel, change the type of the line buffer to 32 bits, then try to do something with it in the scanline renderer)

The scaling is a tricky thing. I whacked together a software scaler to get a nice 4:3 output without ending up with such a gigantic 32 bits buffer to feed into SDL so it may scale it. If you want something crisp, then you would have to use linear interpolation for the horizontal shrink, but nearest for the vertical expansion. Vertically somehow at least creating a faint scanline effect would also be nice (the old TVs would do that in progressive scan mode). Anyway, the way to handle horizontal adjustment and vertical is entirely different, even more so as you try to get nearer to old television style picture, simply due to how the CRT operated.

So in short: First thing is maybe trying to see how you can pass over to the video card anything line by line, how well that performs (does it have some ridiculous overhead for example). If it can be done, then the line buffered approach is a good thing.
The only way that I can see to pass the data to the video card line-by-line would be if you used a separate texture for each scanline (224 textures in total), but since there is a lot of overhead involved for each OpenGL call that is made, I would expect individually sending each scanline to absolutely destroy performance.

To minimize the number of underlying OpenGL calls, we pretty much need the data stored in one big chunk on the CPU-side so the entire thing can be DMA'd directly into the GPU's texture memory. Given that the transfer rate for uploading textures from CPU to GPU is measured in GB/sec, I'm guessing that sending 1440*224*4 bytes would cost very little to send. I was just trying to minimize the amount of pre-processing that we did in software, but overlooked the fact that even when using a buffer that's 720 pixels wide, we still look at all 1440 pixels when doing the >> 1.

The part that concerns me is using a texture 1440 pixels wide, because I'm not sure that older graphics cards can support a texture > 1024 pixels on a side. (Some older graphics cards don't support texture sizes that aren't powers of two, but in that case I'm guessing that SDL rounds the size of the actual texture it uses up to the next power of two, and scales the texture coordinates appropriately to make it appear that non-power of two texture sizes are supported.)

Tonight I'll try the experiment of rendering directly into a 1440x224 pixel buffer with no scaling, and then making a single call to DMA that data into the GPU for scaling. It might end up being faster than hitting every pixel to scale it down to 720 wide before handing it off to the GPU for scaling. I'm not sure it's possible to specify a different scaling method for each axis though.
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: Smoother rendering of high res modes

Post by CunningFellow »

I like the idea of making the software surface 1820 (or 2048) wide so there is never an conditional branch.

However - why does it need to be a single line. Is there something I am missing about cache access? Could you not make the SW surface 1820x224?

Wouldnt the cache controller move the next page in when you are about to hit the boundary of the current page? I understand jumping all over the the 1.2megabyte bitmap would be constant cache penalties. Just having a a single index that is moving FWD only wouldn't be more cache misses than rendering a single line and then having to copy it out to the big bitmap afterwards would it?

(note: I don't know much about X86 and CPUs with 6 megabytes of cache ram)
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

So it looks like we can make the texture bigger than the area we want to render (1820x224, or 2048x256) and then when we call SDL_UpdateTexture, pass in an SDL_Rect* specifying the 1440x224 area that we changed, and then when we call SDL_RenderCopy we would pass in that same SDL_Rect* to specify that we only want a portion (1440x224) of the texture mapped to our window.

That should give us no conditionals, and allow for all the scaling to be performed on the GPU. The only question is what do we want to specify in the call to SDL_RenderSetLogicalSize? If a user makes the window exactly 1440 pixels wide, then we wouldn't want it to be scaled down to 630, losing pixels, and then scaled back up, so I'm thinking we should set the logical size to 1440x1024?

Edit: Using a SDL_Rect is not necessary, see the below patch.
Last edited by Artcfox on Thu Oct 08, 2015 5:05 am, edited 1 time in total.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

The attached patch, to be applied to your uzem140-linebuffer branch is what I meant, except it doesn't make the horizontal scaling look blurry.

If that blur is what showed you the extra columns, you can add it back in hardware by changing:

Code: Select all

SDL_SetHint(SDL_HINT_RENDER_SCALE_QUALITY, "nearest");
to:

Code: Select all

SDL_SetHint(SDL_HINT_RENDER_SCALE_QUALITY, "linear");
Here is what the attachment looks like for people that want to see it without downloading and unzipping:

Code: Select all

diff --git a/tools/uzem/avr8.cpp b/tools/uzem/avr8.cpp
index fa6f7c4..5d42c4c 100644
--- a/tools/uzem/avr8.cpp
+++ b/tools/uzem/avr8.cpp
@@ -279,33 +279,16 @@ void avr8::spi_calculateClock(){
 
 
 // Renders a line into a 32 bit output buffer.
-// Performs a shrink by 2,33
-static void render_line(u32* dest, u8 const* src, u32 const* pal)
+static inline void render_line(u32* dest, u8 const* src, u32 const* pal)
 {
-	unsigned int sp;
 	unsigned int dp;
 
 	// Note: This function relies on the destination using a 8 bits per
 	// channel representation, but the channel order is irrelevant.
 
-	for (dp = 0U; dp < ((VIDEO_DISP_WIDTH / 3U) * 3U); dp += 3U)
+	for (dp = 0U; dp < VIDEO_DISP_WIDTH; dp ++)
 	{
-		// Shrink roughly does this:
-		// Source:      |----|----|----|----|----|----|----| (7px)
-		// Destination: |-----------|----------|-----------| (3px)
-		dest[dp + 0U] =
-			(((pal[src[sp + 0U]] & 0xF8F8F8F8U) >> 3) * 3U) +
-			(((pal[src[sp + 1U]] & 0xF8F8F8F8U) >> 3) * 3U) +
-			(((pal[src[sp + 2U]] & 0xFCFCFCFCU) >> 2)     );
-		dest[dp + 1U] =
-			(((pal[src[sp + 2U]] & 0xF8FCFCFCU) >> 2)     ) +
-			(((pal[src[sp + 3U]] & 0xFEFEFEFEU) >> 1)     ) +
-			(((pal[src[sp + 4U]] & 0xFCFCFCFCU) >> 2)     );
-		dest[dp + 2U] =
-			(((pal[src[sp + 4U]] & 0xFCFCFCFCU) >> 2)     ) +
-			(((pal[src[sp + 5U]] & 0xF8F8F8F8U) >> 3) * 3U) +
-			(((pal[src[sp + 6U]] & 0xF8F8F8F8U) >> 3) * 3U);
-		sp += 7U;
+		dest[dp] = pal[src[dp]];
 	}
 }
 
@@ -1909,19 +1892,19 @@ bool avr8::init_gui()
 	atexit(SDL_Quit);
 	init_joysticks();
 
-	window = SDL_CreateWindow(caption,SDL_WINDOWPOS_CENTERED,SDL_WINDOWPOS_CENTERED,VIDEO_DISP_WIDTH,448,fullscreen?SDL_WINDOW_FULLSCREEN_DESKTOP:SDL_WINDOW_RESIZABLE);
+	window = SDL_CreateWindow(caption,SDL_WINDOWPOS_CENTERED,SDL_WINDOWPOS_CENTERED,630,448,fullscreen?SDL_WINDOW_FULLSCREEN_DESKTOP:SDL_WINDOW_RESIZABLE);
 	if (!window){
 		fprintf(stderr, "CreateWindow failed: %s\n", SDL_GetError());
 		return false;
 	}
-	renderer = SDL_CreateRenderer(window, -1, SDL_RENDERER_ACCELERATED | SDL_RENDERER_PRESENTVSYNC);
+	renderer = SDL_CreateRenderer(window, -1, SDL_RENDERER_ACCELERATED);
 	if (!renderer){
 		SDL_DestroyWindow(window);
 		fprintf(stderr, "CreateRenderer failed: %s\n", SDL_GetError());
 		return false;
 	}
 	SDL_SetHint(SDL_HINT_RENDER_SCALE_QUALITY, "nearest");
-	SDL_RenderSetLogicalSize(renderer, VIDEO_DISP_WIDTH, 448);
+	SDL_RenderSetLogicalSize(renderer, VIDEO_DISP_WIDTH, 1024);
 
 	surface = SDL_CreateRGBSurface(0, VIDEO_DISP_WIDTH, 224, 32, 0,0,0,0);
 	if(!surface){
diff --git a/tools/uzem/avr8.h b/tools/uzem/avr8.h
index 3eea7e5..184fe11 100644
--- a/tools/uzem/avr8.h
+++ b/tools/uzem/avr8.h
@@ -57,7 +57,7 @@ THE SOFTWARE.
 #define VIDEO_LEFT_EDGE  166U
 // Video: Display width; the width of the emulator's output (before any
 // scaling applied) and video capturing
-#define VIDEO_DISP_WIDTH 618U
+#define VIDEO_DISP_WIDTH 1440U
 
 //Uzebox keyboard defines
 #define KB_STOP		0
You have mixed line endings in the files, and copying and pasting the above code into a .patch file prevents the patch from being applied, so use the attachment to apply the patch cleanly.

Here is the difference (using "linear" with a maximized Uzem window):
SoftwareScalingVSHardwareScaling.png
SoftwareScalingVSHardwareScaling.png (74.25 KiB) Viewed 13893 times
And here is the difference (using linear with a normal-sized Uzem window):
SoftwareScalingVSHardwareScaling-618px.png
SoftwareScalingVSHardwareScaling-618px.png (40.61 KiB) Viewed 13883 times
Attachments
1440.patch.zip
patch against Jubatian's uzem140-linebuffer branch to perform the scaling in hardware for all modes (including high res modes)
(1.26 KiB) Downloaded 659 times
Last edited by Artcfox on Thu Oct 08, 2015 6:00 am, edited 1 time in total.
Post Reply