Smoother rendering of high res modes

The Uzebox now have a fully functional emulator! Download and discuss it here.
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

There is also an inbetween approach for this problem, I had seen it somewhere used. So neither a line buffer, but neither full display.

The minimal 32 bit buffer (1440x224) necessary for the thing is 1,2 megs, that's pretty large to be expected sitting in the cache well, maybe except with the most current CPUs, however currently I think the emulator could run even on an 1,4GHz Tualatin (Pentium 3) which had only 512K (and there are several more powerful Celerons maybe still around which have smaller caches for cost reduction). And there may be many things competing for the cache, for example when running it through Emscripten, the browser's code is also there, spinning.

Anyway, so what is it.

Slice based rendering. You have, say, 32 lines only, so a 2048x32 pixels surface of 256 KBytes. To implement it, simply use the linebuffer based code, and apply the line offset with a "(scanline & 0x1F) * 2048px". Then whenever a slice is completed, push the entire slice to the video card.

The 1024 pixels versus 2048 pixels problem with slices might be solved by creating an 1024x64 slice instead, and using the 10th bit of the cycle counter to add the 32*1024 pixel offset. Then when pushing to video, you do it in 2 passes (first the top half for the left side, then the bottom half for the right side). Of course this method can be applied to full surface rendering as well.

Returning to the halving approach.

The original code halved pixel output frequency using a ">> 1" when writing out the pixel in update_hardware. There are multiple ways to smooth this out, so the render of pixel widths of odd number of cycles become cleaner. Simplest now is to stick with the line buffer + software scaler, rewritting the latter (trivial) to simply halve instead of dividing by 2,33, which should be notably faster. An other approach may be not using a line buffer, but doing this:

- Palette is calculated with half intensities.
- Using a memset, clear the scanline (to zero) before its render begins.
- Pixel output is performed at halved frequency, but instead of a set, using an add (so the average of 2 pixels end up in each surface pixel).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

Jubatian wrote:There is also an inbetween approach for this problem, I had seen it somewhere used. So neither a line buffer, but neither full display.

The minimal 32 bit buffer (1440x224) necessary for the thing is 1,2 megs, that's pretty large to be expected sitting in the cache well, maybe except with the most current CPUs, however currently I think the emulator could run even on an 1,4GHz Tualatin (Pentium 3) which had only 512K (and there are several more powerful Celerons maybe still around which have smaller caches for cost reduction). And there may be many things competing for the cache, for example when running it through Emscripten, the browser's code is also there, spinning.

Anyway, so what is it.

Slice based rendering. You have, say, 32 lines only, so a 2048x32 pixels surface of 256 KBytes. To implement it, simply use the linebuffer based code, and apply the line offset with a "(scanline & 0x1F) * 2048px". Then whenever a slice is completed, push the entire slice to the video card.

The 1024 pixels versus 2048 pixels problem with slices might be solved by creating an 1024x64 slice instead, and using the 10th bit of the cycle counter to add the 32*1024 pixel offset. Then when pushing to video, you do it in 2 passes (first the top half for the left side, then the bottom half for the right side). Of course this method can be applied to full surface rendering as well.

Returning to the halving approach.

The original code halved pixel output frequency using a ">> 1" when writing out the pixel in update_hardware. There are multiple ways to smooth this out, so the render of pixel widths of odd number of cycles become cleaner. Simplest now is to stick with the line buffer + software scaler, rewritting the latter (trivial) to simply halve instead of dividing by 2,33, which should be notably faster. An other approach may be not using a line buffer, but doing this:

- Palette is calculated with half intensities.
- Using a memset, clear the scanline (to zero) before its render begins.
- Pixel output is performed at halved frequency, but instead of a set, using an add (so the average of 2 pixels end up in each surface pixel).
I'm pretty sure that we need the final image to be in actual RAM (not the CPU's cache) in order to perform a DMA transfer into the GPUs texture memory. Also, we're mostly writing to those pixels, and only ever reading them back exactly once for the DMA transfer. If we were using SSE, I would actually choose an non-temporal SSE instruction so writing to that final image buffer doesn't pollute the cache at all.

Edit: What do you think about my patch that goes on top of your linebuffer branch? It renders to something exactly 1440 pixels wide, and then uses the GPU to scale that up or down as required by the size of the window it's actually being displayed in.
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

I will look into it the evening, now just a bit of idle time here.

I looked in SSE, that would indeed be nice! But only for Intel PC builds, also excluding Emscripten. There may also be cases where there is no much video memory involved, windowed mode without some overlay for example (by this thread on X11, it is a bit old, SDL likely can almost always use some HW accelerated overlay to get around this if HW acceleration is requested). But the point on caching is OK, since if there is a DMA transfer there, things must be forced out of cache line buffer or no line buffer.

I think this is a good point to modularize the code a bit, by decoupling the render from avr8.cpp in a way opening possibility for selectable rendering algorithms (which would come nice later if for example somebody chooses to implement or add an NTSC TV emulation library for the output, so it can be selected as the user prefers, and this way it may also be easier to add for example a more Emscripten specific renderer once its performance characteristics are figured out). The 8 bit line buffer would be a nice interfacing point for that, so even the palette can sit in the render module. Of course then it has to be copied off on every line into whatever buffer the renderer uses, but then having it cached pays off both for the fill and the copy (since these still happen in system RAM, potentially never even leaving cache).

I would implement it with some kind of callback mechanism, so the avr8.cpp module can call back to the renderer (every scanline), and the renderer may even be selected dynamically. The best would be if every single scanline was passed just as-is, letting the renderer decide what it processes from those. Maybe tomorrow I will try to do something in this regard.

What I would like to have myself is some pleasant software renderer, which would work consistently at a fixed ~640x448 size. It is nice when for some reason graphics acceleration just doesn't even want to work right (happens on older machines).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

I like the idea of plug-able rendering algorithms, and a good software fallback for older computers.

I also discovered how to render directly into a 1440x224 texture's back buffer, rather than having to copy the pixels into a surface first. I create a temporary SDL_Surface to steal its optimal pixel format, and then that's what I use for the texture.

Code: Select all

diff --git a/tools/uzem/Makefile b/tools/uzem/Makefile
index 7a94f3c..24321c3 100644
--- a/tools/uzem/Makefile
+++ b/tools/uzem/Makefile
@@ -14,7 +14,7 @@
 TARGETS = debug release
 
 #Uncomment to optimize for local CPU
-#ARCH=native
+ARCH=native
 #TUNE=y
 
 ######################################
@@ -50,7 +50,7 @@ CPPFLAGS += $(SDL_FLAGS) -D$(OS) -D_GNU_SOURCE=1 -DGUI=1 -DJOY_ANALOG_DEADZONE=8
 ######################################
 RELEASE_NAME = uzem$(OS_EXTENSION)
 RELEASE_OBJ_DIR := Release
-RELEASE_CPPFLAGS = $(CPPFLAGS) -O3 
+RELEASE_CPPFLAGS = $(CPPFLAGS) -Ofast -flto -fwhole-program
 
 ######################################
 # Debug definitions
diff --git a/tools/uzem/avr8.cpp b/tools/uzem/avr8.cpp
index 5d42c4c..d8a6fac 100644
--- a/tools/uzem/avr8.cpp
+++ b/tools/uzem/avr8.cpp
@@ -357,7 +357,7 @@ void avr8::write_io_x(u8 addr,u8 value)
 
 				if (scanline_count >= 0){
 					render_line(
-						(u32*)((u8*)surface->pixels + scanline_count * surface->pitch),
+						(u32*)((u8*)surface_pixels + scanline_count * surface_pitch),
 						&scanline_buf[left_edge],
 						palette);
 				}
@@ -368,13 +368,12 @@ void avr8::write_io_x(u8 addr,u8 value)
 				if (scanline_count == 224)
 				{
 
-					SDL_UpdateTexture(texture, NULL, surface->pixels, surface->pitch);
-					SDL_RenderClear(renderer);
+					SDL_UnlockTexture(texture);
 					SDL_RenderCopy(renderer, texture, NULL, NULL);
 					SDL_RenderPresent(renderer);
 
 					//Send video frame to ffmpeg
-					if (recordMovie && avconv_video) fwrite(surface->pixels, VIDEO_DISP_WIDTH*224*4, 1, avconv_video);
+					if (recordMovie && avconv_video) fwrite(surface_pixels, VIDEO_DISP_WIDTH*224*4, 1, avconv_video);
 
 					SDL_Event event;
 					while (singleStep? SDL_WaitEvent(&event) : SDL_PollEvent(&event))
@@ -444,6 +443,7 @@ void avr8::write_io_x(u8 addr,u8 value)
 						buttons[0] |= 0xFFFF8000;
 
 					singleStep = nextSingleStep;
+					SDL_LockTexture(texture,0,&surface_pixels,&surface_pitch);
 					scanline_count = -999;
 				}
 			}
@@ -1906,13 +1906,14 @@ bool avr8::init_gui()
 	SDL_SetHint(SDL_HINT_RENDER_SCALE_QUALITY, "nearest");
 	SDL_RenderSetLogicalSize(renderer, VIDEO_DISP_WIDTH, 1024);
 
-	surface = SDL_CreateRGBSurface(0, VIDEO_DISP_WIDTH, 224, 32, 0,0,0,0);
+	SDL_Surface *surface = SDL_CreateRGBSurface(0, VIDEO_DISP_WIDTH, 224, 32, 0,0,0,0);
 	if(!surface){
 		fprintf(stderr, "CreateRGBSurface failed: %s\n", SDL_GetError());
 		return false;
 	}
 
-	texture = SDL_CreateTexture(renderer,surface->format->format,SDL_TEXTUREACCESS_STREAMING,surface->w,surface->h);
+	pixel_format = *surface->format;
+	texture = SDL_CreateTexture(renderer,pixel_format.format,SDL_TEXTUREACCESS_STREAMING,surface->w,surface->h);
 	if (!texture){
 		SDL_DestroyRenderer(renderer);
 		SDL_DestroyWindow(window);
@@ -1920,10 +1921,12 @@ bool avr8::init_gui()
 		return false;
 	}
 
+	SDL_FreeSurface(surface);
 	SDL_RenderClear(renderer);
 	SDL_RenderCopy(renderer, texture, NULL, NULL);
 	SDL_RenderPresent(renderer);
 
+	SDL_LockTexture(texture,0,&surface_pixels,&surface_pitch);
 
 	if (fullscreen)
 	{
@@ -1968,30 +1971,30 @@ bool avr8::init_gui()
 		int red = (((i >> 0) & 7) * 255) / 7;
 		int green = (((i >> 3) & 7) * 255) / 7;
 		int blue = (((i >> 6) & 3) * 255) / 3;
-		palette[i] = SDL_MapRGB(surface->format, red, green, blue);
+		palette[i] = SDL_MapRGB(&pixel_format, red, green, blue);
 	}
 	
-	hsync_more_col=SDL_MapRGB(surface->format, 255,0, 0); //red
-	hsync_less_col=SDL_MapRGB(surface->format, 255,255, 0); //yellow
+	hsync_more_col=SDL_MapRGB(&pixel_format, 255,0, 0); //red
+	hsync_less_col=SDL_MapRGB(&pixel_format, 255,255, 0); //yellow
 
 	if (recordMovie){
 
 		if (avconv_video == NULL){
 			// Detect the pixel format that the GPU picked for optimal speed
 			char pix_fmt[] = "aaaa";
-			switch (surface->format->Rmask) {
+			switch (pixel_format.Rmask) {
 			case 0xff000000: pix_fmt[3] = 'r'; break;
 			case 0x00ff0000: pix_fmt[2] = 'r'; break;
 			case 0x0000ff00: pix_fmt[1] = 'r'; break;
 			case 0x000000ff: pix_fmt[0] = 'r'; break;
 			}
-			switch (surface->format->Gmask) {
+			switch (pixel_format.Gmask) {
 			case 0xff000000: pix_fmt[3] = 'g'; break;
 			case 0x00ff0000: pix_fmt[2] = 'g'; break;
 			case 0x0000ff00: pix_fmt[1] = 'g'; break;
 			case 0x000000ff: pix_fmt[0] = 'g'; break;
 			}
-			switch (surface->format->Bmask) {
+			switch (pixel_format.Bmask) {
 			case 0xff000000: pix_fmt[3] = 'b'; break;
 			case 0x00ff0000: pix_fmt[2] = 'b'; break;
 			case 0x0000ff00: pix_fmt[1] = 'b'; break;
@@ -2095,7 +2098,7 @@ void avr8::handle_key_down(SDL_Event &ev)
 			case SDLK_PRINTSCREEN:
 				sprintf(ssbuf,"uzem_%03d.bmp",ssnum++);
 				printf("saving screenshot to '%s'...\n",ssbuf);
-				SDL_SaveBMP(surface,ssbuf);
+				//SDL_SaveBMP(surface,ssbuf);
 				break;
 			case SDLK_0:
 				PIND = PIND & ~0b00001100;
diff --git a/tools/uzem/avr8.h b/tools/uzem/avr8.h
index 184fe11..4ac6725 100644
--- a/tools/uzem/avr8.h
+++ b/tools/uzem/avr8.h
@@ -292,7 +292,7 @@ struct avr8
 		timer1_next(0),
 
 		/*SDL*/
-		window(0),renderer(0),surface(0),texture(0),
+		window(0),renderer(0),/* surface(0), */texture(0),surface_pixels(0),surface_pitch(0),
 
 		/*Video*/
 		fullscreen(false),inset(0),
@@ -409,7 +409,10 @@ public:
 
 	SDL_Window *window;
 	SDL_Renderer *renderer;
-	SDL_Surface *surface;
+	//SDL_Surface *surface;
+        SDL_PixelFormat pixel_format;
+	void *surface_pixels;
+	int surface_pitch;
 	SDL_Texture *texture;
 	int sdl_flags;
 	int scanline_count;
Attachments
direct_rendering_into_texture.patch.zip
apply this on top of (after) the 1440.patch
(2.04 KiB) Downloaded 559 times
User avatar
uze6666
Site Admin
Posts: 4801
Joined: Tue Aug 12, 2008 9:13 pm
Location: Montreal, Canada
Contact:

Re: Smoother rendering of high res modes

Post by uze6666 »

I also vote for the modularity goal. It's been something I wanted to adress for a long time. Making it more object oriented and decouple the AVR core from everything external like the graphics, sound, controllers and SD card.
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

I applied the patches to see how well it performs. Nice, it is definitely smooth when turning bilinear interpolation on, and acceptable without (better than the original which dropped entire columns of Mode 9). It is a little bit slower on my machine (about 2 percents) than the software scaled method. It's just personal opinion, but I don't like too much the overall look and feel of the interpolated result, although definitely better than before. It somehow isn't any retro TV look and feel, just blurry, the kind of blurry when you tell right away its just interpolation. Anyway, it is definitely fine for a basic output solution. Faint interlace lines would maybe lift its mood, at large sizes, interpolated, it looks odd how it is blurry vertically, but crisp horizontally.

Maybe the video card could bear doing stuff in multiple passes. You could shrink first the surface horizontally with interpolation to some old TV resolution (preferably configurable, 512 pixels, maybe). Then maybe upscale vertically with nearest first to the output height (scanlines would be nice here), finally upscale with bilinear horizontally. Just wild guesses, to get a bit of TV look and feel cheap.

Anyway, tomorrow I will try to do something about modularization, to figure out some nice interface for the thing. I was thinking about also removing the video capture feature into a module which would be callable from the renderer (so the renderer's characteristics may also appear on the recorded video), but I guess I will put this off for later due to the audio dependency and the mess around there. After tomorrow I would go volunteering for the weekend, but it might be cancelled due to the predicted heavy rains (if not, we will be like boars bathing in mud in the forest), I will try to get the interface, and my software renderer done before that, so there is some reference to build upon.
User avatar
uze6666
Site Admin
Posts: 4801
Joined: Tue Aug 12, 2008 9:13 pm
Location: Montreal, Canada
Contact:

Re: Smoother rendering of high res modes

Post by uze6666 »

If you modularize, one very cool feature I would like to see (and was asked a few times before on the forums) is the possibility to dynamically apply NTSC filters like Blargg's http://slack.net/~ant/libs/ntsc.html. Is the methods you guys are discussing compatible with such filters?
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

My attempt is up on the uzem140-linebuffer branch!

This is what I could do today, hopefully it is self-explanatory as-is. Sorry for no link, it took half an hour and a reconnect script spinning in the background even to get here after about ten minutes for the commit (an I am paying for this crap...). I won't be up in the next few days (going volunteering on a narrow gauge, we decided to work in the engine shed if we can't go off to the track due to predicted heavy raining).

Anyway what is notably missing for now is the capture function, but I guess it will be manageable. Experiment with it, see how it works out, is the interface useful in general, or misses anything. Artcfox: try to implement your renderer over the interface, and maybe you could even try to implement a selection feature just as an experiment (it is mess out there, for now just proof-of-concept, later things will hopefully be cleaned up).

And please keep the "-Wall" flag there in commits even if it floods the console like hell! At least any new code shouldn't produce any warning!
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

I don't have time for a full test of this right now, but I went through and made some (what I hope are constructive) comments. This is a great step for making the code more manageable!
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

Github isn't letting me fork Jubatian's repo, because I already have the official one forked. How do we want to collaborate on this? Make a new branch within the official repo?

Also, doing it scanline by scanline should work with the renderer that renders directly into a texture's backbuffer, as long as we can feed the video recording system scanline by scanline, since technically the texture's backbuffer is a write-only pointer, and may not read back the same (though in my experience, since we pre-match the optimal format for textures it has read back the same for recording purposes).

The only issue I see is implementing screenshots, since we wouldn't be storing an entire frame anywhere that we could get to. SDL_SaveBMP requires a SDL_Surface, but my renderer doesn't use an SDL_Surface at all, it draws directly into the backbuffer of a SDL_Texture, and the documentation says we might not be able to read those pixels back. (And when we do read them back, it's a 1440x224 bitmap!) I'm thinking that with the recording feature, the ability to take screenshots can be dropped, since you can just record a video and save off whichever frame you want from it as a .png file.
Post Reply