Smoother rendering of high res modes

The Uzebox now have a fully functional emulator! Download and discuss it here.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

D3thAdd3r wrote:Sucks to lose the screenshot I guess. I feel like there is an efficient way to do it since certainly every first person shooter allows it, with things like live security camera screen written to texture. I think you have to lock the buffer and stall though.

For whatever reason I always hit Print Screen which copies the screen and I paste it, crop it, scale it, whatever, and save it as .png. Not sure why I do that since it's a bit of a pain, but works for every program :|
If I use an intermediate SDL_Surface then I can read back the pixels just fine for a screenshot. It looks like that may be the solution here. The issue I am running into is that SDL2 is not catching my press of the PrintScreen button because GNOME is catching it first (and doing its own screenshot) and not passing the keypress through to Uzem.
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

My commit is up with the changes to the renderer interface (uzem140-linebuffer branch in my fork).

I tried to decouple the render and whatever capture feature is there, sure it is a loss of performance in some cases, but also establishes less tight coupling. As of now for example it should be clearly visible how it would be possible to use libpng for generating screenshots without butchering unrelated parts of the code.

I also outlined how a debug display is supposed to work, next time I think I will complete my software renderer, and in general the renderer's integration so it becomes functional. It could be very useful for graphics mode design as some before here mentioned. For now I imagined it with correct aspect ratio (like for example the Vice emulator of C64 does), but your part also makes sense (to have every cycle one pixel). I however haven't got a sufficiently large monitor so I could use that (a big 1600x1200 LCD discarded from some graphics workstation sits on my desk, I hate 16:9... Damnit, I want to work with this thing, not only watching videos).

For collaborating on the code it is problematic. It is overcast, rainy here (and is predicted to stay so for the oncoming week), and so my net access is terrible, I can't even really access GitHub due to its heavy JavaScript use (I can't get a complete site load). So only command line git for me. Artcfox: please add my fork as a remote, and fetch / pull from it to work on it. I am doing the same on Uze's master branch, at least that works with more or less retries. I can also add your fork as remote (please post its address) so I can pull back from you to integrate.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

Sorry, I didn't realize I had replied in the other thread.
Jubatian wrote:I wrote that function faster than I could load that page of the SDL documentation ... And still on the pages I could already load during this day, I only see SDL_ConvertSurface which as I remember allocated a new surface, so was unsuitable for the purpose (but I already closed the damn page some hours ago). It should be able to handle other 32 bit surface formats fine on the last else if you refer to the rendersupp_convsurf function. Anyway, it is OFFtopic here (belongs to the linebuffer topic, in this one I build over Uze's master branch, so there is nothing from that here).
This is the function I was referring to:

Code: Select all

SDL_Surface* SDL_ConvertSurfaceFormat(SDL_Surface* src,
                                      Uint32       pixel_format,
                                      Uint32       flags)

src            the SDL_Surface structure representing the surface to convert
pixel_format   one of the enumerated values in SDL_PixelFormatEnum; see Remarks for details
flags          the flags are unused and should be set to 0
But I think even that is unnecessary, because I believe that SDL_SaveBMP doesn't care what the format of the SDL_Surface is, it should be able to write it to a BMP.

The decoupling looks good, but I think that we should decouple the renderer one level higher, at the frame level, rather than at the scanline level. For every virtual function call at the scanline level, there will be 224*60 = 13440 virtual function calls per second, but at the frame level there would only be 60. If we store raw pixel values into a 1440x224 chunk of memory, then you can still loop over every scanline to apply any NTSC effects (or scaling) that you want without having to make so many virtual function calls. Also, by having the entire frame in one memory section, screenshotting would be just as easy as it was before (you wouldn't have to extract them line by line with a virtual function call per scanline). For the other renderers we need the pixels in one big chunk anyway, so why break it up into scanlines and then reassemble them into a big chunk in order to render each frame? (I did a benchmark of rendering scanline by scanline into the texture's backbuffer and then locking it to flip buffers versus rendering into the pixels of an SDL_Surface and the latter ended up being quite a bit faster than using the backbuffer of the texture.)

Is there any reason why your software rendering algorithm wouldn't work just as well or better with the rendering abstraction moved up to the frame level rather than at the scanline level?

I'll try to pull your changes down when I get more computer time later tonight. My github fork is at https://github.com/Artcfox/uzebox.git .Uze pulled my changes to uzem140 into his uzem140 branch, so they should be the same right now.
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

Eh, do you know whether a downloadable package of the SDL2 documentation exists? It would make life a whole lot easier here, without that I think I rather give up until weather clears up somewhat so I get a more consistent network connection.

The line by line request has marginal overhead compared to what else the CPU has to do. 1820 cycles to emulate per line, as far as my CPU goes, at least about 35 CPU cycles spent to emulate an Uzebox cycle, that means you get one single call for 60K cycles. One function call more or less doesn't really matter at this magnitude.

Why the line by line interface is better is that it gives you a possibility to use the cache. If you send an entire frame in a defined format, you may get stuck with two copies unassisted by any cache. Once you might be forced to copy within the renderer (into the defined format), and then you might be forced to copy in the consumer (video capture or screenshot module) if the format used for the interface isn't good for use right away. If you go line by line instead, in this scenario you would essentially still copy twice, but the line buffer passing through the interface is cached, so faster (unless some forced DMA hinders this).

If you use some SDL structure for the interface (surface for example) you might still be stuck with the same, and worse, the consumers will have to depend on SDL and convert the stuff unless they use SDL to do their task. The BMP screenshot feature is such a simple scenario, but even the video capture or a libpng output wouldn't necessarily work so well with it. (The video capture implementation is just as fragile as some of my new specific codes, depending on the surface being 32 bits with 8 bits per channel encoding, thankfully avconv can be fed with all these. But some different target added later may not even be that flexible)

In such scenarios when the target needs some specific, not even 0x00RRGGBB format, it can allocate a single scanline 0x00RRGGBB buffer to use with the interface, and convert that to its native output line by line. The communication between the modules this way will sit within the cache (the single scanline of 0x00RRGGBB buffer fits well). In some cases the line buffer might not even need to be copied into a surface, just probably converted line by line and streamed out (like now the case with avconv: you could fwrite it line by line instead of entire surface).

So I though while in some cases when SDL can do all or the target is flexible enough to receive the most common SDL surfaces which might occur in the renderer its performance is inferior, the overall generic performance is at least consistent (even in the most adverse scenarios when neither the renderer, nor the receiver can work with 0x00RRGGBB), and the interface is simple (0x00RRGGBB format all the time, no need to insert source conversion magic on the receiver end).

Cutting SDL dependence from the interface is beneficial in that regard that later it would be easier to port the emulator onto different (non-SDL) backends, which porting wouldn't affect the capture features using the renderer's interface.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

Not that I can find, but the source code (specifically the include files) has better comments than their terrible wiki page does, and it'd be a one-time only 3.9 MB download. If that's too much of a strain because it's coming from far away, you might have better luck grabbing it from a local mirror using:

Code: Select all

apt-get source libsdl2-2.0-0
Okay. Now I understand why you chose the scanline approach. I dug up my 9 year old laptop (2.0 GHz, Core 2) and tried the uzem140 branch and it could do 31 MHz, and then I tried your latest code with the linebuffer and abstracted software renderer and it could do 45 MHz. I guess the virtual function calls don't make that much of a difference, though I'd be curious to add just the linebuffer (without abstraction) part to the uzem140 branch to see how that performs, since it has fewer conditionals, but at this point you have me convinced that abstracting it at the scanline level is a good thing. :)
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

Jubatian, I figured out how to coordinate development among multiple forks (thanks for the tips!), and also how to grab an entire frame of already processed pixels from your software renderer without adding a dependency on SDL to the renderer interface. I used this to implement video recording in your software renderer, and it works very well. To avoid a moire pattern from the NTSC effect, I switched ffmpeg's upscaling from nearest to bilinear. I submitted a pull request to you, so hopefully you can get the Github page loaded (or maybe you can add my repo as a remote and merge the changes in through the command line).

I didn't make this part of the pull request, but if you want it to run even faster, you can do this to the Makefile:

Code: Select all

diff --git a/tools/uzem/Makefile b/tools/uzem/Makefile
index d1c4685..430e55a 100644
--- a/tools/uzem/Makefile
+++ b/tools/uzem/Makefile
@@ -22,7 +22,7 @@ ifeq ($(EMSCRIPTEN_BUILD),1)
 endif
 
 #Uncomment to optimize for local CPU
-#ARCH=native
+ARCH=core2
 #TUNE=y
 
 ######################################
@@ -59,7 +59,7 @@ CPPFLAGS += -Wall
 ######################################
 RELEASE_NAME = uzem$(OS_EXTENSION)
 RELEASE_OBJ_DIR := Release
-RELEASE_CPPFLAGS = $(CPPFLAGS) -O3 $(EMSCRIPTEN_FLAGS)
+RELEASE_CPPFLAGS = $(CPPFLAGS) -Ofast -flto -fwhole-program $(EMSCRIPTEN_FLAGS)
 
 ######################################
 # Debug definitions
@@ -93,7 +93,7 @@ ifeq ($(EMSCRIPTEN_BUILD),1)
     CC := emcc
 else
     SDL_FLAGS := $(shell sdl2-config --cflags)
-    LDFLAGS += $(shell sdl2-config --libs)
+    LDFLAGS += $(shell sdl2-config --libs) -Ofast -flto -fwhole-program
     CC := g++
 endif
 MKDIR := mkdir -p
On my desktop, those changes resulted in a speedup of ~35 MHz!

I also tested the effect of using profile guided optimizations when compiling Uzem, and that gave me an additional ~2 MHz of emulation speed.
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

Now I am in the office, the only good thing being reliable internet connection here (and early morning, nobody here yet, no mail or issues in the tracker to work on, either).

I checked your work on GitHub, just the code. As far as I see it might be good for consideration, but as of now it is quite fragile (just as fragile as it was in the original code).

For a start I have the following important thing on my TODO list: Fix the line rendering code so it supports non 8 bit per channel 32 bit formats

This is not a hard fix, I already know how I will do it. Why it is needed? Because it may be rare, but possible. I found that for example "deep color" may also be used at 32 bits per pixel where there are 10 bits for each component. As of now my 32 bit line renderer would only produce a mess with that. Your part can not be fixed to support this since the "-pix_fmt" parameter of avconv might not accept such. It may not even accept all possible 8 channel encodings, such as some rare case when alpha would be on the low 8 bits (you noted that for my convert algorithm, so here it is back).

There is another big problem right now with your code: it ignores the pitch of the surface. The pitch is not necessarily equivalent to the requested width (* sizeof(uint32)), it may be larger depending on what's optimal for the rendering system under SDL (the video card if it is hardware).

For me the concept feels simply too fragile for general use, something prone to bite back at some point when nobody would want to touch this part of the code, yet it becomes necessary due to user complaints. If a problem rises somewhere in the line by line solution, it likely needs a fix only in one module (either the renderer if it fails to produce proper 0x00RRGGBB or the affected capture module if it fails to accept 0x00RRGGBB), while the full frame approach would call for a fix in both. And then you wouldn't be able to change the format of the full frame since you have like three other capture modules relying on it. So you will end up with two entire frame copies unassisted by any cache since the defined interface ends up fitting neither the renderer nor the receiver.

The full frame approach I think would only do any good if a fixed format could be established for it (like 0x00RRGGBB), but it is not possible. It is possible from the renderer to build up an entire 0x00RRGGBB frame assisted by SDL, but the problem is that while most receivers will likely work with such a frame, some may not like it, or are of streaming nature anyway (like avconv), so you wasted some RAM and if the renderer could use the cache to generate 0x00RRGGBB lines also a lot of CPU time (since the entire frame doesn't fit in the cache).

By the way the performance cost of calls for line versus frame is not negligible because the calls are so cheap, they are negligible since the CPU already spends 60K cycles doing other stuff before it advances a line. So even if the call itself had some ridiculous 600 cycles of overhead, you would barely notice it (1% drop in performance).

From modularization perspective you also shouldn't add parts of the video capture to the screenshot module. They are different things. The video capture should be sitting within its own module, currently not possible since first the audio concepts of the emulator have to be cleaned up proper (such as incorporating the 48KHz idea and likes). In general things dealing with avconv specific stuff should sit in a module called something like "captureavconv.cpp", and nowhere else. Like with the renderer, later more capture modules might be present, probably an interface (maybe "captureif.h") connecting them so they can be selectable. Same for the screenshot stuff. But this is still a long way to go... (And I also wish to get some time to spare to actually code for Uzebox, like building a video mode. It also helps understanding how the video signal generation works, which also needs some changes in the related part of the renderer interface)

Of course your solution is likely the fastest for the cases where it can be operational, especially if anything from the renderer can only be extracted by some DMA (bypassing cache). But I simply can't see any way how it can be utilized for general use in such a manner that it won't bite back later.

Designing good interfaces is simply tricky, a beast, but taming this beast tells apart unmaintainable spaghetti from clean and robust code capable to cope with the pass of time.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

That goes back to my earlier point of not using the Window's surface to render into. Right now you create a window and then get its surface. If you create a new surface with a fixed bpp, like the original code does, and then ask it to render that surface, then you don't have to worry about the person's monitor not being 32bpp; SDL will do whatever conversion is necessary to render it, either in hardware if it can, or in software.

Why manually write the code to support every possibly monitor when you can just let SDL handle that detail?
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Smoother rendering of high res modes

Post by Jubatian »

The original code doesn't create a surface with a fixed bpp. The current master branch (uzem140) has this on line 1928:

Code: Select all

surface = SDL_CreateRGBSurface(0, 720, 224, 32, 0,0,0,0);
This, by its documentation, uses default parameters for the layout of the 32 bit surface which could be anything. Then the original code works this all around by (correctly) creating a fitting palette, and by composing an avconv pix_fmt parameter (which works for the most common cases, but not for all).

While SDL doesn't document it there, I think there was a good reason for this decision there. This way SDL can choose a format which is likely the fastest for rendering, and at least for the display (the palette generation) the original code handles that all right. Asking for a specific output format (such as 0x00RRGGBB if the system in question is for example 0x00BBGGRR natively) might make performance worse. So I wouldn't change that part to force a specific surface format, which would lower performance for everyone, not only those few who actually want to record video.

It is true that the rendersupp_convsurf is a kludge, just created since I couldn't find anything in SDL then which would do something equivalent (The proper method would be likely creating a small 1 px tall 0x00RRGGBB surface and blit the source lines into that, but when I had to write that my net access was so flaky that I got enough of trying to load the damn doc and hacked up something which should work. If the method has some notable overhead I would allocate a few px tall slice instead assuming that the caller will request consequent lines. Maybe swapping it so the getline returns a pointer instead of filling a provided location would be better on that term, then it would be possible to just point in your surface if it already happens to be 0x00RRGGBB. I guess I will do this change, it makes sense).

In rendersupp_line32 it is necessary to work around the problem of different targets to avoid calling for a surface conversion at some point. One will happen unless you write in the native output format of the target, it just may be accelerated by the hardware if its functional. My software renderer assumes a situation where the hardware might be flaky or not even available (as in "acceleration") for whatever reason. The process what the rendersupp_line32 function does can be well implemented for any format, so the overall performance is best if it works on native format wherever possible.

So the following changes should be made:
  • Make renderIf::getLine returning a pointer to the line data instead of filling one. This way if the renderer natively works on 0x00RRGGBB format, it can return lines without doing anything. Otherwise it needs to implement the conversion as necessary.
  • The rendersupp_convsurf should be replaced to an SDL_BlitSurface() call where a conversion is actually necessary. That will handle all source => destination conversions proper.
  • Depending on the implementation, it might be beneficial to not create an entire shadow 0x00RRGGBB surface, only a single line or slices of it (in the case of a software renderer this will likely make things stay cached). The renderer can expect that the renderIf::getLine method is called with incrementing line numbers. The shadow surface, slices or line should only be created on a call to renderIf::getLine.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Smoother rendering of high res modes

Post by Artcfox »

Jubatian wrote:The original code doesn't create a surface with a fixed bpp. The current master branch (uzem140) has this on line 1928:

Code: Select all

surface = SDL_CreateRGBSurface(0, 720, 224, 32, 0,0,0,0);
This, by its documentation, uses default parameters for the layout of the 32 bit surface which could be anything. Then the original code works this all around by (correctly) creating a fitting palette, and by composing an avconv pix_fmt parameter (which works for the most common cases, but not for all).
How does this not cover all possible arrangements of an SDL_PACKEDLAYOUT_8888 (R, G, B, and A)?

Code: Select all

			char pix_fmt[] = "aaaa";
			switch (surface->format->Rmask) {
			case 0xff000000: pix_fmt[3] = 'r'; break;
			case 0x00ff0000: pix_fmt[2] = 'r'; break;
			case 0x0000ff00: pix_fmt[1] = 'r'; break;
			case 0x000000ff: pix_fmt[0] = 'r'; break;
			}
			switch (surface->format->Gmask) {
			case 0xff000000: pix_fmt[3] = 'g'; break;
			case 0x00ff0000: pix_fmt[2] = 'g'; break;
			case 0x0000ff00: pix_fmt[1] = 'g'; break;
			case 0x000000ff: pix_fmt[0] = 'g'; break;
			}
			switch (surface->format->Bmask) {
			case 0xff000000: pix_fmt[3] = 'b'; break;
			case 0x00ff0000: pix_fmt[2] = 'b'; break;
			case 0x0000ff00: pix_fmt[1] = 'b'; break;
			case 0x000000ff: pix_fmt[0] = 'b'; break;
			}
Edit: I see where the confusion may lie. In my above comment, I didn't say a fixed pixel format, I said a fixed bpp. And in that call to SDL_CreateRGBSurface, the 4th parameter is depth, which was hard coded to 32, and implies the bpp will be 8, given that the rest of the parameters are the masks for R, G, B, and A.
Post Reply