Get that emu faster

The Uzebox now have a fully functional emulator! Download and discuss it here.
User avatar
Jubatian
Posts: 1562
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Get that emu faster

Post by Jubatian »

Uh. I am pretty drained right now. Just a note for the performance: If your CPU has enough L1 instruction cache, it is likely that the most common paths of the inflated code fit, so it is fast. My older Core 2 CPU likely doesn't have that much cache, and I was hitting its limits. At least this is what I think, since otherwise there is no reason for adding update_hardware_fast everywhere causing any slow-down. I think here general common sense should be followed: don't inflate unless you get something really noticeable (since it will likely have adverse effect on older processors where the performance is actually needed - in general, optimization should target the lowest spec hardware on which the thing runs acceptably since all the rest will deal with it fine).

I don't think profiling is a good idea this case since the used instructions and their layout are very characteristic to the game played (especially the implementation of its video mode). You might gain 5 percents for one game, and by that maybe lose 5 for another. For Emscripten, to integrate the emulator with a single game it is fine of course, but the general use emulator won't benefit from it.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Get that emu faster

Post by Artcfox »

Jubatian wrote:Uh. I am pretty drained right now. Just a note for the performance: If your CPU has enough L1 instruction cache, it is likely that the most common paths of the inflated code fit, so it is fast. My older Core 2 CPU likely doesn't have that much cache, and I was hitting its limits. At least this is what I think, since otherwise there is no reason for adding update_hardware_fast everywhere causing any slow-down. I think here general common sense should be followed: don't inflate unless you get something really noticeable (since it will likely have adverse effect on older processors where the performance is actually needed - in general, optimization should target the lowest spec hardware on which the thing runs acceptably since all the rest will deal with it fine).
I'll try it out on my Core 2 to see if I get the same results.
Jubatian wrote:I don't think profiling is a good idea this case since the used instructions and their layout are very characteristic to the game played (especially the implementation of its video mode). You might gain 5 percents for one game, and by that maybe lose 5 for another. For Emscripten, to integrate the emulator with a single game it is fine of course, but the general use emulator won't benefit from it.
It's not a matter of 5%, it's more like a 20% speed boost, and it's possible to train it across multiple different ROMs. I haven't seen it lose speed when compared to the non-PGO build yet, even with games that use completely different video modes. And actually only the natively compiled Uzem can benefit from this, because the Emscripten compiler does not have working PGO. I encourage you to try it yourself, but if I can get time tonight I'll make a core2 specific build and benchmark it with and without.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Get that emu faster

Post by Artcfox »

Okay, so I did some benchmarks on my 2.0 GHz Core2. With ARCH=core2, and as long as you enable PGO (which lets you collect the data from multiple runs, so you can profile all of the different video modes, rather than pick favorites) it's faster to call update_hardware_fast() everywhere possible, than to call it in a limited number of circumstances. If you don't enable PGO, then having update_hardware_fast() everywhere ends up being slightly slower, but it's still more than fast enough to run it full speed. My vote is to put it everywhere possible, since it does offer a significant speedup when combined with PGO (18% faster!).

My baseline of latest uzem140 branch w/ just the ARCH=core2 flag ran at 55 MHz. Putting update_hardware_fast() everywhere and enabling PGO (not even trained on Arkanoid) it ran at 65 MHz. Additionally, when running it at 100% emulation speed, the build that calls update_hardware_fast() everywhere with PGO enabled uses 10% less CPU than calling update_hardware_fast() a limited number of times without PGO enabled.

Perhaps to please everyone, update_hardware_fast() would only be enabled everywhere possible when compiled with PGO turned on?
User avatar
Jubatian
Posts: 1562
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Get that emu faster

Post by Jubatian »

I tested your branch. I used Arkanoid since I did most of my previous benchmarks on it, so it was a good reference point.

Something is severely wrong with your build on my system. My peak is 78MHz on Arkanoid, as described in my commit. However I get the following figures with your branch:

- 55MHz with a plain make (this is how I compiled my branch, resulting the 78MHz build).
- 59MHz when I make with NOGDB=1.
- 59MHz when I make with NOGDB=1 and ARCH=core2 (so absolutely nothing gained here).
- 62MHz when building with profiling, playing the first two levels of Arkanoid with the GEN build.

So for me the profiling gave only a meager 5% increment on this game, and the whole thing is still a far cry from the 78MHz I achieved before. The 55MHz figure is the same like I had before adding my "fasten your seatbelt" changes and CunningFellow's instruction decoder. Seriously it feels like this build is missing both, like Make behind just pulled that old version back in silence (I see the code containing the changes).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Get that emu faster

Post by Artcfox »

Yes, you pointed out that I messed the merge up, but all the changes are in upstream, so I'm using the upstream uzem140 branch to test. For the fix, I'm just doing a selective search and replace for update_hardware() with update_hardware_fast().

I have a slower core2, but it went from 55MHz up to 65 MHz with update_hardware_fast() everywhere and PGO on the upstream uzem140.
User avatar
Jubatian
Posts: 1562
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Get that emu faster

Post by Jubatian »

I figured out what was wrong with that build. It can get 78MHz on normal clean build.

There is something wrong with how the flag set is implemented which you use for measuring performance. If I reverted to the old method of removing SDL_RENDERER_PRESENTVSYNC manually and commenting out the "while(audioRing.isFull())SDL_Delay(1);" line, I get the old figures back. With those I get the following results respectively:

- 78MHz with a plain make (this is how I compiled my branch, resulting the 78MHz build).
- 81MHz when I make with NOGDB=1.
- 81MHz when I make with NOGDB=1 and ARCH=core2 (so still absolutely nothing gained here).
- 93MHz when building with profiling, playing the first two levels of Arkanoid with the GEN build.

It is interesting how this case profiling provided a 15% increment (playing Arkanoid at that is nigh impossible). Of course necessarily the GEN build of that also had the performance measurement tweaks, maybe the original profiling tried hard to improve on the waiting part (which was absent here)?

So I say at least the method for performance measurement is wrong in some way by those flags. Something is still burning away loads of CPU time there which is not the AVR's emulation.

As a reference I also checked profiling with the flag based version, by running it with "-vnw" for the GEN stage. I got the following:

- 65MHz when building with profiling, playing the first two levels of Arkanoid with the GEN build (-wnv).

This is versus the 62MHz of what could be achieved without -wnv, and the 59MHz achieved without profiling. To the latter it is a 10% improvement, more consistent with what I got with the commented out measurements. Something fools the profiler there if you play without -wnv (It might be trying hard to optimize the part which ensures proper timing instead of working on the instruction decoder? That "busy loop" of SDL_Delay(1); is not exactly a nice programming practice...

I did further tests with peppering the code with update_hardware_fast() everywhere:

- 68MHz with a plain make (this is how I compiled my branch, resulting the 78MHz build).
- 61MHz when I make with NOGDB=1 (Yes, strange it is, but it became slower...)
- 58MHz when I make with NOGDB=1 and ARCH=core2 (Even slower!!!).
- 96MHz when building with profiling, playing the first two levels of Arkanoid with the GEN build.

The profiling build managed to pull off a bit faster result than that without update_hardware_fast everywhere, but only a very meager 3 percents faster. Judging by this I would ditch this, just as I ditched my update_hardware related tweaks first when I found that I could only get 61MHz versus the 55MHz without them, only in very specific circumstances not being consistent with the changes in code. For a release build, the thing should better behave consistently.

For the matter of cache sizes, my CPU is the following:

Core 2 Duo T7500 (mobile), 2,2GHz, 4Mb L2 cache, 2x32Kb 8-way set associative instruction cache, 2x32Kb 8-way set associative data cache. So one core has 32K of L1. I am pretty sure the most commonly used code in uzem balances the edge on this, so as the code inflates with update_hardware_fast() calls, more and more cache misses occur at this level. If you have larger L1, then it would clearly explain the experienced results (also if it was smaller on the Core 2: a consistently worse performance since there is no way it fits no matter the number of expansions). So it seems like inflating the code size is just pounding hard on processors with 32K of instruction L1 (maybe others could try to verify this).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Get that emu faster

Post by Artcfox »

Cool, I'm glad you saw similar gains!
Jubatian wrote:There is something wrong with how the flag set is implemented which you use for measuring performance.
What do you mean? By default it sets some flags, and depending on what options you pass, it can turn off the SDL_RENDERER_PRESENTVSYNC flag, or turn off both SDL_RENDERER_ACCELERATED and SDL_RENDERER_PRESENTVSYNC while turning on SDL_RENDERER_SOFTWARE. If you disable sounds, then it never even hits the while (audioRing.isFull()) code.

Code: Select all

    // init basic flags before parsing args
    uzebox.sdl_flags = SDL_RENDERER_ACCELERATED | SDL_RENDERER_PRESENTVSYNC;

  ...

        case 'n':
			uzebox.enableSound = false;
            break;
        case 'w':
			uzebox.sdl_flags = (uzebox.sdl_flags & ~(SDL_RENDERER_ACCELERATED | SDL_RENDERER_PRESENTVSYNC)) | SDL_RENDERER_SOFTWARE;
            break;
        case 'v':
			uzebox.sdl_flags &= ~SDL_RENDERER_PRESENTVSYNC;
            break;

  ...

        renderer = SDL_CreateRenderer(window, -1, sdl_flags);

  ...

		if (enableSound && TCCR2B)
		{
			// raw pcm sample at 15.7khz
#ifndef __EMSCRIPTEN__
			while (audioRing.isFull())SDL_Delay(1);
#endif // __EMSCRIPTEN__
			SDL_LockAudio();
			audioRing.push(value);
			SDL_UnlockAudio();
  ...
		}
I ran an instrumentation and sampling based profiler, and it didn't find anything that was burning away CPU time other than the AVR emulation (when you profile it this way, be sure to disable link time optimization, disable whole program optimization, and add the -g flag so the profiler has access to the line number information).

Do you still have issues with your GPU when running full screen, or changing the resolution?
User avatar
Jubatian
Posts: 1562
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Get that emu faster

Post by Jubatian »

Checked, the software renderer is responsible for this slowdown (-w). It is worse than drawing directly into the window (which didn't have any really noticable performance impact).

I have no video card problems here, that was my parents' computer far away.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Get that emu faster

Post by Artcfox »

Jubatian wrote:Checked, the software renderer is responsible for this slowdown (-w). It is worse than drawing directly into the window (which didn't have any really noticable performance impact).

I have no video card problems here, that was my parents' computer far away.
Oh, I only included the -w because I thought you needed that for it to run. By all means, if just -vn works for you then use that to benchmark. On my core2, I don't think that it has accelerated drivers (it is using Mesa for OpenGL), so adding -w makes it run faster for me.

So what is your final opinion on update_hardware_fast()? 15% faster, and 10% less CPU seems like a big enough improvement to include it everywhere as long as PGO is enabled.
User avatar
Jubatian
Posts: 1562
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Get that emu faster

Post by Jubatian »

I don't think it should be done. For me, adding update_hardware_fast() everywhere only bumped up performance by 3 percents (see PGO results: 93MHz versus 96MHz), and only by profile guiding (otherwise results were abysmal). For you, by this comment from before:
Artcfox wrote:On my desktop, I did notice a slowdown when update_hardware_fast is not called everywhere versus when it's called in the few places that Jubatian chose (225 MHz when called everywhere, versus 215 MHz the way it is now), but I left it the way it is so we have a baseline to build on top of.
That's an 5% increment for the version without update_hardware_fast everywhere. This doesn't worth it, especially if it is really that the thing balances on the edge of a 32Kb L1 cache (since it is more likely that many people has older CPU's with 32K L1 than like 40K or 48K, or 28K on the other end): it would likely deteriorate things for people for who performance matters more.
Post Reply