Get that emu faster

The Uzebox now have a fully functional emulator! Download and discuss it here.
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Get that emu faster

Post by Jubatian »

What you describe should essentially have very similar end effect to what I did as clean-up. I would except breaking this up the way you describe (CunningFellow) would even be slower without the benefit of microcoding than what I did.

The most characteristic effect of my change is that update_hardware no longer processes multiple cycles, just one, and it is inserted in the decoding process of instructions where it is appropriate. The observed 5% slow-down (if that's really true, and isn't just the effect of the sloppy merge making a little mess around the pixel output) comes from that many frequently used instructions (ld and st) take 2 cycles, and in the original if they accessed RAM, those were "packed" together, processed in one go (both by the decoder and update_hardware).

My changes doesn't break the process of instruction decoding, only update_hardware. Your proposed changes also break up the instruction decoding, which will likely introduce a considerable hit in performance since for every ld and st you will have to do the "which instruction to process" decision twice. And from what I deducted from graphics modes, their code, these are about the most frequently used instructions along with "lpm" (3 cycles). Of course if you think it worths a try, by all means do so, I just think it wouldn't work that well based on what I experienced myself, how the code reacted to my changes.

CunningFellow: Your code is not bad. Just my expectations are usually a bit steep (my job requires such attitude, very strict standards to adhere etc, and I keep noticing potential hazards, fragile code and likes).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Get that emu faster

Post by Artcfox »

Artcfox wrote:This change makes it 6.5 MHz faster for me:

Code: Select all

diff --git a/tools/uzem/avr8.cpp b/tools/uzem/avr8.cpp
index ae90524..1e73f9f 100644
--- a/tools/uzem/avr8.cpp
+++ b/tools/uzem/avr8.cpp
@@ -1059,14 +1059,14 @@ u8 avr8::exec()
 
        currentPc=pc;
        const instructionDecode_t insnDecoded = progmemDecoded[pc];
-       const u8  opNum  = insnDecoded.opNum;
-       const u8  arg1_8 = insnDecoded.arg1;
-       const s16 arg2_8 = insnDecoded.arg2;
+       const u32  opNum  = insnDecoded.opNum;
+       const u32  arg1_8 = insnDecoded.arg1;
+       const s32 arg2_8 = insnDecoded.arg2;
 
        cycles = 1;                             // Most insns run in one cycle, so assume that
-       u8 Rd, Rr, R, CH;
-       u16 uTmp, Rd16, R16;
-       s16 sTmp;
+       u32 Rd, Rr, R, CH;
+       u32 uTmp, Rd16, R16;
+       s32 sTmp;
 
        //GDB must be first
        if (enableGdb == true)
So this change introduces some bugs, which I'll try to track down and fix. The Mode13ExtendedDemo.hex ends up flickering with these changes.

Edit: specifically it's this one line:

Code: Select all

u32 Rd, Rr, R, CH;
changing that one line back to:

Code: Select all

u8 Rd, Rr, R, CH;
fixes it.
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Get that emu faster

Post by Jubatian »

I though about doing the same in my run of performance optimizations, but I discarded it. This is far from being a trivial change, and I have experience on it (my RRPGE emulator's 16 bit CPU's ALU is realized with 32+ bit unsigned integers). For most part it works, but there are a few instructions, such as right shifts, signed multiplies, possibly fetching of the carry of an add or sub, a zero check, which will break unless the operands are properly masked to 8 bits. Having a 8 bit type performs these masks (and, of course, forces the compiler to generate code everywhere in such a manner, even where unnecessary), which are lost if you simply change the type.

So getting there needs going through the entire instruction decoder fixing every instruction proper. That's a big change (bigger than just moving around the instructions for rearranging the decoder, or doing CunningFellow's pre-decoding of the instruction word).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Get that emu faster

Post by Artcfox »

Jubatian wrote:I though about doing the same in my run of performance optimizations, but I discarded it. This is far from being a trivial change, and I have experience on it (my RRPGE emulator's 16 bit CPU's ALU is realized with 32+ bit unsigned integers). For most part it works, but there are a few instructions, such as right shifts, signed multiplies, possibly fetching of the carry of an add or sub, a zero check, which will break unless the operands are properly masked to 8 bits. Having a 8 bit type performs these masks (and, of course, forces the compiler to generate code everywhere in such a manner, even where unnecessary), which are lost if you simply change the type.

So getting there needs going through the entire instruction decoder fixing every instruction proper. That's a big change (bigger than just moving around the instructions for rearranging the decoder, or doing CunningFellow's pre-decoding of the instruction word).
No it doesn't because the pre-decoder still uses 8-bit types internally. ;) This only performs the upcast on the temporaries used inside exec() based on the 8-bit values from the pre-decoder.

I just tested CunningFellow's branch (without this change) and it runs 6 MHz faster than the branch in your PR. With the change, it runs 11 MHz (10 %) faster. That should make a big difference for the web version, don't you think?
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Get that emu faster

Post by Jubatian »

Arctfox wrote:No it doesn't because the pre-decoder still uses 8-bit types internally. ;)
I know, but that's still pretty much enough to break stuff in subtle manners. It makes things only worse that the variables are global to the exec function, so you can only change their type to all the instructions, so you can't go step by step examining stuff, avoiding the change for dubious instructions. The biggest risk is that it breaks some rarely used feature (such as for example the half-carry flag, well, I don't think this case the half-carry could be broken, but imagine a change which introduces a bug to that flag, and how long, how many unrelated commits it may take until the bug surfaces).

If you wish, I could jump in the thing and go through the decoder step by step verifying and weeding out potential undefined behaviors, but that's a long tedious process. Right now I am pretty much eager to work on a few graphics modes, it just started becoming really fun as I compile, test, and see what works and where I missed one damn cycle :) (I tweaked the emu a bit to report exact locations of sync edges for me).
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: Get that emu faster

Post by CunningFellow »

Jubatian wrote:Right now I am pretty much eager to work on a few graphics modes, it just started becoming really fun as I compile, test, and see what works and where I missed one damn cycle :) (I tweaked the emu a bit to report exact locations of sync edges for me).
Yes - the video modes are where the fun is at. I should get back to working on the next thing to blow the T2K video mode out of the water :)
User avatar
Jubatian
Posts: 1563
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Get that emu faster

Post by Jubatian »

I am looking at the pre-decoding branch done by CunningFellow.

The biggest hit in performance (that is where the most could be gained) in my opinion could hide along these lines:

Code: Select all

	const u8  opNum  = insnDecoded.opNum;
	const u8  arg1_8 = insnDecoded.arg1;
	const s16 arg2_8 = insnDecoded.arg2;
This is on the top of avr8::exec. The problem is the use of non-native types which usually generates lots of code (like Arctfox experienced through the performance boost when blindly changing stuff to unsigned 32 bit integers). The problem start with the microcode RAM, in there instead of the structure, simply an "unsigned int" should sit, which is broken up proper on the head of avr8::exec. Using non-native types for the opNum, arg1_8, and arg2_8 variables are also part of this problem. From this opNum could be changed to "unsigned int" without problem, the other two need digging in the code. In most parts they are used to address arrays, where the CPU has to convert them back to unsigned ints.

If they were all unsigned, it would all work just fine (since the decoder doesn't write result arithmetic in them, just using them as sources). The big nasty problem is arg2_8, which is a signed int, which even now could trigger undefined behavior at places. Changing this to any other type is not trivial. In the original code, the "k7" and "k12" macros used for relative jumps are at least "implementation defined behavior" by the C standard (shifting a negative value to the right may either use an arithmetic or logical shift on an x86, it relies on them using an arithmetic shift). Anyway, for the cleanest result, the signedness of arg2_8 should be ditched for good (rather calculating those parameters so they produce the intended result with unsigned wrapping arithmetic). Note that if done right, the 16 bit program counter will behave the same even if the value added to it is wider than 16 bits with only the low 16 bits set up correctly (all bits above are irrelevant when it comes to adding and subtracting).

Once there, the program counter logic also should be fixed to use a 0x7FFFU mask as proper for the 64K of flash (and so eliminating a possibility of a segfault from weird or messed up avr code jumping to the high part of the flash). Then it can also have a native (unsigned int) type.

Why "unsigned int" by the way? (And not "u32" and likes) The current set of types used within the program doesn't contain any typedef which can always be native. By using a fixed 32 bit width, you could cripple 64 bit architectures to which working with 32 bits needs slower operations. The x86-64 is not like that: it is engineered to be capable to work just as fast with 32 bit types as 64 bit types in its 64 bit mode (by zero-extending), so even GCC has 32 bit unsigned ints for it (since they are just as fast, and likely this decision caused the breaking of less already existing code when ported to 64 bits).

Also, why the strange "0x7FFFU" and likes, with "U" suffix? Since otherwise by the rules of the C language, the constants are signed, and will promote the operands to signed int. This may stealthily introduce undefined behavior, which could cause disaster in very subtle ways (like the old ubiquitous example of the case of the "x + 1 > x" comparison completely eliminated as evaulating to TRUE by GCC in a piece of code expecting wrapping arithmetic). Using "U" siffixes ensure that things stay unsigned, and for them wrapping arithmetic is defined by the C standard.
User avatar
uze6666
Site Admin
Posts: 4801
Joined: Tue Aug 12, 2008 9:13 pm
Location: Montreal, Canada
Contact:

Re: Get that emu faster

Post by uze6666 »

I merged your last PR and with all those improvements T2K runs at least 100% now. :D File I/O certainly has an impact on performance but with the code improvement it compensated just enough. While checking out the code, I was fed up of seeing all the SD code in there that just makes more scrolling and things to check. So I moved it to the SDEmulator.c file. Perhaps a bit too much, since some SPI stuff too. But it's a start to cleanup the core. All these different ways with u8 and uint8_t etc also annoyed me for a while, so let's have a common style and all use the uint8_t etc. I'll update that to to the uzem140 branch tonight.

Edit: pushed those changes. Please confirm it did not break anything on Linux.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Get that emu faster

Post by Artcfox »

uze6666 wrote:I merged your last PR and with all those improvements T2K runs at least 100% now. :D File I/O certainly has an impact on performance but with the code improvement it compensated just enough. While checking out the code, I was fed up of seeing all the SD code in there that just makes more scrolling and things to check. So I moved it to the SDEmulator.c file. Perhaps a bit too much, since some SPI stuff too. But it's a start to cleanup the core. All these different ways with u8 and uint8_t etc also annoyed me for a while, so let's have a common style and all use the uint8_t etc. I'll update that to to the uzem140 branch tonight.

Edit: pushed those changes. Please confirm it did not break anything on Linux.
I don't see any of the variable type changes, nor the refactoring pushed to the uzem140 branch, but I have spent quite a few hours manually merging things, so is it reasonable to ask that you don't refactor or change the variable types all over the place until after we get CunningFellow's changes merged into the uzem140 branch?
User avatar
uze6666
Site Admin
Posts: 4801
Joined: Tue Aug 12, 2008 9:13 pm
Location: Montreal, Canada
Contact:

Re: Get that emu faster

Post by uze6666 »

I didn't push, just commit locally. But OK I'll wait for the types changes. Do you have issues that I push the SD stuff?
Post Reply