Get that emu faster

Post by **Jubatian** » Thu Oct 08, 2015 6:18 am

Okay with me. I hope to stay around and contribute, this is a nice system to play around with for some decent 8 bit fun

(Just notes then: Please use my name for example as it is on GitHub, that both the nick and the real name)

Artcfox · Post by **Artcfox** » Thu Oct 08, 2015 6:34 am

uze6666 wrote:Damn....it's fast!! Tested and merged. I will start checking for the remaining timing issues.

Ps: Let me know if it's ok to add you guys names in the sources as contributors.

If you meant me as well, sure.

Post by **uze6666** » Thu Oct 08, 2015 5:57 pm

Or you can also add you names when you make changes. For me it's important to list contributors.

CunningFellow · Post by **CunningFellow** » Tue Oct 20, 2015 3:37 am

OK - I have made a branch of Uzem140 that reduces the switch statement in the AVR8 exec function down to a single level.

It does this by pre-decoding the 32768 FLASH locations into a 32 bit structure of

Operation Number (and arbitrary number that us uint8)
Argument 1 ( the first argument of the instruction uint8)
Argument 2 (the 2nd argument that is a int16)

(it also then has to pre-decode any words written with the SPM instruction also)

On my machine it boosts the speed from 95Mhz to around 103Mhz. So no great gain there.

It does however make it so it should be easier to implement the "pure" hardware idea I mentioned previously. Having a 2 stage pipeline for instruction fetch like the real silicon. That should make it easier to make things cycle perfect and as a bonus should be able to make the writing of the software surface a bit faster.

It's a of a mess at the moment and things are just shoved in random places, but if people could have a look and say if they think it is a useful idea - that would be great.

https://github.com/andrewm1973/uzebox/t ... pre-decode

Artcfox · Post by **Artcfox** » Tue Oct 20, 2015 5:26 am

CunningFellow wrote:OK - I have made a branch of Uzem140 that reduces the switch statement in the AVR8 exec function down to a single level.

It does this by pre-decoding the 32768 FLASH locations into a 32 bit structure of

Operation Number (and arbitrary number that us uint8)
Argument 1 ( the first argument of the instruction uint8)
Argument 2 (the 2nd argument that is a int16)

(it also then has to pre-decode any words written with the SPM instruction also)

On my machine it boosts the speed from 95Mhz to around 103Mhz. So no great gain there.

It does however make it so it should be easier to implement the "pure" hardware idea I mentioned previously. Having a 2 stage pipeline for instruction fetch like the real silicon. That should make it easier to make things cycle perfect and as a bonus should be able to make the writing of the software surface a bit faster.

It's a of a mess at the moment and things are just shoved in random places, but if people could have a look and say if they think it is a useful idea - that would be great.

https://github.com/andrewm1973/uzebox/t ... pre-decode

I'm getting around 5% gain in speed on the Core2. Jubatian could you maybe take a look at this and see if you can work your x86 asm magic on it to improve it even more?

CunningFellow · Post by **CunningFellow** » Tue Oct 20, 2015 7:25 am

Jubatian , if you do want to look at it - I have not commented it or made it very neat because it was just a proof of concept.

How it works is that after the UZE/HEX file is loaded another function is called that runs through all 32768 flash locations and decodes the instruction

because the 64 bit X86 CPU does not have to squeeze all 86 AVR opcodes into a 16 bit word - it can have the opcode numbers all in line (and I choose alphabetically ordered)

For example

ST Z+q, rd

which is encoded in the AVR ISA as

10q0 qq1d dddd 0qqq

is the 81st opcode (alphabetically) and it has two arguments q and rd

it ends up as

instruction.opNum = 81;
instruction.arg1 = (insn & 7) | ((insn >> 7) & 0x18) | ((insn >> 8) & 0x20); // this is q
instruction.arg2 = ((insn >> 4) & 31); // this is rd

all the pre-decoded instructions are stored in a "parallel" 32768 slot array to the progmem[] array called progmemDecoded[]. The original progmem array is now only used for LPM instructions.

The main SWITCH statement is on a simple linear list of instructions AND the arguments are all already shifted/ORed and added before any execution happens.

Post by **Jubatian** » Tue Oct 20, 2015 8:39 am

Dammit! The boons of having a Harvard architecture!

The idea is indeed great, just needs proper testing and verifying. Even I had this in mind some long time ago when building an own emu, but there I planned the instruction set so it naturally had a single function table (switch), and forgot about that this could be done! Skimming through it the overall idea seems to be implemented properly, a pre-decoding of the flash, then running the execution over the resulting "microcode", for now it seems its components are engineered nice and proper (even pre-decoding the components of the instruction to exploit even that possibility).

Some merging should probably be done beforehand, I also submitted a pull which again did some larger messing around in the instruction decoder to achieve cleaner cycle perfection.

Some undefined behaviors might also be introduced in this change, Please check all the new signed variables (such as uses of arg2) whether your code depends on overflows or similar stuff there which is undefined by the specification of the C / C++ languages! (For now it seems like maybe OK, but if possible, I rather avoid signed stuff in code which also wants to depend on wrapping at places). The instruction table also contains 32K instructions, but the program counter has a 16 bit range (on the bare metal capable to access 128K flash). Buggy or weird avr code in some manner relying on pushing PC past the 32K barrier will crash this implementation as it is now.

Artcfox · Post by **Artcfox** » Tue Oct 20, 2015 9:00 am

This change makes it 6.5 MHz faster for me:

Code: Select all

diff --git a/tools/uzem/avr8.cpp b/tools/uzem/avr8.cpp
index ae90524..1e73f9f 100644
--- a/tools/uzem/avr8.cpp
+++ b/tools/uzem/avr8.cpp
@@ -1059,14 +1059,14 @@ u8 avr8::exec()
 
        currentPc=pc;
        const instructionDecode_t insnDecoded = progmemDecoded[pc];
-       const u8  opNum  = insnDecoded.opNum;
-       const u8  arg1_8 = insnDecoded.arg1;
-       const s16 arg2_8 = insnDecoded.arg2;
+       const u32  opNum  = insnDecoded.opNum;
+       const u32  arg1_8 = insnDecoded.arg1;
+       const s32 arg2_8 = insnDecoded.arg2;
 
        cycles = 1;                             // Most insns run in one cycle, so assume that
-       u8 Rd, Rr, R, CH;
-       u16 uTmp, Rd16, R16;
-       s16 sTmp;
+       u32 Rd, Rr, R, CH;
+       u32 uTmp, Rd16, R16;
+       s32 sTmp;
 
        //GDB must be first
        if (enableGdb == true)

Artcfox · Post by **Artcfox** » Tue Oct 20, 2015 9:55 am

Jubatian wrote:Some merging should probably be done beforehand, I also submitted a pull which again did some larger messing around in the instruction decoder to achieve cleaner cycle perfection.

Does your PR fix cycle perfection bugs, or does it just make the code cleaner? These changes slowed it down for me ~5MHz, and I didn't just compare the numbers which you said may be different, I ran two different versions of the emulator at the same time one above the other and I could see things moving back and forth slower. If it's fixing bugs, then I think the tradeoff is probably worth it. The only reason why I ask is because I'm trying to get the web version to run better on more devices, and even minor slowdowns really impact the speed of the web version.

I built a web version using CunningFellow's changes (merged with the upstream uzem140 branch) and with the 32 bit integer changes I described above, and it runs decently, but not quite 100% in Firefox on my 2.0 GHz Core 2. That's pretty good considering the 50% performance hit of Emscripten versus native. As a comparison, the native version of the same runs at 43 MHz on the 2.0 GHz Core 2 (when passed the -nw flags for no sound and software rendering).

I very much appreciate the good work you're doing. Code cleanups often unlock even more opportunities for optimization!

CunningFellow · Post by **CunningFellow** » Tue Oct 20, 2015 12:33 pm

Awesome.

I am not as dumb a C programmer as I look

I will guess someone that knows better than I do will implement this idea better/neater yes ?

After this has all settled down a bit I will try knock up a proof of concept for my idea to make the avr "core" as a two stage pipeline that will make "cycle perfect" easier and should hopefully speed up the video rendering.

One of the reasons I wanted to make the "different microcoded" instructions was as part of my pipelined cycle perfect changes. The multi clock instructions can actually have internal hidden ticks. That way every internal-instruction is one clock cycle. Actual real opcodes can be multi clock cycles by being made up from smaller internal instructions.

For example

MUL would be 2 ticks long.

The pre-decoded flash table would contain MUL_TICK_1. The first tick would do all the hard work and then load the "next instruction" that was MUL_TICK_2. The next pass through the switch statement MUL_TICK_2 would do nothing but set a flag that the the instruction pipeline is allowed to fetch the "next instruction" from progmem for the following tick.

RETI would be 4 ticks long and have 4 different internal "microcode" instructions RETI_tick1, RETI_tick2, RETI_tick3, RETI_tick4

The execute loop always pulls the instruction to do from "next instruction" rather than directly from flash. The only thing that is allowed to act on an interrupt is the part that loads an instruction from flash. It only ever loads the interrupt vector into the "next instruction".

That should match how things work in the silicon. So all the things like "interrupts being serviced the cycle AFTER the flag bit is set and the CPU always executing one normal intruction between any two interrupts would happen inherently.

This means that multi clock instructions would have to go through the switch statement multi times, but the "update hardware" could be cleaner. I'm punting a guess that the cleaner "update_hardware" would make up for the multi passes.

Uzebox Forums

Get that emu faster

Re: Get that emu faster

Re: Get that emu faster

Re: Get that emu faster

Re: Get that emu faster

Re: Get that emu faster

Re: Get that emu faster

Re: Get that emu faster

Re: Get that emu faster

Re: Get that emu faster

Re: Get that emu faster