Jubatian's kernel hackery

Topics related to the API, programming discussions & questions, coding tips, bugs, etc. should go here.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Jubatian's kernel hackery

Post by Artcfox »

I just tried compiling Bugz using this kernel, and it does save some cycles over the previous mode 3! I might have to use this for a Bugz sequel in order to have more cycles available for AI. :)
User avatar
D3thAdd3r
Posts: 3221
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: Jubatian's kernel hackery

Post by D3thAdd3r »

Nice I will have to try it out for non-scrolling stuff. BTW Bugz 2 would be awesome, first Uzebox sequel even :ugeek:
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Jubatian's kernel hackery

Post by Jubatian »

Artcfox: Just a few questions then. Did you use my previous kernel boost hack before? (That gives about 2000 VBlank cycles without fundamentally changing anything except for the timer's config which also requires readjusting Mode 13's timing). The approximately 4000 cycles this kernel gives should provide about 6-7% improvement in usable VBlank time, not a huge thing, but likely noticable (at least it is so in my own game), but of course it is only 2000 cycles if you change from the previous kernel boost hack. What about mixing? Do you use the inline mixer? If so, do you use the PCM channel? (I guess not, but you should check whether it is enabled or disabled as it would regain you 2000 VBlank cycles if you disabled it, even a little more than with the original kernel).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Jubatian's kernel hackery

Post by Artcfox »

This was the first time I used a modified kernel.

I didn't really do a scientific benchmark, For the unmodified kernel, I just added a bunch of delays at the end of my main loop, and used the wdr instruction to benchmark the number of clocks between calls to wdr. I found that if my main loop used over about 11,200 clocks it caused me to miss vsync, evidenced by the clocks measured suddenly jumping from around 11,200 to 39,000+.

I was able to add more delays to my main loop while using your kernel boost, but it still jumps up to 39,000+ clocks when I get above 11,600 clocks, so I'm not sure how much it is really saving me. I would have expected that I could use 4000 more clocks in my main loop before missing the next vsync, but it seems that I only got roughly 400.

I use the inline mixer, and I don't use PCM. I believe I have that disabled, along with the noise channel.
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Jubatian's kernel hackery

Post by Artcfox »

My Makefile has:

Code: Select all

KERNEL_OPTIONS  = -DVIDEO_MODE=3 -DINTRO_LOGO=0 -DSCROLLING=0 -DSOUND_MIXER=1 -DSOUND_CHANNEL_5_ENABLE=0
KERNEL_OPTIONS += -DMAX_SPRITES=12 -DRAM_TILES_COUNT=36 -DSCREEN_TILES_V=28
KERNEL_OPTIONS += -DMIXER_WAVES=\"$(MIX_PATH_ESC)\"
And the full source of Bugz is here: https://github.com/artcfox/bugz
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Jubatian's kernel hackery

Post by Jubatian »

I see you use a stack monitor: Good decision! (It is not a great necessity for games, but of course could make your life a lot easier if you happen to overrun the stack, this way you can identify the fault)

What you see there as a big jump in cycle count isn't the end of the frame. A full frame if the video display runs correctly takes 262 * 1820 = 476840 clocks. If you use a 224 lines tall screen, you would have 38 - 1 lines for VBlank (as the video mode enters one cycle in advance), which is 67340 cycles. In Mode 3 you have a sprite engine running which renders your sprites, this happens as a VSync task. The VSync tasks run at the end of the display lines, a couple of lines after the end of the video mode. So your user VBlank time is split in two halves, one before the VSync tasks, and one after.

If I take that you tried to cram your main loop to fit before that 11200 cycle barrier, that would mean you completely left the second half of the VBlank time unused. So you have lots of cycles there! You miss a frame when the cycle counts jump by 400K clocks. Of course you should probably determine sprite locations before the 11200 cycle barrier since after that their positions will lag a frame (I don't particularly like the kernel's method of producing sprites for this).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Jubatian's kernel hackery

Post by Artcfox »

Jubatian wrote:I see you use a stack monitor: Good decision! (It is not a great necessity for games, but of course could make your life a lot easier if you happen to overrun the stack, this way you can identify the fault)
Yeah, I added all kids of defensive measures and extra checks early on when I was trying to track down a stuck note issue, which we then discovered to be multiple race conditions inside the kernel's sound code. Once those got fixed, I kept all the defensive measures in the code for my sanity.
Jubatian wrote:What you see there as a big jump in cycle count isn't the end of the frame. A full frame if the video display runs correctly takes 262 * 1820 = 476840 clocks. If you use a 224 lines tall screen, you would have 38 - 1 lines for VBlank (as the video mode enters one cycle in advance), which is 67340 cycles. In Mode 3 you have a sprite engine running which renders your sprites, this happens as a VSync task. The VSync tasks run at the end of the display lines, a couple of lines after the end of the video mode. So your user VBlank time is split in two halves, one before the VSync tasks, and one after.
Seriously!?
Jubatian wrote:If I take that you tried to cram your main loop to fit before that 11200 cycle barrier, that would mean you completely left the second half of the VBlank time unused. So you have lots of cycles there! You miss a frame when the cycle counts jump by 400K clocks. Of course you should probably determine sprite locations before the 11200 cycle barrier since after that their positions will lag a frame (I don't particularly like the kernel's method of producing sprites for this).
Yes, that is exactly what I did! I optimized everything to run the average case in under 8000 cycles, so the pathological worst case of multiple collisions and both players jumping at the same time didn't push the cycles over that 11,200 cycle barrier.

When I cross that 11,200 barrier it does feel like the game lags, so I assumed that it was missing vsync. (You think this may be due to the sprite positioning code?) Also, going over that barrier cased the kernel to interrupt my user code and that would push extra stuff on the stack causing it to sometimes blow my stack, hence the stack painting code.

So you're saying that I can use this extra time, and still update everything at 60fps? That would mean that I could do so much more! Can I truly do it from user code, or do I need to schedule a pre-vsync callback?
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Jubatian's kernel hackery

Post by Jubatian »

Yes, VBlank works that way as far as the kernel is concerned. I thought about moving it at the return from the video mode in my kernel hack (so you would get one continuous chunk of VBlank without interruption), but for now, for compatibility reason I left it there (there are valid reasons to have it there, but probably the best would be giving the user an option to call that himself as his code requires).

It can blow the stack since it calls a bunch of C code for which it pushes all non call-saved regs on top of what is used for the interrupt (this is 17 bytes total: PC, SREG, r0, r1, r18-r27, ZL, ZH), then the called C routines are also sloppy on stack usage, and this may happen while your own code is also deep in stack (video mode entry is more prone to blow it though as it takes near 45 bytes, but by then normally your main code should be finished). Now using my emulator you may also visually monitor your stack usage (just observe the appropriate location in the memory access blocks on the bottom).

The lagging is a logical consequence of that the VBlank code is doing the sprite processing. If your code can complete before it, the sprites will be placed accordingly on the next frame, if it can't, then possibly some or all sprites will lag a frame (depending how much of your code generating sprite locations spilled).

The pre-vsync callback is called from there, before the sprite processing. So whatever you do in that callback will be visible in the next frame (if the sprite processing can complete of course). If you moved everything there, you would miss that 11200 cycles though.

The controllers are also read within that VBlank block, so to improve responsibility, you might want to process them after it (the pre-vsync callback is executed before them, there is nothing between them and the video mode's sprite processing, so by the current kernel there is no way to produce the fastest theoretical response).

So it might need some planning to decide what to do and when. Just to note: The vsync counter increments within this kernel code, so you can synchronize to it if you want by it (so you don't need a pre-vsync callback and a state variable if you wanted to accomplish only this).
User avatar
Artcfox
Posts: 1382
Joined: Thu Jun 04, 2015 5:35 pm
Contact:

Re: Jubatian's kernel hackery

Post by Artcfox »

Jubatian wrote:Yes, VBlank works that way as far as the kernel is concerned. I thought about moving it at the return from the video mode in my kernel hack (so you would get one continuous chunk of VBlank without interruption), but for now, for compatibility reason I left it there (there are valid reasons to have it there, but probably the best would be giving the user an option to call that himself as his code requires).
Would it be possible to do this without making the position of the sprites lag by a frame?
Jubatian wrote:It can blow the stack since it calls a bunch of C code for which it pushes all non call-saved regs on top of what is used for the interrupt (this is 17 bytes total: PC, SREG, r0, r1, r18-r27, ZL, ZH), then the called C routines are also sloppy on stack usage, and this may happen while your own code is also deep in stack (video mode entry is more prone to blow it though as it takes near 45 bytes, but by then normally your main code should be finished). Now using my emulator you may also visually monitor your stack usage (just observe the appropriate location in the memory access blocks on the bottom).
Is the C code it calls the pre-vsync callback, or can it be optimized to save more cycles?
Jubatian wrote:The lagging is a logical consequence of that the VBlank code is doing the sprite processing. If your code can complete before it, the sprites will be placed accordingly on the next frame, if it can't, then possibly some or all sprites will lag a frame (depending how much of your code generating sprite locations spilled).

The pre-vsync callback is called from there, before the sprite processing. So whatever you do in that callback will be visible in the next frame (if the sprite processing can complete of course). If you moved everything there, you would miss that 11200 cycles though.

The controllers are also read within that VBlank block, so to improve responsibility, you might want to process them after it (the pre-vsync callback is executed before them, there is nothing between them and the video mode's sprite processing, so by the current kernel there is no way to produce the fastest theoretical response).

So it might need some planning to decide what to do and when. Just to note: The vsync counter increments within this kernel code, so you can synchronize to it if you want by it (so you don't need a pre-vsync callback and a state variable if you wanted to accomplish only this).
I've looked at the diagram for how the vblank thing works, but for some reason I'm just not understanding it fully. Does it interrupt after 11,200 clocks to do the sprite processing? Can we skew this a bit so it interrupts later as long as we know that the sprite processing can still complete?

Being able to do a bunch of work in the second half seems great, but most of what I'd want to do (more complex physics engine) would need to happen before the sprites are positioned, so anything that can give more cycles before that would be a huge win.
User avatar
Jubatian
Posts: 1561
Joined: Thu Oct 01, 2015 9:44 pm
Location: Hungary
Contact:

Re: Jubatian's kernel hackery

Post by Jubatian »

So what about those sprites.

The biggest problem with them is that you can't know in advance how long the sprite processing will take.

In Mode 74, and in my UCC game which I am designing I solved these problems by eliminating the VSync video tasks. Mode 74 doesn't have a sprite engine, rather a simpler blitter which automatically allocates RAM tiles. You blit transparent 8x8 tiles with it from user code. I also use the frame reset concept, which means that after every video frame the user code begins at a fixed address with empty stack (so everything delaying past the end of the VBlank is never processed). The blitter is designed to handle this: it can be halted anywhere in its processing without ending up with a corrupted display (so only some things will be missing, but the video mode in overall keeps working, RAM tile allocation and restoration can't end up being corrupted). I prioritize my sprites to work with this concept, so only less important things may end up missing from a frame if there was no time to render them. Of course this is truly as complex as it sounds.

I don't know how robust is the kernel in this regard, without frame reset it should work, but what kinds of transients would show on the display I am not sure.

The code within the VBlank tasks by function can not be trimmed: the things it does are necessary to get the game going (reading controllers, processing sprites, processing music and sound patches).

To figure out how long blitting your sprites take, you may use that "wdr" technique to find out where the end of the VBlank tasks are in worst case, and also to find out how much time you have remaining until the next frame. The cycles between the worst case VBlank task end and VBlank end can be exploited using a pre-vsync callback (where you can calculate sprite positions with them still ending up on the next frame). Possibly you will have more than 15000 or so cycles there to use.
Post Reply