The Case of the Missing Cycles (Mode 41)

Topics related to the API, programming discussions & questions, coding tips, bugs, etc. should go here.
Post Reply
charliegreen
Posts: 22
Joined: Wed Jan 14, 2015 7:28 pm
Location: California, USA

The Case of the Missing Cycles (Mode 41)

Post by charliegreen »

Hello everyone! It's been a while!

uzem says it's running at 30 MHz, and Wait100ns in uzeboxCore.c just runs the 3 cycle lpm instruction, which also suggests 30 MHz. At 60 fields per second, that suggests each field should have 500000 cycles available, right? Of course, much of that will be consumed from outputting the video signal and sound processing. But I don't think that should consume anywhere near as many cycles as I'm measuring.

Here's my code:

Code: Select all

#include <avr/pgmspace.h>
#include <uzebox.h>

u32 wait_vsync(void);

int main(void) {
    ClearVram();
    SetFontTilesIndex(32);
    SetBorderColor(0x52U);

    while (true) {
        u32 cycles = wait_vsync();
        PrintLong(10, 5, cycles);
        PrintHexLong(12, 5, cycles);
    }
}

Code: Select all

#include <defines.h>

        .global wait_vsync

        .section .text

;;; ================================================================================
;;; Wait until next vsync, and return the number of cycles that passed.
;;;
;;; C-callable
;;; returns: (u32)
	.section .text.wait_vsync
wait_vsync:
        clr     r1       ; note r1 is assumed by C code to be 0, but better safe
        clr     r22
        clr     r23
        clr     r24
        clr     r25

1:
        ;; See uzeboxVideoEngineCore.s:GetVsyncFlag
        lds     r18, sync_flags      ;2 cycles
        andi    r18, SYNC_FLAG_VSYNC ;1 cycle

        ldi     r20, 11              ;1 cycle
        add     r22, r20             ;1 cycle
        adc     r23, r1              ;1 cycle
        adc     r24, r1              ;1 cycle
        adc     r25, r1              ;1 cycle
        cpi     r18, 0               ;1 cycle
	breq	1b		     ;2 cycles when true, 1 when false

        ;; See uzeboxVideoEngineCore.s:ClearVsyncFlag
        ;; Copying because we already copied GetVsyncFlag right here and this
        ;; shaves off a few extra cycles
	lds r18,sync_flags
	andi r18,~SYNC_FLAG_VSYNC
	sts sync_flags,r18
	ret
        .size wait_vsync, . - wait_vsync
The building process:

Code: Select all

$ make
mkdir .bin
mkdir .obj
mkdir .gen
avr-gcc -I"../../uzebox/kernel" -mmcu=atmega644 -mmcu=atmega644 -Wall -std=gnu99 -DF_CPU=28636360UL -Os -fsigned-char -ffunction-sections -fno-toplevel-reorder -DVIDEO_MODE=41 -DM40_IBM_ASCII=1 -DINTRO_LOGO=0  -DENABLE_MIXER=0 -fshort-enums  -g3 -gdwarf-2 -x assembler-with-cpp  -c ../../uzebox/kernel/uzeboxVideoEngineCore.s -o .obj/uzeboxVideoEngineCore.o
avr-gcc -I"../../uzebox/kernel" -mmcu=atmega644 -mmcu=atmega644 -Wall -std=gnu99 -DF_CPU=28636360UL -Os -fsigned-char -ffunction-sections -fno-toplevel-reorder -DVIDEO_MODE=41 -DM40_IBM_ASCII=1 -DINTRO_LOGO=0  -DENABLE_MIXER=0 -fshort-enums  -g3 -gdwarf-2 -x assembler-with-cpp  -c ../../uzebox/kernel/uzeboxSoundEngineCore.s -o .obj/uzeboxSoundEngineCore.o
avr-gcc -I"../../uzebox/kernel" -mmcu=atmega644 -Wall -std=gnu99 -DF_CPU=28636360UL -Os -fsigned-char -ffunction-sections -fno-toplevel-reorder -DVIDEO_MODE=41 -DM40_IBM_ASCII=1 -DINTRO_LOGO=0  -DENABLE_MIXER=0 -fshort-enums  -g3 -gdwarf-2 -c ../../uzebox/kernel/uzeboxCore.c -o .obj/uzeboxCore.o
avr-gcc -I"../../uzebox/kernel" -mmcu=atmega644 -Wall -std=gnu99 -DF_CPU=28636360UL -Os -fsigned-char -ffunction-sections -fno-toplevel-reorder -DVIDEO_MODE=41 -DM40_IBM_ASCII=1 -DINTRO_LOGO=0  -DENABLE_MIXER=0 -fshort-enums  -g3 -gdwarf-2 -c ../../uzebox/kernel/uzeboxSoundEngine.c -o .obj/uzeboxSoundEngine.o
avr-gcc -I"../../uzebox/kernel" -mmcu=atmega644 -Wall -std=gnu99 -DF_CPU=28636360UL -Os -fsigned-char -ffunction-sections -fno-toplevel-reorder -DVIDEO_MODE=41 -DM40_IBM_ASCII=1 -DINTRO_LOGO=0  -DENABLE_MIXER=0 -fshort-enums  -g3 -gdwarf-2 -c ../../uzebox/kernel/uzeboxVideoEngine.c -o .obj/uzeboxVideoEngine.o
avr-gcc -I"../../uzebox/kernel" -mmcu=atmega644 -Wall -std=gnu99 -DF_CPU=28636360UL -Os -fsigned-char -ffunction-sections -fno-toplevel-reorder -DVIDEO_MODE=41 -DM40_IBM_ASCII=1 -DINTRO_LOGO=0  -DENABLE_MIXER=0 -fshort-enums  -g3 -gdwarf-2  -c main.c -o .obj/main.o
avr-gcc -I"../../uzebox/kernel" -mmcu=atmega644 -mmcu=atmega644 -Wall -std=gnu99 -DF_CPU=28636360UL -Os -fsigned-char -ffunction-sections -fno-toplevel-reorder -DVIDEO_MODE=41 -DM40_IBM_ASCII=1 -DINTRO_LOGO=0  -DENABLE_MIXER=0 -fshort-enums  -g3 -gdwarf-2 -x assembler-with-cpp  -c asm.s -o .obj/asm.o
avr-gcc -mmcu=atmega644 -Wl,-Map=.obj/test41.map -Wl,-gc-sections   .obj/uzeboxVideoEngineCore.o .obj/uzeboxSoundEngineCore.o .obj/uzeboxCore.o .obj/uzeboxSoundEngine.o .obj/uzeboxVideoEngine.o .obj/main.o .obj/asm.o -o .bin/test41.elf
avr-objcopy -O ihex -R .eeprom  .bin/test41.elf .bin/test41.hex
Also, my copy of the uzebox repo is at commit b1d8914.

When I run this in uzem and cuzebox, it prints 66737 = 0x104b1 free cycles per vsync. This is clearly nowhere near 500000. As you can see, I have -DENABLE_MIXER=0, so sound processing shouldn't be consuming any cycles. I can't imagine that mode 41 consumes 430000 cycles per field, so I figure I must have made a mistake somewhere else.

Is my wait_vsync bad? Did I pass some bad compiler argument, or not pass one I should? Do PrintLong or PrintHexLong take a really long time? Is something other than mode 41's video processing running during interrupts? Was my estimate of 500000 cycles per field wrong?

I've really been scratching my head trying to figure out where these cycles are going, to be honest. Any help is appreciated! Thanks!
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: The Case of the Missing Cycles (Mode 41)

Post by CunningFellow »

Back of napkin I think that sounds about right.

The user loses control of the CPU the entire visible picture time (video and audio) and the Hsync time of the rest (for audio)

There a 262 TV lines of which 224 are visible. That leaves 262-224=38 lines the user gets the CPU

Of those 38 lines the user gets about 1800 clock cycles per line. 1800*38 = 68400

0x104b1 = 66737
charliegreen
Posts: 22
Joined: Wed Jan 14, 2015 7:28 pm
Location: California, USA

Re: The Case of the Missing Cycles (Mode 41)

Post by charliegreen »

Ah, I see.. video rendering is more demanding than I remember.

After refreshing my memory of NTSC, though, I'm a bit confused; I thought that 486 scanlines made the visible image, meaning 243 visible per field instead of 224? Also, looking at this timing diagram, each scanline is supposed to take 63.6μs; at 30 MHz, each cycle is 33.3ns, giving us 1908 cycles/scanline instead of 1800. Is time being taken up by the kernel during the vblank interval, or was 1800 just a rough number?

Side note: I saw you've done some video mode programming, and I was wondering if there are any tools for counting cycles? This could be useful if I want to sacrifice sound for more cycles during hblank, or if I want to get back into writing video modes.

Thanks again for your help!
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: The Case of the Missing Cycles (Mode 41)

Post by CunningFellow »

charliegreen wrote: Thu Jan 14, 2021 8:48 pm <SNIP>meaning 243 visible per field instead of 224?
There are only 224 scanline that the Uzebox draws. In fact you can set that to a lower number in most video modes with renderLinesCount and firstRenderLine. You could make the screen a bit more "letter box" by setting the renderlines to 186 and double the user CPU time.
charliegreen wrote: Thu Jan 14, 2021 8:48 pm <SNIP>each scanline is supposed to take 63.6μs; at 30 MHz, each cycle is 33.3ns, giving us 1908 cycles/scanline instead of 1800. Is time being taken up by the kernel during the vblank interval, or was 1800 just a rough number?
1800 was just a rough number. But you are correct there is some time taken up each NON visible scanline for doing HSync and sound. Even the invisible TV time during flyback Audio still has to be processed otherwise the sound would be crackly.
charliegreen wrote: Thu Jan 14, 2021 8:48 pm Side note: I saw you've done some video mode programming, and I was wondering if there are any tools for counting cycles? This could be useful if I want to sacrifice sound for more cycles during hblank, or if I want to get back into writing video modes.
You probably wont like my answer. A Spreadsheet and AVR Studio 4.
User avatar
D3thAdd3r
Posts: 3171
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: The Case of the Missing Cycles (Mode 41)

Post by D3thAdd3r »

Just to throw my 2 cents, CUzeBox was developed after Uzem and in many/most ways is superior I think. It has visual helpers to indicate scanline timing inconsistency which is nice.

Also, sort of related depending on your use case, you can have a timer/interrupt terminated scanline to partially alleviate cycle counting. This can allow higher resolution in some cases, and I found it quite trivial in conjunction with CUzeBox hsync indicators to align a "general idea is right" scanline renderer with the actual NTSC requirements.

On the extreme crazy edge Cunning fellow does video modes, I have to imagine some advanced spreadsheet skills really are required.

For very increased user time, you can run a game at 30hz logically with a short screen. For Columns I didn't have enough cycles for the AI until I did that. If Mode 3 or similar, you need to do some tricks to avoid the kernel blitting ram tiles at 60hz if you go down this twisted road.
charliegreen
Posts: 22
Joined: Wed Jan 14, 2015 7:28 pm
Location: California, USA

Re: The Case of the Missing Cycles (Mode 41)

Post by charliegreen »

CunningFellow wrote: Thu Jan 14, 2021 9:29 pm
There are only 224 scanline that the Uzebox draws. In fact you can set that to a lower number in most video modes with renderLinesCount and firstRenderLine. You could make the screen a bit more "letter box" by setting the renderlines to 186 and double the user CPU time.
Oh, rad! I'm using mode 41, which has a border that I can maybe cut into with that. I grepped through the kernel for renderLinesCount and firstRenderLine and couldn't find anything, though; how do I set them?
CunningFellow wrote: Thu Jan 14, 2021 9:29 pm You probably wont like my answer. A Spreadsheet and AVR Studio 4.
Oh yikes lol. This seems like something a tool should exist for! Perhaps something that breaks the assembly into basic groups, like the graph view in IDA Pro, and counts total cycles for each group? Maybe I'll give writing one a go at some point. 🤷‍♀️
D3thAdd3r wrote: Sat Jan 16, 2021 4:47 am Just to throw my 2 cents, CUzeBox was developed after Uzem and in many/most ways is superior I think. It has visual helpers to indicate scanline timing inconsistency which is nice.

Also, sort of related depending on your use case, you can have a timer/interrupt terminated scanline to partially alleviate cycle counting. This can allow higher resolution in some cases, and I found it quite trivial in conjunction with CUzeBox hsync indicators to align a "general idea is right" scanline renderer with the actual NTSC requirements.
Oh nice, I may end up using the timer/interrupt terminated scanlines in the future. As for CUzeBox, I've used it some, but I didn't understand the heatmap-like display at the bottom; is there documentation anywhere on what the different colors mean, and what the axes are?
D3thAdd3r wrote: Sat Jan 16, 2021 4:47 am For very increased user time, you can run a game at 30hz logically with a short screen. For Columns I didn't have enough cycles for the AI until I did that. If Mode 3 or similar, you need to do some tricks to avoid the kernel blitting ram tiles at 60hz if you go down this twisted road.
This sounds really useful! How do I do that? Is there a macro somewhere for the video interrupt frequency? Does this just ignore every other field?

Also, thank you both for all your help!
User avatar
D3thAdd3r
Posts: 3171
Joined: Wed Apr 29, 2009 10:00 am
Location: Minneapolis, United States

Re: The Case of the Missing Cycles (Mode 41)

Post by D3thAdd3r »

charliegreen wrote: Mon Jan 18, 2021 9:39 pm As for CUzeBox, I've used it some, but I didn't understand the heatmap-like display at the bottom; is there documentation anywhere on what the different colors mean, and what the axes are?
I don't recall where that is documented, but I think the colors are based on frequency of access(ie they dim to blue when they aren't used for a bit, white flashes are memory access). For scanlines I think I only ever used the red lines. Any scanline that is off time(including non-visible vblank ones) will show a red line to the left or the right. By dialing things in the line gets shorter until it disappears. It's been a while since I've played with video modes, but possibly a red line to the left of a scanline means it started late, a red line to the right means it ran late, or on both sides the scanline is too many cycles overall.
charliegreen wrote: Mon Jan 18, 2021 9:39 pm I grepped through the kernel for renderLinesCount and firstRenderLine and couldn't find anything, though; how do I set them?
Use SetRenderingParameters()...though not sure if this actually works in Mode 41? The API documentation needs some things added if anyone gets the time.
charliegreen wrote: Mon Jan 18, 2021 9:39 pm Does this just ignore every other field?
The 30hz concept still rasterizes scanlines at 60hz. It is more of a concept than a concrete thing, because it depends highly on the video mode and the specific games. Instead of using WaitVsync(1), which just spins away extra cycles waiting to align with 60hz, you use GetVsyncFlag()/ClearVsyncFlag() to keep track of vsyncs yourself. Or instead of that...or in combination with...you can utilize SetUserPreVsyncCallback()/SetUserPostVsyncCallback().

So I have not been Uzeboxing for quite a while and I don't want to give incorrect information. But I can speak on the overall concept without digging into old code of mine. The screen is always drawn at 60hz and sound/music is always updated 60hz as well(could alter this with vsync mixer), but that doesn't necessarily mean you have to update game logic, sprites, etc. that often. This is where it depends on the video mode, and the technique really excels with video modes that have heavy cycle useage like Mode 3 blitting sprites, or game logic that can be slow like games with puzzle AI. So by keeping track of vsyncs, we just wait for every other one manually. Or, if using a lot of ram where the stack gets deep, wait for every vsync, but bypass normal kernel tasks every other frame.

Eh, hard to explain but I think when you play with it for a bit it will make sense how it could work to whatever your scenario is. Like, Mode 3 blits sprites at 60hz, unless there are no sprites. So if you have sprites set up on frame 1, it will blit those to ram tiles and modify vram to point to them. On frame 2, you could turn all the sprites off, and the sprite data from last frame would still be blit into ram tiles. Normally, Mode 3 will use ram_tiles_restore[] to go back after a frame is rendered, and replace all vram entries it modified(point them back to normal tiles, supposed to be transparent to the user program), back to the state it was before the sprite blitting began. Now, using a Vsync callback or other method, you could prevent the restore from happening. It gets a little tricky depending on the situation.
CunningFellow
Posts: 1445
Joined: Mon Feb 11, 2013 8:08 am
Location: Brisbane, Australia

Re: The Case of the Missing Cycles (Mode 41)

Post by CunningFellow »

charliegreen wrote: Mon Jan 18, 2021 9:39 pm
CunningFellow wrote: Thu Jan 14, 2021 9:29 pm You probably wont like my answer. A Spreadsheet and AVR Studio 4.
Oh yikes lol. This seems like something a tool should exist for! Perhaps something that breaks the assembly into basic groups, like the graph view in IDA Pro, and counts total cycles for each group? Maybe I'll give writing one a go at some point. 🤷‍♀️
I'd use that.
charliegreen wrote: Mon Jan 18, 2021 9:39 pm
D3thAdd3r wrote: Sat Jan 16, 2021 4:47 am Also, sort of related depending on your use case, you can have a timer/interrupt terminated scanline to partially alleviate cycle counting. This can allow higher resolution in some cases, and I found it quite trivial in conjunction with CUzeBox hsync indicators to align a "general idea is right" scanline renderer with the actual NTSC requirements.
Oh nice, I may end up using the timer/interrupt terminated scanlines in the future. As for CUzeBox, I've used it some, but I didn't understand the heatmap-like display at the bottom; is there documentation anywhere on what the different colors mean, and what the axes are?
T2K was the first thing that used the interrupt ended scanline. Uzem had to be modified to emulate that hardware to be able to run. Not having to count tiles does not save a HUGE amount of time. Just one or two clocks per tile. Where it shines is if your loop uses an IJMP to make a decision on loop back you don't have to use a branch-out-of-loop instruction.

The 2nd video mode that uses a different interrupt is the RLE/filled-polygon/3117 mode (don't know what to call it yet). It uses a UART interrupt as a pseudo-timer to end the scanline. This was done because the timer interrupt I used in the T2K mode was already being used to end a pixel run. Ending the scanline with an interrupt in this mode is vital because the RLE decoder doesn't actually know or keep track of how many pixels have been pushed. It just keeps running until it gets a tap on the shoulder.

The heatmap down the bottom of the screen is broken into 16x16 pixel boxes (256 pixels) each pixel is a memory location. So each box is showing a 256 byte chunk of memory. The first box on the far left is the I/O addresses. The next 16 boxes (16x256=4096) is the RAM. You get a different colour if you are reading or writing the memory.

I believe RED is writing, Green is reading and Yellow is R/W. This is just my observations from looking at the memory map when T2K is running.
T2KMap.png
T2KMap.png (29.83 KiB) Viewed 9725 times
The object store is a 64 element array of an 8 byte structure. This takes up two blocks of 256 bytes. The first element of this array is the object type. It gets read every game loop for collision detection. If the object is empty then that read is all that happens. You can see all 64 objects get that read which is the two green lines across these two sqaures. There are 4 objects on the screen (claw and 3 flippers) so yo can see the first 4 elements of this array have more reads and writes in their 8 byte structures.

The next 8 squares is RAMTiles. With most of the screen blank (apart from the blue lines which are read from SD) you can see that most of this 2 Kilobytes is empty and is getting reads and writes.

Next the 900ish bytes of VRAM gets totally erased, partially written and then totally read each frame. So this is fully yellow.

I may be wrong about this - but the assumptions fit what I know about my game there.
Post Reply