Emulator Developers: Lagless VSYNC ON Algorithm

YouTube · Post by **Chief Blur Buster** » 05 Apr 2018, 14:00

Calamity wrote:
Chief Blur Buster wrote:But fuggedaboudit, we don't have easy access to a front buffer (yet)....
You want front buffer access? Use DirectDraw
Promise, DirectDraw allows front buffer access. Too bad the api is totally broken/emulated since 8.

NVIDIA/AMD, front buffer, pretty-pretty-pretty-please?
This is something that has to be done from the os side. We really don't want another proprietary api with pompous branding for something that is a hack.

You're right -- yes, it should be a standard Direct3D or OpenGL setting to force-enable a front buffer, and getting AMD/NVIDIA to give industry-standardized discovery ("Yes, the front buffer setting is supported") and industry-standardized configurability ("Now configuring front buffer").

This was more common back in the old days when graphics rendered much more slowly and you could watch OpenGL graphics render bit-by-bit on a workstation machine. Along the way, we stopped having access to permanent continuous display of front buffer anymore, when it's really useful again for beam chasing applications now that even phone GPUs perform fast enough to do basic beam chasing/racing.

YouTube · Post by **Chief Blur Buster** » 05 Apr 2018, 14:03

Hmmm, just noticed a Reddit thread about WinUAE implementation of the Lagless VSYNC (beam racing) algorithm!
https://www.reddit.com/r/emulation/comm ... ency_mode/
I totally missed that two days ago.

Tommy wrote:I'm also a big worried about eGPUs. Has anybody performed any metrics on those? I understand that when an eGPU is running an internal screen it's actually still the original GPU generating the display, with the eGPU piping data? If so then is there much extra latency there?

Beam racing works on Intel eGPUs.
(from post in the multiple pages of the abime.net version of this thread)

Rotareneg wrote:My laptop is running Windows 10 Home on a Core i5-3230M with the integrated GPU (Intel HD Graphics 4000.) When using the built-in screen it hardly jitters as all, but has a fixed wrap-around of the last slice:

The wraparound probably is amplified because several laptops has almost no VBI -- probably only 10 or 15 scanlines tops. Scanline #1 begins scanning almost right after last scanline. (This requires surge-execution of VBI code to complete emulator VBI tasks).

For this, I highly recommended flipping after the first (topmost) emulator frameslice render, while the realraster is still within the bottom part of the screen -- gives you more safety margin by doing time-offsets from Scan Line #1, so it's more VBI-size-independent (more universal on more platforms on more displays). Don't worry about odd color strips as long as the flip boundaries are above the frameslice boundaries, then it's harmless, even in the wraparound-scan situation.

Confirmed so far:
-- Works on NVIDIA
-- Works on AMD
-- Works on Intel eGPUs
-- Works on Android GPUs

Tommy wrote:I don't know how relevant or helpful it is, but on embedded GPUs like those in phones and Intel's various solutions it's often more expensive not to clear a target buffer upon first binding it.

Oh, good point! I hadn't really thought about this fully through -- you're right that certain GPUs/drivers may actually depend on an assumption the buffer is already clear. That said, it appears that for my desktop at least, buffers aren't pre-cleared, and it certainly seems to save a little bandwidth and slightly increase frameslice throughput (not as much performance increase as I expected). More study needed.

Tommy wrote:Anyway, yes, I'm aware that my priorities are likely distinct. They usually are.

Certainly understandable! There's so many -- e.g. optimized for performance, or CRT-behavior authenticity, or blur-authenticity, etc. They're often mutually difficult to do simultaneously due to one reason or another (e.g. hardware limitations -- like LCD not doing perfect blacks).

Tommy wrote:That surely depends on how you are generating audio. If tasked with producing a classic 44100Khz output, I currently get a guaranteed worst-case audio latency of 5.8ms because audio output exhaustion is one of the triggers to do more emulation. So that's not just sub-frame, but completely detached from the frame rate.

In the hypothetical 60->120 mapping of a surge frame then a fallow frame, the worst case is now the length of the fallow frame. So it's gone up to around 8.3ms. If I were 60->240 mapping by a surge frame and then three fallows, I'm up at 12.5ms. Etc.

Or am I labouring under a misapprehension?

Excluding the audio buffer -- audio never needs to be delayed beyond the 1/60sec window no matter how slow or fast the scanout, and large VBI's larger than Active (e.g. 75% VBI, 25% active) doesn't shrink the wraparound jitter margin, you still have 16.7ms of video-delay adjustment (minus the time interval of one frameslice) even with the ultrafast scanout situations, no matter whether the raster is in Active or in VBI, even if VBI is bigger than Active.

Surge-scanouts doesn't prevent you from delaying your "beam-race" to a "beam-chase". Basically beam racing with a ~15-16ms margin.

Unchanged Constants:
-- Audio buffer exhaustion (meaning 5.8ms after this event, audio stops?)
-- The original emulator refresh cycle target of 1/60sec (or 1/50sec, etc)
-- Full emulator-refresh-cycle jitter margin (unchanged regardless of fallow cycles or large VBI)*
*Remember this is the full emulator refresh cycle jitter margin, so always 1/60sec no matter how fast the scanout is

So the beam race velocity doesn't interfere with your ability to compensate to exact audio delay. So a fast-beamrace can have unchanged-lag audio because the constants remain unchanged. So you simply use a bigger audio buffer so you can buffer more audio (e.g. 16.7ms of audio buffered in 4.2ms fast-scanout), and then also do a video delay adjustment.

Since the average lag improvement of a faster scanout translates to the midpoint of the improvement in the visible-scanlines scanout time (minus VBI). A 60Hz scanout is roughly 16.2ms, excluding VBI time (roughly 0.4-0.5ms for a common VBI) -- so ultimately, this is (16.2ms - 4.2ms) / 2 = (12ms / 2) = 6ms average lag decrease (for all pixels of the entire screen combined) for a 1/240sec scanout instead of a 1/60sec scanout. Now approaching this from a different point of view: 1/240sec scenario top-to-bottom scanout [0...4.2ms] (average 2.1ms) rather than 1/60sec scenario top-to-bottom scanout [0..16.2ms] (average 8.1ms). So double-checking formula through different math (8.1ms-2.1ms) = 6ms. So, ~6ms average decrease in visual lag for the refresh cycle combined, for a 1/240sec scanout instead of 1/60sec scanout. Meaning the average photon hits the eyeballs 6ms sooner for any given random pixel on the screen (less for top edge, more for bottom edge -- this is just the average)

Now, this is clearly within the jitter safety margin, so you simply fall back the realraster scanout behind emuraster scanout with a roughly 6ms chase distance between emuraster and realraster. Viola. Audio is back in sync.

If you preferred to align audio to top-edge refresh, or bottom-edge refresh, that's easy -- it's only a 4.2ms range of adjustment so +/- 2.1ms on 6ms which is an adjustment range of [3.9ms..8.1ms] video lag adjustment for all possible theoretical audio-delay abberations solely caused by scanout velocity difference. This adjustment range fully fits within the jitter margin.

Mathematically, this always happens, provided your frameslices are smaller height than the height of the active part of the real refresh cycle. Yet your range-of-adjustment is the FULL emulator refresh cycle. When I said jitter margin was "a refresh cycle minus a frame slice", that meant "a full 1/60sec minus one frame slice" -- even if realworld scanout is 1/240sec. Maybe this was the source of confusion. The math formula of the jitter range is (1/60th of scanlines per second) minus (number of scanlines of a frame slice) -- that's your video-delay range of adjustment you can get. A bigger VBI actually improves your range of video-delay adjustment when you use this formula with the same frameslice size! So that's hopefully more generous range of video-delay adjustment than you might have thought. My apologies for the confusion.

Since both situations (1/60sec and 1/240sec) would have the same unchanged audio buffer exhaustion signal (and the time interval between software exhaustion event and physical sound stopping -- which you say is a worst-case of 5.8ms, right) -- then there's no difference in the fast-scanout and slow-scanout situation in terms of buffer behavior. At any one time, no matter how slow-scan or infinite-fast-scan, there's never a need to generate more than one refresh cycle's worth of audio at any one instant (and only for the impossibly improbable scenario of instant-scanout + 16.7ms VBI -- but even that impossible scenario still only requires you to do an 8.3ms video delay adjustment to match the average video lag decrease -- still within your video delay adjustability margin (via beam chase-distance between emuraster + realraster) ...).

There's one caveat -- the equally rare situation of "audio-stimuli-directly-clocked-from-an-input-read" combined with "random-input-reads-in-random-places-of-emulator-scanout" situations. Due to the asymmetry of emu-scanout versus real-fast-scanout -- this would cause a potential variable audio delay (+/- a few milliseconds) depending on where in the emulator refresh cycle the keypress or fire button was read. In reality, neither happens simultaneously because either (1) input reads occur at a consistent location in a refresh cycle, or (2) audio events internally synced to the emulator's clock or emulator's VBI .... so the audio lag offset is constant if either one or the other is true. Easy to adjust the audio/video lag. You'd target at the average lag decrease caused by fast-scanout and adjust the beam-race margin as a video delay. For recalibrating to the game that does input-reads at the bottom edge of a screen, a slider could adjust the video delay (via adjusting chase distance between emuraster + realraster, as described earlier). In all possible situations, a video-ahead-of-audio never mathematically outside of the video-delay adjustment range.

Now if your audio buffer is fixed at 5.8ms (can't buffer more), that's another ball of wax, but that's a different subtopic (please clarify if you've got a limitation on buffering /more/ -- e.g. buffer more than 5.8ms of audio. My assumption is that you are able to. It will be potentially necessary to do that in fast-scanout situations -- since bottom-of-screen is finished 12.5ms sooner in 1/240sec beam-race (12ms sooner, excluding VBI). Then you do the video delay to sync up with the audio delay -- obviously half the 12ms (6ms) to match up with the average lag. So more audio buffer (than 5.8ms) becomes necessary to zero-out the audio-video lag.

(Covering all, not leaving stones unturned .... That said, if you're 100% driving your emulator speed directly only from an audio clock, and never reclocking audio, it's very hard to precisely calibrate your real display's refresh rate to be perfectly in clock-ratio sync with your audio - for example, different 50Hz EDIDs on different displays may actually result in a graphic clock generating 49.999Hz or 50.001Hz (even if it claims 50.000) -- so you must be doing some form of compensation for this, especially where error tolerances don't overlap. Especially if a user wants to sync emulator to VSYNC ON (so your emulator necessarily runs off the video card's "clock"). With a GPU clock slightly slewing relative to the CPU's clock and potentially relative to the audio chip's clock (tiny slews as it may be, e.g. even if <.001% difference in tick-tocking rates of all the various chips in a modern computer...) you're probably making design decisions based on your priorities. How are you balancing synchronization priorities? Beam racing means by necessity you're slaving your emulator clock to the video output, and you've gotta reclock everything anyway, so it all boils down to a "doesn't-matter-slowscan-vs-fastscan" based on the above. But if you're permanently clocking your emulator to audio clocks with zero audio reclocking (meaning frame skips/drops at 50Hz or 60Hz VSYNC ON) then that might make it tougher for you to beam race even a simple 60Hz original-scan-velocity output -- you can no longer rely on audio buffer exhaustion during beam racing. Besides, fast-scan is buffer-stuffing, not buffer-exhaustion, and the range-of-adjust venn diagrams lays worries at rest) -- I'm curious what priorities you chose in synchronization priorities given the known slew-effects issues between all the various imperfect clocks (video, audio, cpu, etc. Or is this last paragraph irrelevant (my overthinking), and the other part of my posts useful-to-know stuff?)

YouTube · Post by **Chief Blur Buster** » 07 Apr 2018, 01:16

As I am using Monogame as a cross platform rendering engine for a cross platform beam racing demo....I used it for the successful YouTube demo

I have hit a small temporary roadblock (MonoGame specific) which I have personally offered a bounty for:
Accuracy of game.targetElapsedTime on laptops affect beam racing

The Monogame "timer" keeps going to millisecond precision on one of my laptops, so I have to resort to a hack to get back to microsecond (best-effort) precision.

If I was doing OpenGL calls directly, I would not have to contend with this, but I want to K.I.S.S. and use a third party engine to make programming beam racing easier. It does (with my helper objects), make beam racing child's play, so I want to keep on this path.

I may (for now) stick with a clever workaround that might interfere with input sample rate (1000Hz down to 60Hz) when enabled, but that may be fine for simply beam-racing-dependant animation demos of demoparty lore, and specific only to MonoGame (unless its core is fixed). Keep tuned...

YouTube · Post by **Chief Blur Buster** » 22 Apr 2018, 05:45

For programmers who want to figure out how to implement the new "Lagless VSYNC" (tearingless VSYNC OFF) techniques...

My cross platform beam racing demo ("Tearline Jedi Demo") will be released shortly... New sneak preview:

phpBB [video]

C# MonoGame engine 3.7, Works on PC and Mac, compiles in Visual Studio 2017 Community Edition (both PC version and Mac version).

Source code will be released as open source under Apache 2.0 source code license on github within a few weeks (after I finish a few other critical projects which I'm also working on concurrently).

It contains multiple demo modules (four so far -- including one big surprise). And the bouncing lines demo is only 114 lines of C# code (thanks to the easy cross-platform beam racing framework I've created).

-- Tested on my MacBook (MacOS X native OpenGL) and my PC (Windows 10 native Direct3D).
-- The 3D API doesn't matter, repeat after me, "VSYNC OFF tearlines are just rasters".
-- No raster register needed either, just doing Presents() at high precision time offsets from VSYNC timestamps.

This should be a good beginner's sandbox for writing beamracing apps! While it just does simple animations and stuff, all of this technique is the foundation of lagless VSYNC (beamraced tearingless VSYNC OFF) and many don't quite understand how that's possible. But this demo will pretty much demonstrate that it really is just simply "VSYNC OFF tearlines are just rasters!"

YouTube · Post by **Chief Blur Buster** » 24 Apr 2018, 17:02

RetroArch emulator has invented a very clever technique that allows an emulator to have less input lag than the original machine via a "rollback-insert-input-fastforward" technique.

See http://www.libretro.com for a clever explanation of the "Run Ahead" technique.
It was also covered by this ArsTechnica Article. It was first proposed in this 2016 blog posting by an author of the RetroArch emulator.

phpBB [video]

This started off a friendly debate between me and some very interested/inquiring RetroArch authors (who also expressed interest in eventual beam racing). They thought that it was probably impossible to combine RunAhead with beam racing, but I have come up with a way to combine the two. I've diagrammed as such:

This diagram is understandable only if you understand the basic concept of a RunAhead algorithm (which is simply rewinding the emulator 1 or 2 frames, inserting the input read, and surge-executing (fastforwarding) and displaying the resulting frame. The RunAhead technique can also compensate for display lag, which is very clever -- something even myself never thought of, so much kudos to author Dan Weiss.

By combining beam-racing and RunAhead, you can have up to 1 to 2 less refresh cycles of input lag, so you can have even less input lag with the same recommended RunAhead margin. RunAhead can be run alone, or beam racing can be done alone, or the two can be combined.

Calamity · Post by **Calamity** » 26 Apr 2018, 03:25

Hi Mark,

Could you clarify what do you mean exactly by "Driver's VSYNC ON delay that's beyond our control"? Is it the host video driver's lag? Or the emulated game built-in lag that the run-ahead trick tries to compensate for?

Also, by "inputdelay VSYNC ON", do you mean the frame delay method?
EDIT: Ok, I'm pretty sure you mean the "normal" run-ahead implementation (graph in the middle)

YouTube · Post by **Chief Blur Buster** » 26 Apr 2018, 08:56

- This is time-based graph, 16.7ms between VSYNC events, 60Hz example only
- "Off screen" part is the one under our control.
- "On screen" is pixels hitting the video cable (and photons hitting the human eyes on lagless monitors) according to Blur Busters input lag tests
- Beginning of purple arrow (blunt end) is API call Present() or glutSwapBuffers()
- End of purple arrow (pointy end) is the pixel being transmitted out of graphics output (aka beginning of monitor's scan out on a lagless monitor like CRT).
- Grey arrows is the maximum delay between emulator pixel render and pixels hitting eyes.
- Frameslicing allows us to chop up the grey arrows to much shorter ones, with mid-screen input reads.

Calamity wrote:Could you clarify what do you mean exactly by "Driver's VSYNC ON delay that's beyond our control"?

When I say "Driver's VSYNC ON delay that's beyond our control", I'm not talking about beam racing or RunAhead. After we Present() we have no guarantee what the latency of each pixel coming out of the graphics output. Sometimes there's one or two VSYNC cycles before the frame finally gets displayed. With some clever programming, you can remove this 'gap' (make frame display the next refresh cycle), but it is never guaranteed. But that assumes you were able to avoid creating framebuffer backpressure & and reach 0 frames -- it still does not cancel-out beam racing advantages of the ability to insert mid-scanout input reads. Next frame response time is very hard to get. VSYNC OFF guarantees the ability to do that and even more (subframe mid-scanout respone time too).

My input lag tests show there is almost always one 'gap' (two VSYNC events passes) between Present() and the actual display scan-out beginning to show the frame.

Calamity wrote:Is it the host video driver's lag?

Yes. We have no guarantee like we can obtain with beamracing.

Calamity wrote:Or the emulated game built-in lag that the run-ahead trick tries to compensate for?

No.

Calamity wrote:Also, by "inputdelay VSYNC ON", do you mean the frame delay method?

Yes. Basically delaying input reads to closer to VSYNC, to reduce input lag.

The graph is time-based. With 16.7ms bewteen VSYNC. This graph is only a 60Hz example, for simplicity of interpretation.
"Off screen" part is the one under our control. "On screen" is what photons hitting the human eyes.

I have a different reply here: libretro forums that might clear things up better.

Calamity wrote:EDIT: Ok, I'm pretty sure you mean the "normal" run-ahead implementation (graph in the middle)

Both top and middle are the normal RunAhead implementations. If I am not mistaken, it is my understanding that frame delay and Run Ahead are separate settings that are toggleable separately. Basically delaying input reads closer to VSYNC, to reduce input lag.

Calamity · Post by **Calamity** » 26 Apr 2018, 10:45

I'm probably missing something here.

My understanding is that although beam raced run ahead is possible on the paper, the "logic" of this method would require back propagation of the mid-frame fresh input to the past frames, and then fastforwarding again to the present point, and do this for each slice.

YouTube · Post by **Chief Blur Buster** » 26 Apr 2018, 11:01

Calamity wrote:I'm probably missing something here.

My understanding is that although beam raced run ahead is possible on the paper, the "logic" of this method would require back propagation of the mid-frame fresh input to the past frames, and then fastforwarding again to the present point, and do this for each slice.

That is not how RunAhead _needs_ to work. RunAhead inserts input retroactively, runs ahead a frame or two hidden, then displays only the final frame. We only beam race visible frames.

Retroactive input read insertions aren't necessiarly part of the beam race.
It can be, though (see below). But we can even simplify this and skip this.

We're simply using beam racing to bypass driver delays with VSYNC ON, by guaranteeing Present()-to-Pixels latency, bypassing driver latencies. VSYNC ON can not always guarantee next-frame latency so even if we never do inputreads.

With VSYNC OFF, the first row of pixels underneath tearlines are already being transmitted out of the video output practically immediately at that instant. So we're 100% bypassing driver VSYNC sheninigians.

That said, realtime input reads can still be inserted anyway to override the retroactively inserted input read. I realize that this will be too late for most games since the RunAhead chain is replaced by by the next rollback-RunAhead chain, since it's doing 60 rollbacked chains of executions per second at 60Hz. Any game that does mid-screen input reads for bottom-screen stuff (e.g. say, pinball flippers) would still benefit. But, I realize that usually won't have any benefit -- the primary benefit of beam racing in the specific RunAhead use case, is to completely bypass driver latencies. There's usually a mandatory "two-VBLANKS-passes-by-before-photons" situation during many VSYNC ON workflows, but we can guarantee that never happens when we treat beam-raced tearingless VSYNC OFF as a roll-our-own VSYNC ON mode.

Also, it can reduce the CPU processing overhead of RunAhead because the final frame doesn't need to be surge-executed.

TL;DR, even if we remove inputread latency benefits (due to RunAhead), we still have these advantages of beamracing:
-- Guaranteed Present()-to-Pixels latency (at the graphics output electrical level, at least)
-- Reduced CPU overhead by no longer needing to surge-execute the final RunAhead frame.

Calamity · Post by **Calamity** » 26 Apr 2018, 12:49

Alright, but the driver's frame queue can be bypassed by just polling vblank directly and presenting with vsync off, once. You don't need the added complexity of beam racing for that.

----------

With regards to frame delay, since we were the first ones to conceptualize and implement it, I feel I need to exercise my auctoritas.

In GroovyMAME, we've been routinely achieving next frame response since 2013, thanks to frame delay and direct vblank polling. I feel a bit tired of explaining this once again. Next frame response is not something that is difficult or unfrequent to achieve with frame delay, it's the boring rule. Intealls recently proved 100% of next frame response hits if input was sent just 2.68 ms before vblank.

The fact is that when emulating frame buffered systems, frame delay and beam racing are EQUIVALENT. It's only for beam-raced systems (e.g. Amiga) where beam racing makes a difference.

The beauty of beam racing is that it's a more natural approach than frame delay. Besides of virtually no latency, it has other added benefits. It increases input granularity, which might be of help even for frame buffered systems (fight games: combos).

Beam racing brings emutime closer to realtime. Run-ahead goes in the opposite direction.

RA's architecture makes it very difficult to implement beam racing. RA is made up of different kernels. Each kernel is designed differently. The interaction between frontend and kernels is based on full frames.

MAME on the other hand has a common architecture behind all systems. The emulation is also organized by frames but this is arbitrary. You can choose any different time slicing, as I proved on my implementation. And it will be ready for all systems.

With all honesty, I believe run-ahead will prevail over beam racing, at least in the short term. I don't see both hybridizing. Run-ahead is easier and effective enough for the people. It's still the wrong approach, you already know my opinion. It will be an incentive for people to stick with crappy hardware.

You will never achieve current frame response like in my video with run-ahead, but who cares? You will never be "better than hardware" with beam racing either

Blur Busters Forums

Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm