Emulator Developers: Lagless VSYNC ON Algorithm

YouTube · Post by **Chief Blur Buster** » 27 Mar 2018, 09:37

Spinning only 10us without busylooping. That's tough. I just busywait on a microsecond counter if my wait is less than 2ms. If my wait is 2ms or more, I thread-sleep at least 1 millisecond, and busyloop the rest of the way to microsecond-align the raster.

RTDSC is the ultimate in precision but can have backwards-tick problems when threads changes processor cores (see this link ...). Generally, I try to use the higher level APIs that are cross-platform:

C# -- System.Diagnostics.Stopwatch
C++ -- std::chrono::high_resolution_clock::now

They both usually use RTDSC but now have built-in backwards-tick protection from what I've read (AFAIK).
I find I don't need better precision than these; I'm getting scanline-exact with the above.

But RTDSC, is definitely the ultimate, sometimes nanosecond consistent, just make sure you give it thread-affinity. Use RTDSC if you're already using it anyway all over the emulator, and prepared to handle it for sure, but cross-platform emulator authors are also reading this thread thinking beam racing is Windows only when that's totally false. I have to fight this myth. It works on Linux, Android and Mac.

The horizontal scanrate of 1080p is only 67.5 KHz -- so a timer precision of 1/67500sec needed for approximate scanline accuracy. Even 4K60 is under 270KHz. RTDSC is massive overkill timer resolution for slice-based beam racing, and requires a bit of hoop-jumping in some languages. Overkill is good whenever possible, but can be unnecessary for cross-platform emulators trying to use KISS code. Even simple C# Stopwatch is 0.1us accurate on my system. Beam chasing is complex as it is for some authors, so...

For sub-millisecond sleeping, there's kernel calls available to allow 0.1ms sleeping, but that's platform specific. You might want to investigate this, so that sub-millisecond sleeps are automatically used whenever possible.

YouTube · Post by **Chief Blur Buster** » 27 Mar 2018, 10:09

I've posted a heads-up to RetroArch (libretro) -- https://forums.libretro.com/t/an-input- ... u=mdrejhon

I don't think they'll implement it for a while yet because RetroArch is much more complicated as a multi-emulator system -- but it is to give them a heads up of the existence of real-time beam racing is now actually possible. And decide (later) if they want to begin pointing their codebase in that direction -- I did mention to them that they may prefer to wait until other emulators have more fully validate this first!

I think beam racing should arrive to simpler line-exact / cycle-exact emulators first, and to also validate more cross-platform open source code first. Some (not all) emulator authors who program cycle-exactness for game preservation purposes, could be very interested in recreating original input lag mechanics with no latency-futzing by buffered 3D APIs. There are a huge number of excellent emulator candidates first to implement real-time beam chasing in. Before the complexity of cross-platform multi-platform emulators like MESS / MAME / RetroArch / etc.

twilen · Post by **twilen** » 27 Mar 2018, 12:41

I meant busywaiting, and not for any exact timing or similar reasons, only to keep spinning CPU core off the bus few microseconds or so. (Some kind of stop-for-x-cycles instruction would be even better but I don't think it exists..)

Website · Post by **RealNC** » 27 Mar 2018, 12:58

You can calculate a delay loop value for the current CPU. You calibrate it at program start up. The calibration involves finding out how much time the CPU takes to do 1000000 busy-wait iterations (I pulled that number out of my ass here, choose something that makes sense.)

Once you calibrate, you can then know how much time an N-iterations busy-wait will consume and can thus make nanosecond-precise busy-waits. Well, in theory, the OS will always get in the way. CPU power management can of course also interfere and make it tricky. Compiler optimizations might get in the way too, so it might be required to write a tiny bit of assembly that does the busy loop.

In general, you might want to research delay loop calibrations on the web.

twilen · Post by **twilen** » 27 Mar 2018, 14:06

I didn't ask that (of course I know how delay loop calibration works and how power saving can interfere with it, thats the simple part), I asked for busyloop that is most "optimal", for example it should not slow down other cores, at least too much, due to single core being too active but it still should have somewhat reliable max delay. For example repeatedly executing some synchronization instruction may work nicely to reduce useless CPU load during spinning.

Good luck finding any useful and accurate information about busy loops. ("you are doing it wrong, busy wait is always bad!!11!")

(Yes, this is not really directly related to topic but because not continuously busylooping scanline position improved stability, I want to make it as good as possible. Unfortunately sub-1ms idle delays aren't reliable and not officially supported, at least in Windows)

YouTube · Post by **Chief Blur Buster** » 27 Mar 2018, 14:07

More precise way is busylooping on the RTDSC instruction (or other high performance counter) which is microsecond accurate.
I think he's looking for 'better' ways to do short precision busywaits, including ways that uses less than 100% CPU.

YouTube · Post by **Chief Blur Buster** » 27 Mar 2018, 16:12

I've posted a followup on another board, which I think is worth crossposting here:

Chief Blur Buster wrote:While it’s a boon for emulator users with CRT, it’s also equally majorly lag-reducing for LCD too.

It’s compatible with any display (yes, even VRR if you use the GSYNC+VSYNC OFF or the FreeSync+VSYNC OFF simultaneous combo technique as linked). Since outputting onto display cables, are by necessity, top-to-bottom raster serializations of pixels from a 2D plane (screen data), even DisplayPort micropackets are still raster-accurate serializations. Everything on any display cable is top-to-bottom. We’re just piggybacking on that fact that all video outputs on a graphics card still scans out top-to-bottom.

The only thing that really throws it off is rotated display – e.g. left-to-right – but if you’ve got a left-to-right scanning emulator (e.g. Galaxian or Pac Man) – then you can even beam chase left-to-right scanouts too. To enable beam racing synchronization (sync between emu raster + real raster) you need to keep scanout direction in the same direction.
If you ever used a Leo Bodnar Lag Tester, you know that it has three flashing squares, Top/Center/Bottom to measure lag of different parts of a display. It measures lag from VBLANK to the square. SO bottom square often has more lag, unless the display is strobed/pulsed (e.g. LightBoost, DLP, Plasma) then the TOP/CENTER/BOTTOM squares are equally laggy.

The latency reduction offsets are similar regardless what an LCD does – if the LCD (e.g. fast TN LCD) had 3ms/11ms/19ms TOP/CENTER/BOTTOM input lag from Leo Bodnar Lag Tester – beam racing makes Top/Center/Bottom equally 3ms on many LCD panels because you’ve bypassed the scanout latency by essentially streaming the rasters in realtime onto the cable. When you use Leo Bodnar on a CRT, it also measures more lag for bottom edge, but it’s lagless if you do beamchased output.
So what you see as “11ms” on DisplayLag.com (CENTER square on Leo Bodnar Lag Tester) will actually be “3ms” lag with the beam-racing method, because beam-racing equalizes the entire display surface to match the input lag of the TOP-edge square in Leo Bodnar Lag Tester. (see…bypassing scanout lag) The lowest value becomes equalized for the entire screen surface.

(Niggly bit for advanced readers: Well, VSYNC OFF frame slices are their own mini lag-gradients, but a 1000-frame-slice-per-second implementation will have only 1/1000 = 1ms lag gradient throughout the frame slice strip … The more frame slices per second, the tinier these mini lag-gradients become. Instead of a lag gradient for the whole display in the vertical dimension, the lag gradients are shrunk to individual frame slices, so each frame slice may be (example numbers only) 3.17ms-thru-4.17ms lag apiece, depending on which scanline inside the frame slice. This frame slice lag-gradient behavior was confirmed via an oscilloscope. That said, these tiny lag gradients are much smaller than full-refresh-cycle lag gradients. Not relevant topic matter for most people here, even emulator developers, but I mention this only for mathematical completeness’ sake.)

Whatever Leo Bodnar Lag Tester measures for input lag for the TOP square, becomes the input lag of the MIDDLE and BOTTOM when you use beam-raced output. The lag is essentially equallized for the whole display surface. So no additional bottom-edge lag when you do beam-raced output even to LCDs. As many know, Blur Busters does latency tests, and some emu authors have posted high speed video proof on the forum thread now, so it’s validated – realtime beam racing bypasses the mandatory scanout latency of full-framebuffer implementations.

So beam racing the real-world raster enables up to up to half a refresh cycle less lag than websites that measure CENTER lag (e.g. Leo Bodnar CENTER square method), and a up to a full refresh cycle less lag than websites that measure BOTTOM EDGE lag (e.g. TomsHardware full-scanout measurement method).

As a result, beam racing produces lower "emulator-pixel-generated-to-real-pixel-photons" latency than the number seen on these overly-conservative worse-case latency test results.

This is because of the way websites measure input lag using VBI stopwatching technique, e.g. VBI-to-screen-middle. Leo Bodnar measures different numbers for TOP/CENTER/BOTTOM on a CRT. But it also does on an LCD. Exhibiting the familiar top to bottom scan out behavior.

The bottom line, is real world latency numbers of VSYNC OFF techniques, are actually lower latency numbers than what you would see on DisplayLag.com or TomsHardware.com display lag measurements. This is also why the Jorim's Blur Busters GSYNC 101 tests (1000fps high speed camera lag measurements) in CS:GO sometimes showed lower button-to-pixels lag numbers than the input lag numbers shown on certain websites, due to different lag measuring methodologies that are sometimes "VSYNC ON biased".

Calamity · Post by **Calamity** » 27 Mar 2018, 17:29

Chief Blur Buster wrote: It depends on which os the bigger latency issue
- slice size adds latency, as 6 slices is about 33 percent more laggier than 8 slices. So your VSYNC offsets savings have to be big enough to more than compensate for this. Basically the vsync offset savings need to be bigger than the slie thickness difference, for it to be worth it.
- you don't need a flush when using large number of frameslices like 40 frameslices per refresh (non-HLSL). Except for the betweem-refresh one (e.g. a flush() at the bottom of screen, right when it enters VBI).

I've tried adding a single flush on the last slice, right in VBI, before it, etc. I can't seem to get a stable solution either way. On the other hand, putting again a flush before each Present() makes everything solid. I've checked that the lower vsync offset compensates by far the bigger slice size, I'll post some figures when I have a chance.

My current understanding of the issue is that parallelization causes an offset in the flip position that builds up with each slice that is sent to the gpu. If the number of slices is high enough, it reaches a point where the gpu is always busy when the new slice is sent, forcing the cpu<->gpu system to stabilize. At that point, a manual vsync offset can be applied to reposition the flip back in its place. However, if the number of slices is below a critical number, the gpu will be busy or free based on an unstable pattern, making it impossible to correct the offset with a fixed value. Breaking parallelization (as much as possible) makes the behaviour deterministic.

Calamity · Post by **Calamity** » 27 Mar 2018, 17:52

twilen wrote:My assumption is that continuous D3DKMTGetScanLine() calls waste bus bandwidth and/or requires some driver locks, stalling other threads.

Yes, whatever the reason behind, that is true. Constant polling in a loop stresses the gpu badly. That's why it'd be great to have an interrupt-like mechanism.

YouTube · Post by **Chief Blur Buster** » 28 Mar 2018, 07:31

Yes, it's a fairly "expensive" API call. Once called, if I get a scanline number far away from my target scanline number, I don't need to call it again for a while. Knowing the horizontal scanrate (rate of .ScanLine incrementing) I can easily extrapolate values in between -- or forgo it altogether (listening to VSYNC heartbeats only). I'm dreaming up ideas of a long-term raster calculation framework.

...Now what I'm doing is trying to refine the VSYNC timestamp dejittering algorithm in my projects (I need it for multiple ongoing projects anyway, so this is important to me)...

Right now, I'm experimenting with VSYNC timestamp dejittering formulas. The one used on vsynctester.com / testufo.com are fairly complicated dejitterers, and I'm trying some simpler averaging algorithms first, but it looks like there's pros/cons.

So it looks like I might include a straight C# port of the vsynctester.com Javascript dejitterer routine (with permission of creator Jerry of Duckware) since it's extremely hard to beat that vsync timestamp dejittering algorithm in a CPU-efficient manner. It can calculate refresh rates to an accuracy of several decimal digits (e.g. "120.02316 Hz") in highly noisy/jittery/framedropped JavaScript -- sufficiently accurate VSYNC timestamps for beam racing without a raster register. Even as we validate that this technique actually works with the (relatively) easy RasterStatus.ScanLine or D3DKMTGetScanLine() (...the KMT one is superior because it works with OpenGL / D3D9 / D3D10 / D3D11) my focus is trying to avoid that altogether for cross-platformness.

There are millions of ways to do it, but "good" + "cpu efficient" + "simple open source code" is hard for a VSYNC timestamp dejittering algorithm.

Blur Busters Forums

Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Re: Emulator Developers: Lagless VSYNC ON Algorithm