Emulator Developers: Lagless VSYNC ON Algorithm

Talk to software developers and aspiring geeks. Programming tips. Improve motion fluidity. Reduce input lag. Come Present() yourself!
User avatar
Chief Blur Buster
Site Admin
Posts: 6486
Joined: 05 Dec 2013, 15:44

Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 16 Mar 2018, 21:52

Advanced Post For Emulator Developers

Update: Four emulator authors actively chatting in subsequent pages of this thread

SKIP if you aren't a programmer/developer and/or unfamiliar with raster scan line operations. This post is for 8-bit emulators like old consoles and computers, that uses raster-accurate emulation

An algorithm for a tearingless VSYNC OFF that creates a lagless VSYNC ON

An approximate synchronization -- a "beam racing" implementation -- of the emulated raster ahead of the real-world raster, with a forgiving jitter margin (<1ms).

Essentially, a defacto rolling-window scanline buffer achieved via high-buffer-swap-rate VSYNC OFF (redundant buffer swaps) -- with the emulator raster scanning ahead of the realworld raster, with a sufficient padding margin for performance-jittering making scanline-exact unnecessary -- just approximate raster sync (within ~0.1ms to ~0.2ms pratical with C/C++/asm programming).

OPTION1: There are raster-polling functions on some platforms (e.g. RasterStatus.ScanLine as well as D3DKMTGetScanLine ...) to let you know of the real-world raster.
OPTION2: Other platforms may need to extrapolate based on time intervals between VSYNC heartbeats. This can actually be done super-accurately, with good math formulas and VBI-size awareness. Video: https://www.youtube.com/watch?v=OZ7Loh830Ec
This can be used to synchronize to the emulator's raster within a rolling-window margin.

Simplified diagram:
Image

Long Explanation:
https://www.blurbusters.com/blur-buster ... evelopers/

This is a 1/5th frame version, but in reality, frame slices can be as tiny as 1 or 2 scanlines tall - or a few scanlines tall, computer-performance-permitting.

Tests show that I successfully can do roughly 7,000 redundant buffer swaps per second (2 NTSC scanlines) in high level garbage-collected C# programming language of 2560x1440 framebuffers on a GTX 1080 Ti.

With proper programming technique, darn near 1:1 sync of real raster to virtual raster is possible now. But exact sync is unnecessary due to the jitter margin and can be computed (to within ~0.5-1ms margin) from an extrapolation between VSYNC timings if the platform has no raster-scanline register available.

  • Only ~10% overhead added to CPU.
  • Tearing-less VSYNC OFF: Lagless VSYNC ON
  • As long as emulated raster stays ahead of real raster, the black part of frame never appears
    (or just use previous frame in place of black, to reduce artifact risk)
  • Same number of pixels per second.
  • Still emulating 1:1 emulated CPU.
  • Still emulating the same number of emulator rasters
  • Still emulating the same number of emulator frames per second (60 fps).
  • Performance can jitter safely in the frame-slice height area, so perfect raster sync not essential.
  • Lagless VSYNC ON achieved via ultra-high-buffer-swap-rate VSYNC OFF
  • It’s only extra buffer swaps mid-raster (simulating a rolling-window buffer)
Even C# programming was able to do this within a ~0.2-0.3ms jitter margin, except during garbage-collection events (which causes brief momentary surge of tearing artifact).

But C++ or C or assembler would have no problem, and probably can do it within <0.1ms -- permitting within +/- 1 scanline of NTSC (15.6 KHz scan rate). Input lag will be roughly equivalent to ~2x the chosen jitter margin you choose -- but if your performance is good enough for 1-emulated-scanline sync, that's literally 2/15625 second input lag (less than 0.2ms input lag) -- all successfully achieved with standard Direct3D or OpenGL APIs during VSYNC OFF operation for platforms that gives you access to polling the graphic's card current-raster register. The jitter margin can be automatic-adaptive or in a configuration file.

Image

Can be made compatible with HLSL (but larger jitter margin will essential, e.g. 1ms granlarity rather than 0.1ms granularity) since you're forcing the GPU to continually reprocess HLSL. A performance optimization is that one could modify the HLSL to only process frameslice-by-frameslice worth), but this would be a difficult rearchitecturing, I'd imagine.

Most MAME emulators emulate graphics in a raster-based fashion, and would be compatible with this lagless VSYNC ON workflow, with some minor (eventual, long-term) architectural changes to provide the necessary hooks for mid-frame busywaiting on realworld raster + mid-frame buffer swaps.

To read more about this algorithm and its programming considerations, see https://www.blurbusters.com/blur-buster ... developers
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors

User avatar
Chief Blur Buster
Site Admin
Posts: 6486
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 16 Mar 2018, 22:32

Beam-chasing (strip rendering) is being used in virtual reality in some cases:
https://www.imgtec.com/blog/reducing-la ... rendering/
(Thanks RealNC for the heads up)
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors

User avatar
RealNC
Site Admin
Posts: 2822
Joined: 24 Dec 2013, 18:32
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by RealNC » 16 Mar 2018, 23:18

MPV (the media player) is doing something related for video playback. They do redundant buffer swaps in order to combat clock drift. They have an article:

https://github.com/mpv-player/mpv/wiki/ ... ronization

The last section at the bottom ("display-sync") explains what MPV does by default.

Another thing to take from that article is that getting the raster position is impossible in OpenGL? Not sure about Vulkan.

In any event, the "redundant buffer swaps" approach is useful for video playback too in order to eliminate jitter due to clock drift. Of course in this case it's tens of redundant flips per second rather than thousands, and it's vsync on, not vsync off. As I said, it's related, not the same, but might be of interest.
TwitterSteamGitHubStack Overflow
The views and opinions expressed in my posts are my own and do not necessarily reflect the official policy or position of Blur Busters.

User avatar
Chief Blur Buster
Site Admin
Posts: 6486
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 16 Mar 2018, 23:50

Good stuff...That stuff is not raster-accurate but it does serve a purpose.
RealNC wrote:Another thing to take from that article is that getting the raster position is impossible in OpenGL?
-- You can use a separate process to poll the Direct3D raster, while still using OpenGL for everything
-- You can use graphics-card proprietary APIs (if available)
-- You can approximate it via time interval between vblanks. (accuracy can still be <1ms)

For the approximation approach, there are already standard Windows API call (graphics card independent) to poll for the Vertical Total, and the Horizontal Scan Rate, so you can use this information to even guesswork the exact scanline from a microsecond-accurate VSYNC timestamp to VSYNC timestamp (or even one from averaging -- like vsynctester and testufo.com/refreshrate does accurately on Windows systems). For Linux systems, you can get the data from the modeline.

You'd simply have to find a VSYNC or VBI heartbeat to listen to (whether it's a flag, an event, or the timing of the return of a swap buffer call) -- then you can extrapolate either an approximate or an exact scanline from that timing. The jitter margin will mop-up any estimation errors.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors

User avatar
RealNC
Site Admin
Posts: 2822
Joined: 24 Dec 2013, 18:32
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by RealNC » 17 Mar 2018, 00:41

I came across an interesting thing while looking at one of the Vulkan samples provided by the Khronos Group:

https://github.com/KhronosGroup/Vulkan- ... w_opengl.c

This was submitted by Occulus, and there's an interesting note in there:
WORK ITEMS
...
- Implement an OpenGL extension that allows rendering directly to the front buffer.
If Vulkan could be used to do front-buffer rendering outside of VR, that would be extremely useful.

I believe you do have some contacts in Occulus? You might want to ask them why front-buffer rendering was "locked down" and only allowed in VR, rather than making it available in general. Emulators would benefit a lot from it. Raspberry Pi emulation setups are very popular these days, and Vulkan is becoming more and more usable in the open source Linux graphics stack. The Pi is quite weak as a machine, and the "lagless vsync on" algorithm is probably too much for it. Front-buffer rendering should provide much better performance.
TwitterSteamGitHubStack Overflow
The views and opinions expressed in my posts are my own and do not necessarily reflect the official policy or position of Blur Busters.

User avatar
Chief Blur Buster
Site Admin
Posts: 6486
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 17 Mar 2018, 16:37

YES!
We need front buffer rendering again.

The beam-racing rendering would be much better with front-buffer rendering.

You'd simply rasterplot your cycle-exact or line-exact emulator in a beam-racing fashion, directly to the front buffer, and call it a day.

0.1ms beam-racing wasn't possible until recently for lagless VSYNC ON for emulators.

I'm going to try and release an open source demo of real-time beam racing using the MonoGame engine in C#. Yes, C#. (Needs a DLL though to call the RasterStatus.ScanLine or D3DKMTGetScanLine() ....but you can just easily extrapolate the raster based on VSYNC heartbeats). That's what my test was in -- and beam even racing works in C# with <0.5ms beam racing margin as a proof of concept.

In a Nerf kindergarten game engine that you teach to highschool/university game programmers that want to try their first simple game. With it, I managed to hit 7,000 frames per second in C# at 2560x1440, permitting <1ms tight beam racing! I'm using a nerf programming language to demonstrate, it's already possible, in a language less ideal than what the emulators use.

Some emulator authors don't believe reliable beam racing is possible, but it is now, and with a creative jitter margin (to make raster-exact unnecessary).

Background processes and garbage collection events do occasionally (once a minute) bring a beam fallbehind (brief surge of tearing) but the beam racing instantly catches up. C/C++/asm won't have garbage collection, and with admin + realtime priority, you could get as tight as 0.1ms beamracing -- possibly near-raster-exact beam racing on NTSC emulation (15,625 scan lines per second = 15,625 frames per second of 320x240 frame buffers). If I can do 7000fps at 2560x1440, I'm sure they can do 15625fps at 320x240 -- line-exact beam racing on NTSC emulators (with maybe just a 2-line jitter margin)!

Emulator authors can surely do better in C, C++ and assembler with low-rez 320x240 frame buffers. Yes, even in higan and cycle-exact emulators which uses >70% CPU. Get original device lag mechanics (including scanout asymmetries) in a software emulator connected to a CRT. Very preservationist-friendly algorithm.

(Due to some cards no longer having VGA outputs -- a consideration: External cable adaptor lag -- such as DP-to-VGA -- is another story altogether, but you have to use adaptors anyway no matter what emulator, with many FPGA emulators and MAME emulators, with arcade CRTs. Make sure to as simple and direct adaptor where possible, for lowest lag, a good DVI-D-to-VGA (with actual A/D conversion) can just have only microseconds lag)
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors

Calamity
Posts: 24
Joined: 17 Mar 2018, 10:36

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Calamity » 17 Mar 2018, 18:00

Hi Mark,

I just saw your issue on MAMEdev's github. Well, I already have a working implementation of the method you describe, I coded it last month as an experimental extension of GroovyMAME. I was planning to publish it with some proper video proving current frame response in an emulator for the first time. Unfortunately real life issues are delaying this release. I also shared this with Oomek so he could play with it but have had no feedback since.

I won't dispute the "invention" with you, I guess these ideas are just floating around, anyway I'm sharing my build so you can see it really works.

https://www.dropbox.com/s/2chq2l29wujuu ... ce.7z?dl=0

I'm on a trip right now and won't be back till tuesday, I'll share the source then. I'm writing this from a phone. I can't remember if I built that one based on MAME 194 or 195.

Relevant options are:

-frame_slice (0-9): 0 is off, 1-9 divides the screen in n+1 slices.

-vsync_offset: number of lines to synchronize the raster before the target position, required to account for the time it takes to render a frame. Usually a low number of lines on a fast card.

-syncrefresh: required for obvious reasons

-monitor lcd: if you're running this on an lcd

If you press F11 you can see the slices with a color filter.

A good starting point is -frame_slice 4. Press F11 and see the color bands, then play with vsync_offset until the bottom band doesn't overflow on the top.

Some drivers in MAME are not ready for this patch however, not all are scan-accurate yet. You'll notice tearing and palette glitches on some drivers. Others run great.

I'll post more about this when I'm back.

User avatar
Chief Blur Buster
Site Admin
Posts: 6486
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 17 Mar 2018, 19:02

Fantastic to hear from you!

Thanks for this simultaneous-invention prior art. I don't want to dispute it either (maybe we can share credit for this simultaneous invention or what?) -- but I'd love BlurBusters to help give publicity to this, given that Blur Busters are input-lag nuts and display-nuts with a former CRT heritage -- do you want to co-brand some demonstration open source software together? Merge knowledge and experiences together, to be more rapidly deployed into the emulator community? Even a RasterScanLine.(dll|lib|so|etc) library that encapsulates real raster polls and VBI-size-calculators, to approach nearly-line-accurate beam-chasing? All for brainstorm...

I have not checked yet, but I have a suggested improvement:

One tricky challenge is that you need to begin plotting the emulator raster before RasterStatus.ScanLine begins incrementing. (Frustratingly, it stays at a fixed value during VBI). The VBI length can be estimated by timing the length of time of RasterStatus.InVBlank (in microseconds). This is always microsecond exact, and this becomes your starting pistol for beam racing the emulator raster.

But I have come up with a better method! Use WinAPI QueryDisplayConfig() to get the exact horizontal scan rate AND the Vertical Total and subtract vertical resolution from it to get the size of VBI in number of scanlines. With these two numbers (VBI size divided by horizontal scan rate), you can get the exact VBI time to the sub-microsecond on any graphics card. By knowing your computer's exact VBI size to the microsecond (and fortunately using standard Windows API -- it works!) it is possible to begin beam racing only a few scanlines ahead of a real raster. Incidentially, for platforms that don't give you a raster but give you a VBI heartbeat -- this can successfully extrapolate the predicted raster value (to single-NTSC-scanline accuracy) from just knowing (A) Vertical Total, (B) Vertical Resolution, and (C) VSYNC heartbeat. This makes possible 100 or 200 frameslices instead of 9.

Also: VBI size apparently doesn't matter, and scaling doesn't matter (as long as you convert values accordingly, to keep same relative physical-margin). Beam racing works fine with scaling (e.g. #540 of 1080p corresponding to #120 of 240p), and if you want border effects, just scale a little differently, the important thing is that the emulated raster is physically below the real raster, however you decide to map-out the emulator layout (HLSL effects, raster fuzz effects, border effects, whatnot). There may be slight divergences in lag linearity if the VBI-to-active ratio of emulator versus realworld is different, but typically this is hundred-microsecond-scale stuff. And besides, you can use CRU (Custom Resolution Utility) to make sure that the VBI is a 480:525 ratio on whatever higher resolution you want (e.g. VT1181 for 1080p, which many newer LCD monitors will sync to -- since 1080:1181 is almost identical to 480:525 ratio of NTSC for VBI time to active time) -- that is, if you indeed an exact-ratio VBI (but that's cherrypicking microseconds, at this stage -- but at least you have that option). Regardless, the beam chasing algorithm is VBI-size-independent, any VBI-time-ratio differences only introduces minor vertical lag gradient nonlinearities (in hundreds microsecond timescales) between emu "signal" and real signal.
-frame_slice (0-9): 0 is off, 1-9 divides the screen in n+1 slices.
Nice! But only 9 slices?

My rudimentary redundant-buffer testing shows it should be possible to have 100 frame slice granularity on some fast systems (e.g. i7-7740X with a 1080Ti). In C# garbage-collected programming even!

Possibly single-scanline-tall frame slices (240p) if you're running in realtime/admin priority mode using C/C++/asm and low-rez framebuffers, as current NVIDIA/Radeon GPUs can do 15,625 buffer swaps per second on a 320x240 or 640x480 VSYNC OFF buffer with little overhead. It's neat we can blit a whole frame buffer in the same time interval of a single NTSC scan line nowadays. Maybe overkill, but at least give the configurable flexibility/choice all the way to single-scanline slices for super-tight beam chasing! :D .....

Obviously, this will require much more precision but modern computers now have 10M-tick-per-second accuracy in precision clocks, far more than accurate for beamchasing 8-bit platforms now. I imagine such precision may not always easy to slap-on into an existing emulator, but we could figure out how to get the precision you need, including the necessary ultra-accurate VBI-size calculation. A hook that calls every completion of MAME scanline, and the beamchaser module can decide slice granularity (Even all the way down to single-scanline slices).

Also, WinUAE author (who is always eager about my BlurBusters ideas, since all of the ideas I've given to him have worked -- e.g. software black frame insertion). So he has confirmed he will be working to implement beam chasing into WinUAE in the coming months.

Let's discuss this concept more, because I really think Year 2018 is the year of more widespread real-vs-emu raster synchronization finally arriving to emulators.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors

Calamity
Posts: 24
Joined: 17 Mar 2018, 10:36

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Calamity » 18 Mar 2018, 02:39

Hi Mark, I'll post back later if I have a chance, anyway I beg you not to harass MAMEdev about this, it won't work.

Calamity
Posts: 24
Joined: 17 Mar 2018, 10:36

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Calamity » 18 Mar 2018, 13:23

Hi Mark,

I didn't find that knowing vbi exactly was critical for the implementation. Just aligning slice #0 with vblank start works good enough for me. Then the following slices are aligned to visible scanlines for which the system returns valid values. It's not a problem anyway for GM as we are the ones dictating the video timings so they're already known beforehand. Anyway that's for AMD. As you probable know AMD, NVIDIA, and Intel all return the scanlines numbers in a different way, being Intel the only specification-compliant. GM accounts for this.

What's certainly critical is to properly account for the borders and scaling, if done right it's perfect. Besides, rendering a frame still takes a macroscopic time in the cards I've tested, so you need to determine a valid offset to synchronize before the the actual line, otherwise you get as many tearing lines as slices you have.

Why 9? It was fine for a proof of concept, that's what the purpose of the implementation was. Obviously you could add support for more, as many as want. There's a physical limit marked by the offset explained above.

However, it's not the same thing to push 1000 frames in the vaccuum than doing the same thing in the real emulator. For each frame you push, you must run a full iteration of the emulator loop, in order to get the new input and process it, etc., otherwise all the effort wouldn't make sense. This has an important overhead that prevents slicing more and more indefinitely.

The most intersting thing I've found is that for some reason I don't quite grasp, GPU jitter reduces when the GPU is on load. I can't get this thing to be fully stable with 1, 2 or 3 slices. However, as soon as I use 4 or more slices the jitter stabilizes and the bands stick nicely to their place. This happens with different cards and brands so there must be a fundamental reason for it.

I'd say I have a reasonable knowlege of MAME's source base and it took me a good couple of weeks to get this thing right. It requires some non-trivial rewiring of the screen emulation layer, a part I'm not that confortable to mess with as other stuff we touch in GM as it directly affects the emulation and may and does introduce the most unexpected glitches. That's why it won't be accepted by MAMEdev. Even if you coded an acceptable implementation you'd still need to fix the individual drivers one by one.

I'll follow later, it's a pain to write from the phone.

Post Reply