Emulator Developers: Lagless VSYNC ON Algorithm

Talk to software developers and aspiring geeks. Programming tips. Improve motion fluidity. Reduce input lag. Come Present() yourself!
Calamity
Posts: 24
Joined: 17 Mar 2018, 10:36

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Calamity » 24 Mar 2018, 04:55

Chief Blur Buster wrote:Nice, the lag is what I expect:
That said, if graphics card performance allows, I'd go 20 slices per refresh cycle (maybe more). That reduces input lag by half and will probably be quite dramatic for high speed video demonstrations.
I'll implement the ability to add more slices soon and see what's the actual limit with this (relatively old) hardware.
* Yes. REALLY. Yes, you're controlling the display's exact timing of refresh cycles -- when a display is in variable refresh rate mode. The display is actually idling for YOU and really does begin its scanout when you Present()
I would have thought that internally the monitor was not actually idling but looping through the last frame at 144 Hz (I know real idling is perfectly possible on an LCD, I just doubted they had been this audacious).

twilen
Posts: 8
Joined: 25 Jan 2014, 13:27

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by twilen » 24 Mar 2018, 06:06

Chief Blur Buster wrote:Toni tells me he already has a functioning WinUAE internally with this too. He said it was easier than he thought.
Yeah, I got it working few days ago but it was very quick and ugly D3D11 only hack. It was easy because emulation already did partial frames to internal buffer and then synced with real time (while doing extra CPU emulation if fast CPU mode), repeat.

Today I replaced first ugly implementation with less ugly D3DKMTGetScanLine() (+QueryDisplayConfig()) combination that works with both D3D11 and D3D9 and does not need render backend specific hacks, number of slices is configurable, tested up to 20 without major problems. (http://eab.abime.net/showthread.php?t=88777 end of thread)

I still have one unexplained problem where Present() still shows few lines from previous buffer randomly in top of the screen. D3DKMTGetScanLine() always returns InVerticalBlank set immediately before Present() call. So it looks like something randomly stalls inside Present() for some reason. G-Sync on or off makes no difference.

Previously this happened all the time, it went almost completely away when I entered and exited NVidia control panel, without changing anything.. (1080Ti + Acer Predator X34)
Acer Predator X34, GTX1080Ti, Asus Maximus Hero VIII, i7-6700k @ 4.5GHz, 16G ...

Calamity
Posts: 24
Joined: 17 Mar 2018, 10:36

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Calamity » 24 Mar 2018, 14:16

twilen wrote:I still have one unexplained problem where Present() still shows few lines from previous buffer randomly in top of the screen. D3DKMTGetScanLine() always returns InVerticalBlank set immediately before Present() call. So it looks like something randomly stalls inside Present() for some reason. G-Sync on or off makes no difference.
Hi Toni,

Yes, that's normal. In order to remove that, you need to implement a configurable vysnc offset, i.e. call Present() for the bottom slice a few lines before vblank to account for the time the GPU takes to render a frame. If you see my previous videos, I need to adjust vsync offset differently for the raw and HLSL cases, because the GPU needs more time in the HLSL case.

User avatar
Chief Blur Buster
Site Admin
Posts: 6676
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 24 Mar 2018, 14:36

I did notice some extra jitteriness for the topmost tearlines in my beam chasing experiments, so there is definitely some wonky thing going on. I'm not sure what is caused by, but it is easily hidden by a slight offsetting that's less than a millisecond lag.

Great to hear hugely successful lag-reducing experiments in early tests in two emulators (GroovyMAME test and WinUAE test).

My module will also have an adjustable VSYNC offset too, and its VSYNC de-jitterer also can help raster prediction compensation. I'll try to have it released within 2 weeks (or less) as it is important for a different project.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors

User avatar
Chief Blur Buster
Site Admin
Posts: 6676
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 24 Mar 2018, 18:53

Calamity wrote:
* Yes. REALLY. Yes, you're controlling the display's exact timing of refresh cycles -- when a display is in variable refresh rate mode. The display is actually idling for YOU and really does begin its scanout when you Present()
I would have thought that internally the monitor was not actually idling but looping through the last frame at 144 Hz (I know real idling is perfectly possible on an LCD, I just doubted they had been this audacious).
Yup, they were that audacious.
The displays is actually waiting for your software.

That's how random framerates look smooth (see the http://www.testufo.com/vrr software-interpolated VRR demo -- I developed that motion demonstration as well) -- the display and the frame rate are essentially in perfect sync. Random framerates on random refresh rate, staying in (near) perfect sync, so objects are still exactly where you expect them to be, as they move across your screen.

I keep trying to explain to disbelieving software developers that this is REALLY what happens.
Yes, REALLY, your display is idling, waiting for you to Present() or glutSwapBuffers().

From a video cable point of view, the graphics card is simply just scanning out "one more last VBI scanline" in a nonstop manner (at the same horizontal scan rate which is unchanged). Once you Present(), the next scanline is Scanline #1 of your new refresh cycle.

So the blanking interval automatically extends over and over (dynamic-sized Vertical Back Porch) until Present() or glutSwapBuffers(). This is what FreeSync does, this is what HDMI 2.1 VRR does, this is what VESA AdaptiveSync does.

And a few of us managed to get FreeSync working on a MultiSync CRT (via a ToastyX tweak to enable FreeSync over HDMI, plus using a HDMI-to-VGA adaptor), because this is a very gentle way of varying a refresh rate -- done in a backwards compatible way to analog signal standards -- and because the horizontal scanrate remains unchanged, many MultiSync CRTs don't even do their usual refresh-rate-change blankout (depending on what they trigger on) and all you see is a dynamically-variable-rate flicker on your CRT, with zero stutter with varying-framerate motion. Just like a LCD GSYNC monitor, except you're doing variable refresh rate on a raster-scanned MultiSync CRT. The less "firmware cops" the multisync CRT has (refresh-rate-change blankout electronics), the more successfully it works with a FreeSync signal. (FreeSync, HDMI VRR, and VESA AdaptiveSync, use the same protocol, so they adaptor just fine into each other, sometimes with minor EDID override modifications, etc -- only G-SYNC is somewhat different, but their raster behaviour is still similar -- scanline counter only begins incrementing when you Present())

Here is a scan-out diagram of G-SYNC doing 100 frames per second at 144 Hz.

Image

As you can see, during 144Hz GSYNC (where range is 30Hz-144Hz), the intervals between VBI beginnings can vary from 1/30sec through 1/144sec between Present() calls. So if you do Present() only every 1/100sec, that's when the beginnings of the scanout occurs.

-- If you Present() or glutSwapBuffers() too early, it gets the alternate treatment (e.g. VSYNC ON, VSYNC OFF, FastSync) that is configured as the fallback.

-- If you Present() or glutSwapBuffers() too late, the drivers/display will automatically repeat the last refresh cycle so you cannot go below that Hz.

-- If you Present() or glutSwapBuffers() on time -- i.e. within the VRR range -- for "e.g. between 1/30sec and 1/144sec after your last Present() call) then you've effectively was you being the master of your very own, personal software-timed refresh cycle.

The most complicated part of VRR is making sure 30Hz looks exactly like 144Hz. Ghosting-wise, overdrive-wise, gamma-wise. LCD pixels can decay differently. GtG can overshoot more or less. So they come up with very complex variable-refresh-rate dynamic overdrive algorithms to try to make 30Hz as identical as possible as 144Hz, so framerates can randomly spray all over the place, without having any noticeable color effects, or spikes in bright-ghosts (e.g. coronas). It's all technically challenging. Generally (most of the time), NVIDIA has had been better at overdriving VRR, but I've seen some really improved FreeSync lately too. One big improvement AMD has done is to try to improve the certification process. FreeSync 2 certification is also AMD giving more rigorous quality rules to a VESA Adaptive Sync panel before manufacturers are allowed to call it FreeSync 2. There's no cable differnce between FreeSync, HDMI 2.1 VRR and VESA AdaptiveSync -- at least when it comes to their venn diagram of compatibility (e.g. 8-bit 1920x1080p) but FreeSync 2 becomes also a stamp of quality approval on what would normally be sold as generic unbranded VESA Adaptive-Sync. So, if you had to choose a random FreeSync 2 display and a random VESA AdaptiveSync display, obviously steer to FreeSync 2 display since it means it got tested by a certification lab and passed the criteria. (Like a Dolby lab, or whatever). NVIDIA had a tighter leash on GSYNC, but AMD is improving quality control so I'll give them kudos -- it's sometimes cheap chinese manufacturers often releasing generic VRR Panels and slapping the FreeSync label on them without AMD's permission sometimes (or only minimally passing criteria), and sometimes you see an overdriven-mess (really bad ghosting) during variable refresh rate operation. The spec is essentially free to use, but certification services (laboratory) is not free, so sometimes that's why there is a much bigger spread of best-versus-worst when it comes to FreeSync, than when it comes to GSYNC. At least historically. That's why sometimes some people just want to be willing to pay the "GSYNC tax", due to the consistency. Either way, I am a big fan of both technologies, it's just helpful to understand the complications to the display industry that VRR has foisted upon.

NOTE: Drivers also have a Low Framerate Compensation feature (LFC) which you may have heard of. It is simply a predictive repeat-refresher (to avoid well-framepaced attempted deliveries of new refresh cycles from occuring while a panel is still scanning out). So if it detects frame rates that runs well below minimum VRR, it will trigger predictive repeat-refreshes early, running at a perfect 48Hz to play 24fps material. Or a perfect 40Hz to play 20fps material. Or perfect ~30Hz to do 10 fps. Etc. By repeating refresh cycles early, the display is idling precisely at the exact time of the next frame. So it really is essentially behaving like a 24Hz display when you play 24fps movies in a VRR-compatible app (game, video player like SMplayer, etc), the drivers are simply repeat-refreshing at 1/48sec intervals rather than 1/30sec (where it might be still scanning out and colliding with the attempted delivery of a new frame shortly after the 1/30sec interval). So if you frame-pace your Present() events 1/24sec apart the drivers intelligently forces a repeat-refresh-cycle exactly in between the two, unbeknownst to your app. But this is only important to know for low framerates. This doesn't occur for 50Hz or 60Hz material, so this stuff isn't important for emulators, but I mention this as "technology background" stuff.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors

User avatar
RealNC
Site Admin
Posts: 2900
Joined: 24 Dec 2013, 18:32
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by RealNC » 25 Mar 2018, 09:19

(Nit-pick: It's MPV, not SMPlayer. SMPlayer is just a popular front-end for MPV, and there's other front-ends for it too.)
SteamGitHubStack OverflowTwitter
The views and opinions expressed in my posts are my own and do not necessarily reflect the official policy or position of Blur Busters.

User avatar
Chief Blur Buster
Site Admin
Posts: 6676
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 25 Mar 2018, 09:24

RealNC wrote:(Nit-pick: It's MPV, not SMPlayer. SMPlayer is just a popular front-end for MPV, and there's other front-ends for it too.)
Thanks for the correction!
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors

Calamity
Posts: 24
Joined: 17 Mar 2018, 10:36

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Calamity » 27 Mar 2018, 05:33

(EDIT: I concur -- raster prediction becomes less reliable with longer intervals between tearlines. It might be some form of background processing in the graphics card or something. But perhaps a flush call might solve the problem)
Indeed, you were right. I've added a flush before Present(), and now I can reliably do 1, 2, 3, 4, etc. slices. Tearline between slices is rock solid even on my laptop. BUT (a big but), this reduces performance greatly. E.g., where before I could do 8 slices with HLSL, now I can do only 6. However, the better performance obtained before was caused by gpu parallelization, which is undesired. That forced me to use higher vsync offset values, while now I can reduce those to nearly zero (I still need a few lines though). So even if performance is lower, latency must also be lower now as vsync offset was a latency tax. Now it's a matter of investing on a more powerful GPU (my R9 270 is aging).

User avatar
Chief Blur Buster
Site Admin
Posts: 6676
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 27 Mar 2018, 06:02

Calamity wrote:Indeed, you were right. I've added a flush before Present(), and now I can reliably do 1, 2, 3, 4, etc. slices. Tearline between slices is rock solid even on my laptop. BUT (a big but), this reduces performance greatly. E.g., where before I could do 8 slices with HLSL, now I can do only 6. However, the better performance obtained before was caused by gpu parallelization, which is undesired. That forced me to use higher vsync offset values, while now I can reduce those to nearly zero (I still need a few lines though). So even if performance is lower, latency must also be lower now as vsync offset was a latency tax. Now it's a matter of investing on a more powerful GPU (my R9 270 is aging).
Interesting!

It depends on which os the bigger latency issue
- slice size adds latency, as 6 slices is about 33 percent more laggier than 8 slices. So your VSYNC offsets savings have to be big enough to more than compensate for this. Basically the vsync offset savings need to be bigger than the slie thickness difference, for it to be worth it.
- you don't need a flush when using large number of frameslices like 40 frameslices per refresh (non-HLSL). Except for the betweem-refresh one (e.g. a flush() at the bottom of screen, right when it enters VBI).
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors

twilen
Posts: 8
Joined: 25 Jan 2014, 13:27

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by twilen » 27 Mar 2018, 09:13

I noticed that stability is clearly better if I spin between each D3DKMTGetScanLine() poll. (I used simple loop: read RDTSC value, spin until RDTSC is larger than value+x) My assumption is that continuous D3DKMTGetScanLine() calls waste bus bandwidth and/or requires some driver locks, stalling other threads.

For example if I keep polling D3DKMTGetScanLine() immediately after first Present() (even if I present immediately after invblank changes to TRUE), actual starting position is very jittery (as I reported few posts ago). If I do anything else (like emulate next slice, sleep 1ms etc), jitter is almost completely gone.

I call flush immediately after each slice's shader rendering pass.

So I guess we need to find most "optimal" way to spin max 10us or so...
Acer Predator X34, GTX1080Ti, Asus Maximus Hero VIII, i7-6700k @ 4.5GHz, 16G ...

Post Reply