Blur Busters Forums

Who you gonna call? The Blur Busters! For Everything Better Than 60Hz™ Skip to content

Emulator Developers: Lagless VSYNC ON Algorithm

Talk to software developers and aspiring geeks. Programming tips. Improve motion fluidity. Reduce input lag. Come Present() yourself!

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby Chief Blur Buster » 18 Mar 2018, 13:37

Calamity wrote:Hi Mark, I'll post back later if I have a chance, anyway I beg you not to harass MAMEdev about this, it won't work.

I can be eager, but I won't follow up on the Mamedev bug tracking again after that final followup I did. (Apologies)

Sometimes I am too eager, I know!

Calamity wrote:Hi Mark,
I didn't find that knowing vbi exactly was critical for the implementation. Just aligning slice #0 with vblank start works good enough for me.

Agreed.
Knowing VBI time only gives an approximately ~5% fudge factor.

A 480p signal has a 525 vertical total and a 1080p signal usually has a 1125 vertical total. The 480:525 ratio (45/525th of a refresh) or the 1080:1125 ratio (45/1080th of a refresh) is small enough that you don't really need to care about the VBI size. This is mostly cherrypicking a couple milliseconds at the most.

The VBI-time can default to a simple percentage constant of ~4% or thereabouts, but automatically optionally refined if the system is able to successfully detect the Vertical Total of the current video mode. This allows much tighter beam-chasing safety margins (shaving off up to a couple more milliseconds).

One question is which API's unblock at the beginning of vblank, or at the end of vblank.

I have to do further work to do confirmations, but for VSYNC ON Direct3D, it appears that the wait-for-vblank may unblock Present() at the bottom of the refresh cycle, so for a 1080p signal with VT1125, there's a mandatory pause of ~45/1125th of 1/60th of a second before the top scanline of a 1080p (a 0.666666 millisecond delay between the exit of Present() and the first scanline appearing). OpenGL would behave the same way.

The return from a VSYNC ON framebuffer flip provides a useful VBI heartbeat reference, that is accurate enough (with vsynctester.com filtering algorithms -- even with 10% jitter + 10% skipped VBI) to get a VSYNC heartbeat accurate enough to software-calculate a predicted raster scan line "register" variable with an accuracy of only ~1-2% error (VBI size known) or within ~5% error (VBI size unknown)! This provides a portable cross-platform method of emulating a scan line register as a fallback.

For simplicity, just use a constant based on an average realworld VBI, not emulated VBI -- so just use HDTV default....
Code: Select all
const double VBI_DEFAULT_PERCENTAGE = (45.0/1125.0);

...this constant would work fine for the vast majority of default video signals and timings formulas (VESA GTF or whatever) of the last 50 years from NTSC 240p through 4K/8K DisplayPort -- they all scanout top to bottom, and have a VBI of roughly similar percentage (~1% through ~9% but usually 4%). Optional platform frosting (optional video timings, EDID, etc) will simply just serve as bonus to improve this accuracy further.

This constant becomes your estimated VBI time (of a refresh cycle) between the last scanline of previous refresh, and first scanline of next refresh. If using blocking VSYNC ON calls (e.g. Present() ..) this returns at the beginning of the VBI, so you can use this to estimate the time before the raster begins visibly scanning.

Calamity wrote:I'll follow later, it's a pain to write from the phone.

No rush. I want to gel on this during the coming years to work with a few developers (that I've been privately-or-publicly messaging) -- possibly including you -- to create a generic open-source raster polling library for beam chasing applications.

Basically it's just a raster discovery library:
-- Returns True/False if raster is pollable (Returns: Impossible, Software-Calculated & Hardware-Poll)
-- Returns current raster line (either hardware poll or software-calculated from a VBI heartbeat)
-- Returns current floating-point refresh rate (important for synchronization)
-- Returns current scanout direction (relative to current screen orientation)
-- Returns microsecond-accurate timestamp of last VBI event (de-jittered internally, maybe similar to vsynctester.com algorithm).
-- Returns True/False if exact video timings is pollable (optional, but improves beam-chasing accuracy)
-- Optional Current video timings in floating-point horizontal scanrate, floating-point vertical refreshrate, Vertical Total, VBI size, etc
(waterfall best-attempt to poll using standard APIs, via EDIDs, via modelines, etc)
-- Optional Current VBI size (floating-point percentage of a refresh cycle -- aka, how long "InVBlank" stays TRUE)
(The latter 3 is only necessary for near-scanline-exact beam chasing algorithms).
-- Other APIs (e.g. microsecondtimestamps, precision callback events for VBI or raster (raster interrupts!)) may be provided.

I have found collections of several APIs that can be rolled together into a multiplatform Mac/Linux/PC/Android library for making it much easier to do cross platform tile-based beam-chased rendering algorithms. Some VR software for Android is already using raster-synchronized apps, and it is possible to guess the smartphone scan position to a 5% accuracy without needing the native raster poll.

A waterfall cascade on the most-preferred APIs to the least-preferred APIs would be capable of providing sufficient data for accurate 10-tile beam chasing (90% lag reduction) on most platforms, and single-scanline beam chasing on a few platforms.

Even 4-tile beam chasing (75% input lag reduction) would be very doable in many existing emulator architecture, even if it means repeating the HLSL steps four times an emulator refresh.

By making a generic raster-poll library available, more (willing) emulator developers can start to implement optional beamchasing into their emulators with less difficulty, and with less odds of breaking existing architectures. It may take a year, but I already have some source code in Blur Busters Strobe Utility that I am willing to donate to this long-term cause of making beam chasing easier.

This library could potentially become popular in a few years from now with VR developers & emulator developers by hiding all the messy complexity. It could even potentially pressure graphics driver makers (NVIDIA, AMD) into making front-buffer rendering a standard API option again. (BlurBusters has influenced the gaming monitor industry with new testing techniques & influencing the widespread deployment of blur reduction backlights, so some of our eager advocacy do sometimes eventually stick -- after time -- though sometimes after a few years).

So, I feel, getting started on a generic cross-platform raster poll library will be an important world's first step.

I think, maybe we all (the developers I've been messaging) should start a github on this, and obviously maybe you have an interest on doing this.

No rush in reply, until you're back at a computer if you prefer, but once you do.... Let me know what you think about an open-source raster discovery library, to make beam-chasing programming easier for the eager developers who want to work on this.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors
User avatar
Chief Blur Buster
Site Admin
 
Posts: 5909
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby Chief Blur Buster » 18 Mar 2018, 14:17

Precedents of Android calculating a scan line on oculus.com

Oculus wrote:The mobile displays are internally scanned in portrait mode from top to bottom when the home button is on the bottom, or left to right when the device is in the headset. VrLib receives timestamped events at display vsync, and uses them to determine where the video raster is scanning at a given time.


Several methods of getting the VSYNC (VBI) timestamps (listening to the VSYNC heartbeat):

-- Getting a VSYNC timestamp from OS (e.g. DwmGetCompositionTimingInfo() ...) -- There are various APIs that works on Mac, PC, Windows, Android, OpenGL, Direct3D
-- Failing this, spinning on a VSYNC flag (preferably non-CPU-consuming method of spinning) to get timestamps of VBI
-- If VSYNC timestamps are "noisy" or randomly missing, it's possible to filter these down to microsecond accuracy (Example: vsynctester.com and http://www.testufo.com/animation-time-graph ....)

By waterfalling down from best-API to worst-API, almost any platform can predict the raster to an accuracy of 5%, assuming the scan direction is known (almost always top-to-bottom, but can be detected on many platforms too).

It doesn't matter what 3D API you use, as long as the 3D API supports tearing (aka VSYNC OFF). Once you know the raster, you now precisely control the tearline position (even if the raster is software-calculated).

So boiling down to the simplest, absolute-minimum platform requirements:
1. Either VSYNC OFF rendering --or-- front-buffer-rendering (doesn't matter if OpenGL or Direct3D)
2. Knowing the VSYNC timestamps
...Then it's possible do beam-chased rendering (to a ~5% accuracy) on that particular platform!

Now, if you add.
3. Ability to know video timings (poll EDID, poll timings, poll modeline, poll CRU, etc)
...Then beam-chasing accuracy improves to <1%!

Obviously, whatever cake frosting the platform provides (e.g. a raster poll, video timings, etc) are all sweet bonuses that can only improve accuracy (to sub-microsecond accuracy).

But those are actually optional, since you only need minimally know 90% of noisy VSYNC timestamps (10% skipped frame) to just be able to estimate a raster accurately. I already do filtering algorithms to predict the system refresh rate using TestUFO motion tests -- including http://www.testufo.com/animation-time-graph -- and so does Jerry of http://www.vsynctester.com .... With noisy VSYNC timestamps, raster-register accuracy is still possible with these filtering algorithms!

Anyway, I really think this now belong in a cross-platform raster-polling library that can also calculate the raster based on only knowing a VSYNC event.

It will take time, there is a lot of beautiful code in bits-and-pieces all over the place, ready to be rolled together (over the course of a year) into a unified accurate cross-platform raster poll/estimator library. Say, under Apache 2.0 License (or similar) as long as credit is given -- dynamic linking, static linking, or just-copy-into-your-code (do what you want approach) -- to make beam racing easier.

Mop away all the complexity, just easy data to the programmer. Send me an email to mark@blurbusters.com if you prefer to discuss this offline.

---------
Note: While raster can be estimated successfully -- here's source code for Windows raster polling on a multiple monitor system: Jerry of Duckware (author of http://www.vsynctester.com ...) created a demo of D3DKMTWaitForVerticalBlankEvent() that doesn't hog the CPU -- and also a D3DKMTGetScanLine() .... This older demo was originally for a Chromium browser bug report. But useful here in this context. I've now converted it to VS2017; it is VSYNC/ScanLine API Examples -- Source Code Download (Visual Studio 2017, Windows 10 SDK, includepath is currently configured to point to 10.0.16299.0 ...)
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors
User avatar
Chief Blur Buster
Site Admin
 
Posts: 5909
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby Chief Blur Buster » 18 Mar 2018, 14:53

Calamity wrote: I can't get this thing to be fully stable with 1, 2 or 3 slices. However, as soon as I use 4 or more slices the jitter stabilizes and the bands stick nicely to their place. This happens with different cards and brands so there must be a fundamental reason for it.

(EDIT: I concur -- raster prediction becomes less reliable with longer intervals between tearlines. It might be some form of background processing in the graphics card or something. But perhaps a flush call might solve the problem)

That said, here's the raster stability I managed to get in C# programming at 600 frames per second (10 frame slices per refresh).

(My videos, filmed on smartphone)

10-frameslice VSYNC OFF precise tearline steering test
phpBB [video]

-- a little less than 10% GPU load on a GTX 1080 Ti.

Here's the raster stability at 3,600 frames per second (60 frame slices per refresh)

60-frameslice VSYNC OFF precise tearline steering test
phpBB [video]

-- a little less than 50% GPU load on a GTX 1080 Ti
(should drop to 1% if front buffer rendering available instead of emulating front buffer rendering via ultra-high-framerate VSYNC OFF)

I "cheated" a little bit on the 3,600fps version, I used Task Manager -> RealTime Priority
The 600fps version used default Normal Priority.

And this was done without a raster poll (just microsecond-accurate clock).
You can guess the raster position pretty accurately without a raster register!

Resolution 1920x1080 .... So a lot slower than 320x240 framebuffers outputting to a 480p VGA display. Also, I can reach up to 7000+ frame buffer swaps per second, so I wasn't maxing out the graphics card, but minor floaty drift (orbitting back and fourth a few scanlines -- haven't implemented much filtering yet) -- can still show through averaged-out VBI timestamp jitter -- but is not critical. Look closely at 3600fps -- it did briefly drift by a few scanlines. Not a biggie, that's very well within the jitter margin.

VSYNC OFF tearlines can be controlled & predicted on any platform with any API (OGL, D3D, etc). As long as you have access to a VSYNC heartbeat, you've got enough information to extrapolate a basic raster register (with sufficient filtering for vsync jitter/skips -- http://www.vsynctester.com style or http://www.testufo.com/animation-time-graph style) -- but which can then be optionally made more accurate if you have additional information (e.g. timings).

These videos were recorded with no access to low-level APIs -- any tearline-compatible 3D APIs work (OGL, D3D, etc).

My suspicion is that some graphics drivers on some graphics cards are trying to do some background rendering, but the key is to flush the background rendering pipeline THEN spin (or busywait) to the desired raster or desired timestamp, THEN swap frame buffers. The precision of frame buffer swapping is the important part. That probably will stabilize more jitter.

Also, your beam-chasing margin is the window of 1 frame slice ago .... So your realraster can't get closer than 1 frameslices to the emuraster, or artifacts begin appearing. Because you're still rendering the next frameslice when the realraster hits the bottom of the already-displayed frameslice. So you gotta back-off your beam racing margin to allow the race margin to jitter within "1 frame slice ago".

Also, if you're using HLSL style framebuffers (e.g. curved CRT simulations, so it's not perfectly 1:1 beam chasing), the vertically topmost pixels (highest on the screen's Y dimension) in the curved HLSL scanline is your realraster deadline. Don't let the realraster go below that in you beam chasing margin. Eventually, long term, (hopefully palatable, non-messy) optimizations to HLSL to force it to render only overlapped-slicefuls of HLSL to improve performance to almost parity with non-beamchased (existing fork of) MAME -- if you haven't done so already for GroovyMAME. Ultimately, long term -- if designing HLSL algorithms, design your algorithm to rasterplot a block of scanlines at a time, for efficiency's sake. Basically emulating CRT scanning at the scanline level, or scanlines-block level :D

Make sure you turn off other driver sheninigians that does less-than-fully-VSYNC-OFF (e.g. dynamic "Enhanced Sync" algorithms where the graphics drivers "steers" the tearline offscreen if the tearline is close to the bottom edge of screen). I have noticed some graphics drivers provide such optinal modes, to balance out the pros/cons of VSYNC OFF and VSYNC ON.

You just only need plain, simple, garden-variety VSYNC OFF for beam chasing to work well (doesn't matter which 3D API, as long as you have access to a VSYNC heartbeat).

The biggest gotcha is usually knowing the VSYNC ON heartbeat while running in VSYNC OFF mode, but many workarounds exist without needing access to low-level OS APIs (e.g. multithreaded app with two instances: One visible VSYNC OFF instance and one hidden 1-pixel VSYNC ON instance whose sole purpose is to provide a VSYNC heartbeat source).

Jitter Margin Best Practices

Just created a new diagram to outline where the tearline jitter margin must be, for slice-based beam racing:

Image

If you're rasterplotting your current emulator refresh on top of the previous refresh cycle's frame, then your beam chasing margin can be ANY slice not adjacent to the realraster. If you do this, your emulator raster can be up to almost a full refresh cycle ahead of the real raster -- a much bigger jitter margin than a single frame slice.

But if you want tight beam racing, you want your (invisible) tearline to be in that 2nd most recent frameslice -- that's your beam "pacing" goal for minimum input lag. (This may be part of the fundamental reason to make it stable with fewer slices -- large frame slices requires further beam-race-ahead for emulator raster)

Nontheless, beam racing best-practices for emulator is:
-- Always rasterplot your new emulator refresh cycle on top of a copy of the previous emulator refresh cycle's
(VSYNC OFF emulation of front buffer rendering)
-- Don't let real world raster get into your still-rendering frameslice. Jitter goal is the previous frameslice.
-- Knowing the approximate pause between refresh cycles (safe assumption if not knowing timings: ~4% of a refresh cycle)
-- Every frame slice, flush your rendering pipeline, then spin time to raster position (real/guessed) then buffer swap.

That way, you have a very forgiving jitter margin:
-- Anywhere between 1 frameslice behind and 1 refresh cycle behind -- no tearline!
-- More than 1 refresh cycle behind, artifacts appear
-- Less than 1 frameslice behind, artifacts appear
-- Ideal minimum input lag would be if you aim your beampacing of your beamracing to the 1 frameslice ago
(basically between the two figurative "tearlines" 1 frameslice ago and 2 frameslice ago)

This way, the jitter margin becomes so forgiving -- that you don't even need to know a direct polled raster register -- just approximate.

Even more forgiving is to time the buffer swap approximately 3 or 4 frameslices higher up in a fine-frameslice implementation (e.g. 60 frameslice per refresh cycle) -- such fine granularity will ideally require knowing approximate length of VBI (safe assumption of 4% of a refresh cycle if you don't have access to video signal timings).
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors
User avatar
Chief Blur Buster
Site Admin
 
Posts: 5909
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby Chief Blur Buster » 19 Mar 2018, 03:01

Breakthrough!

Dragging exact position of VSYNC OFF tearlines with the computer mouse!

Accurate software-based prediction of raster scanline. Beam chasing without a raster register -- using only ordinary math calculations + ordinary VSYNC OFF tearing. In C# using Monogame (a cross-platform game engine) -- so theoretically runs on any platform that supports tearing.

The UFO bitmap (spriteBatch) is being only drawn once per frame, intentionally aligned to the predicted tearline!

The pixels are displayed on the monitor in only 2-3 milliseconds after the spriteBatch.Draw() API call! API to photons!
(DisplayPort overhead + LCD GtG)

phpBB [video]


Yes, it's the open source cross-platform MonoGame 3.6 for C# -- I'm just simply using real-time modifications of Game.TargetElapsedTime (MonoGame) to target the next tearline position (aka raster position). A pseudo-raster-interrupt, if you will!

There are no System.Runtime.Interop calls being done for this (no Windows API calls).

Internally, I use doubles for time to prevent rounding errors when using clock accumulator variables. I use the platform's maximum high performance counter accuracy (even nanoseconds if available) but it seems beam chasing is okay with 100KHz counters, but most platforms (Mac/Linux/Android/PC) uses at least 1MHz Stopwatch.Frequency.

It's only recently that computers have become high-performing enough, with sufficiently precise clocks (on all platforms) that this is possible in a cross-platform way without needing a raster register -- just a reliable VSYNC heartbeat (which you can get from a lot of sources on most platforms) -- and it will work with both OpenGL and Direct3D -- it simply needs a tearing-compatible API.

I can confirm that all one needs for beam chasing is:

(A) A VSYNC OFF compatible API (Direct3D or OpenGL) or front buffer rendering
(B) Listening to a VSYNC heartbeat (needed to extrapolate a simulated raster register)
(C) Sufficiently accurate clock (at least 100000 tick per second).

With this data, I'm able to calculate the predicted tearline position really accurately, even in plain old C# and a starter engine. One can simply assume a 4% VBI pause between refresh cycles (no need to know exact video timings) -- not knowing the exact signal timings and only assuming common VBI size -- only results in a few-pixel misalignments away from the mouse cursor at most.

For software-based raster calculation I think doing a 10 or 20 tile approach is a sweet spot on faster systems. Going tighter frameslices with a purely cross-platform approach will be harder, but doing it in a 100% cross-platform way should still achieve about 10-strip rendering (90% lag reduction of VSYNC ON).

I'll add a strip-render demo (only have one strip of data in the frame buffer at a time) to generate a complete, full-screen VSYNC ON image.

Once I'm sufficiently complete with this code, I'll put it out as open source code for anybody to port the algorithm into any language they want.

EDIT: Just added a mouseghost, and the ghost leads ahead of the mouse arrow pointer -- I'm displaying mouse positions quicker than the Windows mouse arrow! That's very low mouse lag.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors
User avatar
Chief Blur Buster
Site Admin
 
Posts: 5909
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby lexlazootin » 19 Mar 2018, 05:22

Color me super impressed. All the luck ahead. I really do hope i can use this someday in pretty much all my applications.
User avatar
lexlazootin
 
Posts: 1251
Joined: 16 Dec 2014, 02:57

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby Chief Blur Buster » 19 Mar 2018, 11:26

lexlazootin wrote:Color me super impressed. All the luck ahead. I really do hope i can use this someday in pretty much all my applications.

It's more for renders that have a sweep pattern
-- Raster-accurate emulators
-- Virtual reality renderers

It is simpler to beam-chase for raster renderers (e.g. emulators) but complicated for 3D graphics.
So that means emulators will probably be the first spinoff applications to use this development. That said, as RealNC has reminded me a couple times, virtual reality is starting to use strip-based rendering to reduce VR lag, and it's quite possible that some software will borrow the improved beam chasing algorithms.

People like me ( http://www.testufo.com/animation-time-graph ) and Jerry Jongerius of Duckware ( http://www.vsynctester.com ) are able to extrapolate an ultra-precise VSYNC to the microsecond from an erratic VSYNC signal (10-20% random jitter in timestamps + missed VSYNC events) --

The ability to pull a sub-millisecond VSYNC heartbeat from a noisy VSYNC signal (jitter & misses) is the breakthrough that makes beam chasing cross-platform, more accurate, eliminates requirement for a raster register. As long as you meet the minimum cross-platform requirements (access to a tearing-able API such as VSYNC OFF + also access to VSYNC events + an accurate clock).

Although the requirements are simple, it has one tricky consideration: Doing VSYNC OFF (ability to do tearing) while also getting the equivalent of a VSYNC ON timing at the same time. But once you manage that, it is fundamentally all you need to estimate a raster.
Some workarounds are:
(A) Two instances of graphics canvas. One full screen VSYNC OFF instance and one background hidden 1-pixel VSYNC ON window -- running concurrently in a separate thread. That VSYNC ON window only needs to be 1 pixel hiding in the background, only to listen to the timing of the VSYNC intervals.
(B) Optional platform-specific access to rasters and/or VSYNC does improve things, but that is made optional cake frosting, since VSYNC OFF tearline position is really just simple mathematics on modern high-precision counters as a relative time-basis between two VSYNC events.

Optional stuff: Accuracy improved by assuming VBI size (~4%) which improves mouse cursor lock on exact tearline position to within a few pixels. Even more optionally, accuracy is improved further by knowing horizontal scanrate and/or vertical total (one or the other, or both). Beam chasing can still be variable refresh rate (FreeSync, GSYNC) compatible as long as you use the VSYNC OFF option (FreeSync+VSYNC OFF or GSYNC+VSYNC OFF) in the graphics drivers. It's also Quick Frame Transport compatible (the new VESA HDMI 2 standard which is simply large vertical totals -- large VBI) -- provided you know the scanout velocity (horizontal scan rate). So you can combine beam-chasing and custom framerates (53.1fps, 60fps, 70fps). That does require a platform-specific system call to grab the signal's horizontal scanrate -- but that's easy for me to do on many platforms. Again, that's still optional cake frosting.

Anyway, as long as you can get a VSYNC heartbeat (and do some reasonable filtering on jitter/misses), then that is all the data you need to control the exact position of VSYNC OFF tearlines accurately!
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors
User avatar
Chief Blur Buster
Site Admin
 
Posts: 5909
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby Sparky » 19 Mar 2018, 17:28

This is very good news. It opens up a lot of possibilities.

One interesting application is simulating front buffer rendering while using a double buffered render chain. If you can render faster than your refresh rate, and you're gpu limited, you should be able to increase the effective framerate substantially by choosing not to render the part of the screen that will get overwritten before it gets displayed. You add some overhead by serializing a parallel workload, but in many cases it should be worth it. For example, you might be able to render 1920 x 108 at1500fps, instead of 1920x1080 at 200fps.The 108 pixel region leads the scanline, and the user sees the equivalent of 1500fps 1080p vsync off.
Sparky
 
Posts: 627
Joined: 15 Jan 2014, 02:29

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby Chief Blur Buster » 19 Mar 2018, 17:54

Sparky wrote:One interesting application is simulating front buffer rendering while using a double buffered render chain. If you can render faster than your refresh rate, and you're gpu limited, you should be able to increase the effective framerate substantially by choosing not to render the part of the screen that will get overwritten before it gets displayed.

Yes, ultra-high-framerate VSYNC OFF is a stand-in for front buffer rendering.

Basically redundant repeat-delivery of partially-complete renders -- essentially simulates front-buffer rendering. A huge waste of memory bandwidth, but the bandwidth is so incredibly high, and high end GPUs are literally mostly idling in most emulators, since emulators don't use much GPU horsepower (except for HLSL style processing).

So if you're just plotting out emulator frame buffers, then you have plenty of spare GPU headroom on a modern GPU to simulate front buffer rendering without a front buffer (at least until NVIDIA/AMD finally relents and re-enables front buffer rendering -- then the cap is totally blown off the building and all bets are off on possible performance achieveable).

Even under 50% GPU load can allow front-buffer emulation at about 3,600 frame buffer swaps per second (0.3ms render lag) -- and if you go down to 1,000 frame buffer swaps per second (1ms render lag), the GPU load falls to under 10% of a GTX 1080 Ti. Quite doable for emulators.

Provisionally, I'm refactoring my quick-whip modules into this at the moment:
  • (core) Generic cross platform raster position calculator (with optional user-defined roll-your-own vsync heartbeat callback function)
  • (core) Generic vsync heartbeat listener (achieved via using two MonoGame instances: one for VSYNC OFF visible canvas, another for VSYNC ON for VSYNC heartbeat listening; works on Windows platforms)
  • (optional cake frosting) Optional platform specific VSYNC heartbeat monitor and optional hardware raster-poller
  • (optional cake frosting) Optional platform specific display enumerator (for improving accuracy of raster calculator via polling scanrate, refreshrate, and Vertical Total)

Some of the code comes from Blur Busters Strobe Utility 2.1 work so it's essentially my partial open-sourcing of some of the non-brand-specific modules I already use in Strobe Utility (stuff like the display enumerator and VSYNC logic). Since I'm excellent at calculations in terms of display signals, and can calculate a simulated raster register to near-scanline-exact, I think my knowledge needs to be brought to the open source community for beam chasing applications.

The math formulas are really just simply generic and fundametal -- and works with any VSYNC OFF tearing compatible APIs; the tearline is the raster position of the buffer swap, and the exact tearline position can be controlled solely only from a high performance clock.

This needs to be ported to C++ but C# is currently my favourite rapid-development language (2x faster development) so I start some of my experimental projects in C#. C# is now cross platform (Linux, Android, MacOS, Windows) and MonoGame is available for Linux, Android, MacOS, Windows too.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

       To support Blur Busters:
       • Official List of Best Gaming Monitors
       • List of G-SYNC Monitors
       • List of FreeSync Monitors
       • List of Ultrawide Monitors
User avatar
Chief Blur Buster
Site Admin
 
Posts: 5909
Joined: 05 Dec 2013, 15:44

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby Calamity » 20 Mar 2018, 12:59

Ok, I'm back home.

phpBB [video]


Demonstration of the experimental "frame slice" feature. Tear free rendering at 600 fps, 2560x1600. Emulation of each frame is divided in 10 "slices", synchronized with the physical raster. Input data is polled and processed for each slice, potentially allowing for sub-frame input responsiveness.

On pressing F11, slices are shown with a color filter, revealing the (low) existing jitter.

Intel i7-4771 3.5 GHz, AMD Radeon R9 270, Windows 8.1 64 bits

Command line:
Code: Select all
mame64 -srf -frame_slice 9 -vsync_offset 42 -ues -nouesx -priority 1 -nosleep alexkidd



Now the same test, with HLSL enabled:

phpBB [video]


Apparently the R9 270 can handle 8 slices per frame at most with HLSL enabled. Slice color filter is overridden by HLSL, that's why it's not shown compared to previous, non-HLSL video.

Command line:
Code: Select all
mame64 -srf -frame_slice 7 -vsync_offset 148 -ues -nouesx -hlsl -priority 1 -nosleep alexkidd
Calamity
 
Posts: 24
Joined: 17 Mar 2018, 10:36

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Postby Calamity » 20 Mar 2018, 13:40

Here are the diffs, that must be applied over MAME 0.195 baseline code, in this order: 0195_groovymame_017g.diff -> d3d9ex.diff -> slice.diff

(being slice.diff the file that contains the "frame slice" implementation. Most of the file just removes previous GM features).

Now, an important warning: THIS DIFF BRAKES THE EMULATION OF MANY DRIVERS. IT INTRODUCES GRAPHICS AND PALETTE GLITCHES AND OTHER MORE SUBTLE PROBLEMS. I am posting this only as an illustration of this method in action, for the drivers that are ready for it. Please do not bother MAMEdev about this feature or use this patch in derivative builds or I will abandon this experiment forever.

I'll paste the relevant bits so it's easier than messing with the diff.

This is borrowed from the frame delay implementation, the values were worked out by Intealls. This bit is not exclusive of this implementation but I thought I'd paste it so you can see where the values used later are calculated. Real timings are only actively applied (so known) for the AMD case, for the other cases they are guessed, calculated using standard timings (I'm thinking it'd be great to use your vtotal guess method here instead.)

Code: Select all
   switch (m_vendor_id)
   {
      case 0x1002: // ATI
         m_first_scanline = m_switchres_mode && m_switchres_mode->vtotal ?
            (m_switchres_mode->vtotal - m_switchres_mode->vbegin) / (m_switchres_mode->interlace ? 2 : 1) :
            1;

         m_last_scanline = m_switchres_mode && m_switchres_mode->vtotal ?
            m_switchres_mode->vactive + (m_switchres_mode->vtotal - m_switchres_mode->vbegin) / (m_switchres_mode->interlace ? 2 : 1) :
            m_height;
         break;

      case 0x8086: // Intel
         m_first_scanline = 1;

         m_last_scanline = m_switchres_mode && m_switchres_mode->vtotal ?
            m_switchres_mode->vactive / (m_switchres_mode->interlace ? 2 : 1) :
            m_height;
         break;

      default: // NVIDIA (0x10DE) + others (?)
         m_first_scanline = 0;

         m_last_scanline = m_switchres_mode && m_switchres_mode->vtotal ?
            (m_switchres_mode->vactive - 1) / (m_switchres_mode->interlace ? 2 : 1) :
            m_height - 1;
         break;
   }



Here is the bit where the break scanlines for each slice are calculated. My initial implementation only had 5 slices and the values where hardcoded, I've left that bit commented as it makes it easier to see the logic:

Code: Select all
   borders = target->height() - visheight;
   
   float vsync_offset = (float)win->machine().options().vsync_offset() / m_last_scanline;
/*
   m_break_scanline[1] = m_first_scanline + borders / 2 + visheight * (1.00f - vsync_offset) -1;
   m_break_scanline[2] = m_first_scanline + borders / 2 + visheight * (0.20f - vsync_offset) -1;
   m_break_scanline[3] = m_first_scanline + borders / 2 + visheight * (0.40f - vsync_offset) -1;
   m_break_scanline[4] = m_first_scanline + borders / 2 + visheight * (0.60f - vsync_offset) -1;
   m_break_scanline[0] = m_first_scanline + borders / 2 + visheight * (0.80f - vsync_offset) -1;
*/
   int frame_slice = win->machine().options().frame_slice();
   float slice_period = 1.00f / (frame_slice + 1);

   m_break_scanline[0] = m_first_scanline + borders / 2 + visheight * (frame_slice * slice_period - vsync_offset) -1;
   m_break_scanline[1] = m_first_scanline + borders / 2 + visheight * (1.00f - vsync_offset) -1;
   for (int i = 2; i <= frame_slice; i++)
      m_break_scanline[i] = m_first_scanline + borders / 2 + visheight * (float(i - 1) * slice_period - vsync_offset) -1;

   if (visheight != old_visheight)
      for (int i = 0; i <= frame_slice; i++) osd_printf_verbose("Direct3D: break_scanline[%d]: %d\n", i, m_break_scanline[i]);

   old_visheight = visheight;



Now, here is the Present part, where the raster synchronization and busy wait is performed:

Code: Select all
   int curr_slice = 0;
   D3DRASTER_STATUS raster_status;
   memset (&raster_status, 0, sizeof(D3DRASTER_STATUS));

   // sync to break scanline for this slice
   if (video_config.syncrefresh && win->machine().options().frame_slice())
   {
      update_break_scanlines();

      if (m_device->GetRasterStatus(0, &raster_status) == D3D_OK)
         osd_printf_verbose("draw->entered at raster line: %d\n", raster_status.ScanLine);

      curr_slice = win->machine().first_screen() == nullptr? 0 : win->machine().first_screen()->slice_current();
      if (curr_slice > win->machine().options().frame_slice()) curr_slice = 0;

      osd_sleep(1);
      do
      {
         if (m_device->GetRasterStatus(0, &raster_status) != D3D_OK)
            break;
      } while (raster_status.ScanLine < m_break_scanline[curr_slice]);
   }

   // present the current buffers
   result = m_device->Present(nullptr, nullptr, nullptr, nullptr);
   if (FAILED(result))
      osd_printf_verbose("Direct3D: Error %08lX during device present call\n", result);

   // sync slice 0 to VBLANK-begin
   if (video_config.syncrefresh && curr_slice == 1)
   {
      do
      {
         if (m_device->GetRasterStatus(0, &raster_status) != D3D_OK)
            break;
      } while (!raster_status.InVBlank);
   }



Finally, here's the core part where the timing of each slice is adjusted. The emulator will run the emulated devices for the time specified here, then it'll render the slice in sync with the physical raster, then it'll go on with next slice, and so on:

Code: Select all
   if (machine().options().frame_slice())
   {
      int frame_slice = machine().options().frame_slice();
      float slice_period = 1.00f / (frame_slice + 1);

      for (int i = 0; i < frame_slice; i++)
         m_slice_timer[i]->adjust(attotime(0, (m_frame_period - m_vblank_period) * slice_period * (i + 1)));
         /*
         m_slice_timer[0]->adjust(attotime(0, (m_frame_period - m_vblank_period) * 0.20f));
         m_slice_timer[1]->adjust(attotime(0, (m_frame_period - m_vblank_period) * 0.40f));
         m_slice_timer[2]->adjust(attotime(0, (m_frame_period - m_vblank_period) * 0.60f));
         m_slice_timer[3]->adjust(attotime(0, (m_frame_period - m_vblank_period) * 0.80f));
         */
   }
Attachments
frame_slice.zip
(71.26 KiB) Downloaded 75 times
Calamity
 
Posts: 24
Joined: 17 Mar 2018, 10:36

PreviousNext

Return to Software Developers / Low-Lag Code / Game Programming

Who is online

Users browsing this forum: No registered users and 1 guest