Emulator Developers: Lagless VSYNC ON Algorithm

Talk to software developers and aspiring geeks. Programming tips. Improve motion fluidity. Reduce input lag. Come Present() yourself!
User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 26 Apr 2018, 14:19

Fantastic points, Calamity.

While it's quite possible RunAhead and beamracing won't be as useful together as I might have expected, but what it does indeed mean -- it's still possible to combine the two. Just with slightly (or lot) less benefits than I might have thought.
Calamity wrote:In GroovyMAME, we've been routinely achieving next frame response since 2013, thanks to frame delay and direct vblank polling.
Which is fantastic. I'll take your word for it that it's now being done regularly.
Calamity wrote:The fact is that when emulating frame buffered systems, frame delay and beam racing are EQUIVALENT. It's only for beam-raced systems (e.g. Amiga) where beam racing makes a difference.
Agreed. That's where my current beamracing focus is, the 8-bit and 16-bit systems of the raster-interrupt and raster-coprocessor era.
Calamity wrote:The beauty of beam racing is that it's a more natural approach than frame delay. Besides of virtually no latency, it has other added benefits. It increases input granularity, which might be of help even for frame buffered systems (fight games: combos).
Hmmm, I didn't think of that part -- yes, this could still be a benefit there.
Calamity wrote:With all honesty, I believe run-ahead will prevail over beam racing, at least in the short term. I don't see both hybridizing. Run-ahead is easier and effective enough for the people. It's still the wrong approach, you already know my opinion. It will be an incentive for people to stick with crappy hardware.
RunAhead requires more performance than beam racing. Beam racing will work on Raspberry Pi systems, modern Android and Pi GPUs have the GPU memory bandwidth to do 4-to-10-slice beamracing, especially with low-resolution framebuffers designed for low-resolution screens (e.g. CRT outputs). Frameslice beam racing is mostly GPU-bandwidth dependant (unless you do shader full-screen re-renders per frameslice)

Beam racing is much easier for lower-performance systems at the low-frameslice granularity since the bandwidth of the 2017/2018 embedded GPUs far exceeds a few-year-old midrange desktop GPU, which is frankly amazing. Thanks to phone/tablet miniaturization, RAM is embedded into the GPU which helped a lot. The highest-performing Android devices can spray more frameslices per second than the Intel 4000 GPU that did 6-frameslices with WinUAE.

Although many frameslices demands fast GPU memory, it is very scaleable all the way down to low-end systems from my findings if you do the generous-jittermargin technique. Raster can jitter almost a refresh cycle safely if using the generous jittermargin technique, wider race margins makes it much more lowend-friendly.

That allows emulators to only need to execute in real time, with no surge execution. Lower latency on cheaper hardware.
Calamity wrote:RA's architecture makes it very difficult to implement beam racing. RA is made up of different kernels. Each kernel is designed differently. The interaction between frontend and kernels is based on full frames.
Actually, I think it's doable if we just take our time. I looked at the Retroarch API, and it generally mainly requires adding one optional raster callback function to the emulator modules that supports beamracing. Basically called after every raster:
(1) Centralizes beam raced renderers (hides complexity)
(2) Same arguments as the full-frame-delivery callback
(3) Gives centralized beam raced renderers an early peek at frame buffer
(4) Can be ignored if you're doing full frame mechanisms
(5) Can be acted upon (every scanline for front buffer rendering, every X scanlines for frameslice rendering)
(6) Centralized beam raced renderers will do its own busywaits if needed before returning from callback. This delays the emulator to stay closer behind the realraster.
(7) Minimizes footprint of modification for some emulator modules (in some cases, as little as 5-line mods)
(8) The only responsibility the emulator module does is to call the callback module every raster. That's it. The central beamracer module handles the rest (deciding if it's time to frameslice, deciding on nanosleeps)
(9) The workflow stays compatible with future beamracing creativity (e.g. front buffer delivery, future line-based HLSL systems like AddRaster() CRT emulator workflows, future combo of RunAhead+beamracing if still desirable/wanted, VRR, 60Hz, 120Hz, or any workflow)
(10) Do it in a staged way.
.......Add the API as an optional per-raster callback function
.......Emulator modules that has this callback, are the beamraceable modules
.......We can then begin with one module at a time.

I imagine that we could begin with one module, like the one they suggested in their forums.

There is nothing stopping the core module from dynamically entering/exiting beamracing, by ignoring rastercallbacks and only acting upon the final full frame (e.g. when screen is rotated on a tablet computer where real scan direction is mismatched from emulator scan direction -- which can easily automatically disable beamracing if not in native screen orientation), or occasionally deciding to suddenly do a larger delay during scanline #1 to realign refresh cycles for misalignments (e.g. beam-racing 60 out of 75 refresh cycles (for whatever reason user kept it 75Hz). Stuttery but still lower-lag and prevents a beamracing failure for odd-Hz, it looks just like VSYNC ON stuttering at odd-Hz). For situations where beamracing is fully unwanted (e.g. offscreen RunAhead frames, wrong screen rotation, windowed-mode switch that doesn't support beamracing), the callbacks would continue, just simply return(), and other times, the callbacks can also optionally instead be used to monitor emulator raster-rendering performance to make inputdelay predictions a little more accurate, etc.

I have offered some technical help to RetroArch programmers for this. Two developers behind RetroArch has expressed tentative interest, and I have stressed it is only up to them, if they ever want to implement beamracing.

I am thinking I'd like to put up a 3-figure BountySource prize for this for any multi-emulator system that inludes a large 8-bit footprint (MAME/NES/etc). We're latency crazy people on Blur Busters anyway. Personal funds. Opinions? Gotchas/Problems? Etc? Interested?
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

Tommy
Posts: 6
Joined: 04 Apr 2018, 14:02

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Tommy » 09 May 2018, 10:24

Chief Blur Buster wrote:(Covering all, not leaving stones unturned .... That said, if you're 100% driving your emulator speed directly only from an audio clock, and never reclocking audio, it's very hard to precisely calibrate your real display's refresh rate to be perfectly in clock-ratio sync with your audio - for example, different 50Hz EDIDs on different displays may actually result in a graphic clock generating 49.999Hz or 50.001Hz (even if it claims 50.000) -- so you must be doing some form of compensation for this, especially where error tolerances don't overlap. Especially if a user wants to sync emulator to VSYNC ON (so your emulator necessarily runs off the video card's "clock"). With a GPU clock slightly slewing relative to the CPU's clock and potentially relative to the audio chip's clock (tiny slews as it may be, e.g. even if <.001% difference in tick-tocking rates of all the various chips in a modern computer...) you're probably making design decisions based on your priorities. How are you balancing synchronization priorities? Beam racing means by necessity you're slaving your emulator clock to the video output, and you've gotta reclock everything anyway, so it all boils down to a "doesn't-matter-slowscan-vs-fastscan" based on the above. But if you're permanently clocking your emulator to audio clocks with zero audio reclocking (meaning frame skips/drops at 50Hz or 60Hz VSYNC ON) then that might make it tougher for you to beam race even a simple 60Hz original-scan-velocity output -- you can no longer rely on audio buffer exhaustion during beam racing. Besides, fast-scan is buffer-stuffing, not buffer-exhaustion, and the range-of-adjust venn diagrams lays worries at rest) -- I'm curious what priorities you chose in synchronization priorities given the known slew-effects issues between all the various imperfect clocks (video, audio, cpu, etc. Or is this last paragraph irrelevant (my overthinking), and the other part of my posts useful-to-know stuff?)
As currently implemented, both the vertical sync and audio exhaustion signals are triggers to do more work. If the emulator is not already processing when it receives a do-more-work trigger then a timestamp is requested from the ordinary std::chrono mechanism. The emulator then runs the machine for the amount of time necessary to reach that timestamp.

Output frames are just whatever the current CRT state is, with my model of persistence of vision (which, I continue to admit, is a bit of a hand wave, but is sourced from the genuine observation that CRTs instantaneously involve a small slitherof display being lit very brightly, and I need to map that to LCDs which involve the whole display being lit at a normal brightness).

My planned adaptation to raster chasing is now going to be pretty simple stuff: the audio events are a very regular low-latency interrupt source. So I'll flip my buffers at each do-more-work trigger. Then I'll just add a flywheel-esque phase locked loop on emulated machine frame rate and display frame rate. Whenever they're close enough, allowing for fudging of the machine running speed by a few percentage points this way or that, I'll just adjust emulated machine phase to match computed host machine phase and significantly dial down the persistence.

So video latency will become coupled to audio latency, I won't have a busy poll, and my perception of what my users want in terms of processing cost versus fans versus fidelity will be met.

I'm still much more excited about breaking the whole-frame-length latency barrier to give something in the low milliseconds than I am necessarily about reaching zero — about bringing the outlier into the fold in terms of competing things one can optimise for rather than it having a hard cut-off.

Dwedit
Posts: 4
Joined: 06 Jun 2018, 08:48

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Dwedit » 06 Jun 2018, 09:09

Hello, I think there is one program that needs more help with input lag than anything else, and that is Bizhawk. Not only does it have the 3 frames of input lag that all windowed-mode programs get, it also does not have the "lagfix" patch for Snes9x integrated in there, so that's an extra frame of lag. Then if you're playing Super Mario World, that game has two frames of internal lag.

Just try playing Super Mario World on BizHawk. I dare you. You can really feel 6 frames of input lag.

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 06 Jun 2018, 11:07

Dwedit wrote:Hello, I think there is one program that needs more help with input lag than anything else, and that is Bizhawk. Not only does it have the 3 frames of input lag that all windowed-mode programs get, it also does not have the "lagfix" patch for Snes9x integrated in there, so that's an extra frame of lag. Then if you're playing Super Mario World, that game has two frames of internal lag.

Just try playing Super Mario World on BizHawk. I dare you. You can really feel 6 frames of input lag.
Welcome dwedit
(For readers, he's the author of Snes9x, PocketNES, etc and participates extensively with RetroArch)

That's an awfully laggy situation. I would hate playing Super Mario World with that much lag. It'd be like playing with interpolation on a very old laggy HDTV. Except you're not getting the benefits of interpolation. Boo.

Any thoughts of eventually implementing beamracing in your emulators?

BTW, the RTSS software now has a new frame-capping-by-raster feature that I suggested to the RTSS author. So RTSS can now beamrace a specific scanline for page flips, for next-frame-response "VSYNC ON" equivalent via tearingless VSYNC OFF (tearing can be hidden in the VBI if the gametimes are tiny, e.g. CS:GO). See this thread if interested in some extra knowledge.
_____

Oh and while I'm posting in this thread:
Some fresher, more recent, observations that also applies to emulator beamracing.

-- If you decide to use D3DKMTGetScanLine rather than use time offsets from a VBI -- then never, never, never busyloop D3DKMTGetScanLine() -- it stresses the GPU a lot and delays draw commands while making tearline jitter worse. In fact, doing a 10 microsecond busyloop on the CPU before doing the next D3DKMTGetScanLine() really helps. Better yet, use known horizontal scanrate to predict where the scanline probably is, and begin busylooping D3DKMTGetScanLine a few scanlines prior (with CPU busywaits about 0.5 or 1 scanline between scan line polls -- if current scanrate is 65KHz, then CPU-busyloop 0.5/65000th or 1/65000th of a second, before scanline next poll.

-- QueryDisplayConfig() API call can conveniently obtain the current horizontal scanrate, as well as VBI size (total vs active) as well as whether the display is currently being GPU-scaled or not. This is mighty useful for assisting beamracing.

-- The GPU often runs background tasks (compositor) roughly when the raster is near top edge of screen. The GPU is less busy (less background processing) when it's nearly finished scanning-out than when it begins scanning-out. Tearline jitter is erratic near top edge and more solid near bottom edge. To work around this, flip earlier in VBI to avoid this. WinUAE uses a fixed raster offset to compensate.

-- Flipping right after D3DKMTGetScanLine() will result in a tearline below the scanline number read. To lock better sync between scan line # and the tearline, do Flush() before busylooping, and Present() roughly a time interval of Present() before the predicted scanline. On my system, depending on refresh rate and resolution, Present() often takes roughly 10 scanlines to do a memory copy of the framebuffer (if no draw commands pending), so performance profiling Present() to calibrate, can also help calibrate the scanline-offset of the tearline. Unfortunately, calibration can get erratic if there's a hugely varying page flip rate. If I recall correctly, WinUAE simply uses (AFAIK) a fixed scanline-count offset and gets done with it.

-- Some GPUs seem to batch the scanlines together (e.g. micropackets) so you may see tearline jitter in 2-scanline or 4-scanline increments instead of 1-scanline increments. 4-scanline-granularity tearline jumping happens on my MacBook.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 06 Jun 2018, 11:12

One warning about QueryDisplayConfig() .... CPU clocks are not in perfect sync with GPU clocks.

There's always a bit of slewing between the clocks (unfortunately, big enough to affect a theoretical TasBOT-precision speedrun system if you're trying to use an external device playing button inputs to control an emulator instead of a real device....)

So getting the horizontal scanrate only tells you how rapidly ScanLine will increase to an accuracy of roughly ~0.01% or ~0.001%.

So QueryDisplayConfig() may return "162000" for horizontal scanrate, while ScanLine actually increments at 161997.3598 times per second when hot and 161997.3602 times per second when cold (yes....sigh...thermals even affects the slew rate between CPU & GPU clocks!)

But it's still a very accurate guideline (much more accurate than just knowing refresh rate). Far more accuracy than necessary to predict where the scanline currently for up to about 1 to 5 seconds after one single D3DKMTGetScanLine() call ...

So one call at startup (or mode change) to QueryDisplayConfig and then one rare call to D3DKMTGetScanLine() .... and you can easily 'guess' what the current ScanLine value simply by extrapolating from known vertical total (received form QueryDisplayConfig) and known horizontal scan rate (also received from QueryDisplayConfig) at least for non-VRR operation.

And it even remains correct during variable refresh rate (which is always constant scanrate, just varying-size VBI). The horizontal scanrate stays supremely consistent and unchanged whether fixed-Hz or VRR. The same is obviously not true for vertical refresh rate on VRR (obviously!)

Another discovery I had: When Windows compositor is running at desktop in GSYNC mode at full frame rate (e.g. running Chrome), 144Hz may actually be varying in <0.01Hz range (or thereabouts). Counting refresh intervals to a high precision (counting blanking intervals accurately 10 minutes and calculating an ultra-precise refresh rate from that, in tests) -- I was getting numbers like 143.983Hz varying to 143.987Hz so there'll be amplified errors at the fractional scale if you're running in Desktop GSYNC mode. The micro-fluctuation disappears when I turn off GSYNC mode. Occasionally, it becomes visible at http://www.testufo.com/refreshrate#digit=6 too when displaying 6-decimal precision (close and reopen the browser between repeat measurements -- 10-minute average numbers will vary a lot less if Desktop GSYNC is turned off than turned on). May not be a relevant finding to this thread, but "Useful To Know" information that consistent full-framerate GSYNC has more refresh-interval fluctuation (at the microseconds league) than fixed-Hz mode. Such is the nature of asynchronous-refreshing modes like VRR (GSYNC/FreeSync), if you're relying on the full flat-out refresh rate to be a super-precise GPU clock, then you want to turn off windowed GSYNC... Unless you're handling the timing of refresh cycles yourself (with your Present() calls).

P.S. In addition to scanrate/total -- and detecting presence of GPU scaling -- I also use QueryDisplayConfig for screen rotation detection. This _always_ gives you scan direction when it comes to desktop GPUs. This can allow you to enter/exit beamracing mode whenever realscreen + emuscreen scanout directions aren't in sync.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

Dwedit
Posts: 4
Joined: 06 Jun 2018, 08:48

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Dwedit » 06 Jun 2018, 11:37

So I was reading about DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL, DXGI_SWAP_CHAIN_FLAG_ALLOW_TEARING, and watching the Youtube video from Microsoft saying that you can do lagless partial updates in Borderless fullscreen mode as long as you cover up the whole screen. It also claims to support lagless windowed-mode updates if your GPU supports overlays, which my Intel Skylake GPU does not.

It's supported in DirectX 11 and 12, and Windows 8 or later.

Has the Borderless Fullscreen Latency stuff been tested out and measured, and does it really work as well as Microsoft claims it does?

Seems that there could also be an opportunity to do partial GPU updates, so only a partial screen slice could be drawn to the video memory each frame, does that actually work? Or do you need to use DISCARD mode and just redraw everything?

I still haven't figured out how to write the code to initialize DirectX 11 or 12 and draw a single textured quad. I have done DX9 before though, which is really finicky about all the presentation parameters being correct before it will give you a device object.

------------------------

I also messed around with CRU (custom resolution utility). It lets you play with the timing parameters, so you can pick any refresh rate you want, 60.098 is no problem.
Seems that the Pixel Clock it shows is only given as 5 significant figures, so you may have 148.75MHz, but it could actually vary. Good information to know. Horizontal dots per scanline can not vary though, and for 1920x1080, you may get 2200 dots per scanline, and 1125 lines per frame.
So your measurements for the "162000" horizontal scanrate means that a 148,750,000 pixel clock is actually between 148,747,575.74 Hz and 148,747,576.11 Hz, and a refresh rate specified at 60.098Hz could actually be 60.097Hz.

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 06 Jun 2018, 11:58

Dwedit wrote:Has the Borderless Fullscreen Latency stuff been tested out and measured, and does it really work as well as Microsoft claims it does?
While I can't be the last word on this due to lack of tests on this -- but I briefly tested tearlines and:

In the best case, it's darn near as lagless as fullscreen mode. At lest when tearlines are genuinely enabled in borderless windowed mode. The raster offset of Present() and tearline is not easily noticeably worse than in fullscreen mode, so essentially far less than 1ms difference. (1ms would be 1/8th screen height raster offset at 120Hz refresh rate using 8.3ms scanout. I was not getting noticeable offsets in my brief unscientific test).

(Tearline movement and jitter is an amazing human visualization of performance profiling! At 162KHz scanrate, it is amazing to see 1/162000sec timer differences cause a tearline move down by 1 pixel. Due to the incredible accuracy of scanrate, we use the tearline position as a realtime "lag-differential" indicator -- 1ms increase to lag moves the tearline downwards about 1/8 screen height at 120Hz)

I'll let other readers/authors chime in. Calamity/Toni/Tom probably has by now gotten email notifications about this thread, and should be chiming in the coming days.
Dwedit wrote:So your measurements for the "162000" horizontal scanrate means that a 148,750,000 pixel clock is actually between 148,747,575.74 Hz and 148,747,576.11 Hz, and a refresh rate specified at 60.098Hz could actually be 60.097Hz.
Unfortunately, I've seen worse. The error can be as much as 0.01 Hz on some of the systems I've tested. Usually it's <0.01Hz but I've seen "60.000" in CRU result in QPC performance profiling resulting in "59.993" or "60.009" (etc) on some laptops -- very close to 0.01 magnitude instead of 0.001. Some of seem to be much more inaccurate in difference between GPU clock and CPU clock. The different power management modes (and clockrates) causes slewrate changes in realtime -- force your CPU to 1.5GHz versus 4GHz and you'll see different numbers come up. Likewise for forcing GPU power states too.

In reality, GPU may be unchanged from an external atom clock perspective but that the power-managed CPU is getting the inaccuracy changes to QPC itself. Could be vice versa. But I think it's the CPU getting more error margin variances. Meaning "time" is getting distorted within the CPU (time speeding up and slowing down microscopically) more often than the GPU -- which must stay & remain more precise due to demanding displays intolerant of this stuff.

(Like Einstein, all is relative in time speeding up or slowing down. You would never notice these microsecond variances, unless you monitored externals. Like noticing the slewrate varying in realtime between CPU and GPU!)

P.S. If you're doing extremely tight raster margins (scanline exact) or running on power-managed systems -- keep the Present() rate intentionally high, don't let GPU idle beginning roughly 1ms prior to a raster-perfect Present(). So sometime do a dummy duplicate-frame Present() right before the actual tearline Present(). This forces GPU power states to stay at maximum, keeping Present() time interval consistent. Present() may take 10 scanlines at full power and 20 scanlines at power-saver speed (slower memory blitting, I think...), creating 10 versus 20 scanline offsetting, creating a tearline jitter from power management erraticness.

High framerate rate does not have a problem but raster-jitter amplification can begin to occur at low frameslice rates on powerful GPUs, due to the rapidly switching GPU power mangement states. That's why WinUAE raster jitter sometimes becomes worse when you use fewer frameslices, unless workarounds are done -- power state changes are screwing around things.

In reality, one don't need such exactness, if one exercise the jittermargin techniques mentioned throughout. But it's still Good To Know info...
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 06 Jun 2018, 12:10

Dwedit wrote:Seems that there could also be an opportunity to do partial GPU updates, so only a partial screen slice could be drawn to the video memory each frame, does that actually work? Or do you need to use DISCARD mode and just redraw everything?
1. On the fastest GPUs, it seemed not to matter much (~10% difference in frameslice rate for simple graphics, e.g. 7000 frameslices per second versus 8000 frameslices per second on GTX 1080 Ti). The overhead of blitting the whole framebuffer versus just the "frameslice of interest" within the buffer -- ends up almost NIL for GPUs capable of blitting a significant fraction of a terabyte per second of memory.

2. Now, for laptop GPUs using much slower shared memory, Raspberry PI, Android, etc -- blitting just the frameslice during Present() may actually result in significant performance improvement.

It's a function of how much time it takes to blit a full-screen framebuffer. The fastest GTX TITAN can do that in under 100 microseconds, but a very slow Android GPU might take more than 2 milliseconds. There's a point where it seemed to matter and where it seemed to not matter -- so it depends on what your software targets. Lower resolution framebuffers take less time to blit, so the optimization matters less when running during low resolution.

In beam raced emulator rendering workflows one is assumed to render only a frameslice of interest into an existing emulator framebuffer, so that's the workflow I'm assuming for both of the two above scenarios. No extra draw API calls*, it seemed to just be differences in amount of memory blitting per Present() call.

*extra overhead will occur if you plan to do fuzzylines/HLSL/"simulated CRT" rerenders at frameslice-overlap regions like fuzz overlaps and curved-lines overlap between two frameslices, if doing beamraced-optimized shaders. Like a 2-frameslice or 3-frameslice overlapped shader/HLSL rerender during beamracing, to save you from having to do a full-screen shader/HLSL rerender per frameslice event. Due to still needing to redo the frameslice-overlapping regions, still more draw commands totalled than doing it only once per full framebuffer. But such beamrace optimization tasks for shaders are a totally different ball of wax than this topic at the moment. For simplicity, WinUAE just does full-shader passes all over again every frameslice, dramatically reducing framerate.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 09 Jun 2018, 21:51

For everyone looking to do beamraced tearlines:

New finding on tearline-offset compensation, finding the best ways to do raster-exact tearlines (+/- 1 or 2 scanlines)

Experiments done in:
-- Time-based offsets
-- Scanline-based offsets
-- Benchmark-time-of-Present() based offsets

All of these have different pros/cons of accuracy but I found that I got maximum accuracy with a time-based offset. The same time-based offset (manually calibrated on my system as 0.2 millisecond, or 0.0002 sec -- but may vary from system to system) created near scanline-exactness at 480Hz, 144Hz, 120Hz, 60Hz, 4K, 1080p. The offset may vary from platform to platform, but it can easily be visually manually calibrated with a slider while watching a tearline-alignment test pattern.

This stayed within 1-2 scanlines of raster-exact tearline accuracy with both approaches:
-- time-based guesses (offsets from microsecond accurate VSYNC timestamps)
-- D3DKMTGetScanLine polls (read scanline poll)

Now Present() gets the tearline exactly where I want it, at least for Tearline Jedi regardless of 120 frameslices per second (mouse draggable tearline) or 7000 frameslices per second.

So I've added configurable parameters:
-- Time based offset
-- Scanline based offset
For the developer to manually calibrate. 99% of the time, you DO NOT need raster-exact tearlines, especially if you're doing jittermargin technique with emulator frameslice beamracing. However, for Tearline Jedi, I really need them to be darn near raster-exact (due to the special nature of Tearline Jedi demo)
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Post by Chief Blur Buster » 09 Jun 2018, 23:31

Check out my new post on pouet.net
Tearline Jedi Demo Thread
If you're a raster-experienced programmer, I could use some help.

Cheers!
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

Post Reply