Hmmm, just noticed a Reddit thread about WinUAE implementation of the Lagless VSYNC (beam racing) algorithm!
https://www.reddit.com/r/emulation/comm ... ency_mode/
I totally missed that two days ago.
Tommy wrote:I'm also a big worried about eGPUs. Has anybody performed any metrics on those? I understand that when an eGPU is running an internal screen it's actually still the original GPU generating the display, with the eGPU piping data? If so then is there much extra latency there?
Beam racing works on Intel eGPUs.
(
from post in the multiple pages of the abime.net version of this thread)
Rotareneg wrote:My laptop is running Windows 10 Home on a
Core i5-3230M with the integrated GPU (Intel HD Graphics 4000.) When using the built-in screen it hardly jitters as all, but has a fixed wrap-around of the last slice:
The wraparound probably is amplified because several laptops has almost no VBI -- probably only 10 or 15 scanlines tops. Scanline #1 begins scanning almost right after last scanline. (This requires surge-execution of VBI code to complete emulator VBI tasks).
For this, I highly recommended flipping after the first (topmost) emulator frameslice render, while the realraster is still within the bottom part of the screen -- gives you more safety margin by doing time-offsets from Scan Line #1, so it's more VBI-size-independent (more universal on more platforms on more displays). Don't worry about odd color strips as long as the flip boundaries are above the frameslice boundaries, then it's harmless, even in the wraparound-scan situation.
Confirmed so far:
-- Works on NVIDIA
-- Works on AMD
-- Works on Intel eGPUs
-- Works on Android GPUs
Tommy wrote:I don't know how relevant or helpful it is, but on embedded GPUs like those in phones and Intel's various solutions it's often more expensive not to clear a target buffer upon first binding it.
Oh, good point! I hadn't really thought about this fully through -- you're right that certain GPUs/drivers may actually depend on an assumption the buffer is already clear. That said, it appears that for my desktop at least, buffers aren't pre-cleared, and it certainly seems to save a little bandwidth and slightly increase frameslice throughput (not as much performance increase as I expected). More study needed.
Tommy wrote:Anyway, yes, I'm aware that my priorities are likely distinct. They usually are.
Certainly understandable! There's so many -- e.g. optimized for performance, or CRT-behavior authenticity, or blur-authenticity, etc. They're often mutually difficult to do simultaneously due to one reason or another (e.g. hardware limitations -- like LCD not doing perfect blacks).
Tommy wrote:That surely depends on how you are generating audio. If tasked with producing a classic 44100Khz output, I currently get a guaranteed worst-case audio latency of 5.8ms because audio output exhaustion is one of the triggers to do more emulation. So that's not just sub-frame, but completely detached from the frame rate.
In the hypothetical 60->120 mapping of a surge frame then a fallow frame, the worst case is now the length of the fallow frame. So it's gone up to around 8.3ms. If I were 60->240 mapping by a surge frame and then three fallows, I'm up at 12.5ms. Etc.
Or am I labouring under a misapprehension?
Excluding the audio buffer -- audio never needs to be delayed beyond the 1/60sec window no matter how slow or fast the scanout, and large VBI's larger than Active (e.g. 75% VBI, 25% active) doesn't shrink the wraparound jitter margin, you still have 16.7ms of video-delay adjustment (minus the time interval of one frameslice) even with the ultrafast scanout situations, no matter whether the raster is in Active or in VBI, even if VBI is bigger than Active.
Surge-scanouts doesn't prevent you from delaying your "beam-race" to a "beam-chase". Basically beam racing with a ~15-16ms margin.
Unchanged Constants:
-- Audio buffer exhaustion (meaning 5.8ms after this event, audio stops?)
-- The original emulator refresh cycle target of 1/60sec (or 1/50sec, etc)
-- Full emulator-refresh-cycle jitter margin (unchanged regardless of fallow cycles or large VBI)*
*Remember this is the full emulator refresh cycle jitter margin, so always 1/60sec no matter how fast the scanout is
So the beam race velocity doesn't interfere with your ability to compensate to exact audio delay. So a fast-beamrace can have unchanged-lag audio because the constants remain unchanged. So you simply use a bigger audio buffer so you can buffer more audio (e.g. 16.7ms of audio buffered in 4.2ms fast-scanout), and then also do a video delay adjustment.
Since the average lag improvement of a faster scanout translates to the midpoint of the improvement in the visible-scanlines scanout time (minus VBI). A 60Hz scanout is roughly 16.2ms, excluding VBI time (roughly 0.4-0.5ms for a common VBI) -- so ultimately, this is (16.2ms - 4.2ms) / 2 = (12ms / 2) = 6ms average lag decrease (for all pixels of the entire screen combined) for a 1/240sec scanout instead of a 1/60sec scanout. Now approaching this from a different point of view: 1/240sec scenario top-to-bottom scanout [0...4.2ms] (average 2.1ms) rather than 1/60sec scenario top-to-bottom scanout [0..16.2ms] (average 8.1ms). So double-checking formula through different math (8.1ms-2.1ms) = 6ms. So, ~6ms average decrease in visual lag for the refresh cycle combined, for a 1/240sec scanout instead of 1/60sec scanout. Meaning the average photon hits the eyeballs 6ms sooner for any given random pixel on the screen (less for top edge, more for bottom edge -- this is just the average)
Now, this is clearly within the jitter safety margin, so you simply fall back the realraster scanout behind emuraster scanout with a roughly 6ms chase distance between emuraster and realraster. Viola. Audio is back in sync.
If you preferred to align audio to top-edge refresh, or bottom-edge refresh, that's easy -- it's only a 4.2ms range of adjustment so +/- 2.1ms on 6ms which is an adjustment range of [3.9ms..8.1ms] video lag adjustment for all possible theoretical audio-delay abberations solely caused by scanout velocity difference. This adjustment range fully fits within the jitter margin.
Mathematically, this always happens, provided your frameslices are smaller height than the height of the active part of the real refresh cycle. Yet your range-of-adjustment is the FULL emulator refresh cycle. When I said jitter margin was "a refresh cycle minus a frame slice", that meant "a full 1/60sec minus one frame slice" -- even if realworld scanout is 1/240sec. Maybe this was the source of confusion. The math formula of the jitter range is (1/60th of scanlines per second) minus (number of scanlines of a frame slice) -- that's your video-delay range of adjustment you can get. A bigger VBI actually improves your range of video-delay adjustment when you use this formula with the same frameslice size! So that's hopefully more generous range of video-delay adjustment than you might have thought. My apologies for the confusion.
Since both situations (1/60sec and 1/240sec) would have the same unchanged audio buffer exhaustion signal (and the time interval between software exhaustion event and physical sound stopping -- which you say is a worst-case of 5.8ms, right) -- then there's no difference in the fast-scanout and slow-scanout situation in terms of buffer behavior. At any one time, no matter how slow-scan or infinite-fast-scan, there's never a need to generate more than one refresh cycle's worth of audio at any one instant (and only for the impossibly improbable scenario of instant-scanout + 16.7ms VBI -- but even that impossible scenario still only requires you to do an 8.3ms video delay adjustment to match the average video lag decrease -- still within your video delay adjustability margin (via beam chase-distance between emuraster + realraster) ...).
There's one caveat -- the equally rare situation of "audio-stimuli-directly-clocked-from-an-input-read"
combined with "random-input-reads-in-random-places-of-emulator-scanout" situations. Due to the asymmetry of emu-scanout versus real-fast-scanout -- this would cause a potential variable audio delay (+/- a few milliseconds) depending on where in the emulator refresh cycle the keypress or fire button was read. In reality, neither happens simultaneously because either (1) input reads occur at a consistent location in a refresh cycle,
or (2) audio events internally synced to the emulator's clock or emulator's VBI .... so the audio lag offset is constant if either one or the other is true. Easy to adjust the audio/video lag. You'd target at the average lag decrease caused by fast-scanout and adjust the beam-race margin as a video delay. For recalibrating to the game that does input-reads at the bottom edge of a screen, a slider could adjust the video delay (via adjusting chase distance between emuraster + realraster, as described earlier). In all possible situations, a video-ahead-of-audio never mathematically outside of the video-delay adjustment range.
Now if your audio buffer is fixed at 5.8ms (can't buffer more), that's another ball of wax, but that's a different subtopic (please clarify if you've got a limitation on buffering /more/ -- e.g. buffer more than 5.8ms of audio. My assumption is that you are able to. It will be potentially necessary to do that in fast-scanout situations -- since bottom-of-screen is finished 12.5ms sooner in 1/240sec beam-race (12ms sooner, excluding VBI). Then you do the video delay to sync up with the audio delay -- obviously half the 12ms (6ms) to match up with the average lag. So more audio buffer (than 5.8ms) becomes necessary to zero-out the audio-video lag.
(Covering all, not leaving stones unturned .... That said, if you're 100% driving your emulator speed directly only from an audio clock, and never reclocking audio, it's very hard to precisely calibrate your real display's refresh rate to be perfectly in clock-ratio sync with your audio - for example, different 50Hz EDIDs on different displays may actually result in a graphic clock generating 49.999Hz or 50.001Hz (even if it claims 50.000) -- so you must be doing some form of compensation for this, especially where error tolerances don't overlap. Especially if a user wants to sync emulator to VSYNC ON (so your emulator necessarily runs off the video card's "clock"). With a GPU clock slightly slewing relative to the CPU's clock and potentially relative to the audio chip's clock (tiny slews as it may be, e.g. even if <.001% difference in tick-tocking rates of all the various chips in a modern computer...) you're probably making design decisions based on your priorities. How are you balancing synchronization priorities? Beam racing means by necessity you're slaving your emulator clock to the video output, and you've gotta reclock everything anyway, so it all boils down to a "doesn't-matter-slowscan-vs-fastscan" based on the above. But if you're permanently clocking your emulator to audio clocks with zero audio reclocking (meaning frame skips/drops at 50Hz or 60Hz VSYNC ON) then that might make it tougher for you to beam race even a simple 60Hz original-scan-velocity output -- you can no longer rely on audio buffer exhaustion during beam racing. Besides, fast-scan is buffer-stuffing, not buffer-exhaustion, and the range-of-adjust venn diagrams lays worries at rest) -- I'm curious what priorities you chose in synchronization priorities given the known slew-effects issues between all the various imperfect clocks (video, audio, cpu, etc. Or is this last paragraph irrelevant (my overthinking), and the other part of my posts useful-to-know stuff?)