Reliable and efficient busy-waiting?

Talk to software developers and aspiring geeks. Programming tips. Improve motion fluidity. Reduce input lag. Come Present() yourself!
silikone
Posts: 57
Joined: 02 Aug 2014, 12:27

Reliable and efficient busy-waiting?

Post by silikone » 14 Dec 2020, 03:56

I made a small OpenGL program from scratch that does little but alternate between black and white frames with some control of V-sync and frame rate using QueryPerformanceCounter spinning.
When it works, it works surprisingly well. Using a refresh rate set to exactly 60Hz, the tear line manifesting in an arbitrary position on the screen remains static with virtually no drifting. However, once in a while, it can suddenly jump, either a few hundred lines, or in some cases a whole frame. Considering that I am just checking whether enough time has passed since the last frame began, it's no surprise that it will miss the mark ever so slightly occasionally, but the amount it sometimes misses is concerning. In addition, this is occupying an entire CPU core's worth of resources despite being a lightweight workload in principle. It wouldn't surprise me if some internal thread scheduler then causes the aforementioned hitches in part due to the busy-waiting being a hog. Some further observation is that, while the tear line's base position remains static, there is some rather major jitter between every frame that is significantly affected by keyboard and mouse input. Holding down the ctrl key while moving the mouse can push the line almost a quarter of the screen down.

Using the Sleep function before entering the loop mitigates some of the high CPU utilization, but it's far too inaccurate for something this sensitive, especially so at 120Hz where it never stabilizes. I've also tried using YieldProcessor (_mm_pause instruction), but it had no impact I could detect.

What are some further steps that could be taken to, most importantly, ensure that the frame always begins on schedule, but also not unnecessarily leech off CPU cycles?

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Reliable and efficient busy-waiting?

Post by Chief Blur Buster » 15 Dec 2020, 03:33

Firstly, congratulations on becoming a Tearline Jedi -- aka beam raced control of tearline.

You just acheived a raster beam-raced tearline. Precisely timing a new spliced frame mid-scanout on the GPU output.

You're sort of using a raster-based technique used by RTSS Scanline Sync as well as Tearline Jedi (see Tearline Jedi Demo Thread).

A more advanced implementation of tearline beamracing:
phpBB [video]


Tearline position is based on a time offset between two VSYNC's, so as long as you know the tick-tock of the VSYNC or you're using API calls such as D3DKMTGetScanLine() (don't busyloop on it -- it's a VERY expensive API call, but it can help you predict how long you need to wait).

If you're familiar with raster interrupts of old 8-bit days, then becoming a Tearline Jedi is relatively easy, since it's now achievable with high level programming languages instead of assembly. But as you've seen, it's very precision-demanding.

Raster-based control over tearlines is extremely hard, because tearlines will jitter with just mere microseconds. (In NVIDIA Control Panel, your horizontal refresh rate determines how far a tearline moves downwards. Scan Rate -- e.g. 135 KHz means 1/135000sec will move a tearline downwards by 1 pixel. You can detect this through QueryDisplayConfig() with timings returned:

Code: Select all

int vertRefreshRate = (double)timings.vSyncFreq.Numerator / (double)timings.vSyncFreq.Denominator;
int horizRefreshRate = (double)timings.hSyncFreq.Numerator / (double)timings.hSyncFreq.Denominator;
From horizRefreshRate determines the number of pixel rows per second, which tells you how fast your tearline will move downwards over a time-basis. This can help you understand how timing-critical this is -- mere microseconds later, the tearline has moved downwards.

For beam racing the position of tearlines, the position of tearline can be controlled via:

1. Time the tearline via D3DKMTGetScanLine() ...(windows specific)
2. Time the tearline via time-offset between two VSYNC's (more crossplatform)
3. Time the tearline via known precise refresh rate (though position may be a bit more random, if you don't know time of VSYNC).

Monitoring the VSYNC clock is hard during VSYNC OFF mode, so you have two methods

1. (easier but Windows-specific) Get the VSYNC heartbeat from D3DKMTGetScanLine() while keeping OpenGL in VSYNC OFF mode
(doesn't require a Direct3D context -- you just grab it from windows desktop of primary monitor which has a permanent context)

2. (buggy) Two 3D threads (one hidden VSYNC ON thread to get VSYNC heartbeat, and one visi8ble VSYNC OFF to beamrace the tearlines). This can sort of work, possibly crossplatform, if the 3D API supports multiple windows concurrently.

You may need to:

1. Render in a separate lower-priority thread
2. Present the frame in a higher-priority thread (busywait+present is a critical section that needs to be max priority, highest priority thread of everything in the app)
3. Make sure the application is running at a higher process priority than anything else in the machine.
4. You can use high-precision timer events on some machines, but best is to timer until ~1ms prior, then busywait remaining of way.

You will still need to probably burn a CPU core, but you can reduce it a bit by doing timer-until-1ms-prior, then busywait-remaining-way. Make sure you turn off power management, since power mangement will completely cancel out ability to stabilize the tearline. And be careful about multimonitor, since you may be beamracing the wrong monitor (for simplicity, stick to primary monitor). Remember a multi-GPU system (Intel internal GPU ports + discrete GPU ports) and a multi-monitor system, so you may have to enumerate the GPUs and enumerate the monitors (and index them together) if you want multimonitor-capable beamracing. So simplify things and only work with primary monitor.

Some best practices, crossposted here:
Chief Blur Buster wrote:Programming General Practices for Stable PC-based VSYNC OFF Tearline Beam Racing
- Highest priority for presenter thread that controls tearline. At least 1 thread priority higher than any other threads in the app. If you don't want higher priority for main thread or rendering thread, split the rendering thread and the presenting threads, and give the presenting thread a higher priority.
- Process that contains critical presenter thread should be a higher process priority than any regular process on the same machine. At least 1 process priority higher.
- Use High Performance Mode (use API to turn off power management for improved accuracy)
- Use one core for a busysleep thread for near-microsecond accuracy of tearlines. Yes, not kosher to busysleep, but it improves beamrace accuracy.
- Optionally, use Flush() after Present() to improve beamracing sync accuracy. That kills performance, but massively improves precision of tearline beamracing.
- More GPU is used up in the first few scanlines at top (Windows compositing thread), so render in advance (e.g. 1st frameslice of next refresh cycle can be rendered while scanning-out bottom of previous frame)
- If you use D3DKMTGetScanLine() rather than a temporal offset between VSYNC timestamps, put tiny busysleeps between polls, as D3DKMTGetScanLine is an expensive API
- Remember that beam racing is a bit trickier on VRR displays with VRR=ON. You have to do the first Present() to begin the refresh cycle (force GPU to begin outputting scanline #1 immediately on the spot) THEN you beamrace that specific refresh cycle. This is because refresh cycles starts are software-triggered during VRR mode. Make sure you're using VRR+VSYNC OFF, for beamracing VRR refresh cycles.
- If you're trying to steer tearlines between refresh cycles (like RTSS Scanline Sync goal), then try using a large blanking interval for bigger tearline jitter margin between refresh cycles. To understand how to create a larger blanking interval see Quick Frame Transport thread. This can make it easier to hide jittery tearlines between refresh cycles, and/or reduce latency of beam raced tearlines, if you're hitting two birds with one stone with the Quick Frame Transport effect
P.S. Please give credit to Blur Busters if any of my advice has helped you improve your software.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Reliable and efficient busy-waiting?

Post by Chief Blur Buster » 15 Dec 2020, 03:44

...Oh, and what is your current use case? I'm curious.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

silikone
Posts: 57
Joined: 02 Aug 2014, 12:27

Re: Reliable and efficient busy-waiting?

Post by silikone » 20 Dec 2020, 06:07

Chief Blur Buster wrote:
15 Dec 2020, 03:44
...Oh, and what is your current use case? I'm curious.
Nothing concrete, just experimenting and figuring out what works and how well as a part of a learning experience.
4. You can use high-precision timer events on some machines, but best is to timer until ~1ms prior, then busywait remaining of way.
What would be an example of an optimal way to wait?

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Reliable and efficient busy-waiting?

Post by Chief Blur Buster » 21 Dec 2020, 14:04

silikone wrote:
20 Dec 2020, 06:07
What would be an example of an optimal way to wait?>
It varies from system to system, alas.

There are microsecond-accurate timer events on some computers (HPET-derived) but others go pretty much 0.5ms or 1.0ms. There are pros/cons of different timer architectures on different computers and you can't do a one-size-fits-all.

The best one-size-fits-all microsecond-timing method is a busywait loop, and you can use the best precision timer available to timer-event until the last moment, then busywait the rest of the way.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

Kaldaien
Posts: 21
Joined: 22 Jan 2020, 21:27

Re: Reliable and efficient busy-waiting?

Post by Kaldaien » 15 Aug 2021, 17:56

silikone wrote:
20 Dec 2020, 06:07
4. You can use high-precision timer events on some machines, but best is to timer until ~1ms prior, then busywait remaining of way.
What would be an example of an optimal way to wait?
On Windows, you can get wait handles from D3DKMT or DXGI for specific events, such as VBLANK interrupt or queued frames < n. Those represent the ideal time to submit an actual frame for presentation, since there will not be any back-pressure from a full render queue if you wait on these.

What you really want to do is flush the command queue prior to waiting, and then wait on one of these sync events to issue the final buffer swap / present. D3DKMT even has a method to get a wait event that is offset in 100 ns increments from the VBLANK, so you can have it signal you in advance of the VBLANK deadline. I would not trust that sort of wait to implement scanline sync without raising thread / process priority, but if you're not trying to re-invent VSYNC then this is hands down the most efficient way possible.

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Reliable and efficient busy-waiting?

Post by Chief Blur Buster » 18 Aug 2021, 16:47

Kaldaien wrote:
15 Aug 2021, 17:56
On Windows, you can get wait handles from D3DKMT or DXGI for specific events, such as VBLANK interrupt or queued frames < n. Those represent the ideal time to submit an actual frame for presentation, since there will not be any back-pressure from a full render queue if you wait on these.

What you really want to do is flush the command queue prior to waiting, and then wait on one of these sync events to issue the final buffer swap / present. D3DKMT even has a method to get a wait event that is offset in 100 ns increments from the VBLANK, so you can have it signal you in advance of the VBLANK deadline. I would not trust that sort of wait to implement scanline sync without raising thread / process priority, but if you're not trying to re-invent VSYNC then this is hands down the most efficient way possible.
Reinventing a low-latency VSYNC ON via precision timed VSYNC OFF techniques (ala RTSS Scanline Sync technique) definitely needs a priority timing.

I wrote Tearline Jedi in pure C# programming language (raised thread priority but no raised process priority), so precision is okay until I began moving a window simultaneously with a raster Kefrens Bar animation (Alcatraz bars) being generated by 8000 frameslices per second of VSYNC OFF:

phpBB [video]


Look at how the rasters glitches when I'm doing other background stuff like moving a window! Although if you want to debug tearline jitter, putting in the middle of the screen makes this dirt easy:

phpBB [video]


Remember, this is just plain old C# programming language -- far higher level than C/C++ or ASM normally used for raster interrupts of the 8-bit days of yore (C64, NES, etc) -- so things like garbage collect events will cause a sudden tearline jitter of a few pixels (I can temporarily disable GC and force it during the VBI or some less critical screen region).

Tearline jitter is a great visual debugger of timing precision, so putting tearlines temporarily in a visible screen region and watching it jitter tells you a great visual of microseconds.

1080p60 is 67KHz scanrate, so a 1/67000sec delay moves a tearline downwards 1 pixel. 1080p360 is 400 kilohertz scan rate, so a 1/400,000sec delay (2.5 microseconds!) moves a tearline down by 1 pixel. It's amazing how microseconds become human visible when you intentionally use visible tearlines.

Once you're confident of your tearline jitter, move the tearline to near the end of the VBI offset before beginning of refresh cycle to your known tearline-jitter margin.

It is possible to self-monitor for tearline jitter by self-checking microsecond timing deviations (RTDSC, QueryPerformanceCounter, etc) relative to known raster (I use crossplatform time-offsets between VSYNC's -- but you can use platform-specific D3DKMTGetScanLine() stuff, but remember that API call is incredibly expensive and can take over a hundred microseconds. So you only want to occasionally call it like once a refresh cycle to synchronize your drifting time offsets)

Interesting Note: Also, if you have VRR enabled, your VT's might jitter slightly (e.g. VT1125 versus VT1126) -- there are ultra tiny refreshtime jitters in the refresh rate clock whenever VRR is enabled in Windows even at max Hz. So your refresh cycles may vary 0.001Hz (ish), as www.testufo.com/refreshrate will stabilize at fewer digits when VRR is enabled than when VRR is disabled. For example, after 15 minutes, the refresh rate calculator may stabilize to 6 decimal digits when VRR disabled, but only stabilize to 3 decimal digits when VRR is enabled. So VRR's fixed-Hz compositor isn't as perfect as VRR-disabled -- by at least sub microsecond timescales (ish). So you may want to only aim at the ~1-to-10us error margins, rather than the sub-1us error margins. Also, GPU clocks drift relative to CPU clocks, and drift will vary faster/slower at different temperatures (computer warmup). I can even detect this with RTDSC!

Or even JavaScript timer- b0Яk3d now() -- It's amazing I can even see this in JavaScript, despite its metldown/specture mitigations! Though I have to be smart about skipping missed-VSYNC outliers, to filter those from messing up the math of a refresh rate calculator.

Oh, and Battery Saver / Balanced Power sometimes creates errors in timer event precisions especially if HPET or HPET-like timers are automatically disabled during this mode. Even a 1ms sleep (by both CPU and GPU) can cause the next event to be majorly bad-timed, so low-complexity capped framerates cause things like RTSS Scanline Sync to severely fail in precision. Revving up the CPU/GPU at least a few milliseconds prior to the precision event helps a huge amount, if necessary (more helpful in Balanced Mode that is more tolerant of such sheninigans, while Battery Saver may stubbornly stay imprecise on some systems).

So sometimes you need to go to Performance Mode to bypass this. (If you detect you're running in Battery Saver, you may need to execute compensatory algorithms that your system is not precise enough for raster-controlled VSYNC algorithms. Also I think there's an Admin-required API to temporarily switch power plans, IIRC -- I forget which).

Image

So aim at the Right Precision for the Right Job -- like time offsets relative to clock-corrected beginning of VBI for current refresh cycle, even if you have to use many refresh cycles to compute a precise cross-platform clock for cross-platform rasters... If sticking to Windows platform, a once-a-refresh-cycle call (or few) of D3DKMTGetScanLine() to get both the scanline status, can help you self-correct any clock drift errors caused by anything (including Battery Saver).
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

silikone
Posts: 57
Joined: 02 Aug 2014, 12:27

Re: Reliable and efficient busy-waiting?

Post by silikone » 10 Apr 2022, 19:35

Chief Blur Buster wrote:
18 Aug 2021, 16:47
It's been a while, so I am wondering if the demos have been published yet?

I just used pure C on a single thread to keep the loop as tight as possible, and I didn't think a few API calls could justify having multiple threads, but is it perhaps necessary in order to control priority properly? I'd love to have a good implementation as a reference for my own tinkering.

User avatar
Chief Blur Buster
Site Admin
Posts: 11647
Joined: 05 Dec 2013, 15:44
Location: Toronto / Hamilton, Ontario, Canada
Contact:

Re: Reliable and efficient busy-waiting?

Post by Chief Blur Buster » 14 Apr 2022, 19:26

silikone wrote:
10 Apr 2022, 19:35
Chief Blur Buster wrote:
18 Aug 2021, 16:47
It's been a while, so I am wondering if the demos have been published yet?

I just used pure C on a single thread to keep the loop as tight as possible, and I didn't think a few API calls could justify having multiple threads, but is it perhaps necessary in order to control priority properly? I'd love to have a good implementation as a reference for my own tinkering.
I'm committed to opensourcing Tearline Jedi eventually -- keep tuned.

Blame the pandemic. I wanted to submit it to a demo competition called Assembly '20 in summer 2020. Although it became online, I also had to refocus on other parts of the businesses when revenues dropped due to the pandemic.

So it's two years late in opensourcing -- it needs to hit a demo competition before it gets opensourced. Right now, the code is malfunctioning on a MacBook, but I want to make sure the crossplatform behavior is still correctly working (world's first cross-platform rasterdemo) before I submit again.
Head of Blur Busters - BlurBusters.com | TestUFO.com | Follow @BlurBusters on Twitter

Image
Forum Rules wrote:  1. Rule #1: Be Nice. This is published forum rule #1. Even To Newbies & People You Disagree With!
  2. Please report rule violations If you see a post that violates forum rules, then report the post.
  3. ALWAYS respect indie testers here. See how indies are bootstrapping Blur Busters research!

ad8e
Posts: 68
Joined: 18 Sep 2018, 00:29

Re: Reliable and efficient busy-waiting?

Post by ad8e » 10 Jun 2022, 23:57

If you're still working on that, I have a new version of the vblank finder. It's faster and no longer explodes on clock skew. It's too deeply integrated with my project to extract cleanly, but here is a copy:
vsync copy.7z
(2.13 MiB) Downloaded 209 times

Code: Select all

g++ spinsleep.cpp vsync.cpp -ob -std=c++20 -O2 -DNDEBUG -lwinmm -fno-exceptions -fno-rtti -s
Or if you're compiling with Visual Studio, the important bits are to compile spinsleep.cpp and vsync.cpp, and link in winmm.

See main() at the bottom of vsync.cpp for usage.

I think converting it to C# would be boring and pointless, but if you want to, it's much easier now, since the code shrunk to 1/3. A better option is probably to compile it in C++, as a library, and export vf::new_value like this: https://docs.microsoft.com/en-us/cpp/bu ... xecutables. Then you can call it from C#.

License is 0BSD (i.e. do what you want, no credit needed).

Controls are Ctrl+1, Ctrl+2, mouse, Escape.

I nearly lost my post to the forum again, just like both of us did long ago, but I remembered to copy it this time!

Post Reply