Pre-rendered frames etc. (continued from G-Sync 101 article)

pegnose · Post by **pegnose** » 29 Dec 2018, 19:20

jorimt wrote:
pegnose wrote:Before I read your article I never knew about the over-queuing of rendered frames. I myself had only worked with simple front/back buffer flips, one per unit. I would be particularly interested in
- whether this is an intended feature
As far as I know, it's exclusive to double buffer V-SYNC, as well as "faked" triple buffer V-SYNC (basically double buffer with a third buffer; no relation to true, traditional triple buffer). G-SYNC is also based on a double buffer, as it is only meant to work within the refresh rate that it adjusts to do it's magic.

And no, over-queuing isn't intended, it's simply a limitation of syncing the GPU’s render rate to the fixed refresh rate of the display. Others here can probably breakdown the "why" for you in more details, as my explanation would likely come across more conceptual than technical.

That would be great. From a programmers standpoint a buffer is an "area" in memory. For it to be used it has to be allocated and a pointer has to be created. This is necessary for data to be stored there and data to be retrieved as well. This does not happen accidentally. If there are multiple parallel "areas" in memory being able to hold multiple successively created frame buffers, this was coded. And therefore it likely is intentionally.

jorimt wrote:
pegnose wrote:Would it be valid to say that what pre-rendered frames are for the CPU, piling-up rendered frames are for the GPU? A somewhat hidden queue making sure there is always new data to work with or to present?

So the "safety buffer" on the CPU side - as opposed to the one on the GPU side - is able to always pick the most recent set of data?
As far as I currently understand it myself (anyone is free to chime in with corrections), think of the pre-rendered frames queue as less of a direct input lag modifier (which is actually a peripheral effect of it's primary function), and more as a CPU-side throttle for regulating the average framerate.

The less powerful the CPU is in relation to the paired GPU, the larger the queue needs to be in order to keep a steady flow of information being handed off from the CPU to the GPU, which in most cases, equals a lower average framerate.

This is why in instances where the CPU is more of a match to the GPU, you'll find people reporting that a pre-rendered frame queue setting of "1" actually increases their average framerate (if only ever so slightly) when compared to higher queue values, as the higher queue values actually begin to throttle the average framerate (a.k.a slow CPU hand off to the GPU) unnecessarily. Whereas for the weaker systems (where the CPU is less of a match to the GPU), lower values may decrease performance and/or cause more frametime spikes (complete absence of frames for a frame or frames at a time).

So like my article already states, the effects of the "Maximum pre-rendered frames" setting truly "Depends" on the given system, the given game, the given queue setting, and the interaction between the three.

I think I have understood a bit more again. The setting is called _Max_ pre-rendered frames. This queue can be filled if we are not operating in the CPU limit, if the CPU has free resources, i.e.. If we _are_ in the CPU limit, it won't get filled and can't protect against frame time spikes. Also, if we are constantly hitting our in-game or RTSS frame limiter, which basically function by "pausing" the game's relevant CPU threads for a short while, we are effectively in the CPU limit. And therefore we are effectively at a max pre-rendered frames of "0".

pegnose · Post by **pegnose** » 29 Dec 2018, 19:52

RealNC wrote: The reason this happens is rather simple. In the past, doing the flip would block. Meaning the API function that you call to do the flip would not return until the flip was performed. With vsync, it meant it would block until the next vblank.

Now, this function (usually a present() related function) does not block. It returns immediately and will schedule the submitted frame to be asynchronously presented "later." This creates so-called vsync backpressure, which adds latency. The asynchronous nature of today's graphics APIs allows the game to sample player input early. Too early. When the present() call doesn't block, the game will sample new input, prepare a couple frames to be rendered, then try to render, then try to present, at which point the call will actually block because push came to shove and blocking is the only thing to do, since the flip still hasn't occurred. But now the game is sitting on very old player input, and it's not getting any younger. By the time the frames make it to the screen, they're old and thus with lots of lag.

This is why an FPS cap gets rid of vsync lag. It blocks the game in the same way as flipping did in the past. Thus the game is always being forced to wait and as a result new frames are based on fresh player input.

Thank you! This is something I'll have to digest. Let's see: I already came across asynchronous presentation, but I never thought of it in terms of gaming, i.e. continuous frame delivery. When I programmed visual experiments, it was nearly always about single frame presentation (still/static image stimuli).

So the graphics card prepares a frame, initiates the flip (buffer swap), but the function returns immediately after the flip and before the buffer is completely drawn onto the "screen", i.e. (actually before the drawing even began, probably). Then the graphics card immediately prepares the next frame with user input, position sampling and all. But it can't present this new one right away, because the previous drawing is still in progress. And this is where the lag comes from.

I have two questions:
- you said that after the first flip the graphics card will start preparing a _couple_ of new frames right away; why and how a _couple_? I would expect just one to be prepared now, the next one.
- from your description I can see how a lag of up to 2 frames is created from the very early sampling of user input until the complete new frame has been drawn and actually appeared on screen (after having waited until the previous frame was drawn completely); how is it possible that _2-6_ frames of latency are created as stated in the G-Sync 101 article?

Website · Post by **RealNC** » 29 Dec 2018, 20:47

pegnose wrote:So the graphics card prepares a frame, initiates the flip (buffer swap), but the function returns immediately after the flip

From what I can tell, it returns before the flip even. The flip cannot happen before the vsync. It returns anyway, but will asynchronously flip once vsync occurs. If the game tries to render another frame on the GPU while the flip still hasn't occurred, something will block (because there's no free buffer to render into.) Not sure on the details on that one though. However, between the two calls (the present() that didn't block and the rendering attempt that blocked) the game did lots of stuff; it read input from the player and prepared frames for the pre-render queue (it did the CPU part of the rendering, the results of which are put in the pre-render queue.) So when the API finally decides to block the game's render thread when it tried to do the GPU part of the rendering because both front and back buffers are in use, the game is forced to sit on pre-rendered frames based on old player input for a while, which causes the lag. Also, the game might have read yet another round of player input before being able to render anything, which will increase lag even further. Some games are not careful not to do that. This can show up as a 6th frame of lag (see below) even though there's no actual frame anywhere yet; the game read input, tried to render but got blocked.

you said that after the first flip the graphics card will start preparing a _couple_ of new frames right away; why and how a _couple_? I would expect just one to be prepared now, the next one.

The GPU cannot render anything if both buffers are still unavailable (the front buffer is still being scanned out by the display, the back buffer is still waiting for the flip.) What happens is that the game does the CPU part of the rendering. The results of that are put in the pre-render queue for later submission to the GPU.

from your description I can see how a lag of up to 2 frames is created from the very early sampling of user input until the complete new frame has been drawn and actually appeared on screen (after having waited until the previous frame was drawn completely); how is it possible that _2-6_ frames of latency are created as stated in the G-Sync 101 article?

From what I can tell, it can be anywhere between 2 and 5. 1 or 2 rendered frames (1 when doing double buffer, 2 with three buffers) and uo to 3 pre-rendered frames. The 6th one that sometimes shows up in latency tests is probably just the amount of time the game is blocked while sitting on frames based on old input, and because the game was not developed with care to not read input too soon. So it read input, didn't do anything with it, and now that input becomes older while it's being blocked by the API.

So in double buffer vsync and changing pre-render queue to 1, you get between 2 and 3 frames of lag. 2 real ones and up to 1 apparent one when the game if forced to wait for the flip. A frame limiter will shave off a full frame plus part of a second. So about 1.5 frames less lag. At 60Hz this can be a rather big latency reduction (about 25ms.)

If you don't set pre-render to 1 but use the default (I believe it's 3 by default?) then a frame limiter will shave off about 3.5 frames of lag. At 60Hz that's a huge reduction (58ms.)

MonarchX · Post by **MonarchX** » 29 Dec 2018, 22:31

Therefore setting Max pre-rendered frames to 1 on a G-Sync + V-Sync + FPS limiter would only reduce FPS\performance, but not reduce input lag. Is that correct?

Post by **jorimt** » 29 Dec 2018, 22:53

pegnose wrote:From a programmers standpoint a buffer is an "area" in memory. For it to be used it has to be allocated and a pointer has to be created. This is necessary for data to be stored there and data to be retrieved as well. This does not happen accidentally. If there are multiple parallel "areas" in memory being able to hold multiple successively created frame buffers, this was coded. And therefore it likely is intentionally.

Perhaps; I'm not a game dev/programmer so that's not something I can speak to directly.

What I can say, is IF the over-queuing is intentional with modern double buffer V-SYNC (for FPS above the refresh rate), I would assume it's to avoid the frame dropping method used by some forms of true triple buffer, and thus improve frame pacing...at the expense of heavily increased input lag, of course.

pegnose wrote:I think I have understood a bit more again. The setting is called _Max_ pre-rendered frames. This queue can be filled if we are not operating in the CPU limit, if the CPU has free resources, i.e.. If we _are_ in the CPU limit, it won't get filled and can't protect against frame time spikes. Also, if we are constantly hitting our in-game or RTSS frame limiter, which basically function by "pausing" the game's relevant CPU threads for a short while, we are effectively in the CPU limit. And therefore we are effectively at a max pre-rendered frames of "0".

While I follow the gist of it, the first part of your comment is honestly beyond my purview (and interest) currently, but "yes" to the last bold bit for RTSS, though I'm not certain about that in regards to an in-game limiter...

While RTSS can only limit the frametime, and (as far as I know) only after the CPU has done it's work on a frame, most in-game limiters set an average FPS target, AND while the engine is calculating frames, and, unlike RTSS, it lets frametimes go lower than the FPS limit (much like how some forms of true triple buffer behave above the refresh rate), which is part of why it almost appears to create a "negative" reduction in input lag when compared to external limiters.

But I'm mostly conceptual in my knowledge (enthusiast consumer-side), and, again, mostly a G-SYNC guy, and I haven't tested or studied enough of this specific side of the subject to share much more than I already have at this time.

That said, I hope I cleared a few things up for you.

Website · Post by **RealNC** » 30 Dec 2018, 00:05

MonarchX wrote:Therefore setting Max pre-rendered frames to 1 on a G-Sync + V-Sync + FPS limiter would only reduce FPS\performance, but not reduce input lag. Is that correct?

Performance cannot be reduced when you are reaching your FPS cap, because by definition, you are reaching your FPS cap. In that case, the MPRF setting does not matter and has no effect. The frame limiter is preventing the game from filling the pre-render queue, no matter what size it has been set to.

When the FPS target cannot be reached (the game cannot reach the FPS cap), then MPRF becomes a factor and input lag increases the higher it is set to, if the FPS drop is caused of a GPU bottleneck. However, I have not seen a game here where performance is reduced with MPRF 1 in this case. So for me, MPRF 1 is always beneficial at no perf cost. I have seen other people report perf drops in some games, so I cannot say that never is a perf drop. I just haven't encountered any game yet where that's the case. And it's easy to test anyway. If you set your FPS limiter to 100FPS and you're using MPRF 1, but the game drops to 80FPS in some specific area, then you can raise MPRF, restart the game, go to the same area and see if you still get 80FPS. If yes, then that means MPRF 1 doesn't hurt perf.

If the FPS cap cannot be reached because the CPU is the bottleneck rather than the GPU, then MPRF also doesn't seem to matter, since the GPU outruns the CPU in this case. The pre-render queue never gets full since the GPU processes it faster than the CPU can fill it.

pegnose · Post by **pegnose** » 02 Jan 2019, 17:46

@jorimt, @RealNC: Thank you again for your time, effort, and valuable information! Also I apologize for the response delay due to a couple of days vacation. I try to sum up my understanding taken from both of your replies:

- frames piling up does not actually mean rendered frames, but the CPU part of game logic/progression and rendering instructions to the GPU (sampled/created in quick succession after a present() if the CPU has the resources for it)
- this is inherent to double buffering with modern asynchronous frame scheduling; as such no additional frame buffers/GPU memory is needed
- it is actually a feature created by this (intended) asynchronous frame scheduling, partly in conjunction with the graphics pipeline allowing for pre-rendered frames (similarly intended)
- limiting pre-rendered frames can actually limit double-buffer v-sync related frame lag to some extent
- particularly, a setting of max "1" pre-rendered frame can limit the delay to 2-3 frames (1-2 rendered ones, 1 pre-rendered) depending on double vs. triple buffering (instead of 5-6 if NVCP's max pre-rendered frames is set loosely to "4", or if this setting is not respected or defined otherwise by the game)
- at least some external frame limiters (such as RTSS) achieve a state of 0 pre-rendered frames by effectively putting the system into the CPU limit (where pre-rendered frames can't develop due to the lack of resources)

I am not sure I understand the 6th frame correctly, which you mentioned RealNC, as I already arrived at 6 frames with triple buffering in my calculation. But maybe you meant this with respect to double buffering. I also don't understand the exact computation of lag reduction for frame limiters, yet, but I will re-read and -digest this part of your posting. Oh wait, the 0.5 frame shaved off is due to the sampling of game info starting in the following frame interval _and_ there as late as possible? Is this valid only for in-game limiters?

As far as I know, the maximum of pre-rendered frames is "4" and the default setting in NVCP is "Application-controlled".

Finally, maybe I haven't completely understood your remarks regarding the functioning of in-game limiters, jorimt. Or maybe I have (see above). What you mean is that RTSS blocks the relevant CPU threads after the sampling of the next but before the sampling of the next-but-1 "CPU frame", while in-game limiters can wait an appropriate fraction of frame time _before_ the sampling of the next frame data and thus operate similar to true triple buffering, as this data now is a little less old than if it was sampled directly at the beginning of a frame interval?

Thank you guys so much. When I started to dig deeper into this matter, I had not thought it would take so many of my brain cells to get all this stuff.

Website · Post by **RealNC** » 02 Jan 2019, 19:52

pegnose wrote:- at least some external frame limiters (such as RTSS) achieve a state of 0 pre-rendered frames by effectively putting the system into the CPU limit (where pre-rendered frames can't develop due to the lack of resources)

It's just due to blocking. The game thread that does the present() call does not resume execution until the present() hook of RTSS returns. The CPU load you see in RTSS is simply because RTSS uses a busy loop in its present() hook to count time rather than waiting on a system timer (it does that for better frame pacing accuracy.) So even if there's plenty of CPU resources available, the game does not resume because it's blocked by RTSS (or whatever CPU-based limiter you're using), not because there's no CPU resources available.

You sound like you do programming (or did in the past,) so it's really just because RTSS will replace the default present() function that doesn't block, with an implementation that does block. To bring it down to basics, in this simple C code example:

Code: Select all

/* 1. */ int c = getchar();
/* 2. */ some_other_function_call();

line 2 will not execute until getchar() has returned. It's a blocking function. present() however is not a blocking function. It returns immediately, so the game will go on to the next game loop iteration and render more frames. RTSS will replace the present() function with an implementation that does block. So the game is stuck there, regardless of whether resources are available or not.

I am not sure I understand the 6th frame correctly, which you mentioned RealNC, as I already arrived at 6 frames with triple buffering in my calculation. But maybe you meant this with respect to double buffering. I also don't understand the exact computation of lag reduction for frame limiters, yet, but I will re-read and -digest this part of your posting. Oh wait, the 0.5 frame shaved off is due to the sampling of game info starting in the following frame interval _and_ there as late as possible? Is this valid only for in-game limiters?

Assuming a frame buffer queue size of 3 (what games call "triple buffer vsync") and a pre-render queue of 3 (I think that's the default if the game doesn't use a lower value), then you end up with 2 rendered frames that are waiting and 3 that are have been pre-rendered and waiting to get sent to the GPU. That's 5. Note that this is in addition to the already existing input lag, which is up tp 1 frame (it's the frame that is currently being scanned out.) That's the input lag you can't get rid of because the monitor isn't infinitely fast. At 60Hz is needs 16.7ms to scan out a frame. That means input lag at the top of the screen is 0ms and at the bottom it's 16.7ms. So on average, it's 8.3ms. And, if the game samples user input too early, add to that up to 1 more additional frame of latency. This last one depends on how fast the CPU is. The faster the CPU, the earlier it will sample input and thus the longer it will stay in the blocked state, waiting to start pre-rendering a frame from the sampled input. So let's say on average that adds another half frame (8.3ms) worth of input lag. So in this scenario, you'd see 6 frames total latency.

However, not all games are implemented in the same way. Some games might have a lower latency, because they are careful to not run ahead of the GPU too much. You can find some games with in-game options like "reduced buffering" or "synchronous rendering" or "1 frame thread lag." These games can reduce total lag by 1 or 2 frames.

On the other hand, some games can have even worse lag than "just" 6 frames. They might do internal buffering of pre-render commands (it's a form of "buffer bloat,") have sub-optimal threading with excessive waiting on threads, or have bugs. The Bethesda RPGs seem to be in that category.

And finally, multi-threading in general usually will also add some input lag, even if with good multi-threaded code. Well implemented multi-threading will not add too much lag though. But still.

Anyway, 6 frames worth of total input lag is generally what you can expect to be the worst case scenario with vsync. With RTSS, you can shave off 1 frame buffer (not sure if it gets rid of the second frame buffer in triple buffer vsync,) and the 3 pre-render buffers. So can go from 6 frames of lag to 2 frames of lag. An internal limiter can potentially also prevent the game from sampling the next user input too early, and also eliminate threading lag (an in-game limiter has access to the game's threads and can stop/resume them, RTSS can not.)

As far as I know, the maximum of pre-rendered frames is "4" and the default setting in NVCP is "Application-controlled".

The highest value you can use in NVidia Profile Inspector is 8. Maybe the driver allows even higher values, but there's no point in having an option to that. Even 8 is pretty much useless. Games become unplayable.

Not sure what the default ("app controlled") is when a game doesn't specify a pre-render queue size. I've heard it's 3, and that's decided by DirectX. But not sure. It could be 2. Or it might depend on the amount of cores in the CPU. The whole pre-render buffer mechanism and the switch to asynchronous frame buffer flipping was done because mainstream CPUs started to have more than one core in the 2000's.

And GPUs also started supporting parallelism. Modern GPUs can actually work on multiple frames at once. Not all parts of the GPU are needed at the same time while rendering a frame. So if the GPU would only render one frame at a time, some parts of it would be sitting there doing nothing (this is known as the "cold silicon" problem.) So a GPU can render more than one frame at a time. However, I'm not familiar with how this works exactly. I assume it can grab a new to-be-rendered-frame from the pre-render queue before the current frame has been completely rendered, but when and how... no idea.

So parallelism does need pre-render queues and an asynchronous frame presentation mechanism. With vsync OFF, or VRR like g-sync/freesync, or non-queued triple buffering like fast sync/enhanced sync, this is fine. It's what allows the kind of performance we get today in games. But with vsync though, you get buffer bloat, aka "vsync backpressure." You have all these buffers that need to go through this small bottleneck called vsync, so they just pile up, waiting to go through one by one. VRR really needed to happen at some point. Just plain old vsync isn't really suitable anymore like it was in the old days.

Finally, maybe I haven't completely understood your remarks regarding the functioning of in-game limiters, jorimt. Or maybe I have (see above). What you mean is that RTSS blocks the relevant CPU threads after the sampling of the next but before the sampling of the next-but-1 "CPU frame", while in-game limiters can wait an appropriate fraction of frame time _before_ the sampling of the next frame data and thus operate similar to true triple buffering, as this data now is a little less old than if it was sampled directly at the beginning of a frame interval?

Well, an in-game limiter is part of the game. It has access to everything the game does. It can thus choose to not read player input unless it knows that pre-rendering is possible. When using an external limiter, this isn't possible. The game will usually read player input as usual, and that input is then older by the time RTSS has returned from its present() hook. Note that this would also happen without RTSS; the input would have become older due to the vsync backpressure in that case. So it's not like RTSS added this source of input lag. It just didn't eliminate it. An in-game limiter eliminates it. (This is pretty much the reason you see people claim "RTSS adds input lag, in-game limiters do not." That's not the case. What's happening is that an in-game limiter can eliminate a source of input lag that RTSS can not. It doesn't mean that RTSS adds input lag.)

An in-game limiter can also do what jorimt said; wait a bit more than usual before reading input, depending on how long the previous frame took to render. Whether it does so or, we don't know. I don't think that this kind of predictive frame limiting is common in games though. If done, it can provide input lag that is very close to vsync OFF. But I'm not seeing that in in-game limiters, except in some rare cases, like some open source implementations of Quake.

YouTube · Post by **Chief Blur Buster** » 03 Jan 2019, 21:01

I've created the equivalent of "symbolic links" to this thread from both the Input Lag and Programming forums, because this thread spans multiple topics.

Great programmer talk here too.

Now I'm going to open the famous Blur Busters Latency Pandora Box.... latency gradients. Scanout latencies and how they're affected by the various sync technologies.

An additional factor to understand is a GPU output serializes a 2D framebuffer into a 1D transmission, in a raster fashion, left-to-right, top-to-bottom, as the standard scanout direction over the last literally ~100 years (whether a 1930s analog television broadcast or a 2020s DisplayPort cable).

Now visualize this as a scanout diagram, here's the famous Blur Busters time-based diagrams:

So the button-to-pixels input lag of a framebuffer is actually a latency gradient.

During VSYNC ON and GSYNC, the latency gradient is full framebuffer height at the scanout velocity (during VRR, scanout velocity is always max Hz, e.g. 40Hz on a 240Hz GSYNC is always 1/240sec scanout).

During VSYNC OFF, each frameslice are independent latency gradients. Three frameslices per refresh cycle (e.g. 180fps at 60Hz) means each frameslice is roughly 1/3th screen height (taking into account the VBI size between refresh cycles).

VSYNC ON = Present() blocks for next refresh interval (if the frame queue, if any is used, is full)
VSYNC OFF = Present() nonblocking and splices realtime into scanout
VRR = Present() controls the timing of refresh cycle (monitor begins refreshing the moment the software Present()s

Symbollic scanout diagrams to help software developers understand the deterministicness of latency from Present() to photons.

Sure, other factors abound in the latency chain. There's often a fixed absolute lag, GtG lag, queued framebuffer lag, monitor processing lag, etc -- and some monitors are virtually lagless (e.g. many eSports TN monitors which displays pixels essentially realtime off the port) but we're omitting this, and focussing on latency from API-to-graphics port). But this symbollic diagrams will be a big help to the software developer who want to understand area-related latencies better.

Present() essentially "splices" into the existing scanout during VSYNC OFF
(Think of scanout = as the act of serializing a 2D frame buffer out of the 1D graphics port transmission)

240fps at 60Hz means all pixels output on the GPU port has no more than 4ms lag (maximum). The top edge of frameslices have the least lag (being the first pixels to display), and the bottom edge of frameslices have the most lag (being the last pixels to display).

Which means latency is more uniform for the whole screen plane during VSYNC OFF, unlike for VSYNC ON

Also, those who are familiar with the Leo Bodnar lag tester, top/center/bottom have increasing amounts of lag. Leo Bodnar is a 1080p 60Hz VSYNC ON lag tester. However, VSYNC OFF breaks the scanout latency barrier to make sub-frame scanout latencies possible, at the penalty of tearlines. The higher the framerate, the smaller the latency gradients become, and the aiming becomes more predictable/smoother (e.g. ultra-high-framerate CS:GO).

This is why game developers must pay attention to framepacing, and make sure that the gametimes are in sync with frametimes, to prevent stutters and erratic latencies. Sub-frame millisecond errors are still hugely visible as annoying stutter & annoying latency jittering. I've met some 60fps @ 60Hz games that felt like they had random internal latency jitter as big as 15 milliseconds (nearly one refresh cycle) with gametimes badly out of sync with frametimes. BAD, BAD! Don't do this, game developers, please.

Easiest method is to simply keep gametimes in sync with Present() times, though fluctuating frametimes can make this an imperfect algorithm, so the middle gametime of the center of a frametime, can sometimes be a more ideal metric (exact middle of variable-height frameslices from variable frametimes) for perfect latency averaging, but this requires predicting rendertimes, so this "ideal latency approach" is almost never done, and gametimes are just synchronized to top edges of frameslices, which is "Good Enough" especially for consistent framerates.

Present() triggers the refresh cycles on VRR monitors:

(VRR -- including G-SYNC and FreeSync -- is essentially variable-sized blanking intervals to temporally space out the dynamic/asynchronous refresh cycles that are software-timing-triggered. As long as Present() interval is within VRR range, the display refresh cycle timing are always software-triggered on a VRR monitor)

Now stutters can still show up in VRR if gametime intervals grossly diverge away from frame rendering times, e.g. very erratic frame rendering times, e.g. one frame is 1/40sec render and next frame is 1/200sec render. This is often because the refresh display time is based on the PREVIOUS frame render (the frame fully delivered to the monitor). If that PREVIOUS frame render was a very fast render, but the next frame render is very long, that "fast rendered frame" will be the displayed refresh cycle for a much longer duration (because the next frame -- a slow frame -- is still rendering). So your rendertime-displaytime is more out of sync. Stutters start showing through more again during VRR, as VRR is not perfect at eliminating every single stutter.

For VRR, best stutter elimination occurs when refreshtime (the time the photons are hitting eyeballs) is exactly in sync with the time taken to render that particular frame. But VRR does current refreshtime equal to the timing of the end of previous frametime (one-off), so very bad stutters will still show through if frametimes grossly diverge from refreshtimes (because it's 1-frame-trailing)

The moral of the story is very accurate gametimes are a must in modern engine programming where the render is running off the gametimes, in the era of erratic intervals between renders, and the need to not contribute additional microstuttering (= same thing as latency jittering) where unnecessary.

The worse the microstutter from errors, the more latency jitter there is, and it's harder to do twitch aims during bad latency jitter (Even at the same framerate). Because of bad programming (microstutter = latency jittering), I've seen worse aiming even at 80fps than a well-optimized engine running at 50fps.

This concludes another part of the Blur Busters Latency Pandora Box series. I do apologize for opening this Pandora's Box, but this topic is rather interesting!

Enjoy.

pegnose · Post by **pegnose** » 06 Jan 2019, 16:16

RealNC wrote: It's just due to blocking. The game thread that does the present() call does not resume execution until the present() hook of RTSS returns. The CPU load you see in RTSS is simply because RTSS uses a busy loop in its present() hook to count time rather than waiting on a system timer (it does that for better frame pacing accuracy.) So even if there's plenty of CPU resources available, the game does not resume because it's blocked by RTSS (or whatever CPU-based limiter you're using), not because there's no CPU resources available.

You sound like you do programming (or did in the past,) so it's really just because RTSS will replace the default present() function that doesn't block, with an implementation that does block. To bring it down to basics, in this simple C code example:
Code: Select all
/* 1. */ int c = getchar();
/* 2. */ some_other_function_call();
line 2 will not execute until getchar() has returned. It's a blocking function. present() however is not a blocking function. It returns immediately, so the game will go on to the next game loop iteration and render more frames. RTSS will replace the present() function with an implementation that does block. So the game is stuck there, regardless of whether resources are available or not.

Yes, I am programming. I did a little C++, a lot of Python, and a lot of Matlab with the PsychToolbox for good timing in visual scientific experiments (presenting stimuli for a defined number of frames, with a defined inter-stimulus interval, deriving accurate time stamps for post-experimental analyses and so on; also checking on timing and presentation lag/marker timing by means of photo LED or high-speed camera and such stuff). Using blocking vs. non-blocking functions were an important part of that work.

So RTSS replaces the present() function with one that is able to wait until the given frame time requirement is fulfilled. Effectively the system is CPU limited (hence 0 pre-rendered frames), even if actually not much work is done.

RealNC wrote: Assuming a frame buffer queue size of 3 (what games call "triple buffer vsync") and a pre-render queue of 3 (I think that's the default if the game doesn't use a lower value), then you end up with 2 rendered frames that are waiting and 3 that are have been pre-rendered and waiting to get sent to the GPU. That's 5. Note that this is in addition to the already existing input lag, which is up tp 1 frame (it's the frame that is currently being scanned out.) That's the input lag you can't get rid of because the monitor isn't infinitely fast. At 60Hz is needs 16.7ms to scan out a frame. That means input lag at the top of the screen is 0ms and at the bottom it's 16.7ms. So on average, it's 8.3ms. And, if the game samples user input too early, add to that up to 1 more additional frame of latency. This last one depends on how fast the CPU is. The faster the CPU, the earlier it will sample input and thus the longer it will stay in the blocked state, waiting to start pre-rendering a frame from the sampled input. So let's say on average that adds another half frame (8.3ms) worth of input lag. So in this scenario, you'd see 6 frames total latency.

I see, thank you!

RealNC wrote: However, not all games are implemented in the same way. Some games might have a lower latency, because they are careful to not run ahead of the GPU too much. You can find some games with in-game options like "reduced buffering" or "synchronous rendering" or "1 frame thread lag." These games can reduce total lag by 1 or 2 frames.

Right, like Overwatch, e.g.

RealNC wrote: On the other hand, some games can have even worse lag than "just" 6 frames. They might do internal buffering of pre-render commands (it's a form of "buffer bloat,") have sub-optimal threading with excessive waiting on threads, or have bugs. The Bethesda RPGs seem to be in that category.

Ah. And they are horrible in many other ways (although often fun to play). At least they managed to get rid of movement speed being coupled to FPS a short time after the release of Fallout 76.

RealNC wrote: And finally, multi-threading in general usually will also add some input lag, even if with good multi-threaded code. Well implemented multi-threading will not add too much lag though. But still.

Interesting, never thought of that.

RealNC wrote: Anyway, 6 frames worth of total input lag is generally what you can expect to be the worst case scenario with vsync. With RTSS, you can shave off 1 frame buffer (not sure if it gets rid of the second frame buffer in triple buffer vsync,) and the 3 pre-render buffers. So can go from 6 frames of lag to 2 frames of lag. An internal limiter can potentially also prevent the game from sampling the next user input too early, and also eliminate threading lag (an in-game limiter has access to the game's threads and can stop/resume them, RTSS can not.)

Nice. 6 Frames would be the literal gaming hell for me, particularly in first-person titles. Adding my display lag of ~20 ms to that I would be at ~60 ms even if operating near the upper limit of my 166Hz monitor.

RealNC wrote: The highest value you can use in NVidia Profile Inspector is 8. Maybe the driver allows even higher values, but there's no point in having an option to that. Even 8 is pretty much useless. Games become unplayable.

Not sure what the default ("app controlled") is when a game doesn't specify a pre-render queue size. I've heard it's 3, and that's decided by DirectX. But not sure. It could be 2. Or it might depend on the amount of cores in the CPU. The whole pre-render buffer mechanism and the switch to asynchronous frame buffer flipping was done because mainstream CPUs started to have more than one core in the 2000's.

Makes sense. Wow, 8!

RealNC wrote: And GPUs also started supporting parallelism. Modern GPUs can actually work on multiple frames at once. Not all parts of the GPU are needed at the same time while rendering a frame. So if the GPU would only render one frame at a time, some parts of it would be sitting there doing nothing (this is known as the "cold silicon" problem.) So a GPU can render more than one frame at a time. However, I'm not familiar with how this works exactly. I assume it can grab a new to-be-rendered-frame from the pre-render queue before the current frame has been completely rendered, but when and how... no idea.

So parallelism does need pre-render queues and an asynchronous frame presentation mechanism. With vsync OFF, or VRR like g-sync/freesync, or non-queued triple buffering like fast sync/enhanced sync, this is fine. It's what allows the kind of performance we get today in games. But with vsync though, you get buffer bloat, aka "vsync backpressure." You have all these buffers that need to go through this small bottleneck called vsync, so they just pile up, waiting to go through one by one. VRR really needed to happen at some point. Just plain old vsync isn't really suitable anymore like it was in the old days.

I feel truly honored that you take the time to explain all this to me. Seriously, thank you! I learn so much new stuff, it's amazing. I never heard of this form of parallelism!

From the previous discussion I understood that VRR with V-Sync enabled in NVCP has the same issues as mere V-Sync, correct? Only if you disable V-Sync with VRR, and accept some minor tearing here and there, you can get rid of all the down-sides?

RealNC wrote: Well, an in-game limiter is part of the game. It has access to everything the game does. It can thus choose to not read player input unless it knows that pre-rendering is possible. When using an external limiter, this isn't possible. The game will usually read player input as usual, and that input is then older by the time RTSS has returned from its present() hook. Note that this would also happen without RTSS; the input would have become older due to the vsync backpressure in that case. So it's not like RTSS added this source of input lag. It just didn't eliminate it. An in-game limiter eliminates it. (This is pretty much the reason you see people claim "RTSS adds input lag, in-game limiters do not." That's not the case. What's happening is that an in-game limiter can eliminate a source of input lag that RTSS can not. It doesn't mean that RTSS adds input lag.)

Yes, that is clear to me now. So you could say that in-game limiters achieve "-0.5" pre-rendered frames?

RealNC wrote: An in-game limiter can also do what jorimt said; wait a bit more than usual before reading input, depending on how long the previous frame took to render. Whether it does so or, we don't know. I don't think that this kind of predictive frame limiting is common in games though. If done, it can provide input lag that is very close to vsync OFF. But I'm not seeing that in in-game limiters, except in some rare cases, like some open source implementations of Quake.

Hm, now I thought what you described in this last paragraph was what you already described in the one before. Obviously not.

Blur Busters Forums

Pre-rendered frames etc. (continued from G-Sync 101 article)

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti

Re: Pre-rendered frames etc. (continued from G-Sync 101 arti