Blur Busters Forums

Posted: **16 Mar 2021, 12:11**

So nvidia is giving us 3 tools to get rid of queued frames:

Low latency mode in NVCPL
Reflex in game options
Maximum Pre-rendered frames in NVinspector

Cool, I tested all of them and none work. Maybe my method is flawed, tell me what I'm doing wrong please.

I'm running furmark stresstest to get gpu usage to 99-100%. Then I start a game(kovaak) with various settings (reflex on/off, low latency ultra etc.).
I benchmark with Frameview for a few minutes, then check the log for "Render Queue Depth".
Low latency mode on/off/ultra all give maximum render queue depth of 3+
Reflex even managed 5+

Again, maybe my method is bad, maybe render queue depth is not what I think it is. LMK if you have any info on that.

Posted: **16 Mar 2021, 12:38**

I saw people post that 1 max pre-rendered frames give them huge noticable advantage, but tbh i don't see any differences between 8 or 1.

Posted: **16 Mar 2021, 21:00**

chenifa wrote: ↑
16 Mar 2021, 12:11
So nvidia is giving us 3 tools to get rid of queued frames:

Low latency mode in NVCPL
Reflex in game options
Maximum Pre-rendered frames in NVinspector

Cool, I tested all of them and none work. Maybe my method is flawed, tell me what I'm doing wrong please.

I'm running furmark stresstest to get gpu usage to 99-100%. Then I start a game(kovaak) with various settings (reflex on/off, low latency ultra etc.).
I benchmark with Frameview for a few minutes, then check the log for "Render Queue Depth".
Low latency mode on/off/ultra all give maximum render queue depth of 3+
Reflex even managed 5+

Again, maybe my method is bad, maybe render queue depth is not what I think it is. LMK if you have any info on that.

Maybe glitch in reporting. Can't you tell any difference by feel?

First pre-rendered frames are the same thing as Low latency mode in NVCP. It was renamed to latter.

- off should be like 3-4 (no idea, don't use it except in SP)
- on should be 1 pre-rendered frame (it is best in 90% of scenarios)
- Ultra it actually won't lower pre-rendered frames, because GPU needs at least 1 pre-rendered frame from CPU to do work on. It will reduce driver latency however. Don't ask me specifics how this works! But in practice: it works only if you have GPU usage on 99%, was tested with 1000fps camera, otherwise this will give you higher input lag than on!!! Because GPU has nothing to do! I tried this on game when my GPU usage is low and Ultra felt worse then on! Maybe Ultra would work well like on 80-99% (dunno didn't test it)!

To add: 1 pre-rendered frame will increase input lag about 1 frame - like 60 fps has 16.67ms frame time, so 2 pre-rendered frames would add 32ms worth of input lag!

You should be able to tell between these 3 options simply by feel! But some people aren't so sensitive, which don't play that much! Depends also on DPI, I play on 800 DPI longest time: so I can tell difference in input lag a lot. On 400DPI/1000hz I was able to tell even 6 ms difference in input lag from Blurbusters AB test!

Don't know about reflex, I do not play any games currently, which support it! But I heard it reduced input lag even about 33%, even if it reduces less, you should be able to tell, if it works or not! As it is pretty major technology! Reflex is probably not a scam. Ultra was more about marketing, NVIDIA just tried to push out similar thing to what AMD had... Game has to support it and so it will sync with driver to reduce latency. Games/drivers are not optimized on latency, but pushing most fps up to this date... People lower details to lowest, otherwise you can notice: there is gigantic latency! I played CS GO on 1024x768 lowest, otherwise I couldn't kill anything. Even 1280x1024 felt laggy to me!

Posted: **16 Mar 2021, 21:12**

RTSS Scanline Sync is another method to get rid of pre-rendered frames too.

_____

Now my commentary:

Also, if you're running multiple 3D apps, like Furmark AND Kovaak simultaneously, utilities such as Frameview may be counting the aggregate total of prerendered frames in both software packages. Also, some games will always generate prerendered frames internally regardless of the software setting. There are some major GPU pipelining inefficiencies that emerge when running two separate heavy 3D rendering apps on the same GPU where they are now forced to share GPU memory, and may trigger new latency behaviors not normally done by a single GPU.

The best way to benchmark is one software at a time, never two simultaneously -- is to max out the GPU via VSYNC OFF and low-CPU settings. For example, use VSYNC OFF, use maximum resolution, use maximum graphics detail, to put as much load on the GPU instead of the CPU.

Now, which games are you playing that reaches 100% GPU in real world non-synthetic game play? If you have real world games that never max out 100% GPU, then one asks oneself: Is there a purpose to benchmarking a synthetic situation? If you're not getting latency of a 100%-maxed-out-GPU, isn't that good? Then why does one need to worry about the game's own latency during a 100%-maxed-out GPU situation that does not happen with games such as CS:GO anyway? The unused GPU % headroom is very healthy for latency anyway -- even 5% is good when latency is numero uno (esports). CS:GO is one of those older-engine games that is currently CPU-limited, so the GPU tend to not hit 100% in that game on modern GPUs.

There are legitimate needs to do synthetic benchmarks, but would like to know the rationale in this specific situation -- like a specific latency-important game that is at 100% GPU in your real world play cases? Usually when a GPU is hitting 100%, it's during frame rate dips due to super complex sceneary (think Cyberpunk 2077 league graphics) rather than other things like network accesses or disk accesses. Such frame rate dups isn't sync-technology or display-technology bottlenecked (little VSYNC waiting) so most GPU latency is render latency and not frame queue latency -- in fact during VSYNC OFF, the frame will usually splice rather quickly (within less than a millisecond) at the current raster position of the display scanout (creating a tearline on that spot). In that situation, any software-reported frame queue numbers are usually synthetic/artificial and not representative of real-world button-to-pixel latencies at the specific particular moment for that specific particular tearline. They may just be preallocation of extra buffers "just in case" but aren't actually used, so a frame queue of 3 may actually be 0ms latency penalty most of the time... Which means during the 100% GPU surges / frame rate dips, the render queue is least likely to be used! It's a preallocated queue which isn't necessarily used. If this is what is happening, then the only scam is "being prepared" -- nothing wrong with that.

Now, you know how horrendously big the latency chain is...

Now, you can skip the black box by measuring the left end right ends. Button to photons. How do you do that, you ask?

You need a photodiode oscilloscope to bypass all the FUD -- or a purpose built device similar to NVIDIA's LDAT. (We have an in-house device too, but that's mainly used for Blur Busters Approved and consulting services at the moment).

The proof is at the stopwatch endpoints -- the button press (mouse button or mouse move) is the stopwatch start. The light emitting from pixels on a screen is the stopwatch end. This becomes latency ground truth, but one also needs to bear in mind that not all pixels on a display refresh at the same time, and that GtG pixel response can vary for different colors (creating an error margin for latency measurements since some colors will begin emitting a millisecond sooner than others).

What a Latency Pandora Box, eh? But latency connoiseurs know to bypass the FUD and get a photodiode device. Ardunio homebuilt, or vendor built (NVIDIA), or some third party device (of which there's a few). We might even sell our device too (...we're still deciding...)

Definitely, as researchers have seen situations where single milliseconds can affect displays (when tested under the right scientific variables) -- we are Milliseconds Matters people here at Blur Busters, (see The Amazing Human Visible Feats Of The Millisecond) but we are pragmatic about latency noise that may be misunderstood. We see lots of false blame (like blaming "X" when the latency problem is caused by "Y") in the industry, so be careful not to fall in the trap in this thread.

We are big fans of surgically troubleshooting the right/real problems, so secondary verification is helpful (e.g. parallel testing with a photodiode oscilloscope device, and/or a 1000fps high speed camera).

Posted: **17 Mar 2021, 10:48**

Chief Blur Buster wrote: ↑
16 Mar 2021, 21:12
Also, if you're running multiple 3D apps, like Furmark AND Kovaak simultaneously, utilities such as Frameview may be counting the aggregate total of prerendered frames in both software packages.

I doubt it, the logfile gives application name for every frame rendered.

If you have real world games that never max out 100% GPU, then one asks oneself: Is there a purpose to benchmarking a synthetic situation? If you're not getting latency of a 100%-maxed-out-GPU, isn't that good? Then why does one need to worry about the game's own latency during a 100%-maxed-out GPU situation that does not happen with games such as CS:GO anyway?

There is a purpose, which is to examine what those features actually do. The info shared from nvidia is lacking - 5 mentions on their entire website, none of them explaining the feature in more depth than 2 sentences.
And the little info we get is inaccurate. Low latency mode does NOT limit the frame queue. It might dump frames and never use the older ones but the cpu does prerender more than 1 frame and can get up to 5 at least.
Next issue is that there is a clear difference between low latency mode set to ultra vs off. It's very easy to tell by feel. And that's on a regular aim trainer not using much GPU at all. Running framview also reveals that frame queue depth is the same in that scenario(for me it's usually around 0.5 and never above 1!).
So if there is never more than 1 prerendered frame, why can you easily tell the difference with different llm settings?
My theory is that ULL goes beyond setting pre-rendered frames to 1 (ignoring the gsync stuff it does, bc I don't use gsync).
Point is if nvidia doesn't give info we need to test stuff ourselves. The synthetic scenario was chosen to test their proposition and it failed.

Thanks for the input chief, appreciate it!

I did some more testing and looked closer at the log files. Noticed a couple things:
1) Average frame queue depht is well bellow 1 for all settings (llm off/on/ultra; reflex on/off; Max prerender 1/8)(at 100% gpu usage)
2) All settings will have a few instances of framequeue going above 1 and 2, sometimes even 3
3) This is the most interesting: Usually after a high framequeue, let's say 3, the next reported frame queue depth is very low like 0.1.
If the queue was really being used you would expect the queue depth not to drop below 2 on the next frame.
My first thought is that the gpu takes the most recent prerender and dumps the rest.
4) This dumping happens even on LLM off with max pre-rendered set to 8. But I found instances where there is a high frame queue over consecutive frames. However that's not proof that the render queue is not being dumped either.

TLDR: Testing with their own benchmarking tool, it's really hard to find a differnce between all the LLM settings and max pre-render settings even when the GPU is maxed out. This is in stark contrast to the obvious change in mousecontrol, when switching up these settings(especially LLM ultra vs off).

Posted: **17 Mar 2021, 16:41**

chenifa wrote: ↑
17 Mar 2021, 10:48
My first thought is that the gpu takes the most recent prerender and dumps the rest.

Are you using VSYNC ON or NVIDIA Fast Sync?

If you're using VSYNC OFF, frame dumping never happens (unless the game intentionally does it -- but that's rare during VSYNC OFF). You can have variable-thickness frameslices, since the distance between tearlines is dependent on frame time, and you can have multiple frame slice per refresh cycle:

Rarely, in multithreaded rendering VSYNC OFF occasionally/rarely creates a temporary frame queue because of a blocked frame presentation (i.e. GPU too busy doing something else, like a maxed 100% GPU that is now blocking everything else) but then they spew out suddenly (it can take less than 1ms to present 3 frames during VSYNC OFF -- but even they appear as thin frame slices). Most of the time, GPUs pipeline differently than this, but situations may occur where present latency suddenly spikes (blocked call) while a different thread continues rendering (in multithread rendering), so you may have sudden brief spikes in frame queues, especially if you're pushing GPU limits.

But during VSYNC OFF, the frame queue can flush out very fast -- yet still all have frames visible -- see my diagrams below. So frames aren't necessarily "dumped" unseen during VSYNC OFF -- once you understand the relationship of VSYNC OFF versus scanout.

It's useful to know a tearline follows the math of horizontal scan rate (240Hz is usually 270,000 pixel rows per second, so a 0.5ms frame interval between two Present() calls creates a frameslice approximately 270000 * 0.0005 = 135 pixel tall frameslice for a 0.5ms presentation interval (i.e. suddenly emptying of a frame queue in two consecutive Present()'s if Present() latency is 0.5ms). The horizontal scan rate also known as the horizontal refresh rate (horizontal = number of pixel rows per second, including offscreen pixel rows in blanking interval, usually less than 5% of visible resolution)

I'm an expert in terms of the black box between Present()-to-photons, including the pixels coming out of the GPU output, you can see my Tearline Jedi VSYNC OFF Raster Beam Racing Experiments which is how Guru3D was inspired to invent RTSS Scanline Sync.

VSYNC OFF 432fps at 144 Hz

VSYNC OFF 1000fps at 144 Hz

Notice the thinner/thicker slices. That's from frametime variances. Some frametimes may take 0.5ms and other frame times may take 2ms. Those 2ms frametimes have 4x taller frameslices than 0.5ms frametimes.

And, to show mastery of tearline control in software programming:

phpBB [video]

If you need to pick up your jaw from the floor, there's more YouTubes of Tearline Jedi mastery...

At 240Hz at 270KHz scanrate, I needed to program microsecond busywaits in 1/270000sec increments before Present()'ing the frame to move tearline downwards by 1 pixel. It all jitters because of timing imprecisions, but it demonstrates tearlines are just rasters based off the Horizontal Refresh Rate (the number visible in ToastyX Custom Resolution Utility).

To answer some questions, some of these stuff is measurable via tearline behavior too -- to know the latency of my Present()'s and frame queues if I stare at the ground truth of raster tearline positions (they're almost atom-clock-like in precision: You can even use a high speed camera and use the tearline position as part of latency calculations, and the numbers actually end up matching photodiode measurements depending on whether the photodiode is put immediately above tearline or immediately below tearline -- latency is highest at the bottom of frameslices and latency is lowest at the top edge of framelices, and it's a continuous latency gradient along the vertical dimension of the frameslices (between two tearlines).

Frame queue latency can sometimes be sub-millisecond and still produce visible frame slices. So say, you present 3 consecutive frames (during VSYNC OFF) to quickly empty a frame queue (3 down to 1) in less than a millisecond -- that doesn't mean the frames are discarded, if you're using VSYNC OFF.

Example: If you present 3 frames quickly (present call unblocks, and the frame queue unblocks suddenly and flows) -- you might notice, say, 3 frames that are thin -- like, only 100 pixel tall frame slices (seen in high speed camera if you're trying to catch briefly momentarily-visible tearlines). Like no tearlines for a while then suddenly 3 tearlines in same refresh cycle. (Hard to see unless you have a high speed camera). To turn tearline distances into driverside frame presentation latencies, you can even load ToastyX, get the horizontal scan rate, then calculate the time interval between frames simply by the distance between tearlines! You can even estimate; If it's 1/10th screen height between each tearlines and there were three tearlines, it's 1/10th of a refreshtime. So if 144Hz, you had 1/1440sec (0.7ms) between tearlines, during a sudden frame-queue-emptying that say, maybe took about 1.4ms total (2 x 0.7ms), measured from high speed footage and measuring distances between tearlines, and converting it to time numbers that way. Clever method of tearline-position-as-clock (if done properly). It's a bit of overkill work, but valid if one was scientifically inclined to try to measure a black box this way via its outputted tearlines.

So there's multiple ways to measure behaviours. I would like to close out, that adding a photodiode oscilloscope to this, Present()-to-photons is essentially almost lagless at the first pixel row underneath the tearline -- about 2-3ms (the latency of the DisplayPort transceiver + scaler/TCON + pixel GtG response).

The point being, is if you're using VSYNC OFF, frames are typically not thrown out. And most of the time it stays at 1 queue depth because frame presentation (aka Present() ...) is almost non-blocking during VSYNC OFF but in rare cases may block for whatever reason, frame queue gets piled up, but the frame queue may spew out as rapid sub-millisecond frameslices -- so even if you deliver 3 frames in one millisecond, you may still have visible parts of 3 frames on the screen anyway, thanks to the way how VSYNC OFF is scanout-splicing....

As the process of scanout -- serializing a 2D image out of a 1D output for a 1D wire to a display -- a GPU output is spewing out 1 pixel row every 1/270000sec for 1080p240 (just look at the "Refresh" under "Horizontal" in ToastyX to know how fast a GPU outputs one pixel row), and most gaming monitors just realtime streaming that pixel row straight onto the panel in its existing top-to-bottom scanout sweep (there's a small rolling window of a few pixel rows for processing/GtG/etc, but effectively sub-refresh latency for cable:panel refresh).

What this really means, is you can present 3 frames very fast (less than a millisecond apart), and still have visible frameslices from ALL of them, despite less than 1ms to present 3 frames.

Of course, this assumes you're testing using VSYNC OFF rather than certain VSYNC ON / Fast Sync workflows (where frames CAN be discarded completely unseen)

P.S. This talk made RTSS Scanline Sync possible, when I communicated all this to Guru3D...

Posted: **18 Mar 2021, 07:22**

@Chief Blur Buster
I'm uploading the log file so you know what I'm talking about
(would have uploaded here, but the forum doesn't allow spreadsheets)
https://easyupload.io/rldt1q

Open the spreadsheet and seperate by comma so the format is right.
Go the column that says "Render Queue Depth", highlight the whole column and ctrl-f search for 3.946 (or 3,946).
This is the highest frame queue meassured in this test round.
Right below you see the next frame, where render queue depth is now 0.138

How can the render queue drop from 3.946 to 0.138 with only 1 frame presented in between?
For me this indicates that a big chunk of the render queue got dumped.

I don't use v-sync/g-sync/freesync/fastsync. Max pre-render is set to 1.

Posted: **18 Mar 2021, 16:01**

chenifa wrote: ↑
18 Mar 2021, 07:22
@Chief Blur Buster
I'm uploading the log file so you know what I'm talking about
(would have uploaded here, but the forum doesn't allow spreadsheets)
https://easyupload.io/rldt1q

I can't access that outside USA... (Error 1020 Access Denied).

Might want to use a different file hosting service.

Posted: **19 Mar 2021, 02:04**

Chief Blur Busters, do most monitors wait for 1 entire frame of data before starting the rasterization process across each line?

I remember reading about a LG/Samsung TV that started the Rasterization process the moment they got enough data for one line.

This was supposed to help with gaming latency.

Posted: **19 Mar 2021, 07:02**

Chief Blur Buster wrote: ↑
18 Mar 2021, 16:01

chenifa wrote: ↑
18 Mar 2021, 07:22
@Chief Blur Buster
I'm uploading the log file so you know what I'm talking about
(would have uploaded here, but the forum doesn't allow spreadsheets)
https://easyupload.io/rldt1q
I can't access that outside USA... (Error 1020 Access Denied).

Might want to use a different file hosting service.

Maybe this one works
https://ufile.io/plu0qygp

Blur Busters Forums

Pre-Rendered Frames (scam!?)

Pre-Rendered Frames (scam!?)

Re: Pre-Rendered Frames (scam!?)

Re: Pre-Rendered Frames (scam!?)

Re: Pre-Rendered Frames (scam!?)

Re: Pre-Rendered Frames (scam!?)

Re: Pre-Rendered Frames (scam!?)

Re: Pre-Rendered Frames (scam!?)

Re: Pre-Rendered Frames (scam!?)

Re: Pre-Rendered Frames (scam!?)

Re: Pre-Rendered Frames (scam!?)