Special K can drastically reduce latency

Kaldaien · Post by **Kaldaien** » 08 Oct 2020, 03:23

Just discovered this thread

I feel weird not being part of the discussion on my own software...

Let me start by saying that, the reason I have not rushed to make any bold claims or publish anything is because I want the sweet end-to-end validation that NVIDIA's LDAT tool can offer instead of doing it all in software. Also, rather hilariously, latency never mattered to me at all until NVIDIA's Reflex PR.

I have been tuning my framerate limiter over the years for the sole purpose of scheduling frames at a constant rate so that awful console ports with physics tied to framerate work correctly on a wide range of PC hardware. There are some truly stupid games that cannot finish cutscenes if sampled time intervals are not aligned on perfect boundaries (i.e. NPC 0 cannot move from point A to point B without clipping an object in the scene and NPC 1 only starts doing something when NPC 0 reaches its final position (B)). Making those games work at arbitrary framerates was the reason SK's framerate limiter was created and up until a month ago, I was happy to leave it there

Simply placing the delay on the correct side of the swapchain Present (...) / GDI SwapBuffers (...) call was adequate to prevent my limiter from _introducing_ latency, that much was determined years and years ago. To be honest, I figured reducing latency could not be watered down to a two- or three-button process for the end-user and was content to leave my framerate limiter at "it does not unreasonably increase latency" and consider the design complete.

----

To my surprise, consistent timing + reduced latency is possible and I have spent the last month studying ways to minimize render queue latency without requiring the end-user to make 5 or 6 swapchain adjustments in Special K's control panel.

I have gone so far as to integrate Presentation Statistics (Fullscreen Exclusive / DXGI Flip Model) into my framepacing graph as a histogram showing present delay, it has proved incredibly useful to watch this data in real-time rather than rely on PresentMon to collect the information and analyze later.

Even if achieving lower-latency does not eventually boil down to a simple 2 or 3 click procedure, a power-user can watch the histogram and quickly begin to hunt down sources of latency (e.g. Xbox Game Bar adds +1 frame of latency whenever it is visible, and other similar multi-plane overlays behave the same). And that's tremendously powerful, and not something that's been done before.

The following video was captured for the purpose of testing NV's new HDR vidcap (works like a dream, BTW) when Special K's HDR functions are activated, but the framepacing widget (top-left) shows what I am discussing.

phpBB [video]

Sadly some of the tuning knobs that DXGI allows for lowering render queue latency are not practical in Direct3D 12 / Vulkan, because the driver does not implicitly manage resource allocation for the command queue(s) tied to each swapchain backbuffer in the low-level APIs. No doubt the same reason that "Ultra Low Latency" is inapplicable in Vk/D3D12.

Latency and pacing is what it is in D3D12; the best I have been able to do to improve on it is insert a WaitFor...Objects (...) call to delay frame-based game logic until there's a uncontested swapchain backbuffer to draw into -- most D3D12 engines should already be doing this (Horizon: Zero Dawn was not, and a screenshot earlier in this thread illustrates why engines should be doing this).

-----
Tl;Dr:

Why wasn't I invited to this party?

I am very pleased to hear that other individuals have requisitioned LDAT hardware from NVIDIA.

Distributing those tools to popular content creators w/o a formal application for actual developers, or better still, reaching out to well-known developers (such as Unwinder) is baffling. GamersNexus and Digital Foundry can undoubtedly put the tools to good use, but I think that's where the list ends and any other YouTuber who received one was given a fancy tool they will never use

Kaldaien · Post by **Kaldaien** » 08 Oct 2020, 03:50

RealNC wrote: ↑
28 Sep 2020, 08:16

axaro1 wrote: ↑
28 Sep 2020, 06:32
Special K seems incredibly consistent, the frametime deviation is 10x lower than RTSS or Nvidia's built in fps limiter
The frametime graph with RTSS is perfectly flat. So how can SK have 10x lower deviation if it's already 0 with RTSS?

Don't trust your own statistics if you're writing software that both measures and aims to improve them ;P

Use a third-party data collector or you are just giving margin of error more significance than it deserves.

---
That said, I'm familiar with that particular claim re: frametime std. deviation, and it is down to RTSS applying the limit on the wrong side of the API call that presents finished frames.

RTSS imposes its limit after Present returns, which really sucks if the application has VSYNC enabled since a full swapchain + n-many queued undisplayed frames causes the calling thread to be re-scheduled. The DWM uses a signal-based (VBLANK) wait when it reschedules threads, RTSS just uses Sleep (...) with some non-zero value. Waiting after-Present, and without a semaphore to transition the thread back to ready-to-run, causes an additional round-trip through the thread scheduler and you are guaranteed poor timing results if VSYNC is enabled (which is the majority of users playing games).

SK can be configured to apply the limit on either side of the queue submission function.

You will get significantly more stable frame throughput if you allow the engine to batch all commands for a frame and then force it to wait on the swapchain (see DXGI 1.3 Latency Waitable SwapChains), at the expense of added input latency. My opinion is that the very slim latency reduction possible by inducing framerate limiting _after_ Present (...) should be reserved for specialized applications (e.g. G-SYNC) and default behavior should be to delay before present.

YouTube · Post by **Chief Blur Buster** » 08 Oct 2020, 03:56

Kaldaien wrote: ↑
08 Oct 2020, 03:23
Just discovered this thread I feel weird not being part of the discussion on my own software...

Let me start by saying that, the reason I have not rushed to make any bold claims or publish anything is because I want the sweet end-to-end validation that NVIDIA's LDAT tool can offer instead of doing it all in software. Also, rather hilariously, latency never mattered to me at all until NVIDIA's Reflex PR.

Blur Busters loves everything temporal (Hz/fps/GtG/MPRT/lag/framepacing/etc) and we have big interest in everything temporal! Whether at www.blurbusters.com/category/area51-display-research or discussions here in these Forums.

Recently, I'm one of the few people in the world to have a "8000 Hz mouse + 360 Hz monitor combo", and it's already a next-level upgrade combo already!

Are you aware I'm the one who helped Guru3D add the RTSS Scanline Sync feature, based on my raster-based tearline steering research (Tearline Jedi)? I think it can be made WAY more user friendly, while introducing inputdelay feature, with some automation algorithms that I've dreamt up but that Guru3D doesn't have the time to do.

Though waitable swapchains with inputdelay may actually come really darn close to RTSS Scanline Sync, and the use of a custom resolution with large blanking intervals can create a Quick Frame Transport enhancement to reduce input lag even further for sync'd technologies (VSYNC ON, Scanline Sync, or other non-VSYNC-OFF tech), so a custom 60Hz fixed-Hz mode can behave lag-wise like 60fps@144Hz VRR or 60fps@240Hz VRR -- perfect for low-latency strobing use cases where strobing looks best with framerate=Hz motion.

(Hopefully you stay around to participate more in this discussion! BTW, after a few posts, all new users no longer need the Moderation Queue, just so you know)

Unwinder · Post by **Unwinder** » 08 Oct 2020, 06:44

Kaldaien wrote: ↑
08 Oct 2020, 03:50
RTSS imposes its limit after Present returns, which really sucks if the application has VSYNC enabled since a full swapchain + n-many queued undisplayed frames causes the calling thread to be re-scheduled. The DWM uses a signal-based (VBLANK) wait when it reschedules threads, RTSS just uses Sleep (...) with some non-zero value. Waiting after-Present, and without a semaphore to transition the thread back to ready-to-run, causes an additional round-trip through the thread scheduler and you are guaranteed

It is a question of comparing apples to oranges. It depends on your limiter's priorities and the target you want to acheive with it. Synchronizing limiter to the front edge of Present call (i.e. waiting in hook immediately before present) allows you to present frames @ synchronous moments of time and get more flat looking frametime graph in CapFrameX or in anything else based on PresentMon (wherer frametime measurements are DXGI present timestamp difference based). But it also guarantiees an increase in input latency because you are always waiting after finishing CPU rendering but before actually presenting the frame. And no, you cannot see it in CapFrameX.
For such limiting/wait mode CPU will never start rendering @ synchronous moments of time, so input sampling and frame rendering start points will jitter. So that's a case of receiving flat frametime if it is measured in DXGI present points and jittering frametime if it is measured in CPU rendering start points.
If you want to prioritize and synchronize each new frame rendering start timestamp and avoid increasing input latency with a limiter, you're going different way and synchronizing limiter to back edge of Present call (i.e. implement waiting after present). And that's a case of flat frametime when it is measured in CPU rendering start points (that's where RTSS is measuring it) and jittering frametime if it is measured in DXGI present points.
RTSS provides sync on both ways (front edge for SSYNC mode), back edge for regular limiter mode.

Kaldaien · Post by **Kaldaien** » 08 Oct 2020, 08:36

Chief Blur Buster wrote: ↑
08 Oct 2020, 03:56
Recently, I'm one of the few people in the world to have a "8000 Hz mouse + 360 Hz monitor combo", and it's already a next-level upgrade combo already!

Oh, that is awesome. NVIDIA's driver goes insane (massive hitches whenever there is rapid movement) at high mouse polling rates with G-SYNC enabled if the underlying game uses the Windows cursor rather than drawing its own cursor, so that could very well be problematic in some games.

Chief Blur Buster wrote: ↑
08 Oct 2020, 03:56
Are you aware I'm the one who helped Guru3D add the RTSS Scanline Sync feature, based on my raster-based tearline steering research (Tearline Jedi)? I think it can be made WAY more user friendly, while introducing inputdelay feature, with some automation algorithms that I've dreamt up but that Guru3D doesn't have the time to do.

Since I prefer windowed mode, I've never had much opportunity to play with Scanline Sync. I wonder whether that's even possible without removing the DWM from the equation?

Kaldaien · Post by **Kaldaien** » 08 Oct 2020, 08:56

Unwinder wrote: ↑
08 Oct 2020, 06:44

Kaldaien wrote: ↑
08 Oct 2020, 03:50
RTSS imposes its limit after Present returns, which really sucks if the application has VSYNC enabled since a full swapchain + n-many queued undisplayed frames causes the calling thread to be re-scheduled. The DWM uses a signal-based (VBLANK) wait when it reschedules threads, RTSS just uses Sleep (...) with some non-zero value. Waiting after-Present, and without a semaphore to transition the thread back to ready-to-run, causes an additional round-trip through the thread scheduler and you are guaranteed
It is a question of comparing apples to oranges. It depends on your limiter's priorities and the target you want to acheive with it. Synchronizing limiter to the front edge of Present call (i.e. waiting in hook immediately before present) allows you to present frames @ synchronous moments of time and get more flat looking frametime graph in CapFrameX or in anything else based on PresentMon (wherer frametime measurements are DXGI present timestamp difference based). But it also guarantiees an increase in input latency because you are always waiting after finishing CPU rendering but before actually presenting the frame. And no, you cannot see it in CapFrameX.
For such limiting/wait mode CPU will never start rendering @ synchronous moments of time, so input sampling and frame rendering start points will jitter. So that's a case of receiving flat frametime if it is measured in DXGI present points and jittering frametime if it is measured in CPU rendering start points.
If you want to prioritize and synchronize each new frame rendering start timestamp and avoid increasing input latency with a limiter, you're going different way and synchronizing limiter to back edge of Present call (i.e. implement waiting after present). And that's a case of flat frametime when it is measured in CPU rendering start points (that's where RTSS is measuring it) and jittering frametime if it is measured in DXGI present points.
RTSS provides sync on both ways (front edge for SSYNC mode), back edge for regular limiter mode.

Thank you, that makes sense. But I disagree that waiting before presenting _always_ increases latency.

If the thing you are waiting on is a signal from the DWM indicating that Present (...) will return without blocking, you can stage all commands and then immediately begin the next frame. This turns what would be two flips (application back/front, then DWM back/front using app's front buffer) into a single flip from application back to DWM front, and removes an entire frame of latency. That only works if the wait happens before calling Present.

I think a lot of this is down to the fact that I don't use fullscreen exclusive, so wisdom that holds for traditional fullscreen is slightly different.

Unwinder · Post by **Unwinder** » 08 Oct 2020, 09:10

Kaldaien wrote: ↑
08 Oct 2020, 08:56
Thank you, that makes sense. But I disagree that waiting before presenting _always_ increases latency.

It depends on definition of latency. If we're talking about input latency then wait before present always increases it. If we're talking about render latency, then wait before present can decrease it when we steer Present close to VBLANK (that's exactly what SSYNC is doing, it is trying to steer Present call to desired rasterizer scanline position VBLANK - offset, so we can expect frame to be presented close to VBLANK or inside VBLANK and steer tearline position if GPU is powerful enough to provide minimal render latency). If we're talking about end-to-end latency, result can be mixed (increased input latency VS potentially decreased render latency with unknown winner in result).

YouTube · Post by **Chief Blur Buster** » 08 Oct 2020, 12:28

Different games can do it differently (single threaded, multithreaded, etc)
If it's a single thread then it's normally this loop:

[...-inputread-render-Present()-inputread-render-Present()-inputread-render-Present()-...]

For a VSYNC ON technology -- the frame delivered by Present() will not be delivered until the next fixed refresh cycle. That adds VSYNC ON latency. Now you can decrease VSYNC ON latency with this workflow by monitoring how long rendertimes take. While knowing the display refresh rate. Then Then and predictively add an inputdelay AFTER present (plus some safety margin so we don't have a VSYNC miss).

[...-inputread-render-Present()-inputdelay-inputread-render-Present()-inputdelay-inputread-render-Present()-inputdelay-...]

This technique pushes the inputread closer to the next frame presentation. Present() will (almost always) block for a much shorter time period, and the frame is displayed sooner on screen relative to inputread.

With this classical workflow, an inputread before Present() could increase lag in this specific particular case:

[...-inputdelay-inputread-render-inputdelay-Present()-inputread-render-inputdelay-Present()-inputread-render-inputdelay-Present()-...]

Now... I've seen far more complex workflows though (multicore / multithreaded rendering, etc) where location of inputdelay can create different effects. There may be certain cases where rendering in a different thread means that inputdelay before present doesn't have a problem. However, if you're hooking Present() and doing an inputdelay before the real Present(), there will be latency. On the other hand if you're hooking differently, whatever code you have for Present() may be grabbing the freshest frame from the other game rendering thread, so that inputdelay-before-present may not have added latency. In other words, the inputdelay-before-present didn't affect latency in that particular multithreaded workflow.

Configurability will tend to win the day; That's why there's talk of framecapping require so many checkboxes to tick.

YouTube · Post by **Chief Blur Buster** » 08 Oct 2020, 12:39

Kaldaien wrote: ↑
08 Oct 2020, 08:56
If the thing you are waiting on is a signal from the DWM indicating that Present (...) will return without blocking, you can stage all commands and then immediately begin the next frame. This turns what would be two flips (application back/front, then DWM back/front using app's front buffer) into a single flip from application back to DWM front, and removes an entire frame of latency. That only works if the wait happens before calling Present.

I think a lot of this is down to the fact that I don't use fullscreen exclusive, so wisdom that holds for traditional fullscreen is slightly different.

1. RTSS Scanline Sync works at DWM if you configure for the hidden tearline first. (Use VSYNC OFF first, then steer tearline to become visible above bottom edge of screen, THEN going DWM / triple buffer). The distance of tearline from bottom IS the inputdelay before present

2. Inputdelay to reduce VSYNC ON blocking delay is already a form of software-based beamracing (time offsets between two VSYNC's can be used to estimate a raster scan line number), so inputdelay is sorta a guesstimate-equivalent of RTSS Scanline Sync.

RTSS Scanline Sync has actually low lag numbers..... But it’s a defacto inputdelay-before-present! Since it’s waiting for a scan line number. That can add lag but since the technique still eliminates frame queueing, so you reduce lag more than increase — no more backlog of frames.

The rave reviews of both is justified because they reached the “same” end goal differently.

Two different approaches to solving the same problem.

Aren’t happy accidents great?

Unwinder · Post by **Unwinder** » 08 Oct 2020, 12:53

Chief Blur Buster wrote: ↑
08 Oct 2020, 12:28
However, if you're hooking Present() and doing an inputdelay before the real Present(), there will be latency.

That's the only case in context of this thread, Marc. Any third party limiter (and even NV's own framerate limiter) is technically a Present() hook from application's point of view with delay added either before (in case of SpecialK or RTSS working in SSYNC mode) or after (in case of NV driver limiter or RTSS working in regular limiter mode) the real Present() call.
Configurabilitty wins with no doubts and I see no problem adding GUI option for waiting before real Present() in RTSS in traditional framerate limter mode (it is already inside but active for SSYNC only), but I hardly see a practical usage for it besides seeing flatter frametime graph in CapFrameX. Probably with the only exception of using it for some hybrid Scanline Sync mode (where ssync will be performed just once to steer tearline to initial desired rasterizer position then regular framerate limiter with wait performed before invoiking real Present() will do the rest synchronization and tearline stabilization job).

Blur Busters Forums

Special K can drastically reduce latency

Re: Special K can drastically reduce latency

Re: Special K can drastically reduce latency

Re: Special K can drastically reduce latency

Re: Special K can drastically reduce latency

Re: Special K can drastically reduce latency

Re: Special K can drastically reduce latency

Re: Special K can drastically reduce latency

Re: Special K can drastically reduce latency

Re: Special K can drastically reduce latency

Re: Special K can drastically reduce latency