Thanks to my own suggestion to their team --
RTSS now has a new automatic Low-Lag VSYNC ON mode.
It supports beam-raced page flips now (VBI racing), producing what may be the lowest possible VSYNC ON input lag -- without additional programmer work to reduce input lag further in the gaming software.
(Original ChangeLog: https://forums.guru3d.com/threads/rtss- ... st-5549072 ...)
New feature highlighted in blue.
Unwinder, post: 5550748, member: 30019 wrote:RTSS 7.2.0 beta 1 is online:
http://www.guru3d.com/files-details/rts ... nload.html
· Added On-Screen Display performance profiler. Power users may enable it to measure and visualize CPU and GPU performance overhead added by On-Screen Display rendering. Two performance profiling modes are available:
o Compact mode provides basic and the most important CPU prepare (On-Screen Display hypertext formatting, parsing and tessellation), CPU rendering and total CPU times, as well as GPU rendering time (currently supported for Direct3D9+ and OpenGL applications only)
o Full mode provides additional and more detailed per-stage CPU times
· Improved built-in framerate limiter:
o Added power user oriented profile setting, allowing you to specify the limit directly as a target frametime with 1 microsecond precision
o Added power user oriented profile setting, allowing you to adjust throttle time. Throttle time adjustment is aimed to reduce input lag when framerate is below the target limit or without limiting the framerate
o Added power user oriented profile setting, allowing you to synchronize framerate to up to two independent scanline indices per refresh interval. Combining with user configurable scanline wait timeout, those settings provide experienced users low input lag adaptive VSync or FastSync functionality on any hardware
· Various On-Screen Display optimizations and improvements:
o Added adjustable minimum refresh period for On-Screen Display renderer. The period is set to 10 milliseconds by default, so now the On-Screen Display is not allowed to be refreshed more frequently than 100 times per second. Such implementation allows keeping smooth animation when On-Screen Display contents are being updated on each frame (e.g. when displaying realtime frametime graph) without wasting too much CPU time on it
o Added alternate GPU copy based Vector2D On-Screen Display rendering mode implementation for Direct3D1x applications. New mode provides up to 5x Vector2D performance improvement on NVIDIA graphics cards, however it is disabled on AMD hardware due to slow implementation of CopySubresourceRegion in AMD display drivers
o Vector2D rendering mode is now forcibly disabled in Vulkan applications on AMD graphics cards due to insanely slow implementation of vkCmdClearAttachments in AMD display drivers
o Revamped geometry batching and vertex buffer usage strategy in pure Direct3D12 On-Screen Display renderer (currently used in Halo Wars 2 only)
o Added Vector2D rendering mode support to pure Direct3D12 On-Screen Display renderer
o Optimized On-Screen Display hypertext parsing and tessellation implementation
o Optimized state changes in OpenGL On-Screen Display rendering implementation
o Optimized state changes in Direct3D1x On-Screen Display rendering implementation
o Solid rectangles and line primitives in Direct3D8 and Direct3D9 On-Screen Display rendering implementations are now rendered from vertex buffer instead of user memory
o Improved OpenGL framebuffer dimensions detection when framebuffer coordinate space is selected
· Fixed On-Screen Display rendering in wrong colors when Vector2D mode is selected and Direct3D1x applications use 10-bit framebuffer
· Fixed Vulkan fence synchronization issue, which could cause GPU-limited Vulkan applications to hang due to attempt to reuse busy command buffer
· Active busy-wait loop in the framerate limiter module is now forcibly interrupted during unloading the hooks library to minimize the risk of deadlocking 3D application when dynamically closing RivaTuner Statistics Server during 3D application runtime
· Improved synchronization in 32-bit hook uninstallation routines
· Updated profiles list
A few notes about new toys for power users:
New performance profiler
Performance profiler can be enabled by setting PerformanceProfiler field in [OSD] section to 1 (basic mode) or 2 (detailed mode). "Show own statistics" must be enabled in RTSS to see the profiler. The following performance counters are available for detailed mode:
CPU acquire – CPU time, spend on acquiring access to 3D API. This CPU time depends on 3D API used by application, in most cases it is zero, for D3D12 applications displaying OSD in D3D11on12 mode it is CPU time spend on acquiring D3D11on12 wrapper for rendering, in Vulkan applications asynchronically presenting frames from compute queue (e.g. DOOM on Wolfenstein II on AMD cards) it is CPU time spend on synchronizing graphics and compute queues. For OpenGL applications it can be nonzero if application is forcibly flushing the pipeline in the end of each frame rendering with glFlush. CPU acquire stage is executed on each frame.
CPU prepare – CPU time spend on preparing OSD contents for rendering. This CPU time doesn’t depend on 3D API used by application, it entirely depends on the amount of text/graphs you’re displaying in OSD. CPU prepare time is divided into the following substages: init, parse and tessellate. Init is CPU time spend on formatting own RTSS OSD contents (i.e. formatting own framerate counters, scanning hypertext and replacing framerate macro with real formatted framerate values, formatting performance counters, benchmark statistics etc). Parse is CPU time spend on parsing resulting OSD hypertext (including the hypertext supplied by OSD clients like MSI AB or HwInfo), processing hypertext formatting tags and preparing OSD contents to collection of text with attributes to be tessellated on the next stage. Tessellate is CPU time spend on converting parsed OSD text and attributes to renderable form (collection of vector rects for each symbol for vector 2D/3D OSD rendering modes or collection of textured quads for each symbol for raster 3D mode). CPU prepare stage is executed on the frames when OSD contents is refreshing, i.e. if you’re displaying OSD with framerate counter and default refresh rate in RTSS properties (500 ms), then OSD is refreshing and this stage is executed just twice per second.
CPU render – CPU time spend on rendering OSD. This CPU time depends on 3D API used by application and on OSD rendering mode selected in RTSS (Vector2D, Vector3D or Raster3D). CPU render time is divided into the following substages: save, submit and restore. Save is CPU time spend on saving 3D rendering pipeline state before rendering OSD. This substage entirely depends on 3D API used by application, for example state changes are most expensive for Direct3D9 applications (especially pure Direct3D9 ones). Low-level 3D APIs (pure Direct3D12 or Vulkan) do not require saving pipeline state, so this CPU time is zero. Vector2D OSD rendering mode also doesn’t require saving and restoring rendering pipeline state, so it is zero in this case too. Submit is CPU time spend on filling vertex buffers with previously tessellated OSD geometry and submitting it to 3D API. Restore is CPU time spend on restoring previously saved 3D rendering pipeline state after drawing OSD. CPU render stage is executed on each frame.
CPU capture – CPU time spend of capturing framebuffer contents. This stage is executed and this time is not equal to zero during videocapture only.
CPU flush – CPU time spend on the final stage of flushing OSD renderer and returning control to application’s 3D API. This time is D3D11on12 wrapper flushing time for all applications besides D3D12 applications displaying OSD in D3D11on12 mode. For applications using different 3D APIs it is zero. This stage is executed on each frame.
CPU total – total CPU time including all stages listed above.
GPU render – GPU time spend on rendering OSD. This performance counter is currently collected for Direct3D9, Direct3D10, Direct3D11, Direct3D12 applications displaying overlay in D3D11on12 mode and OpenGL applications only. GPU render time profiling is currently not supported for Vulkan and pure Direct3D12 applications.
New scanline sync based framerate limiter
Before you start experimenting with new sync mode, it is recommended to enable diagnostic scanline sync related info in OSD by setting SyncInfo field in [OSD] section to 1. "Show own statistics" must be also enabled in RTSS to see it. New scanline sync based framerate limiter is controlled by the following values:
SyncDisplay – name of logical display device to be synchronized with. Currently it is a primary display name.
SyncScanline0 – index of the first scanline for framerate synchronization. No synchronization is performed when it is set to zero, otherwise this is treated as scanline index starting from top of the frame. E.g. SyncScanline0=1 means that the frame is will be synchronized with the top (or more precisely the second scanline, because indices are zero based) scanline and SyncScanline0=1000 means that each frame will be synchronized with scanline 1000 (which is located in the bottom part of screen if we use 1080p mode with total 1125 scanlines total).
SyncScanline1 – index of the second scanline for framerate synchronization. Defining two independent sycnhronization points per refresh allows us to get functionality of NVIDIA's FreeSync, when why get 2xRefresh smooth framerate). No synchronization is performed when it is set to zero, otherwise this is treated as index starting from middle of the frame. E.g. SyncScanline1=1 with total 1125 scanlines means that the frame is will be synchronized with the scanline 562(1125/2)+1=563 scanline and SyncScanline1=400 means that each frame will be synchronized with scanline 562(1125/2)+400=962 (which is located in the bottom part of screen if we use 1080p mode with 1125 scanlines total).
SyncTimeout – allows adjusting timeout for scanline synchronization. The timeout provides functionality similar to NVIDIA’s Adaptive Sync, meaning that you may forcibly disable synchronization when framerate drops below the refresh rate. Timeout can be specified either explicitly in microseconds (e.g. SyncTimeout=16667 for 60Hz refresh rate) or you can let RTSS to benchmark and calibrate it automatically and set it to 1/N of refresh time when SyncTimeout=N is in [1,8] range).
Summarizing, you may start experiments with scanline sync with the following presets:
For traditional VSync with low input lag:
In this case tearline position is fixed in the top of frame, so you can move it down via tuning and increasing SyncScanline0 value.
Code: Select all
SyncScanline0=1 SyncScanline1=0 SyncTimeout=0
For adaptive VSync with low input lag on 60Hz refresh rate:
or calibrate timeout automatically:
Code: Select all
SyncScanline0=1 SyncScanline1=0 SyncTimeout=16667
For FastSync (i.e. 2x refresh rate framerate, 120FPS for 60Hz refresh rate)
Code: Select all
SyncScanline0=1 SyncScanline1=0 SyncTimeout=1
In this case tearlines will be in the top and in the middle of frame, you can move it down via synchronically increasing SyncScanline0 and SyncScanline1 values. To control timeout in such case use either explicit value:
Code: Select all
SyncScanline0=1 SyncScanline1=1 SyncTimeout=0
or calibrate timeout automatically:
Code: Select all
SyncScanline0=1 SyncScanline1=1 SyncTimeout=8333
Code: Select all
SyncScanline0=1 SyncScanline1=1 SyncTimeout=2