Page 10 of 10

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Posted: 28 Jun 2018, 11:52
by Chief Blur Buster
I'm discussing at RetroArch forums of a potential development path forward on adding frameslice beam racing to RetroArch.

The thread is here: https://forums.libretro.com/t/an-input- ... 07/724#724
Lemme see how we can accelerate things.I’ll spend a couple hours to write this proposal post to see if this is a good plan:

Some considerations:
  • The demo testing will help vet out problems/bugs.
  • The code is in C# which is not the same language as LibRetro.
  • Porting the core libraries to C or C++ was going to be a later project, or someone to volunteer.
So, need to figure out which is faster:
  • Wait for me to release source code (or gain private access to my git repo)
  • Use it as a private sandbox to learn frameslice beamracing
  • Write LibRetro frameslice beamracing from scratch
Or
  • Bootstrap by learning from WinUAE codebase which already has frameslice beamracing
  • Use it as an educational sandbox
  • Write LibRetro frameslice beamracing from scratch
Or
  • Wait (longer) for the C/C++ ports of the raster calculator modules
  • Use it directly within LibRetro frameslice beamracing.
From these approaches, some elements need to be written from scratch, as some LibRetro pre-requisites independently of all the above.Let me see if I can suggest a blueprint of how to proceed…

Recommended Hook
  • Add the per-raster callback function called “retro_set_raster_poll
  • The arguments are identical to “retro_set_video_refresh
  • Do it to one emulator module at a time (begin with the easiest one).
It calls the raster poll every emulator scan line plotted.The incomplete contents of the emulator framebuffer (complete up to the most recently plotted emulator scanline) is provided.This allows centralization of frameslice beamracing in the quickest and simplest way.

Getting the VSYNC timestamps

This technique is only needed for the register-less method, to listen for VSYNC timestamps while in VSYNC OFF mode, and to poll the raster line:
  • Get your primary display adaptor URL such as \.\\DISPLAY1… For me in C#, I use Screen.PrimaryScreen.DeviceName to get this, but in C/C++ you can use EnumDisplayDevices()
  • Next, callD3DKMTOpenAdapterFromHdc() with this info to open the hAdaptor handle
  • For listening to VSYNC timestamps, run a thread with D3DKMTWaitForVerticalBlankEvent() on this hAdaptor handle.Then immediately record the timestamp.This timestamp represents the end of a refresh cycle and beginning of VBI.That’s your VSYNC callback signal.
Other platforms have various methods of getting a VSYNC event hook (e.g. Mac CVDisplayLinkOutputCallback) which roughly corresponds to the Mac’s blanking interval.If you are using the registerless method and generic precision clocks (e.g. RTDSC wrappers) these can potentially be your only #ifdefs in your cross platform beam racing – just simply the various methods of getting VSYNC timestamps.The rest have no platform-specificness.

Getting the current raster scan line number

For raster calculation you can do one of the two:

(A) Raster-register-less-method: Use QueryPerformanceCounter to profile the times between refresh cycle.You can use known fractional refresh rate (from QueryDisplayConfig) to bootstrap this “best-estimate” refresh rate calculation, and refine this in realtime.Calculating raster position is simply a relative time between two VSYNC timestamps, allowing 5% for VBI (meaning 95% of 1/60sec for 60Hz would be a display scanning out).NOTE: Optionally, to improve accuracy, you can dejitter. Use a trailing 1-second interval average to dejitter any inaccuracies (they calm to 1-scanline-or-less raster jitter), ignore all outliers (e.g. missed VSYNC timestamps caused by computer freezes). Alternatively, just use jittermargin technique to hide VSYNC timestamp inaccuracies.

(B) Raster-register-method: Use D3DKMTGetScanLine to get your GPU’s current scanline on the graphics output.Wait at least 1 scanline between polls (e.g. sleep 10 microseconds between polls), since this is an expensive API call that can stress a GPU if busylooping on this register.

NOTE: If you need to retrieve the “hAdaptor” parameter for D3DKMTGetScanLine – then get your adaptor URL such as \.\\DISPLAY1 via EnumDisplayDevices() … Then callD3DKMTOpenAdapterFromHdc() with this adaptor URL in order to open the hAdaptor handle which you can then finally pass to D3DKMTGetScanLine that works with Vulkan/OpenGL/D3D/9/10/11/12+… D3DKMT is simply a hook into the hAdaptor that is being used for your Windows desktop, which exists as a D3D surface regardless of what API your game is using, and all you need is to know the scanline number.So who gives a hoot about the “D3DKMT” prefix, it works fine with beamracing with OpenGL or Vulkan API calls.(KMT stands for Kernel Mode Thunk, but you don’t need Admin priveleges to do this specific API call from userspace.)

Improved VBI size monitoring

You don’t need raster-exact precision for basic frameslice beamracing, but knowing VBI size makes it more accurate to do frameslice beamracing since VBI size varies so much from platform to platform, resolution to resolution.Often it just varies a few percent, and most sub-millisecond inaccuracies is easily hidden within jittermargin technique.

But, if you’ve programmed with retro platforms, you are probably familiar with the VBI (blanking interval) – essentially the overscan space between refresh cycles.This can vary from 1% to 5% of a refresh cycle, though extreme timings tweaking can make VBI more than 300% the size of the active image (e.g. Quick Frame Transport tricks – fast scan refresh cycles with long VBIs in between).For cross platform frameslice beamracing it’s OK to assume ~5% being the VBI, but there are many tricks to know the VBI size.
  • QueryDisplayConfig() on Windows will tell you the Vertical Total.(easiest)
  • Or monitoring the ratio of .INVBlank = true versus .INVBlank = false …(via D3DKMTGetScanLine) by monitoring the flag changes (wait a few microseconds between polls, or 1 scanline delay – D3DKMTGetScanLine is an ‘expensive’ API call)
Turning The Above Data into Real Frameslice Beamracing

For simplicity, begin with emu Hz = real Hz (e.g. 60Hz)
  • Have a configuration parameter of number of frameslices (e.g. 10 frameslices per refresh cycle)
  • Let’s assume 10 frameslices for this exercise.
  • Actual screen 1080p means 108 real pixel rows per frameslice.
  • Emulator screen 240p means 24 emulator pixel rows per frameslice.
  • Your emulator module calls the centralized raster poll (retro_set_raster_poll) right after every emulator scan line. The centrallized code (retro_set_raster_poll) counts the number of emulator pixel rows completed to fill a frameslice.The central code will do either (5a) or (5b):(5a) Returns immediately to emulator module if not yet a full new framesliceful have been appended to the existing offscreen emulator framebuffer (don’t do anything to the partially completed framebuffer). Update a counter, do nothing else, return immediately.(5b) However once you’ve got a full frameslice worth built up since the last frameslice presented, it’s now time to frameslice the next frameslice.Don’t return right away.Instead, immediately do an intentional CPU busyloop until the realraster reaches roughly 2 frameslice-heights above your emulator raster (relative screen-height wise).So if your emulator framebuffer is filled up to bottom edge of where frameslice #4 is, then do a busyloop until realraster hits the top edge* of frameslice #3.Then immediately Present() or glutSwapBuffers() upon completing busyloop.Then Flush() right away.NOTE: The tearline (invisible if unchanged graphics at raster are) will sometimes be a few pixels below the scan line number (the amount of time for a memory blit - memory bandwidth dependant - you can compensate for it, or you can just hide any inaccuracy in jittermargin)NOTE2: This is simply the recommended beamrace margin to begin experimenting with: A 2 frameslice beamracing margin is very jitter-margin friendly.
Image

Note: 120Hz scanout diagram from a different post of mine. Replace with emu refresh rate.matching real refresh rate, i.e. monitor set to 60 Hz instead.This diagram is to help raster veterans conceptualize how modern-day tearlines relates to raster position as a time-based offset from VBI

Image

Bottom line: As long as you keep repeatedly Present()-ing your incompletely-rasterplotted (but progressively more complete) emulator framebuffer ahead of the realraster, the incompleteness of the emulator framebuffer never shows glitches or tearlines. The display never has a chance to display the incompleteness of your emulator framebuffer, because the display’s realraster is showing only the latest completed portions of your emulator’s framebuffer.You’re simply appending new emulator scanlines to the existing emulator framebuffer, and presenting that incomplete emulator framebuffer always ahead of real raster.No tearlines show up because the already-refreshed-part is duplicate (unchanged) where the realraster is.It thusly looks identical to VSYNC ON.

Precision Assumptions:
  • Scaling doesn’t have to be exact.
  • The two frameslice offset gives you a one-frameslice-ahead jitter margin
  • You can vary the height of consecutive frameslices if you want, slightly, or lots, or for rounding errors.
  • No artifacts show because the frameslice seams are well into the jitter margin.
Special Note On HLSL-Style Filters: You can use HLSL/fuzzyline style shaders with frameslices.WinUAE just does a full-screen redo on the incomplete emu framebuffer, but one could do it selectively (from just above the realraster all the way to just below the emuraster) as a GPU performance-efficiency optimization.

Adverse Conditions To Detect To Automatically disable beamracing

Optional, but for user-friendly ease of use, you can automatically enter/exit beamracing on the fly if desired.You can verify common conditions such as making sure all is me:
  • Rotation matches (scan direction same) = true
  • Supported refresh rate = true
  • Module has a supported raster hook = true
  • Emulator performance is sufficient = true
Exiting beamracing can be simply switching to “racing the VBI” (doing a Present() between refresh cycles), so you’re just simulating traditional VSYNC ON via VSYNC OFF via that manual VSYNC’ing.This is like 1-frameslice beamracing (next frame response).This provides a quick way to enter/exit beamracing on the fly when conditions change dynamically.A Surface Tablet gets rotated, a module gets switched, refresh rate gets changed mid-game, etc…

General Best Practices

Debugging raster problems can be frustrating, so here’s knowledge by myself/Calamity/Toni Wilen/Unwinder/etc.These are big timesaver tips:
  1. Raster error manifests itself as tearline jitter.
  2. If jitter is within raster jittermargin technique, no tearing or artifacts shows up.
  3. It’s an amazing performance profiling tool; tearline jitter makes your performance fluctuations very visible.In debug mode, use color-coded tints for your frameslices, to help make normally-hidden raster jitter more visible (WinUAE uses this technique).
  4. Raster error is more severe at top edge than bottom edge.This is because GPU is more busy during this region (e.g. scheduled Windows compositing thread, stuff that runs every VSYNC event in the Windows Kernel, etc).It’s minor, but it means you need to make sure your beam racing margin accomodate sthis.
  5. GPU power management.If your emulator is very light on a powerful GPU, your GPU fluctuating power management will amplify raster error.Which may mean having too frameslices will have amplified tearline jitter.Fixes include (A) configure more frameslices (B) simply detect when GPU is too lightly loaded and make it busy one way or another (e.g. automatically use more frameslices). The rule of thumb is don’t let GPU idle for more than a millisecond if you want scanline-exact rasters.Or you can just merely simply use a bigger jittermargin to hide raster jitter.
  6. If you’re using D3DKMTGetScanLine… do not busyloop on it because it stresses the GPU.Do a CPU busyloop of a few microseconds before polling the raster register again.
  7. Do a Flush() before your busyloop before your precision-timed Present().This massively increases accuracy of frameslice beamracing.But it can decrease performance.
  8. Thread-switching on some older CPUs can cause RTDSC or QueryPerformanceCounter backwards clock ticking unexpectedly.So keep QueryPerformanceCounter polls to the same CPU thread with a BeginThreadAffinity.You probably already know this from elsewhere in the emulator, but this is mentioned here as being relevant to beamracing.
  9. Instead of rasterplotting emulator scanlines into a blank framebuffer, rasterplot on top of a copy of the the emulator previous refresh cycle’s framebuffer.That way, there’s no blank/black area underneath the emulator raster.This will greatly reduce visibility of glitches during beamrace fails (falling outside of jitter margin – too far behind / too far ahead) – no tearing will appear unless within 1 frameslice of realraster, or 1 refresh cycle behind.A humongous jitter margin of almost one full refresh cycle.And this plot-on-old-refresh technique makes coarser frameslices practical – e.g. 2-frameslice beamracing practical (e.g. bottom-half screen Present() while still scanning out top half, and top-half screen Present() while scanning out bottom half).When out-of-bounds happens, the artifact is simply brief instantaneous tearing only for that specific refresh cycle.Typically, on most systems, the emulator can run artifactless identical looking to VSYNC ON for many minutes before you might see brief instantaneous tearline from a momentary computer freeze, and instantly disappear when beamrace gets back in sync.
  10. Some platforms supports microsecond-accurate sleeping, which you can use instead of busylooping.Some platforms can also set the granularity of the sleep (there’s an undocumented Windows API call for this).As a compromise, some of us just do a normal thread sleep until a millisecond prior, then doing a busyloop to align to the raster.
  11. Don’t worry about mid-scanline splits (e.g. HSYNC timings).We don’t have to worry about such sheer accuracy.The GPU transceiver reads full pixel rows at a time.Being late for a HSYNC simply means the tearline moves down by 1 pixel.Still within your raster jitter margin.We can jitter quite badly when using a forgiving jitter margin – (e.g. 100 pixels amplitude raster jitter will never look different from VSYNC ON).Precision requirement is horizontal scanrate (e.g. 67KHz means 1/67000sec precision needed for scanline-exact tearlines – which is way overkill for 10-frameslice beamracing which only needs 1/600sec precision at 60Hz).
  12. Use multimonitor. Debugging is way easier with 2 monitors. Use your primary is exclusive full screen mode, with the IDE on a 2nd monitor.(Not all 3D frameworks behave well with that, but if you’re already debugging emulators, you’ve probably made this debugging workflow compatible already anyway).You can do things like write debug data to a console window (e.g. raster scanline numbers) when debugging pesky raster issues.
  13. Some digital display outputs exhibit micropacketization behavior (DisplayPort at lower resolutions especially, where multiple rows of pixels seem to squeeze into the same packet – my suspicion). So your raster jitter might vibrate in 2 or 4 scan line multiples rather than single-scanline multiples.This may or may not happen more often with interleaved data (DisplayPort cable handling 2 displays or other PCI-X data) but they are still pretty raster-accurate otherwise, the raster inaccuracies are sub-millisecond, and fall far within jitter margin. Advanced algorithms such as DSC (Display Stream Compression of new DisplayPort implementations) can amplify raster jitter a bit.But don’t worry; all known micro-packetization inaccuracies, fall far well within jittermargin technique, so no problem.I only mention this is you find raster-jitter differences between different video outputs.
  14. Become more familiar with how the jitter-margin technique saves your ass.If you do Best-Practice #9, you gain a full wraparound jittermargin (you see, step #9 allows you to Present() the previous refresh cycle on bottom half of screen, while still rendering the top half…).If you use 10 frameslices at 1080p, your jitter safety margin becomes (1080 - 108) = 972 scanlines before any tearing artifacts show up!No matter where the real raster is, you’re jitter margin is full wraparound to previous refresh cycle.The earliest bound is pageflip too late (more than 1 refresh cycle ago) or pageflip too soon (into the same frameslice still not completed scanning-out onto display).Between these two bounds is one full refresh cycle minus one frameslice!So don’t worry about even a 25 or 50 scanline jitter inaccuracy (erratic beamracing where margin between realraster and emuraster can randomly vary) in this case… It still looks like VSYNC ON perfectly until it goes out of that 972-scanline full-wraparound jitter margin.For minimum lag, you do want to keep beam racing margin tight (you could make beamrace margin adjustable as a config value, if desired – though I just recommend “aim the Present() at 2 frameslice margin” for simplicity), but you can fortunately surge ahead slightly or fall behind lots, and still recover with zero artifacts.The clever jittermargin technique that permanently hides tearlines into jittermargin makes frameslice beam-racing very forgiving of transitient background activity._
  15. Get familiar with how it scales up/down well to powerful and underpowered platforms.Yes, it works on Raspberry PI.Yes, it works on Android.While high-frameslice-rate beamracing requires a powerful GPU, especially with HLSL filters, low-frameslice beamracing makes it easier to run cycle-exact emulation at a very low latency on less powerful hardware - the emulator can merrily emulate at 1:1 speed (no surge execution needed) spending more time on power-consuming cycle-exactness or ability to run on slower mobile GPUs.You’re simply early-presenting your existing incomplete offscreen emulator framebuffer (as it gets progressively-more-complete).Just adjust your frameslice count to an equilibrium for your specific platform.4 is super easy on the latest Androids and Raspberry PI (Basically 4 frameslice beam racing for 1/4th frame subrefresh input lag – still damn impressive for a PI or Android) while only adding about 10% overhead to the emulator.
  16. If you are on a platform with front buffer rendering (single buffer rendering), count yourself lucky.You can simply rasterplot new pixel rows directly into the front buffer instead of keeping the buffer offscreen (As you already are)!And plot on top of existing graphics (overwrite previous refresh cycle) for a jitter margin of a full refresh cycle minus 1-2 pixel rows!Just provide config parameter of of beamrace margin (vertical screen height percentage difference between emuraster + realraster), to adjust tightness of beamracing.You can support frameslicing VSYNC OFF technique & frontbuffer technique with the same suggested API, retro_set_raster_poll suggestion – it makes it futureproof to future beamracing workflows.
  17. Yes, it works with curved scanlines in HLSL/filter type algorithms.Simply adjust your beamracing margin to prevent the horizontally straight realraster from touching the top parts of curved emurasters.Usually a few pixel rows will do the job.You can add a scanlines-offset-adjustment parameter or a frameslice-count-offset adjustment parameter.
Hopefully these best practices reduce the amount of hairpulling during frameslice beamracing.

Special Notes
  • Special Note about Rotation Emulator devices already should report their screen orientation (portrait, landscape) which generally also defines scan direction.QueryDisplayConfig() will tell you real screen orientation.Default orientation is always top-to-bottom scan on all PC/Mac GPUs.90 degree counterclockwise display rotation changes scan direction into left-to-right.If emulating Galaxian, this is quite fine if you’re rotating your monitor (left-right scan) and emulating Galaxian (left-right scan) – then beamracing works._

    Special Note about Unsupported Refresh Rates Begin KISS and worry about 50Hz/60Hz only first.Start easy.Then iterate in adding support to other refresh rates like multiples.120Hz is simply cherrypicking every other refresh cycle to beam race.For the in-between refresh cycles, just leave up the existing frame up (the already completed frame) until the refresh cycle that you want to beamrace is about to begin.In reality, there’s very few unbeamraceable refresh rates – even beamracing 60fps onto 75Hz is simply beamracing cherrypicked refresh cycles (it’ll still stutter like 60fps@75Hz VSYNC ON though)._

    Advanced Note about VRR Beam Racing Before beam racing variable refresh rate modes (e.g. enabling GSYNC or FreeSync and then beamracing that) – wait until you’ve mastered all the above before you begin to add VRR compatibility to your beamracing.So for now, disable VRR when implementing frameslice beamracing for the first time.Add this as a last step once you’ve gotten everything else working reasonably well.It’s easy to do once you understand it, but the conceptual thought of VRR beamracing is a bit tricky to grasp at first.VRR+VSYNC OFF supports beamracing on VRR refresh cycles.The main considerations are, the first Present() begins the manually-triggered refresh cycle (.INVBlank becomes false and ScanLine starts incrementing), and you can then frameslice beamrace that normally like an individual fixed-Hz refresh cycle.Now, one additional very special, unusual consideration is the uncontrolled VRR repeat-refresh.One will need to do emergency catchup beamraces on VRR displays if a display decides to do an uncommanded refresh cycle (e.g. when a display+GPU decides to do a repeat-refresh cycle – this often happens when a display’s framerates go below VRR range).These uncommanded refresh cycles also automatically occur below VRR range (e.g. under 30fps on a 30Hz-144Hz VRR display).Most VRR displays will repeat-refresh automatically until it’s fully displayed an untorn refresh cycle.If this happens and you’ve already begun emulating a new emulator refresh cycle, you have to immediately start your beamrace early (rather than at the wanted precise time).So if you do a frameslice beamrace of a VRR refresh cycle, the GPU will send a repeat-refresh to the display automatically immediately.There might be an API call to suppress this behavior, but we haven’t found one, so this behavior is unwanted so this kind of makes beamraced 60fps onto a 75Hz FreeSync display difficult to do stutter-free.But it works fine for 144Hz VRR displays - we find it’s easy to be stutterfree when the VRR max is at least twice the emulator Hz, since we don’t care about those automatic-repeat-refresh cycles that aren’t colliding with timing of the next beamrace._
Is this sufficient QuickStart on quickly rapidly getting started with RetroArch frameslice beamracing?

At the very least the 2 hours I spent writing this post, for you – hopefully can help you possibly achieve experimental test 60Hz beamracing within 2 or 3 day’s of programming?

(Details may take longer, e.g. debugging VRR beamrace support – but 60Hz frame slice beamracing is typically “easier-than-expected” to add according to two other emulator authors – Tony Wilen of WinUAE told me that)

I’ll be able to provide more snippets, examples, suggestions, and snippets of source code (without violating demo rules – and besides, this way is probably faster and more C/C++ useful anyway) – here – or if prefer email, contact me [email protected] …I got some C/C++ test code from Jerry of Duckware, the inventor of vsynctester.com that has a working example of D3DKMTGetScanLine() in .cpp modules, if you’re still having difficulty with the instructions above.

Moving Forward

Before utilizing any existing code (e.g. WinUAE or Tearline Jedi or anything else) – I think the first priority is to blueprint it out, decide how to extend RetroArch API.I propose add – retro_set_raster_poll as described… what do you think?Something with the least pain to add.The raster poll technique is probably a move we have to do regardless.

The hooking technique will have a huge impact on how we decide to frameslice-beamrace, and how flexible it can be made.

Did my post help? Need some code examples by email?

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Posted: 16 Jul 2018, 23:23
by Chief Blur Buster
Wow!

There is now currently an $1050 open-source bounty prize for adding beam racing to the RetroArch emulator:

GitHub: https://github.com/libretro/RetroArch/issues/6984

BountySource: https://www.bountysource.com/issues/608 ... untysource

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Posted: 06 Mar 2019, 16:29
by Tommy
Belated query, but just to check in on this since I'm almost finally in a position to have a play around in my little sandbox. On the Mac side of things, am I right to understand that the current best implementation resolves around:
  • a CVDisplayLink to obtain retrace notifications (and CVDisplayLinkGetActualOutputVideoRefreshPeriod to get the jitterless frame length if available?); and
  • approximating the current raster position based on the amount of time since vsync plus a guess about the retrace period, given the total number of visible lines on the display?
Apple's site has become incredibly difficult to search lately; I guess I'm asking whether there are line length or retrace period length calls that I just haven't yet uncovered.

Re: Emulator Developers: Lagless VSYNC ON Algorithm

Posted: 07 Mar 2019, 16:47
by Chief Blur Buster
Apple's platform is indeed one of the more difficult to get an accurate VBLANK timing from.

The easiest way before you touch your emulator code for synchronizing emu-raster to real-raster -- is to first write a "Hello World" type program to generate a stationary tearline at a specified position in the screen, much like this:

phpBB [video]

Tommy wrote:
  • a CVDisplayLink to obtain retrace notifications (and CVDisplayLinkGetActualOutputVideoRefreshPeriod to get the jitterless frame length if available?); and
That's correct. These retrace notifications occurs roughly at the bottom edge of the refresh cycle, right between the refresh cycle bottom and just above the retrace interval.
Tommy wrote:
  • approximating the current raster position based on the amount of time since vsync plus a guess about the retrace period, given the total number of visible lines on the display?
Yep. Use a standard figure of a 45 scanline retrace period. 480p uses 45 scanlines (525 signal size), and 1080p uses 45 scanlines (1125 signal size). Not all retrace periods are that size, but they will be close to that ballpark. Even in an error, you'll only be less than 1%-4% off, which is only a fraction of a refresh cycle error in a raster guess (Without access to a raster register). That is easily accomodated within an adjutsable beam-chase margin (e.g. the jitter margin technique -- like, around 1 to 2ms jitter margin).

First, for the the notifications:

To verify the phase-offset of the notifications -- you can actually verify the phase timings of this via VSYNC ON frame presentation (e.g. Present() in Direct3D, glutSwapBuffers() in OpenGL). Max-out your framerate and your queue, and these calls begin blocking. These will unblock during the VBI.

Check the microsecond timestamps of this against the retrace notifications and see if they match up. If they match up, then you've confirmed that they're aligned to VBI.

A 1-2ms error is okay as long as you put roughly a 1-2ms padding margin in the "beam-chase" of the emuraster ahead of realraster, to absorb all that performance jitter before the realraster catches up to an accidentally slow emuraster (and start artifacting with tearing). The easiest way to accomodate this, is, indeed simply an adjustable beam-chase margin (preferably in sub-millisecond increments, or scan-line increments). So a user can simply slide a slider until a horizontally-scrolling game (e.g. Super Mario) stops tearing -- that's your sweet spot. One can also use the color-coded slices technique that Calamity added to his experimental GroovyMAME patch.

To improve accuracy further, if there's a small phase offset (e.g. 5% of a refresh cycle), it's possible one of them is aligned to the beginning of VBI and the other one is aligned to the end of VBI. The lower value represents the beginning of VBI and the higher value represents the end of VBI, so you can compensate accordingly. But AFAIK, the numbers should be aligned (sub millisecond on faster systems).

I'm hoping that there's not much dynamic overheads to worry about (e.g. notifications being lower priority than other MacOS processing), but it will be prudent to use a high priority thread for the timestamping of the callback notifications, to minimize any errors caused by CPU/GPU usage fluctuations.

I'd love to see the results of your "Hello World" research, as the Mac compile of Tearline Jedi could use some accuracy improvements.

You'll want a very good VSYNC timestamp de-jittering routine.
Input = a feed of VSYNC timestamps with jitter + missing VSYNC timestamps
Output = an output of ultra-accurate corrected VSYNC timestamps with missing VSYNC timestamps filled-in.

I have that already in Tearline Jedi (C#) and another programmer on Blur Busters forums donated some open source code (C++) that does this. I could dig up this C++ code for you, to use in your VSYNC timestamp de-jittering routine.

Now that this thread is bumped-up again....
As a reminder to late-arriving readers, accessing cross platform beam racing can be distilled down to simply these minimum system requirements:
1. Access to a VSYNC OFF mode (getting tearing)
2. Simultaneous access to timestamps of VSYNC ON (timestamps of refresh cycles)
3. Access to microsecond accurate clocks (e.g. RTDSC, QueryPerformanceCounter, std::chrono::high_resolution_clock::now(), etc)

Then that's enough information to do real "raster interrupts" in a cross platform way (taking advantage of the fact that VSYNC OFF tearlines are simply rasters). The fact that VSYNC OFF tearlines are generically platform-independent has provided the magic recipe for cross-platform "raster interrupts" (even if it's only an approximate scanline number), something formerly not thought to be practical.

The rest is simply mathematics and common sense
- Check screen rotation API. Default screen rotation will be assumed top-to-bottom scan sequence.
- Use surge execution if the math says the display is scanning faster than you need. e.g. 60fps on a 144Hz FreeSync display will scan-out the "60Hz" refresh cycle in 1/144sec, so your emulator will temporarily (in one jiffy 1/60sec) have to execute faster at a 144:60 ratio, to keep the emuraster ahead of the speedy realraster.
- You can follow additional rules to beam-race a GSYNC or FreeSync mode; GSYNC/FreeSync displays are still raster-scanned-out behind the scenes, you just need "GSYNC+VSYNC OFF" or "FreeSync+VSYNC OFF" and the rest is simply algorithmic, explained in an earlier post. You can realtime raster-sync (emuraster+realraasterframeslice beam-race a 60fps refresh cycle on 144Hz and 240Hz VRR monitors.
- You can dynamically enable/disable beam raced frameslicing as needed (e.g. screen rotation mismatch, or performance suddenly slows down to unacceptable error margins). Basically switch to pretty much Presenting the whole frame during the VBI all at once -- it simply looks like VSYNC ON when you present a VSYNC OFF framebuffer with the tearline between refresh cycles. That's the same technique that the new RTSS Scan Line Sync uses. So simply switch to one-fullscreen-frameslice-mode whenever the emulator scan direction diverges from the real scan direction, e.g. user has rotated their computer monitor into portrait mode, or user has rotated a Microsoft Surface tablet into portrait mode or upsidedown, etc. So always check that screen orientation is beam-race friendly.

Here's a pouet.net thread about Tearline Jedi that I've programmed.
http://www.pouet.net/topic.php?which=11422&page=1
I haven't released the app yet because I've been so busy, and I would love to refine the Mac implementation a bit more (I have less access to Macs at the moment) so I wouldn't mind pooling open source refinements, since I'm finding the Mac implementation in need of a bit more strengthening.