I found the most interesting part of the NIST outage post [1] is NIST's special Time Over Fiber (TOF) program [2] that "provides high-precision time transfer by other service arrangements; some direct fiber-optic links were affected and users will be contacted separately."
I've never heard of this! Very cool service, presumably for … quant / HFT / finance firms (maybe for compliance with FINRA Rule 4590 [3])? Telecom providers synchronizing 5G clocks for time-division duplexing [4]? Google/hyperscalers as input to Spanner or other global databases?
Seriously fascinating to me -- who would be a commercial consumer of NIST TOF?
I never saw a need for this in HFT. In my experience, GPS was used instead, but there was never any critical need for microsecond accuracy in live systems. Sub-microsecond latency, yes, but when that mattered it was in order to do something as soon as possible rather than as close as possible to Wall Clock Time X.
Still useful for post-trade analysis; perhaps you can determine that a competitor now has a faster connection than you.
The regulatory requirement you linked (and other typical requirements from regulators) allows a tolerance of one second, so it doesn't call for this kind of technology.
> I never saw a need for this in HFT. In my experience, GPS was used instead, but there was never any critical need for microsecond accuracy in live systems.
My guess would be scientific experiments where they need to correlate or sequence data over large regions. Things like correlating gravitational waves with radio signals and gamma ray bursts.
But they do not need absolute time, and internal rubidium clocks can keep the required accuracy for a few days. After that, sync can be transferred with a portable plug, which is completely viable in tactical/operational level EW systems.
Google doesn't use chrony specifically, just an algorithm that is somewhat chrony-like (but very different in other ways). It's called Google TrueTime.
This page: https://tf.nist.gov/tf-cgi/servers.cgi shows that NIST has > 16 NTP servers on IPv4, of those, 5 are in Boulder and were affected by the power failure. The rest were fine.
However, most entities should not be using these top-level servers anyway, so this should have been a problem for exactly nobody.
I believe if you use time.nist.gov it round robins dns requests, so there’s a chance you’d have connected to the Boulder server. So for some people they would have experienced NIST 5 μs off.
Who does use those top-level servers? Aren’t some of them propagating the error or are all secondary level servers configured to use dispersed top-level servers? And how do they decide who is right when they don’t match?
Is pool.ntp.org dispersed across possible interference and error correlation?
You can look at who the "Stratum 2" servers are, in the NTP.org pool and otherwise. Those are servers who sync from Stratum 1, like NIST.
Anyone can join the NTP.org pool so it's hard to make blanket statements about it. I believe there's some monitoring of servers in the pool but I don't know the details.
For example, Ubuntu systems point to their Stratum 2 timeservers by default, and I'd have to imagine that NIST is probably one of their upstreams.
An NTP server usually has multiple upstream sources and can steer its clock to minimize the error across multiple servers, as well as detecting misbehaving servers and reject them ("Falseticker"). Different NTP server implementations might do this a bit differently.
From my own experience managing large numbers of routers, and troubleshooting issues, I will never use pool.ntp.org again. I’ve seen unresponsive servers as well as incorrect time by hours or days. It’s pure luck to get a good result.
Instead I’ll stick to a major operator like Google/Microsoft/Apple, which have NTP systems designed to handle the scale of all the devices they sell, and are well maintained.
Nitpick: UTC stands for Coordinated Universal Time. The ordering of the letters was chosen to not match the English or the French names so neither language got preference.
That doesn't quite match what the wikipedia page says:
> The official abbreviation for Coordinated Universal Time is UTC. This abbreviation comes as a result of the International Telecommunication Union and the International Astronomical Union wanting to use the same abbreviation in all languages. The compromise that emerged was UTC, which conforms to the pattern for the abbreviations of the variants of Universal Time (UT0, UT1, UT2, UT1R, etc.).
> ... in English the abbreviation for coordinated universal time would be CUT, while in French the abbreviation for "temps universel coordonné" would be TUC. To avoid appearing to favor any particular language, the abbreviation UTC was selected.
Not exactly the topic of discussion but also not not on topic: just wanted to sing praise for chrony which has performed better than the traditional os-native NTP clients in our testing on a myriad of real and virtualized hardware.
I'm missing the nuance or perhaps the difference between the first scenario where sending inaccurate time was worse than sending no time, versus the present where they are sending inaccurate time. Sorry if it's obvious.
The 5us inaccuracy is basically irrelevant to NTP users, from the second update to the Internet Time Service mailing list[1]:
To put a deviation of a few microseconds in context, the NIST time scale usually performs about five thousand times better than this at the nanosecond scale by composing a special statistical average of many clocks. Such precision is important for scientific applications, telecommunications, critical infrastructure, and integrity monitoring of positioning systems. But this precision is not achievable with time transfer over the public Internet; uncertainties on the order of 1 millisecond (one thousandth of one second) are more typical due to asymmetry and fluctuations in packet delay.
> Such precision is important for scientific applications, telecommunications, critical infrastructure, and integrity monitoring of positioning systems. But this precision is not achievable with time transfer over the public Internet
How do those other applications obtain the precise value they need without encountering the Internet issue?
It's a good question, and I wondered the same. I don't know, but I'd postulate:
As it stands at the minute, the clocks are a mere 5 microseconds out and will slowly get better over time. This isn't even in the error measurement range and so they know it's not going to have a major effect on anything.
When the event started and they lost power and access to the site, they also lost their management access to the clocks as well. At this point they don't know how wrong the clocks are, or how more wrong they're going to get.
If someone restores power to the campus, the clocks are going to be online (all the switches and routers connecting them to the internet suddenly boot up), before they've had a chance to get admin control back. If something happened when they were offline and the clocks drifted significantly, then when they came online half the world might decide to believe them and suddenly step change to follow them. This could cause absolute havoc.
Potentially safer to scram something than have it come back online in an unknown state, especially if (lots of) other things are are going to react to it.
In the last NIST post, someone linked to The Time Rift of 2100: How We lost the Future --- and Gained the Past. It's a short story that highlights some of the dangers of fractured time in a world that uses high precision timing to let things talk to each other: https://tech.slashdot.org/comments.pl?sid=7132077&cid=493082...
> […] where sending inaccurate time was worse than sending no time […]
When you ask a question, it is sometimes better to not get an answer—and know you have not-gotten an answer—then to get the wrong answer. If you know that a 'bad' situation has arisen, you can start contingency measures to deal with it.
If you have a fire alarm: would you rather have it fail in such a way that it gives no answer, or fail in a way where it says "things are okay" even if it doesn't know?
I work at a particle accelerator. We use White Rabbit (https://white-rabbit.web.cern.ch/) to synchronize some very sensitive devices, mostly the RF power systems and related data acquisition systems, down to nanosecond accuracy.
As far as I'm aware they just timestamp the sample streams based on a local gps backed atomic reference. Then when they get the data/tapes in one computing center they can just run a more sophisticated correlation entirely in software to smooth things out.
Maybe I missed something, but I don't quite understand the video title "NIST's NTP clock was microseconds from disaster". Is there some limit of drift before it's unrecoverable? Can't they just pull the correct time from the other campus if it gets too far off?
I've never heard of this! Very cool service, presumably for … quant / HFT / finance firms (maybe for compliance with FINRA Rule 4590 [3])? Telecom providers synchronizing 5G clocks for time-division duplexing [4]? Google/hyperscalers as input to Spanner or other global databases?
Seriously fascinating to me -- who would be a commercial consumer of NIST TOF?
[1] https://groups.google.com/a/list.nist.gov/g/internet-time-se...
[2] https://www.nist.gov/pml/time-and-frequency-division/time-se...
[3] https://www.finra.org/rules-guidance/rulebooks/finra-rules/4...
[4] https://www.ericsson.com/en/blog/2019/8/what-you-need-to-kno...
Still useful for post-trade analysis; perhaps you can determine that a competitor now has a faster connection than you.
The regulatory requirement you linked (and other typical requirements from regulators) allows a tolerance of one second, so it doesn't call for this kind of technology.
mifid ii (uk/eu) minimum is 1us granularity
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:...
You can get 50ns with this. Of course, you would verify at NIST.
Where does it say these are commercial consumers?
https://en.wikipedia.org/wiki/Schriever_Space_Force_Base#Rol...
> Building 400 at Schriever SFB is the main control point for the Global Positioning System (GPS).
To start with, probably for scientific stuff, à la:
* https://en.wikipedia.org/wiki/White_Rabbit_Project
But fibre-based time is important in case of GNSS time signal loss:
* https://www.gpsworld.com/china-finishing-high-precision-grou...
They're also the largest holder of IPv4 space, still. https://bgp.he.net/report/peers#_ipv4addresses
Think Google might have rolled their own clock sources and corrections.
Ex: Sundial, https://www.usenix.org/conference/osdi20/presentation/li-yul... / https://storage.googleapis.com/gweb-research2023-media/pubto... (pdf)
To say NIST was off is clickbait hyperbole.
This page: https://tf.nist.gov/tf-cgi/servers.cgi shows that NIST has > 16 NTP servers on IPv4, of those, 5 are in Boulder and were affected by the power failure. The rest were fine.
However, most entities should not be using these top-level servers anyway, so this should have been a problem for exactly nobody.
IMHO, most applications should use pool.ntp.org
Is pool.ntp.org dispersed across possible interference and error correlation?
Anyone can join the NTP.org pool so it's hard to make blanket statements about it. I believe there's some monitoring of servers in the pool but I don't know the details.
For example, Ubuntu systems point to their Stratum 2 timeservers by default, and I'd have to imagine that NIST is probably one of their upstreams.
An NTP server usually has multiple upstream sources and can steer its clock to minimize the error across multiple servers, as well as detecting misbehaving servers and reject them ("Falseticker"). Different NTP server implementations might do this a bit differently.
Instead I’ll stick to a major operator like Google/Microsoft/Apple, which have NTP systems designed to handle the scale of all the devices they sell, and are well maintained.
> The official abbreviation for Coordinated Universal Time is UTC. This abbreviation comes as a result of the International Telecommunication Union and the International Astronomical Union wanting to use the same abbreviation in all languages. The compromise that emerged was UTC, which conforms to the pattern for the abbreviations of the variants of Universal Time (UT0, UT1, UT2, UT1R, etc.).
> ... in English the abbreviation for coordinated universal time would be CUT, while in French the abbreviation for "temps universel coordonné" would be TUC. To avoid appearing to favor any particular language, the abbreviation UTC was selected.
To put a deviation of a few microseconds in context, the NIST time scale usually performs about five thousand times better than this at the nanosecond scale by composing a special statistical average of many clocks. Such precision is important for scientific applications, telecommunications, critical infrastructure, and integrity monitoring of positioning systems. But this precision is not achievable with time transfer over the public Internet; uncertainties on the order of 1 millisecond (one thousandth of one second) are more typical due to asymmetry and fluctuations in packet delay.
[1] https://groups.google.com/a/list.nist.gov/g/internet-time-se...
How do those other applications obtain the precise value they need without encountering the Internet issue?
As it stands at the minute, the clocks are a mere 5 microseconds out and will slowly get better over time. This isn't even in the error measurement range and so they know it's not going to have a major effect on anything.
When the event started and they lost power and access to the site, they also lost their management access to the clocks as well. At this point they don't know how wrong the clocks are, or how more wrong they're going to get.
If someone restores power to the campus, the clocks are going to be online (all the switches and routers connecting them to the internet suddenly boot up), before they've had a chance to get admin control back. If something happened when they were offline and the clocks drifted significantly, then when they came online half the world might decide to believe them and suddenly step change to follow them. This could cause absolute havoc.
Potentially safer to scram something than have it come back online in an unknown state, especially if (lots of) other things are are going to react to it.
In the last NIST post, someone linked to The Time Rift of 2100: How We lost the Future --- and Gained the Past. It's a short story that highlights some of the dangers of fractured time in a world that uses high precision timing to let things talk to each other: https://tech.slashdot.org/comments.pl?sid=7132077&cid=493082...
When you ask a question, it is sometimes better to not get an answer—and know you have not-gotten an answer—then to get the wrong answer. If you know that a 'bad' situation has arisen, you can start contingency measures to deal with it.
If you have a fire alarm: would you rather have it fail in such a way that it gives no answer, or fail in a way where it says "things are okay" even if it doesn't know?
(See https://docs.cloud.google.com/spanner/docs/true-time-externa...)
I defer to the experts.
If (and it isn't very conceivable) GPS satellites were to get 5µs out of whack, we would be back to Loran-C levels of accuracy for navigation.
... unless someone with real experience needing those tolerances chimes in and explains why it's true.