I wonder why timelines aren't implemented as a hybrid gather-scatter choosing strategy depending on account popularity (a combination of fan-out to followers and a lazy fetch of popular followed accounts when follower's timeline is served).
When you have a celebrity account, instead of fanning out every message to millions of followers' timelines, it would be cheaper to do nothing when the celebrity posts, and later when serving each follower's timeline, fetch the celebrity's posts and merge them into the timeline. When millions of followers do that, it will be cheap read-only fetch from a hot cache.
This is probably what we'll end up with in the long-run. Things have been fast enough without it (aside from this issue) but there's a lot of low-hanging fruit for Timelines architecture updates. We're spread pretty thin from a engineering-hours standpoint atm so there's a lot of intense prioritization going on.
Just to be clear, you are a Bluesky engineer, right?
off-topic: how has been dealing with the influx of new users after X political/legals problems aftermath? Did you see an increase in toxicity around the network? And how has you (Bluesky moderation) dealing with it.
I've stood up machines for this before I did not know they had a name, and I worked at the mouse company and my parking spot was two over from a J. Beibe'rs spot.
So now we have Slashdot effect, HN hug, and its not Clarkson its... Stephen Fry effect? Maybe can be Cross-Discipline - there's a term for when lots of UK turns their kettles on at the same time.
I should make a blog post to record all the ones I can remember.
Do you know the name of the problem or strategy used for solving the problem? I'd be interested in looking it up!
I own DDIA but after a few chapters of how database work behind the scenes, I begin to fall asleep. I have trouble understanding how to apply the knowledge to my work but this seems like a useful thing with a more clear application.
> and later when serving each follower's timeline, fetch the celebrity's posts and merge them into the timeline
I think then you still have the 'weird user who follows hundreds of thousands of people' problem, just at read time instead of write time. It's unclear that this is _better_, though, yeah, caching might help. But if you follow every celeb on Bluesky (and I guarantee you this user exists) you'd be looking at fetching and merging _thousands_ of timelines (again, I suppose you could just throw up your hands and say "not doing that", and just skip most or all of the celebs for problem users).
Given the nature of the service, making read predictably cheap and writes potentially expensive (which seems to be the way they've gone) seems like a defensible practice.
> I suppose you could just throw up your hands and say "not doing that", and just skip most or all of the celebs for problem users
Random sampling? It's not as though the user needs thousands of posts returned for a single fetch. Scrolling down and seeing some stuff that's not in chronological order seems like an acceptable tradeoff.
To serve a user timeline in single-digit milliseconds, it is not practical for a data store to load each item in a different place. Even with an index, the index itself can be contiguous in disk, but the payload is scattered all over the place if you keep it in a single large table.
Instead, you can drastically speed up performance if you are able to store data for each timeline somewhat contiguously on disk.
Think of it as pre-rendering. Of pre-rendering and JIT collecting, pre-rendering means more work but it's async, and it means the timeline is ready whenever a user requests it, to give a fast user experience.
(Although I don't understand the "non-celebrity" part of your comment -- the timeline contains (pointers to) posts from whoever someone follows, and doesn't care who those people are.)
As a systems enthusiast I enjoy articles like this. It is really easy to get into the mindset of "this must be perfect".
In the Blekko search engine back end we built an index that was 'eventually consistent' which allowed updates to the index to be propagated to the user facing index more quickly, at the expense that two users doing the exact same query would get slightly different results. If they kept doing those same queries they would eventually get the exact same results.
Systems like this bring in a lot of control systems theory because they have the potential to oscillate if there is positive feedback (and in search engines that positive feedback comes from the ranker which is looking at which link you clicked and giving it a higher weight) and it is important that they not go crazy. Some of the most interesting, and most subtle, algorithm work was done keeping that system "critically damped" so that it would converge quickly.
Reading this description of how user's timelines are sharded and the same sorts of feedback loops (in this case 'likes' or 'reposts') sounds like a pretty interesting problem space to explore.
I guess I hadn’t considered that search engines could be reranking pages on the fly as I click them. I’ve been seeing my DuckDuckGo results shuffle around for a while now thinking it’s an awful bug.
Like I click one page, don’t find what I want, and go back thinking “no, I want that other result that was below” and it’s an entirely different page with shuffled results, missing the one that I think might have been good.
That's connected with a basic usability complaint about current web interfaces, that ads and recommended content aren't stable. You very well might want to engage with an ad after you are done engaging what you wanted to engage with but you might never see it again. Similarly, you might see two or three videos that you want to click on on the side of a YouTube video you're watching but you can only click on one (though if you are thinking ahead you can open these in another tab.)
On top of that immediate frustration, the YouTube style interface here
collects terrible data for recommendations because, even though it gives them information that you liked the thumbnail for a video, they can't come to any conclusion about whether or not you liked any of the other videos. TikTok, by focusing on one video at a time, collects much better information.
I don't use DDG, but in my (very limited, just now) testing it doesn't seem to shuffle results unless you reload the page in some way. Is it possible you're browser is reloading the page when you go back? If so, setting DDG to open links in new tabs might fix this problem.
This behavior started happening for me in the last few months. If I click on a result, then go back, I have different search results.
I've found a workaround, though – click back into the DDG search box at the top of the page and hit enter. This then returns the original search results.
Hi - I work on search at DuckDuckGo. Do you mind sharing a bit more detail about this issue? What steps would allow us to reproduce what you're seeing?
Similar to how Google images loads lower quality blurred thumbnails towards the bottom of the window at first so that the user thinks they loaded faster
This is less a question of perfection and one of trade off's. Laws of physics put a limit on how efficiently you can keep data in NYC and London in perfect sync, so you choose CAP-style trade-offs. There are also $/SLO trade-offs. Each 9 costs more money.
I like your example it is very interesting. If I get to work on (or even hear someone in my team is working on) such interesting problems and I can hear about it, I get happy.
Interesting problems are rare because like a house you might talk about brick vs. Timber frame once, but you'll talk about cleaning the house every week!
Ok I'm curious: since this strategy sacrifices consistency, has anyone thoughts about something that is not full fan-out on reads or on writes ?
Let's imagine something like this: instead of writing to every user's timeline, it is written once for each shard containing at least one follower. This caps the fan-out at write time to hundreds of shards. At read time, getting the content for a given users reads that hot slice and filters actual followers. It definitely has more load but
- the read is still colocated inside the shard, so latency remains low
- for mega-followers the page will not see older entries anyway
There are of course other considerations, but I'm curious about what the load for something like that would look like (and I don't have the data nor infrastructure to test it)
The lossy timeline solution basically means you skip updating the feed for some people who are above the number of reasonable followers. I get that
Seeing them get 96% improvements is insane, does that mean they have a ton of users following an unreasonable number of people or do they just have a very low number for reasonable followers. I doubt it's the latter since that would mean a lot of people would be missing updates.
How is it possible to get such massive improvements when you're only skipping a presumably small % of people per new post?
EDIT: nvm, I rethought about it, the issue is that a single user with millions of follows will constantly be written to which will slow down the fanout service when a celebrity makes a post since you're going through many db pages.
When a system gets "overloaded", typically it enters exponential degradation of performance state, i.e. performs self ddos.
> Seeing them get 96% improvements is insane
TFA is talking about P99 tail latencies. It does not sound too insane to reduce tail latencies by extraordinary margins. Remember, it's just reshaping of latency distribution. In this case pathological cases get dropped.
> does that mean they have a ton of users following an unreasonable number of people
Look at the accounts of OnlyFans models, crypto influencers, etc. They follow thousands or even tens of thousands of accounts in the hope that we will follow them in return.
I don't see that accommodating this behavior is prosocial or technically desirable.
Can you think of a use case?
All sorts of bots want this sort of access, but whether there are legitimate reasons to grant it to them on a non-sharded basis is another question since a lot of these queries do not scale resources with O(n) even on a centralized server architecture.
Hmm. Twitter/X appears to do this at quite a low number, as the "Following" tab is incredibly lossy (some users are permanently missing) at only 1,200 followed people.
It's insanely frustrating.
Hopefully you're adjusting the lossy-ness weighting and cut-off by whether a user is active at any particular time? Because, otherwise, applying this rule, if the cap is set too low, is a very bad UX in my experience x_x
1200 people is really nothing, specially if you have a job tangentially related to social media (for example journalists). It's really simple, you are not the same type of user. You have 50 "acquaintances", they have 1200 "sources".
The article is talking about people who have following/follower counts in the millions. Those are dozens of writes per second in one feed and a fannout of potentially millions. Someone with 1200 followers, if everyone actually posts once a day (most people do not) gets... a rate of 0.138 writes per second.
They should be background noise, irrelevant to the discussion. That level of work is within reasonable expectation. What they're pointing out is that Twitter is aggressively anti-perfectionist for no good technical reason - so there must be a business reason for it.
I can come up with 100 people I'd want to follow on Twitter, and I don't even have an account. Don't dismiss other people's use-cases if you don't have or understand them.
> Additionally, beyond this point, it is reasonable for us to not necessarily have a perfect chronology of everything posted by the many thousands of users they follow, but provide enough content that the Timeline always has something new.
While I'm fine with the solution, the wording of this sentence led me to believe that the solution was going to be imperfect chronology, not dropped posts in your feed.
So, let's say I follow 4k people in the example and have a 50% drop rate. It seems a bit weird that if all (4k - 1) accounts I follow end up posting nothing in a day, that I STILL have a 50% chance that I won't see the 1 account that posts in a day. It seems to me that the algorithm should consider my feed's age (or the post freshness of my followers). Am I overthinking?
The "reasonable limit" is likely set based on experimentation, and thus on how much people post on average and the load it generates (so the real number is unlikely to be exactly "2000", IMHO).
If you follow a lot of people, how likely it is that their posting pattern is so different from the average? The more people you follow, the less likely that is.
So while you can end up in such situation in theory, it would need to be a very unusual (and rare) case.
I think the 'law of large numbers' says that it's very unlikely for you to follow 4k and have _none_ of them posting. You could artificially construct a counter-example by finding 4k open but silent accounts, but that's silly.
The other workaround is: follow everyone. Write some code to get what you want out of the jetstream event feed. https://docs.bsky.app/blog/jetstream
Yeah, this seems concerning to me. Maybe now as the platform is new this isn't much of an issue. But as accounts go inactive people will naturally collect "dead" accounts that they are still following. On Facebook it isn't uncommon of to have old accounts of sociable people naturally collect thousands of friends.
It seems that what they are trying to measure is "busy timelines" and it seems bag they could probably measure that more directly. For example what is the number of posts in the timeline over theast 24h? It seems that it should be fairly easy to use this as the metric for calculating drop rate.
Anyone following hundreds of thousands of users is obviously a bot account scraping content. I'd ban them and call it a day.
However, I do love reading about the technical challenge. I think Twitter has a special architecture for celebrities with millions of followers. Given Bluesky is a quasi-clone, I wonder why they did not follow in these footsteps.
You don't need to follow anyone (or even have an account) to scrape content… Someone following a huge amount of accounts usually wants to get a lot of followers quickly this way through follow-backs.
Maybe not hundreds of thousands but I'd follow anybody that looks remotely interesting and then primarily use customized feeds. E.g. if I wanna hear about union news, my personal irl network, etc I check that feed
This does assume that scrapers are smart, and often they're really not. They have infrastructure for scraping HTML from webpages at scale and that is the hammer they use for all nails. (e.g. Wikipedia has to fight off scraper traffic despite full archives being available as torrents, etc.)
In this case I agree though, they're all spammers and/or "clout farmers", or trying to make an account seem more authentic for future scams. They want to generate follow notifications in the hope that some will follow them back (and if they don't, they unfollow again after some interval).
BlueSky has starter packs that allow you to mass follow in the click of a button. You join 10 starter packs in one day, you are following over 1000 people. Sometimes following others is the only way to get people to engage with your content.
No matter how high you set a maximum limit for interactions on social media (followers, friends, posts, etc), someone will reach the limit and complain about it. I can see why Bluesky would prefer a "soft limit", where going above the limit will degrade the experience. It gives more flexibility to adjust things later, and prevents obnoxious complaints from power users with outsized influence.
When you have a celebrity account, instead of fanning out every message to millions of followers' timelines, it would be cheaper to do nothing when the celebrity posts, and later when serving each follower's timeline, fetch the celebrity's posts and merge them into the timeline. When millions of followers do that, it will be cheap read-only fetch from a hot cache.
off-topic: how has been dealing with the influx of new users after X political/legals problems aftermath? Did you see an increase in toxicity around the network? And how has you (Bluesky moderation) dealing with it.
[1] - https://www.themarysue.com/twitter-justin-bieber-servers/
@bluesky devs, don't feel ashamed for doing this. It's exactly how to scale these kinds of extreme cases.
So now we have Slashdot effect, HN hug, and its not Clarkson its... Stephen Fry effect? Maybe can be Cross-Discipline - there's a term for when lots of UK turns their kettles on at the same time.
I should make a blog post to record all the ones I can remember.
Hot shards were definitely an issue, though.
Thanks a lot for sharing this link.
I own DDIA but after a few chapters of how database work behind the scenes, I begin to fall asleep. I have trouble understanding how to apply the knowledge to my work but this seems like a useful thing with a more clear application.
I think then you still have the 'weird user who follows hundreds of thousands of people' problem, just at read time instead of write time. It's unclear that this is _better_, though, yeah, caching might help. But if you follow every celeb on Bluesky (and I guarantee you this user exists) you'd be looking at fetching and merging _thousands_ of timelines (again, I suppose you could just throw up your hands and say "not doing that", and just skip most or all of the celebs for problem users).
Given the nature of the service, making read predictably cheap and writes potentially expensive (which seems to be the way they've gone) seems like a defensible practice.
Random sampling? It's not as though the user needs thousands of posts returned for a single fetch. Scrolling down and seeing some stuff that's not in chronological order seems like an acceptable tradeoff.
Instead, you can drastically speed up performance if you are able to store data for each timeline somewhat contiguously on disk.
(Although I don't understand the "non-celebrity" part of your comment -- the timeline contains (pointers to) posts from whoever someone follows, and doesn't care who those people are.)
In the Blekko search engine back end we built an index that was 'eventually consistent' which allowed updates to the index to be propagated to the user facing index more quickly, at the expense that two users doing the exact same query would get slightly different results. If they kept doing those same queries they would eventually get the exact same results.
Systems like this bring in a lot of control systems theory because they have the potential to oscillate if there is positive feedback (and in search engines that positive feedback comes from the ranker which is looking at which link you clicked and giving it a higher weight) and it is important that they not go crazy. Some of the most interesting, and most subtle, algorithm work was done keeping that system "critically damped" so that it would converge quickly.
Reading this description of how user's timelines are sharded and the same sorts of feedback loops (in this case 'likes' or 'reposts') sounds like a pretty interesting problem space to explore.
Like I click one page, don’t find what I want, and go back thinking “no, I want that other result that was below” and it’s an entirely different page with shuffled results, missing the one that I think might have been good.
On top of that immediate frustration, the YouTube style interface here
https://marvelpresentssalo.com/wp-content/uploads/2015/09/id...
collects terrible data for recommendations because, even though it gives them information that you liked the thumbnail for a video, they can't come to any conclusion about whether or not you liked any of the other videos. TikTok, by focusing on one video at a time, collects much better information.
I've found a workaround, though – click back into the DDG search box at the top of the page and hit enter. This then returns the original search results.
Looking back at my early work with microservices I'm wondering how much time I would have saved by just manually setting a tongue weight.
I like your example it is very interesting. If I get to work on (or even hear someone in my team is working on) such interesting problems and I can hear about it, I get happy.
Interesting problems are rare because like a house you might talk about brick vs. Timber frame once, but you'll talk about cleaning the house every week!
— https://en.wikipedia.org/wiki/Blekko
Perhaps GP has a more interesting answer though.
Let's imagine something like this: instead of writing to every user's timeline, it is written once for each shard containing at least one follower. This caps the fan-out at write time to hundreds of shards. At read time, getting the content for a given users reads that hot slice and filters actual followers. It definitely has more load but
- the read is still colocated inside the shard, so latency remains low
- for mega-followers the page will not see older entries anyway
There are of course other considerations, but I'm curious about what the load for something like that would look like (and I don't have the data nor infrastructure to test it)
The lossy timeline solution basically means you skip updating the feed for some people who are above the number of reasonable followers. I get that
Seeing them get 96% improvements is insane, does that mean they have a ton of users following an unreasonable number of people or do they just have a very low number for reasonable followers. I doubt it's the latter since that would mean a lot of people would be missing updates.
How is it possible to get such massive improvements when you're only skipping a presumably small % of people per new post?
EDIT: nvm, I rethought about it, the issue is that a single user with millions of follows will constantly be written to which will slow down the fanout service when a celebrity makes a post since you're going through many db pages.
> Seeing them get 96% improvements is insane
TFA is talking about P99 tail latencies. It does not sound too insane to reduce tail latencies by extraordinary margins. Remember, it's just reshaping of latency distribution. In this case pathological cases get dropped.
Look at the accounts of OnlyFans models, crypto influencers, etc. They follow thousands or even tens of thousands of accounts in the hope that we will follow them in return.
Can you think of a use case?
All sorts of bots want this sort of access, but whether there are legitimate reasons to grant it to them on a non-sharded basis is another question since a lot of these queries do not scale resources with O(n) even on a centralized server architecture.
They do, there are groups of users on bluesky who follow inordinate numbers of other accounts to try and get follows back.
It's insanely frustrating.
Hopefully you're adjusting the lossy-ness weighting and cut-off by whether a user is active at any particular time? Because, otherwise, applying this rule, if the cap is set too low, is a very bad UX in my experience x_x
> at only 1,200 followed people.
I follow like, 50 people on bluesky. Who is following 1,200 people? What kind of value do you even get out of your feed?
The article is talking about people who have following/follower counts in the millions. Those are dozens of writes per second in one feed and a fannout of potentially millions. Someone with 1200 followers, if everyone actually posts once a day (most people do not) gets... a rate of 0.138 writes per second.
They should be background noise, irrelevant to the discussion. That level of work is within reasonable expectation. What they're pointing out is that Twitter is aggressively anti-perfectionist for no good technical reason - so there must be a business reason for it.
While I'm fine with the solution, the wording of this sentence led me to believe that the solution was going to be imperfect chronology, not dropped posts in your feed.
The "reasonable limit" is likely set based on experimentation, and thus on how much people post on average and the load it generates (so the real number is unlikely to be exactly "2000", IMHO).
If you follow a lot of people, how likely it is that their posting pattern is so different from the average? The more people you follow, the less likely that is.
So while you can end up in such situation in theory, it would need to be a very unusual (and rare) case.
The other workaround is: follow everyone. Write some code to get what you want out of the jetstream event feed. https://docs.bsky.app/blog/jetstream
It seems that what they are trying to measure is "busy timelines" and it seems bag they could probably measure that more directly. For example what is the number of posts in the timeline over theast 24h? It seems that it should be fairly easy to use this as the metric for calculating drop rate.
However, I do love reading about the technical challenge. I think Twitter has a special architecture for celebrities with millions of followers. Given Bluesky is a quasi-clone, I wonder why they did not follow in these footsteps.
There are only six users with over a million followers, and none with two million yet.
I'm sure they'll get there.
the only reason to mass-follow is for spam purposes.
In this case I agree though, they're all spammers and/or "clout farmers", or trying to make an account seem more authentic for future scams. They want to generate follow notifications in the hope that some will follow them back (and if they don't, they unfollow again after some interval).
Deleted Comment