OpenFreeMap survived 100k requests per second

>Nice idea, interesting project, next time please contact me before.

It's impossible to predict that one's project may go viral.

>As a single user, you broke the service for everyone.

Or you did by not having a high enough fd limit. Blaming sites when using it too much when you advertise there is no limit is not cool. It's not like wplace themselves were maliciously hammering the API.

columb · 16 days ago

You are so entitled... Because of you most nice things have "no limits but...". Not cool stress testing someone's infrastructure. Not cool. The author of this post is more than understanding, tried to fix it and offered a solution even after blocking them. On a free service.

Show us what you have done.

charcircuit · 16 days ago

>You are so entitled

That's how agreements work. If someone says they will sell a hamburger for $5, and another person pays $5 for a hamburger, then they are entitled to a hamburger.

>On a free service.

It's up to the owner to price the service. Being overwhelmed by traffic when there are no limits is not a problem limited only to free services.

010101010101 · 16 days ago

Do you expect him just to let the service remain broken or to scale up to infinite cost to himself on this volunteer project? He worked with the project author to find a solution that works for both and does not degrade service for every other user, under literally no obligation to do anything at all. This isn’t Anthropic deciding to throttle users paying hundreds of dollars a month for a subscription. Constructive criticism is one thing, but entitlement to something run by an individual volunteer for free is absurd.

toast0 · 16 days ago

The project page kind of suggests he might scale up to infinite cost...

> Financially, the plan is to keep renting servers until they cover the bandwidth. I believe it can be self-sustainable if enough people subscribe to the support plans.

Especially since he said Cloudflare is providing the CDN for free... Yes, running the origins costs money, but in most cases, default fd limits are low, and you can push them a lot higher. At some point you'll run into i/o limits, but I think the I/O at the origin seems pretty managable if my napkin math was right.

If the files are all tiny, and the fd limit is the actual bottleneck, there's ways to make that work better too. IMHO, it doesn't make sense to accept a inbound connection if you can't get a fd to read a file for it, so better to limit the concurrent connections and let connections sit in the listen queue and have a short keepalive time out to make sure you're not wasting your fds on idle connections. With no other knowledge, I'd put the connection limit at half the FD limit, assuming the origin server is dedicated for this and serves static files exclusively. But, to be honest, if I set up something like this, I probably wouldn't have thought about FD limits until they got hit, so no big deal ... hopefully whatever I used to monitor would include available fds by default and I'd have noticed, but it's not a default output everywhere.

charcircuit · 16 days ago

We are talking about hosting a fixed amount of static files. This should be a solved problem. This is nothing like running large AI models for people.

rikafurude21 · 16 days ago

the funny part is that his service didnt break- cloudflares cache caught 99% of the requests. just wanted to feel powerful and break the latest viral trend.

Since the limit you ran into was number of open files could you just raise that limit? I get blocking the spammy traffic but theoretically could you have handled more if that limit was upped?

hyperknot · 16 days ago

I've just written my question to the nginx community forum, after a lengthy debugging session with multiple LLMs. Right now, I believe it was the combination of multi_accept + open_file_cache > worker_rlimit_nofile.

https://community.nginx.org/t/too-many-open-files-at-1000-re...

Also, the servers were doing 200 Mbps, so I couldn't have kept up _much_ longer, no matter the limits.

toast0 · 16 days ago

I'm pretty sure your open file cache is way too large. If you're doing 1k/sec, and you cache file descriptors for 60 minutes, assuming those are all unique, that's asking for 3 million FDs to be cached, when you've only got 1 million available. I've never used nginx or open_file_cache[1], but I would tune it way down and see if you even notice a difference in performance in normal operation. Maybe 10k files, 60s timeout.

> Also, the servers were doing 200 Mbps, so I couldn't have kept up _much_ longer, no matter the limits.

For cost reasons or system overload?

If system overload ... What kind of storage? Are you monitoring disk i/o? What kind of CPU do you have in your system? I used to push almost 10GBps with https on dual E5-2690 [2], but it was a larger file. 2690s were high end, but something more modern will have much better AES acceleration and should do better than 200 Mbps almost regardless of what it is.

[1] to be honest, I'm not sure I understand the intent of open_file_cache... Opening files is usually not that expensive; maybe at hundreds of thousands of rps or if you have a very complex filesystem. PS don't put tens of thousands of files in a directory. Everything works better if you take your ten thousand files and put one hundred files into each of one hundred directories. You can experiment to see what works best with your load, but a tree where you've got N layers of M directories and the last layer has M files is a good plan, 64 <= M <= 256. The goal is keeping the directories compact so searching and editing is effective.

[2] https://www.intel.com/content/www/us/en/products/sku/64596/i...

ndriscoll · 16 days ago

One thing that might work for you is to actually make the empty tile file, and hard link it everywhere it needs to be. Then you don't need to special case it at runtime, but instead at generation time.

NVMe disks are incredibly fast and 1k rps is not a lot (IIRC my n100 seems to be capable of ~40k if not for the 1 Gbit NIC bottlenecking). I'd try benchmarking without the tuning options you've got. Like do you actually get 40k concurrent connections from cloudflare? If you have connections to your upstream kept alive (so no constant slow starts), ideally you have numCores workers and they each do one thing at a time, and that's enough to max out your NIC. You only add concurrency if latency prevents you from maxing bandwidth.

justinclift · 15 days ago

> so I couldn't have kept up _much_ longer, no matter the limits.

Why would that kind of rate cause a problem over time?

Starlevel004 · 16 days ago

> I believe what is happening is that those images are being drawn by some script-kiddies. If I understand correctly, the website limited everyone to 1 pixel per 30 seconds, so I guess everyone was just scripting Puppeteer/Chromium to start a new browser, click a pixel, and close the browser, possibly with IP address rotation, but maybe that wasn't even needed.

I think you perhaps underestimate just how big of a thing this became basically overnight. I mentioned a drawing over my house to a few people and literally everyone instantly knew what I meant without even saying the website. People love /r/place style things every few years, and this having such a big canvas and being on a world map means that there is a lot of space for everyone to draw literally where they live.

indrora · 15 days ago

It's also way more than 1px/30s -- Its like 20px/30s and you have a "tank" of them, which you can expand to however big you want.

Placing pixels gives you points, which you can turn into more pixels or a bigger bag of pixels over time. I've seen people who have done enough pixel pushing that they get 3-4K pixels at a time.

Aurornis · 15 days ago

> I think you perhaps underestimate just how big of a thing this became basically overnight.

They don’t need to estimate because in the article they talked to the site and got their traffic numbers: An estimated 2 million users.

That’s 1500 requests per user, which implies a lot of scripting is going on.

zahlman · 15 days ago

> I think you perhaps underestimate just how big of a thing this became basically overnight. I mentioned a drawing over my house to a few people and literally everyone instantly knew what I meant without even saying the website.

On the other hand, this is the first I've heard of this thing.

johnisgood · 15 days ago

I have known about this kind of pixel drawing but it was on empty canvas.

yifanl · 15 days ago

They have the user count from the dev, 2 million daily users shouldn't be generating billions of requests unless a good portion of them are botting.

zamadatix · 15 days ago

Why not? This is tile requests right, not login requests or something, so shouldn't a single user be expected to consume a few thousand zipping around the map while looking at drawings overlaid?

I'm sure there is some botting, it's basically guaranteed, but I wouldn't be surprised if nearly half the traffic was "legitimate". The bots don't normally need to reload (or even load) the map tiles anyways.

LoganDark · 16 days ago

> I believe what is happening is that those images are being drawn by some script-kiddies.

Oh absolutely not. I've seen so many autistic people literally just nolifing and also collaborating on huge arts on wplace. It is absolutely not just script kiddies.

> 3 billion requests / 2 million users is an average of 1,500 req/user. A normal user might make 10-20 requests when loading a map, so these are extremely high, scripted use cases.

I don't know about that either. Users don't just load a map, they look all around the place to search for and see a bunch of the art others have made. I don't know how many requests is typical for "exploring a map for hours on end" but I imagine a lot of people are doing just that.

I wouldn't completely discount automation but these usage patterns seem by far not impossible. Especially since wplace didn't expect sudden popularity so they may not have optimized their traffic patterns as much as they could have.

Karliss · 16 days ago

Just scrolled around a little bit 2-3minutes with network monitor open. That already resulted in 500requests, 5MB transferred (after filtering by vector tile data). Not sure how many of those got cached by browser with no actual requests, cached by browser exchanging only headers or cached by cloudflare. I am guessing that the typical 10-20 requests/user case is for embedded map fragment like those commonly found in contact page where most users don't scroll at all or at most slightly zoom out to better see rest of city.

nemomarx · 16 days ago

There are some user scripts to overlay templates on the map and coordinate working together, but I can't imagine that increases the load much. What might is that wplace has been struggling under the load and you have to refresh to see your pixels placed or any changes and that could be causing more calls an hour maybe?

andai · 16 days ago

From the screenshot I wanted to say, couldn't this be done on a single VPS? Seemed over engineered to me. Then I realized the silly pixels are on top of a map of the entire earth. Dang!

I'm curious what the peak req/s is like. I think it might be just barely within the range supported by benchmark-friendly web servers.

Unless there's some kind of order of magnitude slowdowns due to the nature of the application.

Edit: Looks like about 64 pixels per km (4096 per km^2). At full color uncompressed that's about 8TB to cover the entire earth (thinking long-term!). 10TB box is €20/month from Hetzner. You'd definitely want some caching though ;)

Edit 2: wplace uses 1000x1000 px pngs for the drawing layer. The drawings load instantly, while the map itself is currently very laggy, and some chunks permanently missing.

TylerE · 16 days ago

"€20/month from Hetzner" is great until you actually need it to be up and working when you need it.

motorest · 15 days ago

> "€20/month from Hetzner" is great until you actually need it to be up and working when you need it.

I managed a few Hetzner cloud instances, and some report perfect uptime for over a year. The ones that don't, I was the root cause.

What exactly leads you to make this sort of claim? Do you actually have any data or are you just running your mouth off?

immibis · 15 days ago

IME Hetzner's not unreliable. I don't think you could serve 100k requests per second on a single VPS though. (And with dedicated, you're on the hook for redundancy yourself, same as any dedicated.)

colinbartlett · 16 days ago

Thank you for this breakdown and for this level of transparency. We have been thinking of moving from MapTiler to OpenFreeMap for StatusGator's outage maps.

Feel free to migrate. If you ever worry about High Availability, self-hosting is always an option. But I'm working hard on making the public instance as reliable as possible.

ivanjermakov · 15 days ago

I understand that my popular service might bring your less popular one to the halt, but please configure it on your end so I know _programmatically_ what its capabilities are.

I host no API without rate-limiting. Additionally, clearly listing usage limits might be a good idea.

> I understand that my popular service might bring your less popular one to the halt, but please configure it on your end so I know _programmatically_ what its capabilities are.

Quite entitled expectations for someone using a free and open service to underpin their project.

The requests were coming from distributed clients, not a central API gateway that could respond to rate limiting requests

> I host no API without rate-limiting. Additionally, clearly listing usage limits might be a good idea.

Again, this wasn’t a central, well-behaved client hitting the API from a couple of IPs or with a known API key.

They calculate that per every 1 user of the wlive.place website, they were getting 1500 requests. This implies a lot of botting and scripting.

This is basically load testing the web site at DDoS scale.

> The requests were coming from distributed clients, not a central API gateway that could respond to rate limiting requests

The block was done based on URL origin rather than client/token, why wouldn't a rate limiter solution consider the same? For this case (a site which uses the API) it would work perfectly fine. Especially since the bots don't even care about the information from this API so non-site based bots aren't even going to bother to pull the OpenFreeMap tiles.

Ugh, then I agree. This way it's indistinguishable from DDoS attack.

Aeolun · 15 days ago

I think it’s reasonable to assume that a free service is not going to deal gracefully with your 100k rps hug of death. The fact that it actually did is an exception, not the rule.

If you are hitting anything free with more than 10rps (temporarily) you are an taking advantage in my opinion.

ch33zer · 16 days ago

Ericson2314 · 15 days ago

Oh wow, TIL there is finally a simple way to actually view OpenStreetMap! Gosh, that's overdue. Glad it's done though!

bspammer · 15 days ago

What was wrong with the main site? Genuine question https://www.openstreetmap.org

Ericson2314 · 14 days ago

Oh.... last I checked they didn't have that?

drewda · 15 days ago

The OSM Foundation has been serving raster tiles for years and years (that's what's visible by default on the slippy map at www.openstreetmap.org): https://wiki.openstreetmap.org/wiki/OpenStreetMap_Carto

After on and off experimentation by various contributors, OSMF just released vector tiles as well: https://operations.osmfoundation.org/policies/vector/

Thanks, I'm just out of date

bravesoul2 · 15 days ago

... And then it became 1M rps!