Readit News logoReadit News
jd20 commented on How we blocked TikTok's Bytespider bot and cut our bandwidth by 80%   nerdcrawler.com/blog/how-... · Posted by u/chptung
jd20 · 2 years ago
I didn't see it mentioned, but why not just use robots.txt? Does Bytespider ignore it?
jd20 commented on Interviews with makers of high-end mechanical keyboards   endgame.fyi... · Posted by u/hellweaver666
smabie · 5 years ago
I would love to use a custom built keyboard, but my RSI makes it so I can only type on a Kinesis Advantage for extended periods of time. I'm surprised more people don't use them, the traditional keyboard layout makes very little sense. It's crazy how much inertia many sub-optimal things have.
jd20 · 5 years ago
I'd really love to see more data on whether ergonomic keyboards actually work. From what I've read, it sounds like the results are mixed: I kind of want to try a split keyboard like the ergodox or Kinesis, but I feel I tend to cross-over a fair amount when typing, and I wonder if a split keyboard would be less efficient.

I also overthink a lot about the position of frequently used keys like Cmd/Ctrl/Alt (on a Mac for instance), and what the optimal placement would be, and I feel like there's very little data about this topic.

jd20 commented on Interviews with makers of high-end mechanical keyboards   endgame.fyi... · Posted by u/hellweaver666
janwillemb · 5 years ago
None of these keyboards have a numeric keypad; these are for fancy design and not for heavy use.
jd20 · 5 years ago
One does (it's a southpaw, numeric keypad on the left). A few are TKL (full size keyboard without the numpad).

Actually, as a programmer, I pretty much never use the numeric keypad. But when I start seeing smaller layouts with no arrow keys, Fn keys, or even number keys, I tend to agree: there's a definite trade off between function and aesthetics. The beauty of custom keyboards is people get to decide those trade-off's themselves.

jd20 commented on Interviews with makers of high-end mechanical keyboards   endgame.fyi... · Posted by u/hellweaver666
jd20 · 5 years ago
A clarification: these are interviews with people who assemble custom keyboards, I was expecting chats with the people who actually design and produce custom keyboards (like yuktsi, Rama, Wilba, ZealPC, etc...)

Still, very cool to see what people are building. I've just recently fallen down into the rabbit hole of custom keyboards, after my Apple Keyboard stopped working. As someone who spends almost half my life at a keyboard, I'm surprised it took this long for me to look into improving the tool I interact with most every day.

jd20 commented on Applebot   support.apple.com/en-us/H... · Posted by u/jonbaer
tylfin · 5 years ago
Yeah, I've never had to implement my own DNS cache for a language before...

If you're on a system with cgo available, you can use `GODEBUG=netdns=cgo` to avoid making direct DNS requests.

This is the default on MacOS, so if it was running on four Mac Pro's I wouldn't expect it to be the root cause.

jd20 · 5 years ago
It's possible that wasn't the default setting on Macs back then. I don't know that cgo would be a good choice either, if you're resolving a ton of domains at once. Early versions of Go would create new threads if a goroutine made a cgo call, and an existing thread was not available. I remember this required us to throttle concurrent dial calls, otherwise we'd end up with thousands of threads, and eventually bring the crawler to a halt.

To make DNS resolution really scale, we ended up moving all the DNS caching and resolution directly into Go. Not sure that's how you'd do it today, I'm sure Go has changed a lot. Building your own DNS resolver is actually not so hard with Go, the following were really useful:

https://idea.popcount.org/2013-11-28-how-to-resolve-a-millio...

https://github.com/miekg/dns

jd20 commented on Applebot   support.apple.com/en-us/H... · Posted by u/jonbaer
ksec · 5 years ago
>- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

Considering the timeline, are those Trash Can Mac Pro? Or was it the old Cheese Grater ?

jd20 · 5 years ago
Trash cans :)
jd20 commented on Applebot   support.apple.com/en-us/H... · Posted by u/jonbaer
throwaway4good · 5 years ago
I am particular curious about data storage.

Does it use a traditional relational database or another existing database-like product? Or is built from scratch just sitting on top of a file system.

jd20 · 5 years ago
Nope, you don't really need a database. What you need for fast, scalable web crawling is more like key-value storage: a really fast layer (something like RocksDB on SSD) for metadata about URL's, and another layer that can be very slow for storing crawled pages (like Hadoop or Cassandra). In reality, writing directly to Hadoop/Cassandra was too slow (because it was in a remote data center) so it was easier to just write to RAID arrays over Thunderbolt, and sync the data periodically as a separate step.
jd20 commented on Applebot   support.apple.com/en-us/H... · Posted by u/jonbaer
72deluxe · 5 years ago
Out of curiosity, why would C or C++ not be good for web services?
jd20 · 5 years ago
Some Apple services were written in C/C++. One downside is it's very hard to source engineers across the company who can then work on that code, or for those engineers to go work on other teams.
jd20 commented on Applebot   support.apple.com/en-us/H... · Posted by u/jonbaer
tinco · 5 years ago
Say the average web page is 100kb, and assuming gigabit connection in the office, then that's about a thousand pages per second. If the office switch is on 10gbit that would work out to 4000p/s naively counting. But we're in the same order of magnitude for the speed even on gbit, and we're not accounting for gzip, and the actual average page size might be a bit lower too.
jd20 · 5 years ago
Everything was on 10gigE. The average page size was around 17KB gzipped. Everything's a careful balance between CPU, memory, storage, and message throughput between machines.

Apple's corporate network also had incredible bandwidth to the Internet at large. Not sure why, but I assumed it was because their earliest data centers actually ran in office buildings in the vicinity of 1 Infinite Loop.

jd20 commented on Applebot   support.apple.com/en-us/H... · Posted by u/jonbaer
doh · 5 years ago
Can you talk more about the specific? What kind of parsers did you guys use? How about storage? How often did you update pages?
jd20 · 5 years ago
You should check out Manning's "Introduction to Information Retrieval", it has far more detail about web crawler architecture than I can write in a post, and served as a blueprint for much of Applebot's early design decisions.

u/jd20

KarmaCake day785March 17, 2016View Original