Building an Internet Scale Meme Search Engine

This title really undersells the absolute insanity of the described solution. This is a beautiful example of "if it's stupid, but it works, it's not stupid." The justification is very convincing.

One thing I'm curious about: how did you build your corpus of meme images and videos?

leokennis · 3 years ago

It reminds me a little of https://www.beeper.com/. It allows you to read iMessages on Android and other platforms. To make this work, they will ship you some old iPhone to act as a server "bridge": https://twitter.com/ericmigi/status/1351934418961661959

If it works, it works. But it also speaks volumes about Apple's disregard/inexperience with exposing their stuff via the web - https://www.icloud.com/ being the prime example: half the stuff the phone apps can do are not available (cannot create a reminder with a due date...) and the things that are there are slow and buggy.

dewey · 3 years ago

I think I've seen a post from https://texts.com about that they. I don't think they ship you the iPhone though, the host it themselves.

aemreunal · 3 years ago

(Not the author) Maybe they leveraged https://knowyourmeme.com/? But that can't possibly have all the random memes, could it?

mandatory · 3 years ago

Author here: KnowYourMeme is one of many sites that memes are continually ingested from (any site that has memes I try to ingest regularly) :)

code_duck · 3 years ago

OP’s meme site lists where each image comes from. Looking through it I mainly see ifunny and 9gag.

solarkraft · 3 years ago

Do you crawl telegram channels?

bryanrasmussen · 3 years ago

yeah but lots of things that work are stupid because there are many other solutions that work better, the greatness of this crazy solution is it really seems like the best solution given price requirements.

jychang · 3 years ago

I feel like I’m taking crazy pills in this thread. Am I the only one who talks to Gen Z kids who explore around their iPhone apps? This definitely isn’t the best option given price requirements. It’s not even the most convenient option.

I’m around age 30, not 13, so similar to the article, my first instinct was also to create a database and OCR the image. But by total coincidence, yesterday I had a conversation with my 14 year old cousin on the topic of saving memes. Her response was along the lines of “yeah, everyone nowadays just saves the image to your iPhone photos, and then just search for it later from the photos app”.

Yeah. This whole article is literally already built into iOS UI, not just a hidden API. And kids all seem to know about this, apparently.

This article uses an example meme with the text “Sorry young man But the armband (red) stays on during solo raids”. I saved it in my iPhone photos app… and found it again through the search function in the photos app.

https://imgur.com/a/BPICjOz

https://imgur.com/a/55el9uQ

This is a solved problem already, by teenager standards.

I felt extremely old yesterday when I was talking to my cousin. And I felt extremely old today, reading this article. This is because looking back, the past few decades of CS cultural intuition have established that text are text, and images are images. Strings and bitmaps don’t mix.

This seems sort of obvious to anyone in tech, but I realized that from a clueless grandma perspective, not being able to search up text in photos wasn’t really obvious. Well, the roles are reversed now. Ordinary people now have access to software that treats text as a first class citizen in photos by default.

Although not as elegant a solution as this I've also tried my hand as well at indexing and categorizing memes. I wanted to save a very specific type of meme though since there are, in my mind, 2 main categories of memes. The first category are what I call "story" memes, they are standalone and typically what you see being shared on Facebook. They usually have text and are able to tell a story on their own with no additional context and can be presented as a single post, story, etc, (think 4 panel comics). The second type are reaction memes. These are used to respond to people and usually convey a feeling towards a post or tweet. They can also be standalone so they should probably be considered a subset of the "story" memes. I've gravitated towards the reaction memes as I see more utility in them and can be used in a more universal way. My site if anyone is interested (its still a work in progress):

https://www.memeatlas.com

operator-name · 3 years ago

These different approaches really compliment each other - most of the memes you've categorised are used in a variety of situations and therefore not suited to text searching. Meanwhile if you're looking for a specific meme that you've seen, text search is the way to go.

Ideally there would be a best of both worlds where you could search memes by "characters" or "formats" in addition to text.

As feedback, it would be nice to search all memes from the homepage. The search on https://www.memeatlas.com/meme-templates.html also seems to be broken.

suave_dude · 3 years ago

What are you using as the back end to host your website?

CobrastanJorji · 3 years ago

merpkz · 3 years ago

Nice project, I wanted to build meme search engine myself one day, but figured Tesseract will fail at most of the memes because of how stylized those have become. So I tuned down my meme source to only /r/bertstrips as those contain sane looking text and it's working quite alright - project has no frontend yet, I search from cli and click links.

> Initial testing with the Postgres Full Text Search indexing functionality proved unusably slow at the scale of anything over a million images, even when allocated the appropriate hardware resources.

I can guarantee you that correctly setup PostgreSQL text search will be faster than ES with much, much less hardware resources needed, it's just a matter of correctly creating tsvector column and creating GIN index on it (and ofc asking right queries so it's actually used). I can help you out setting postgres schema up and debugging queries if you are interested, for testing purposes at least.

tmzt · 3 years ago

I recently worked on a project using lnx.rs. Simple to setup and use and fast at the scale I was using it. Built on Tantivy with a custom fast fuzzy search feature.

If you want to go beyond meme sites and possibly detect memes in the wild, common crawl might be something to start with.

iamflimflam1 · 3 years ago

One issue I've had with postgres full text search is when you want to rank using ts_rank you end up with a full table scan.

CiceroCiceronis · 3 years ago

This is really brilliant to see, and I've been surprised for quite a long time that nothing similar exists. I think it's a real shame that few people with interest in memes have interest in building solutions like this that help us engage with them.

People in the 21st century know a lot about the mistakes of the past century that led to much popular culture of the time being lost (especially terminally online people who've watched lots of Youtube documentaries about lost Dr. Who episodes and so on), so it surprises me how little we try and avoid those same mistakes with today's ephemeral pop culture in the form of memes. People like yourself who want to help make the internet's huge corpus of memes tractable are part of the solution in terms of meme archival and cultural memory.

(There's a good meme metadiscussion group on Discord, "The Philosopher's Meme," which you might be interested in joining. People there would be very keen to discuss what you've made.)

spiffytech · 3 years ago

I've always been surprised that Reddit hasn't built meme search into its site search.

Memes are a core part of the Reddit experience, yet it's difficult to find something I know I saw before.

Not familiar with Discord, do you have a link to that group?

https://discord.com/invite/8MVFRMa

Love the hackiness of this - however, the vision framework is available on Desktop macs as well - https://developer.apple.com/documentation/vision

and specifically:

https://developer.apple.com/documentation/vision/vnrecognize...

cassiogo · 3 years ago

> My preliminary speed tests were fairly slow on my Macbook. However, once I deployed the app to an actual iPhone the speed of OCR was extremely promising (possibly due to the Vision framework using the GPU). I was then able to perform extremely accurate OCR on thousands of images in no time at all, even on the budget iPhone models like the 2nd gen SE.

He does mention running it on a macbook

msdrigg · 3 years ago

I would guess that tests in this sentence refers to tests of the iOS app on the simulator.

Which would be slow expectedly

dblitt · 3 years ago

I would assume he's using an intel macbook and wouldn't have the gpu acceleration (and subsequent Vision framework integration) of the m1

pronoiac · 3 years ago

There's ocrit, a CLI utility using Apple's Vision framework for OCR: https://github.com/insidegui/ocrit

price456987 · 3 years ago

What's the cost of building and running a cluster of iPhones vs Mac Minis?

Bad_CRC · 3 years ago

in the article $40 second hand, imei banned and broken screen iPhones are being used so...

yakubin · 3 years ago

The photo under Upgrading the iPhone OCR Service Into An OCR Cluster. In the future, data centres are going to host racks of iPhones.

sankha93 · 3 years ago

It is already here to be honest. I know BrowserStack and other mobile testing platforms (at Facebook and Amazon) do host real devices, both Android and iPhones, in server farms like this. Meta wrote a blog post about it: https://engineering.fb.com/2016/07/13/android/the-mobile-dev...

At one of my previous workplaces, we discussed running the Z3 theorem prover on an iPhone cluster, because they run so much faster on A series processor than a desktop Intel machine.

formerly_proven · 3 years ago

Reminds me of imgix, who built their product on Apple libraries so ended up having racks of macs to run their service.

oefrha · 3 years ago

Modern app click farms already have walls of iPhones.

To be fair...insane performance in a tiny cool package...iPhones would make great servers if you could order them without the screen/camera etc. :-)

defrost · 3 years ago

There's your startup right there - washing line racks of discarded iPhones, near bleeding edge, busted screens, still functional o/wise.

Low entry cost, recycling, eco-friendly, . . . a deck that writes itself.

aabajian · 3 years ago

I had a friend in med school who wrote a very early note-taking app for the iPad. Turns out that there was no way to render PowerPoint files when the iPad first came out. He realized that the iOS/Mac OS "quick preview" function could be used to take screenshots of each PowerPoint slide. For a brief time, his was the only app that could display PowerPoints (albeit, they were just screenshots!). There's a lot of hidden utility in Apple libraries.

ksdme9 · 3 years ago

Love the inventiveness.

My question is about the image distribution costs. All the memes on the site seem to be coming straight off an object storage, all that bandwidth consumption has got to add up(?). Some sort of a CDN might help depending on the search patterns.

memeatlas · 3 years ago