furiousteabag (u/furiousteabag)

furiousteabag commented on Show HN: Simtown: A 2D Role-Playing Game Where Characters Talk, Move, and Act app.simtown.ai/... · Posted by u/syzmony

furiousteabag · 9 months ago

Another thing we are trying to understand is whether the 2D element adds value to the simulation. A simpler option would be a pure text/chat interface. Still, the hypothesis here is that it is easier to comprehend what's going on in the environment with an actual 2D world and characters, and it might be more immersive compared to just a text interface.

furiousteabag commented on Launch HN: Silurian (YC S24) – Simulate the Earth · Posted by u/rejuvyesh

brunosan · a year ago

howdy! Clay makers here. Can you share more? Did you try Clay v1 or v0.2 What image size embeddings from what instrument?

We did try to relate OSM tags to Clay embeddings, but it didn't scale well. We did not give up, but we are re-considering ( https://github.com/Clay-foundation/earth-text ). I think SatClip plus OSM is a better approach. or LLM embeddings mapped to Clay embeddings...

furiousteabag · a year ago

Hey hey! We tried Clay v1 with 768 embeddings size using your tutorials. We then split NAIP SF to chips and indexed them. Afterwards, we performed image-to-image similarity search like in your explorer.

We tried to search for bridges, beaches, tennis courts, etc. It worked, but it didn't work well. The top of the ranking was filled with unrelated objects. We found that similarity scores are stacked together too much (similarity values are between 0.91 and 0.92 with 4 digit difference, ~200k tiles), so the encoder made very little difference between objects.

I believe that Clay can be used with additional fine-tuning for classification and segmentation, but standalone embeddings are pretty poor.

Check this: https://github.com/wangzhecheng/SkyScript. It is a dataset of OSM tags and satellite images. CLIP fine-tuned on that gives good embeddings for text-to-image search as well as image-to-image.

furiousteabag commented on Launch HN: Silurian (YC S24) – Simulate the Earth · Posted by u/rejuvyesh

furiousteabag · a year ago

Curious to see what other things you will simulate in the future!

Shameless plug: recently we've built a demo that allows you to search for objects in San Francisco using natural language. You can look for things like Tesla cars, dry patches, boats, and more. Link: https://demo.bluesight.ai/

We've tried using Clay embeddings but we quickly found out that they perform poorly for similarity search compared to embeddings produced by CLIP fine tuned on OSM captions (SkyScript).

furiousteabag commented on Show HN: Search San Francisco using natural language demo.bluesight.ai/... · Posted by u/furiousteabag

skeptrune · a year ago

Reminds me a ton of the all text in brooklyn post from a couple weeks ago - https://hn.trieve.ai/?q=all+text+brooklyn . I am pretty surprised that I found it easier to come up with cool queries for alltext.nyc though. I wouldn't have expected that with just descriptions of the two search engines.

Some notes from experimenting:

High level, I would expect that using a LLM to caption the images post SAM would do better than CLIP by itself.

I wish that it put my search query and mode as a URL param so sharing would be a bit easier.

Main use-case I could come up with would be spotting areas lacking accessibility features or maybe homeless encampments? Idk, I definitely couldn't think of any commercialization ideas off the jump.

Another thought I had was why you guys chose aerial satellite instead of street view data? I imagine something like "palace of fine arts" would have worked with that approach.

- "skate park" has some classic dense vector failures where it will find an outdoor playground-style area more similar than an actual skate park. Also, big mode is really cool and works much better for these kinds of queries. "chess board" will rank checker patterns over actual chess boards. Bunch of examples will follow this pattern. I wish there was an additional search mode for LLM generated descriptions of the segmented images, but there are probably cost constraints there.

- "USF" also didn't work well. I guess not surprising given it's CLIP, but still kind of interesting. I wonder what it would take to make the multi-modal models better at OCR without actually doing OCR.

- "beach" didn't work great which surprised me

- "picnic tables" and "lots of people" also didn't work | no idea why

furiousteabag · a year ago

Thanks for sharing Brooklyn text demo. Haven't seen it!

Captioning images using VLM would definitely help as an additional conditional feature. Maybe it even would be enough to use only embeddings of captions to do search!

We chose aerial satellite instead of street view because we plan to apply the same technologies where street view is not available, e.g. crop fields or forests. Another thing is that we plan to monitor areas that change frequently and street view data is not enough to keep up. But the idea is great! Although your query "palace of fine arts" is not extremely exciting because it is searchable via Google Maps :D

"USF" by itself doesn't work, "USF word" pointed me where needed xD

"beach" and "picnic tables" indeed doesn't work in object mode, but works great in "big" mode, probably because they needs some context around themselves

"lots of people" didn't work, "a crowd of people" seems to work. Interesting, that almost the same (semantically) queries produce very different results!