Readit News logoReadit News
minimaxir · 2 years ago
Note that a video is just a sequence of images: OpenAI has a demo with GPT-4-Vision that sends a list of frames to the model with a similar effect: https://cookbook.openai.com/examples/gpt_with_vision_for_vid...

If GPT-4-Vision supported function calling/structured data for guaranteed JSON output, that would be nice though.

There's shenanigans you can do with ffmpeg to output every-other-frame to halve the costs too. The OpenAI demo passes every 50th frame of a ~600 frame video (20s at 30fps).

EDIT: As noted in discussions below, Gemini 1.5 appears to take 1 frame every second as input.

ankeshanand · 2 years ago
We've done extensive comparisons against GPT-4V for video inputs in our technical report: https://storage.googleapis.com/deepmind-media/gemini/gemini_....

Most notably, at 1FPS the GPT-4V API errors out around 3-4 mins, while 1.5 Pro supports upto an hour of video inputs.

jxy · 2 years ago
So that 3-4 mins at 1FPS means you are using about 500 to 700 tokens per image, which means you are using `detail: high` with something like 1080p to feed to gpt-4-vision-preview (unless you have another private endpoint).

The gemini 1.5 pro uses about 258 tokens per frame (2.8M tokens for 10856 frames).

Are those comparable?

moralestapia · 2 years ago
>while 1.5 Pro supports upto an hour of video inputs

At what price, tho?

verticalscaler · 2 years ago
The average shot length in modern movies is between 4 and 16 seconds and around 1 minute for a scene.
simonw · 2 years ago
The number of tokens used for videos - 1,841 for my 7s video, 6,049 for 22s - suggests to me that this is a much more efficient way of processing content than individual frames.

For structured data extraction I also like not having to run pseudo-OCR on hundreds of frames and then combine the results myself.

famouswaffles · 2 years ago
No it's individual frames

https://developers.googleblog.com/2024/02/gemini-15-availabl...

"Gemini 1.5 Pro can also reason across up to 1 hour of video. When you attach a video, Google AI Studio breaks it down into thousands of frames (without audio),..."

But it's very likely individual frames at 1 frame/s

https://storage.googleapis.com/deepmind-media/gemini/gemini_...

"Figure 5 | When prompted with a 45 minute Buster Keaton movie “Sherlock Jr." (1924) (2,674 frames at 1FPS, 684k tokens), Gemini 1.5 Pro retrieves and extracts textual information from a specific frame in and provides the corresponding timestamp. At bottom right, the model identifies a scene in the movie from a hand-drawn sketch."

minimaxir · 2 years ago
From the Gemini 1.0 Pro API docs (which may not be the same as Gemini 1.5 in Data Studio): https://cloud.google.com/vertex-ai/docs/generative-ai/multim...

> The model processes videos as non-contiguous image frames from the video. Audio isn't included. If you notice the model missing some content from the video, try making the video shorter so that the model captures a greater portion of the video content.

> Only information in the first 2 minutes is processed.

> Each video accounts for 1,032 tokens.

That last point is weird because there is no way a video would be a fixed amount of tokens and I suspect is a typo. The value is exactly 4x the number of tokens for an image input to Gemini (258 tokens) which may be a hint to the implementation.

btbuildem · 2 years ago
Given how video is compressed (usually, key frames + series of diffs) perhaps there's some internal optimization leveraging that (key frame: bunch of tokens, diff frames: much fewer tokens)
arbuge · 2 years ago
How is sound handled?

All I see in the Gemini docs is a terse sentence that says it isn’t included, which doesn’t sound like an optimal solution.

OkGoDoIt · 2 years ago
It doesn’t appear to be using the sound from the video, but elsewhere in the report for Gemini 1.5 pro it mentions it can handle sound directly as an input, without first transcribing it to text (including a chart that makes the point it’s much more accurate than transcribing text with whisper and then querying it using GPT-4).

But I don’t think it went into detail about how exactly that works, and I’m not sure if the API/front end has a good way to handle that.

minimaxir · 2 years ago
Models have to be trained to understand sound, it's not free.
belter · 2 years ago
Prompt injection via Video?
janpmz · 2 years ago
On the other hand, a picture is a video with a single frame.
DerCommodore · 2 years ago
I expected more from the video
dEnigma · 2 years ago
> It looks like the safety filter may have taken offense to the word “Cocktail”!

I'm definitely not a fan of these severely hamstrung by default models. Especially as it seems to be based on an extremely puritan ethical system.

baby · 2 years ago
Deeply agree with the sentiment. AIs are so throttled and crippled that it makes me sad every time gemini or chatgpt refuses to answer my questions.

Also agree that it’s mostly policed by American companies who follow the American culture of “swearing is bad, nudity is horrible, some words shouldn’t even be said”

Angostura · 2 years ago
So how crippled would you like them to be? Would you put any guard rails in place?
Breza · 2 years ago
The limitations are massively frustrating. I asked Gemini to suggest prayers for my friends based on a search of my inbox (which includes social network notification emails). It refused outright.
riwsky · 2 years ago
Finally, early-aughts 1337 a3s7h37ic can be cool again
nicbou · 2 years ago
I was fighting with ChatGPT yesterday because it wouldn't translate "fuck". I was quoting Office Space's "PC Load Letter? What the fuck does that mean?"

Likewise it won't generate passive-aggressive answers meant for comedic reasons.

I hate having to negotiate with AI like it's a difficult child.

HPsquared · 2 years ago
I wonder, if you put asterisks like 'f***' it would translate that appropriately. Like, as a figleaf.
geonnave · 2 years ago
> I hate having to negotiate with AI like it's a difficult child.

Surely not in the list of things I expected to ever read in real life.

te_chris · 2 years ago
Silicon Valley has been auto-parodic morals-wise for a while. Hell, just the basics of you can have super violent gaming but woe-betide you look at anything sex related in the appstores is intensely comedic. America desperately tries to export its puritanism but most of us just shrug (along with many Americans). Surely it's hard to argue that being open about sex (for consenting adults) is infinitely preferable to a world of wanton, easily accessible violence.
Cthulhu_ · 2 years ago
And it's not even the SV companies themselves per se, it's their partners like credit card companies that will have nothing to with it, citing "think of the children".
jgilias · 2 years ago
I don’t think it’d take offense at alcohol. Most likely that’s because cocktail rhymes with Molotov.
neuronic · 2 years ago
One of the faults is that for every version of morality you can hallucinate a reason why cocktail is offensive or problematic.

Is it sexual? Is it alcohol? Is it violence? All of the above?

For example, good luck ever actually processing art content with that approach. Limiting everything to the lowest common denominator to avoid stepping on anyone's toes at all times is, paradoxically, a bane on everyone.

I believe we need to rethink how we deal with ethics and morality in these systems. Obviously, without a priori context every human, actually every living being, should be respected by default and the last thing I would advocate for is to let racism, sexism, etc. go unchecked...

But how can we strike a meaningful balance here?

Mashimo · 2 years ago
I think it's the COCK in cocktail.
onion2k · 2 years ago
Most likely that’s because cocktail rhymes with Molotov

What definition of 'rhymes' are you using here?

xyzelement · 2 years ago
We're months into this technology being available so it's not a surprise that the various "safeties" have not been perfectly tuned. Perhaps Google knew they couldn't be perfect right now and they could err on the side of the model refusing to talk about cocktails, or err on the side of it gladly spouting about cocks. They may have made a perfectly valid choice for the moment.
slimsag · 2 years ago
If you want a great example of how this plays out long-term, look no further than algospeak[0] - the new lingo created by censorship algorithms like those on youtube and tiktok.

[0] https://www.nytimes.com/2022/11/19/style/tiktok-avoid-modera...

superb-owl · 2 years ago
The “cocktail” thing is real. A while back I tried to get DALLE to imagine characters from Moby Dick [1], but it completely refused. You’d think an AI company could come up with a better obscenity filter!

[1] https://superb-owl.link/shapes-of-stories/#1513

illusive4080 · 2 years ago
I told Azure AI to summarize a chat thread and it gave me a paragraph. I said “use bullets” and got myself flagged for review.

Good gracious could I please just use an unfiltered model? Or maybe one which isn’t so sensitive?

fragmede · 2 years ago
the llama2-uncensored model isn't quite state of the art, but ollama makes it easy to run if you have the hardware/am willing to pay to access a cloud GPU.

I colloquially used the word "hack" when trying to write some code with ChatGPT, and got admonished for trying to do bad things, so uncensoring has gotten interesting to me.

cosmojg · 2 years ago
You sure can! NeuroEngine[1] hosts some nice free demos of what are basically the state of the art in unfiltered models, and if you need API access, OpenRouter[2] has dozens of unfiltered models to choose from.

[1] https://www.neuroengine.ai/

[2] https://openrouter.ai/

justworkout · 2 years ago
I couldn't even get Google Gemini to generate a picture of, verbatim, "a man eating". It gave me a long winded lecture about how it's offensive and I should consider changing my views on the world. It does this with virtually any topic.
shatnersbassoon · 2 years ago
It's the Scunthorpe problem all over again
MyFirstSass · 2 years ago
Ok, crazy tangent;

Where agents will potentially become extremely useful/dystopian is when they just silently watch your entire screen at all times. Isolated, encrypted and local preferably.

Imagine it just watching you coding for months, planning stuff, researching things, it could potentially give you personal and professional advice from deep knowledge about you. "I noticed you code this way, may i recommend this pattern" or "i noticed you have signs of this diagnosis from the way you move your mouse and consume content, may i recommend this lifestyle change".

I wonder how long before something like that is feasible, ie a model you install that is constantly updated, but also constantly merged with world data so it becomes more intelligent on two fronts, and can follow as hardware and software advances over the years.

Such a model would be dangerously valuable to corporations / bad actors as it would mirror your psyche and remember so much about you - so it would have to be running with a degree of safety i can't even imagine, or you'd be cloneable or loose all privacy.

DariusKocar · 2 years ago
I'm working on this! https://www.perfectmemory.ai/

It's encrypted (on top of Bitlocker) and local. There's all this competition who makes the best, most articulate LLM. But the truth is that off-the-shelf 7B models can put sentences together with no problem. It's the context they're missing.

crooked-v · 2 years ago
I feel like the storage requirements are really going to be these issue for these apps/services that run on "take screenshots and OCR them" functionality with LLMs. If you're using something like this a huge part of the value proposition is in the long term, but until something has a more efficient way to function, even a 1-year history is impractical for a lot of people.

For example, consider the classic situation of accidentally giving someone the same Christmas that you did a few years back. A sufficiently powerful personal LLM that 'remembers everything' could absolutely help with that (maybe even give you a nice table of the gifts you've purchased online, who they were for, and what categories of items would complement a previous gift), but only if it can practically store that memory for a multi-year time period.

smusamashah · 2 years ago
Your website and blog are very low on details on how this is working. Downloading and installing an mai directly feels unsafe imo. Especially when I don't know how this software is working. Is it recording a video, performing OCR continuously, taking just screenshots

No mention of using any LLMs in there at all which is how you are presenting it in your comment here.

milesskorpen · 2 years ago
Basically looks like rewind.ai but for the PC?
mdrzn · 2 years ago
I installed it and kept it open for a full day but apparently it hasn't "saved" anything, and even if I open a Wiki page and a few minutes later search for that page, it returns nothing. Tried reading the Support FAQs on the website to no avail. Screen recording is on.
arthurcolle · 2 years ago
This looks cool, I hope you support macOS at some point in the future
m-GDEV · 2 years ago
Any plan to implement this on macOS or Linux?
hodanli · 2 years ago
statistics about the usage would be cool

Dead Comment

Animats · 2 years ago
> Imagine it just watching you coding for months, planning stuff, researching things, it could potentially give you personal and professional advice from deep knowledge about you.

And then announcing "I can do your job now. You're fired."

ghxst · 2 years ago
That's why we would want it to run locally! Think about a fully personalized model that can work out some simple tasks / code while you're going out for groceries, or potentially more complex tasks while you're sleeping.
ChrisClark · 2 years ago
That sounds a lot like Learning To Be Me, by Greg Egan. Just not quite as advanced, or inside your head.
brailsafe · 2 years ago
Jokes on it, already unemployed
slg · 2 years ago
>Isolated, encrypted and local of course.

And what is the likelihood of that "of course" portion actually happening? What is the business model that makes that route more profitable compared to the current model all the leaders in this tech are using in which they control everything?

worldsayshi · 2 years ago
Maybe it doesn't have to be more profitable. Even if open source models would always be one step behind the closed ones that doesn't mean they won't be good enough.
shostack · 2 years ago
This. I want an AI assistant like in the movie Her. But when I think about the realities of data access that requires, and my limited trust in companies that are playing in this space to do so in a way that respects my privacy, I realize I won't get it until it is economically viable to have an open source option run on my own hardware.
fragmede · 2 years ago
Given that http://rewind.ai is doing just that, the odds are pretty good!
bonoboTP · 2 years ago
It doesn't even have to coach you at your job, simply a LLM-powered fuzzy retrieval would be great. Where did I put that file three weeks ago? What was that trick that I had to do to fix that annoying OS config issue? I recall seeing a tweet about a paper that did xyz about half a year ago, what was it called again?

Of course taking notes and bookmarking things is possible, but you can't include everything and it takes a lot of discipline to keep things neatly organized.

So we take it for granted that every once in a while we forget things, and can't find them again with web searching.

But with the new LLMs and multimodal models, in principle this can be solved. Just describe the thing you want to recall in vague natural language and the model will find it.

And this kind of retrieval is just one thing. But if it works well, we may also grow to rely on it a lot. Just as many who use GPS in the car never really learn the mental map of the city layout and can't drive around without it. Yeah, I know that some ancient philosopher derided the invention of books the same way (will make our memory lazy). But it can make us less capable by ourselves, but much more capable when augmented with this kind of near-perfect memory.

Nition · 2 years ago
Eventually someone will realise that it'd also be great for telling you where you left your keys, if it'd film everything you see instead of just your screen.
cush · 2 years ago
https://www.rewind.ai/ seems to be exactly this
behat · 2 years ago
Heh. Built a macOS app that does something like this a while ago - https://github.com/bharathpbhat/EssentialApp

Back then, I used on device OCR and then sent the text to gpt. I’ve been wanting to re-do this with local LLMs

zoogeny · 2 years ago
Why watch your screen when you could feed in video from a wearable pair of glasses like those Instagram Ray Bans. And why stop at video when you could have it record and learn from a mic that is always on. And you might as well throw in a feed of your GPS location and biometrics from your smart watch.

When you consider it, we aren't very far away from that at all.

pier25 · 2 years ago
> encrypted and local of course

Only for people who'd pay for that.

Free users would become the product.

dpkirchner · 2 years ago
I noticed you code this way, may i recommend a Lenovo Thinkpad with an Intel Xeon processor? You're sure to "wish everything was a Lenovo."
fillskills · 2 years ago
Unless its open sourced :)
searchableguy · 2 years ago
I pre-ordered the rewind pendant. It will listen 24/7 and help you figure out what happened.

I bet meta is thinking of doing this with quest once the battery life improves.

https://rewind.ai/pendant

1shooner · 2 years ago
This service says it's local and privacy-first, but it sends to OpenAI?

>Our service, Ask Rewind, integrates OpenAI’s ChatGPT, allowing for the extraction of key information from your device’s audio and video files to produce relevant and personalized outputs in response to your inputs and questions.

ramenbytes · 2 years ago
Black Mirror strikes again.
frizlab · 2 years ago
I would hate that so much.
FirmwareBurner · 2 years ago
IKR, Who wouldn't want another Clippy constantly nagging you, but this time with a higher IQ and more intimate knowledge of you? /s
mixmastamyk · 2 years ago
"It looks like you're writing a suicide note... care for any help?"

https://www.reddit.com/r/memes/comments/bb1jq9/clippy_is_qui...

system2 · 2 years ago
If 7 second video consumed 1k token, I'd assume the budget must be insane to process such prompt.
MyFirstSass · 2 years ago
Yeah not feasible with todays methods and rag / lora shenanigans, but the way the field is moving i wouldn't be surprised if new decoder paradigms made it possible.

Saw this yesterday, 1M context window but haven't had any time to look into it, just an example new developments happening every week:

https://www.reddit.com/r/LocalLLaMA/comments/1as36v9/anyone_...

Invictus0 · 2 years ago
That's a 7 second video from an HD camera. When recording a screen, you only really need to consider whats changing on the screen.
yazaddaruvala · 2 years ago
Unlikely to be a prompt. It would need to be some form of fine tuning like LORA.
philips · 2 years ago
I have a friend building something like that at https://perfectmemory.ai
chancemehmu · 2 years ago
That's impel - https://tryimpel.com
crooked-v · 2 years ago
The "smart tasks" functionality looks like the most compelling part of that to me, but it would have to be REALLY reliable for me to use it. 50% reliability in capturing tasks is about the same as 0% reliability when it comes to actually being a useful part of anything professional.
dweekly · 2 years ago
There's limited information on the site - are you using them or affiliated with them? What's your take? Does it work well?
oconnor663 · 2 years ago
A version of this that seems both easier and less weird would be an AI that listens to you all the time when you're learning a foreign language. Imagine how much faster you could learn, and how much more native you could ultimately get, if you had something that could buzz your watch whenever you said something wrong. And of course you'd calibrate it to understand what level you're at and not spam you constantly. I would love to have something like that, assuming it was voluntary...
lucubratory · 2 years ago
I think even aside from the more outlandish ideas like that one, just having a fluent native speaker to talk to as much as you want would be incredibly valuable. Even more valuable if they are smart/educated enough to act as a language teacher. High-quality LLMs with a conversational interface capable of seamless language switching are an absolute killer app for language learning.

A use that seems scientifically possible but technically difficult would be to have an LLM help you engage in essentially immersion learning. Set up something like a pihole, but instead of cutting out ads it intercepts all the content you're consuming (webpages, text, video, images) and translates it to the language you're learning. The idea would be that you don't have to go out and find whole new sources of language to set yourself with a different language's information ecosystem, you can just press a button and convert your current information ecosystem to the language you want to learn. If something like that could be implemented it would be incredibly valuable.

lawlessone · 2 years ago
>assuming it was voluntary...

Imagine if it was wrong about something. But every time you tried to submit the bug report it disables your arms via Nueralink.

Solvency · 2 years ago
I love how in a sea of navel-gazing ideas, this one is randomly being downvoted to oblivion. Does HN hate learning new languages or something?
nebula8804 · 2 years ago
It would be dangerously valuable to bad actors but what if it is available to everyone? Then it may become less dangerous and more of a tool to help people improve their lives. The bad actor can use the tool to arbitrage but just remove that opportunity to arbitrage and there you go!
CamperBob2 · 2 years ago
I liked this idea better in THX-1138.
MyFirstSass · 2 years ago
One of the movies i've had on my watch list for far too long, thanks for reminding me.

But yeah, dystopia is right down the same road we're all going right now.

chamomeal · 2 years ago
Not crazy! I listened to a software engineering daily episode about pieces.app. Right now it’s some dev productivity tool or something, but in the interview the guy laid out a crazy vision that sounds like what you’re talking about.

He was talking about eventually having an agent that watches your screen and remembers what you do across all apps, and can store it and share it with you team.

So you could say “how does my teammate run staging builds?” or “what happened to the documentation on feature x that we never finished building”, and it’ll just know.

Obviously that’s far away, and it was just the ramblings of excited founder, but it’s fun to think about. Not sure if I hate it or love it lol

jerbear4328 · 2 years ago
Being able to ask about stuff other people do seems like it could be ripe with privacy issues, honestly. Even if the model was limited to only recording work stuff, I don't think I would want that. Imagine "how often does my coworker browse to HN during work" or "list examples of dumb mistakes my coworkers have made" for some not-so-bad examples.
bonoboTP · 2 years ago
Even later it will be ingesting camera feeds from your AR glasses and listening in on your conversations, so you can remember what you agreed on. Just like automated meeting notes with Zoom which already exists, but it will be for real life 24/7.

Speech-to-text works. OCR works. LLMs are quite good at getting the semantics of the extracted text. Image understanding is pretty good too already. Just with the things that already exist right now, you can go most of the way.

And the CCTV cameras will also all be processed through something like it.

az226 · 2 years ago
Rewind.ai
evaneykelen · 2 years ago
I have tried Rewind and found it very disappointing. Transcripts were of very poor quality and the screen capture timeline proved useless to me.
abrichr · 2 years ago
We are building this at https://openadapt.ai, except the user specifies when to record.
delegate · 2 years ago
Can also add the photos you take and all the chats you have with people (eg. whatsapp, fb, etc), the sensor information from your phone (eg. location, health data, etc).

This is already possible to implement today, so it's very likely that we'll all have our own personal AIs that know us better than we do.

huytersd · 2 years ago
If that much processing power is that cheap, this phase you’re describing is going to be fleeting because at that point I feel like it could just come up with ideas and code it itself.
te_chris · 2 years ago
I could've used this before where I accidentally booked a non-transferrable flight on a day where I'd also booked tickets to a sold out concert I want(ed) to attend.
Buttons840 · 2 years ago
Perhaps even more valuable is if AI can learn to take raw information and display it nicely. Maybe would could finally move beyond decades of crusty GUI toolkits and browser engines.
psychoslave · 2 years ago
Perfect, finally I can delegate that lengthy hours spent reading HN fantasies about AI and the laborious art of crafting sarcastic comments.
bagful · 2 years ago
Amplified Intelligence - I am keenly interested in the future of small-data machine learning as a potential multiplier for the creative mind
busymom0 · 2 years ago
And then imagine when employers stop asking for resume, cover letters, project portfolios, github etc and instead ask you to upload your entire locally trained LLM.
spaceman_2020 · 2 years ago
The dystopian angle would be when companies install agents like these on your work computer. The agent learns how you code and work. Soon enough, an agent that imitates you completely can code and work instead of you.

At that point, why pay you at all?

parentheses · 2 years ago
Aside. Is this your first Sass or Saas?
foolfoolz · 2 years ago
you could design a similar product to do the opposite and anonymize your work automatically
MetalGuru · 2 years ago
Isn’t this what rewind does?
bushbaba · 2 years ago
Basically Google’s current search model, just expanded to ChatGPT style. Great….
EGreg · 2 years ago
Imagine if it starts suggesting the ideal dating partner as both of you browse profiles. Actually, dating sites can do that now.
dustingetz · 2 years ago
thoughtcrime
TheCaptain4815 · 2 years ago
I wonder if the real killer app is Googles hardware scale verses OpenAi' s(or what Microsoft gives them). Seems like nothing Google's done has been particular surprising to OpenAi's team, it's just they have such huge scale maybe they can iterate faster.
dist-epoch · 2 years ago
The real moat is that Google has access to all the video content from YouTube to train the AI on, unlike anyone else.
sarreph · 2 years ago
I’m not sure I would necessarily call YouTube a moat-creator for Google, since the content on YouTube is for all intents and purposes public data.

Deleted Comment

danpalmer · 2 years ago
And the fact that Google are on their own hardware platform, not dependent on Nvidia for supply or hardware features.
acid__ · 2 years ago
Wow, only 256 tokens per frame? I guess a picture isn’t worth a thousand words, just ~192.
gwern · 2 years ago
Back in 2020, Google was saying 16x16=256 words: https://arxiv.org/abs/2010.11929#google :)
swyx · 2 years ago
gpt4v is also pretty low but not as low. 480x640 frame costs 425 tokens, 780x1080 is 1105 tokens
tekni5 · 2 years ago
I was thinking about this a while back, once AI is able to analyze video, images and text and do so cheap & efficiently. It's game over for privacy, like completely. Right now massive corps have tons of data on us, but they can't really piece it together and understand everything. With powerful AI every aspect of your digital life can be understood. The potential here is insane, it can be used for so many different things good and bad. But I bet it will be used to sell more targeted goods and services.
worldsayshi · 2 years ago
Unless you live in the EU and have laws that should protect you from that.
spacebanana7 · 2 years ago
Public sector agencies and law enforcement are generally exempt (or have special carve outs) in European privacy regulations.
tekni5 · 2 years ago
What happens if it's a datamining third party bot? That can check your social media accounts, create an in-depth profile on you, every image, video, post you've made has been recorded and understood. It knows everything about you, every product you use, where you have been, what you like, what you hate, everything packaged and ready to be sold to an advertiser, or the government, etc.
seniorivn · 2 years ago
incentives cannot be fixed with just prohibitive laws, war on drags should've taught you something
YetAnotherNick · 2 years ago
Is it true or more of a myth? Based on my online read, Europe has "think of the children" narrative as common if not more than other parts of the world. They tried hard to ban encryption in apps many times.[1]

[1]: https://proton.me/blog/eu-council-encryption-vote-delayed

Nextgrid · 2 years ago
That's only on paper - in practice the GDPR has a major enforcement problem.
londons_explore · 2 years ago
> I bet it will be used to sell more targeted goods and services.

Plenty of companies have been shoving all the unstructured data they have about you and your friends into a big neural net to predict which ad you're most likely to click for a decade now...

tekni5 · 2 years ago
Sure but not images and video. Now they can look at a picture of your room and label everything you own, etc.
ryukoposting · 2 years ago
You nailed it on the head. People dismissing this because it isn't perfectly accurate are missing the point. For the purposes of analytics and surveillance, it doesn't need to be perfectly accurate as long as you have enough raw data to filter out the noise. The Four have already mastered the "collecting data" part, and nobody in North America with the power to rein in that situation seems interested in doing so (this isn't to say the GDPR is perfect, but at least Europe is trying).

It's depressing that the most extraordinary technologies of our age are used almost exclusively to make you buy shit.

fragmede · 2 years ago
would it be more or less depressing if it came out that in addition to trying to get you to buy stuff, it was being used to, either make you dumber to make you easier to control, or get you to study harder and be a better worker?
loudmax · 2 years ago
At the end of the article, a single image of the bookshelf uploaded to Gemini is 258 tokens. Gemini then responds with a listing of book titles, coming to 152 tokens.

Does anyone understand where the information for the response came from? That is, does Gemini hold onto the original uploaded non-tokenized image, then run an OCR on it to read those titles? Or are all those book titles somehow contained in those 258 tokens?

If it's the later, it seems amazing that these tokens contain that much information.

jacobr1 · 2 years ago
I'm not sure about Gemini, but OpenAI GTP-V bills at roughly a token per 40x40px square. It isn't clear to me these actually processed as units, but rather it seems like they tried to approximate the cost structure to match text.
zacmps · 2 years ago
Remember, if it's using a similar tokeniser to GPT-4 (cl100k_base iirc), each token has a dimension of ~100,000.

So 258x100,000 is a space of 25,800,000 floats, using f16 (a total guess) that's 51.6kB, probably enough to represent the image at ok quality with JPG.

simonw · 2 years ago
I don't think that's right. A token in GPT-4 is a single integer, not a vector of floats.

Input to a model gets embedded into vectors later, but the actual tokens are pretty tiny.

llm_nerd · 2 years ago
The whole matter of tokens from video is one that has a lot of ambiguity, and is often presented as if these are some unique weird encoding of the contents of the video.

But logically the only possible tokenization of videos (or images, or series of images ala video) is basically an image to text model that takes each frame and generates descriptive language -- in English in Gemini -- to describe the contents of the video.

e.g. A bookshelf with a number of books. The books seen are "...", "...", etc. A figurine of a squirrel. A stuffed owl.

And so on. So the tokenization by design would include the book titles as the primary information, as that's the easiest, most proven extraction from images.

From a video such tokenization would include time flow information. But ultimately a lot of the examples people view are far less comprehensive than they think.

It isn't surprising that many demonstrations of multimodal models always includes an image with text on it somewhere, utilizing OCR.

famouswaffles · 2 years ago
This explanation is wrong as I've already said (256 is not the result of any conversion to text) but no one has to take my word for it.

From the Gemini report https://arxiv.org/abs/2312.11805

>The visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al.,2022), with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).

These are the papers Google say the multimodality in Gemini is based on.

Flamingo - https://arxiv.org/abs/2204.14198

Pali - https://arxiv.org/abs/2209.06794

The images are encoded. The encoding process tokenizes the images and the transformer is trained to predict text with both the text and image encodings.

There is no conversion to text for Gemini. That's not where the token number comes from.

famouswaffles · 2 years ago
This is not at all how this works. There's no separate model. Yes there's unique tokenization, if not the video as a whole then for each image. The whole video is ~1800 tokens because Gemini gets video as a series of images in context at 1 frame/s. Each image is about 258 tokens because a token in image transformer terms is literally a patch of the image.

https://arxiv.org/abs/2010.11929

famouswaffles · 2 years ago
Image tokens =/ Text tokens.

Image tokens are patches of the image. Each image is divided into ~256 parts. Those parts are the tokens.

There's no separate run to another OCR.

llm_nerd · 2 years ago
Completely wrong.

Well, aside from the edited in bit about OCR. Of course there isn't a separate run to do OCR because that was literally the first step during image analysis. You know, before the conversion to simple tokens.

simonw · 2 years ago
I would LOVE to understand that myself.