Gmail to SQLite - Readit News

I'm curious as to why you choose to break out specific headers in the schema.

For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.

If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.

For example

  CREATE TABLE IF NOT EXISTS "messages" (
    "id" INTEGER NOT NULL PRIMARY KEY, -- internal id
    "message_id" TEXT NOT NULL, -- Gmail message id
    "thread_id" TEXT NOT NULL, -- Gmail thread id
    "headers" JSON NOT NULL, -- JSON object of { "header": value },
    "subject" TEXT GENERATED ALWAYS AS (json_extract("headers", '$.Subject')) VIRTUAL NOT NULL)
    ...
  );
  CREATE INDEX subjectidx on messages(subject);

I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as

  ALTER TABLE messages ADD dkim TEXT GENERATED ALWAYS AS (json_extract("headers", '$."Dkim-Signature"')) VIRTUAL NOT NULL);
  CREATE INDEX dkimidx on messages(dkim);
  SELECT dkim, COUNT(0) FROM messages GROUP BY dkim;

or whatever you want.

Hakkin · 7 months ago

Note that you don't actually need the generated column either, SQLite supports indexes on expressions, so you can do, for example,

  CREATE INDEX subjectidx ON messages(json_extract(headers, '$.Subject'))

and it will use this index anywhere you reference that expression.

I find it useful to create indexes like this, then create VIEWs using these expressions instead of ALTER'ing the main table with generated columns.

pkhuong · 7 months ago

And since view and indexes don't change the data, you can use tools like https://github.com/fsaintjacques/recordlite to automate schema management.

randito · 7 months ago

What a great timely tip. Was just looking for good direction on how to do this. Thanks!

tqi · 7 months ago

Adding indexes to support a one off query seems like bad practice?

In general I prefer break out columns that I expect to have/use consistently, especially for something as stable as email headers. Maybe schema changes are a bit easier with a headers column, but imo its just trading the pain on write for pain on read (while leaving the door open to stuff failing silently).

timeinput · 7 months ago

I reach for a similar pattern a lot with postgres as I'm building up a system. Start with a think about the fields I know I want, and create the tables with them, and then store all the metadata I have lying around in a json column, then in 2 months when I realize what fields I actually need populate them from json, and then make my API keep them up to date, or make a view, or what ever.

I've found it really helpful to avoid the growing pains that come with "just shove it all in mongo", or "just put it on the file system", but not much cost.

dotancohen · 7 months ago

I see that you defined the `dkim` column as NOT NULL. So what happens when an email message does not contain the Dkim-Signature header?

hun3 · 7 months ago

Probably something like

  Error: stepping, NOT NULL constraint failed: messages.dkim (19)

because, unlike MySQL, SQLite apparently returns SQL NULL for JSON null value.

unsnap_biceps · 7 months ago

As hun3 said, it would throw an error. My mistake in the quick example I put together. I just noticed an unbalanced () as well. Whoops.

xearl · 7 months ago

TIL, thanks a lot!

formerly_proven · 7 months ago

You can also create indices directly on expressions, including json_extract etc.

I really lament that you cannot sign in even with an application specific password any more and you need to get an oauth client and go through an oauth flow. It’s my email, but Google takes away an open standard even for myself to access it.

sdoering · 7 months ago

Given the amount of spam I receive on my free Gmail addresses (compared to my paid for freelance one), and the amount of spam I receive from Gmail servers on my non Gmail-E-Mail accounts I get more and more inclined towards degoogling myself.

Especially as I receive more and more information that my freelance e-mail is put into spam by recipient systems.

Not sure how to get rid of my Google ecosystem routines, though. Feels daunting.

immibis · 7 months ago

One service at a time.

I set up Postfix to catch *@immibis.com. I use it for some less important things - first mailing list subscriptions, then I even used it to buy festival tickets. These are lower-risk things. If they don't work then it's not a big deal. In the latter case I'd be out $200 and not be able to go to that festival (which did actually happen, but not because of my email server, but because they tried to invent a hidden fee after I already paid, and I'll have to go to small claims court to get a refund). Now that I know it works, I use it by default for new less-important account signups. (And nobody's questioned me yet why the local-part of my email address is the name of their business)

I still wouldn't use *@immibis.com for my bank account. I'd use gmail for that. The bank is a corporation. If there's a problem between them and my email server, they'll tell me to suck it up, then delete my money. If there's a problem between them and Google's server they'll be forced to fix it. If there's a problem with my Google account, I can go to the bank office and say "Google banned me from Google, so I need to link to a different account" and they'll have a procedure for that. They won't have a procedure for "your mail server sends LF when it should be CRLF" or whatever weird issue could occur between them and a self-hosted mail server. But if my bank account was the last thing remaining on Google, in practice, it would still be a successful email de-googling. 99% is a pretty good success rate. The bank app runs on Android, anyway. Could switch banks and only do banking in person.

I find Youtube a good source of entertaining and informative content (certainly way better than something like Instagram) and I haven't replaced that yet.

After Mozilla jumped the shark and declared they hate privacy, I've been gradually moving things over to Zen Browser, which is based on Firefox. (I don't care that Zen isn't significantly more private than Firefox; I care there's someone in between me and Mozilla and that isn't Google)

redeeman · 7 months ago

step 1: extract data step 2: just dont use google shit anymore. Deal with it.

you dont get it done by moping about it, but by doing

kasey_junk · 7 months ago

Sorry, why do you consider app specific passwords an open standard but oauth not?

simonw · 7 months ago

POP3/IMAP work with any client that supports those protocols.

OAuth really doesn't. Every OAuth integration I've ever built always feels like it needs a tiny bit of custom development.

Also the OAuth flow is usually absolutely horrible for when you're trying to get a token for accessing your own data. I've had to spin up a temporary web app to handle a hunch of redirects just to get my own token!

isaachinman · 7 months ago

Sorry, I don't quite get the point you're trying to make...

With an app password you have full IMAP access.

the_mitsuhiko · 7 months ago

> With an app password you have full IMAP access.

App passwords no longer exist on Google.

terhechte · 7 months ago

I build something to visualize huge amounts of email (such as from Gmail) some years ago:

https://github.com/terhechte/postsack

andai · 7 months ago

Hey this is really neat! It's like those disk usage visualizers, except that it seems to focus on the total volume of the mail rather than the disk usage.

Is there a size option too? To see which senders are using most of my storage.

(Also your website's SSL certificate has expired.)

No currently not. It would be easy to add though. I haven't updated the tool in a while (after using it to clean up my Gmail inbox). Thanks for pointing out the certificate!

mywacaday · 7 months ago

Looks interesting, the link to gmvault in your readme is now a dead end, is this it https://github.com/gaubert/gmvault. Thanks!

sgbeal · 7 months ago

> the link to gmvault in your readme is now a dead end, is this it https://github.com/gaubert/gmvault.

Funnily enough, the gmvault.org domain _that_ page points to is simply a parked GoDaddy placeholder. It's also not been updated in 10+ years except for two non-source files.

nijave · 7 months ago

This looks interesting. I've DIY'd something similar with qdirstat before but you need to arrange your emails a certain way like dated folders and can't re-slice with different criteria.

On the other hand, qdirstat "cache" files are really easy to generate so can be used for visualizing a bunch of file-like things

renegat0x0 · 7 months ago

I recently tried to integrate Gmail in my app [0], and I poured too much time on it. I decided it is not worth to support Gmail.

Gmail to SQLite describes 6 steps to get credentials working, but it is not true for me. After 6 steps:

- that Google said that my app was not published, so I published it

- Google said that app cannot be internal, because I am not a workspace user

- for external apps

- then it said I cannot use the app until it is verified

- in verification they wanted to know domain, address, other details

- they wanted to have my justification for scopes

- they wanted to have video explaining how the app is going to be used

- they will take some time to verify the data I provided them

It all looks like a maze of settings, where requiring any of users to go above the hoops required by Google is simply too much.

Links:

[0] https://github.com/rumca-js/Django-link-archive

bradgessler · 7 months ago

The steps Google makes people jump through just for API keys are absolutely insane.

Does anybody have insight as to why it’s so bad?

victorbjorklund · 7 months ago

Probably because if you get API access to someones email account it is game over. And people are stupid so some of them are going to click yes to some scammy app. And then they will blame Google for not protecting them.

IMTDb · 7 months ago

Because otherwise tons of people anonymously create api keys with extremely wide scopes for small / low quality apps.

When those inevitably get used for nefarious purposes; Google image suffers as a result.

Use regular old IMAP with an app password.

Don't jump through their hoops.

xyzzy123 · 7 months ago

Every year the imap option ("app passwords") gets buried deeper and deeper in the settings.

oulipo · 7 months ago

What's the best open-source GMail backup software that exists? Someone has setup something like that? (also archiving attachments, etc)

Leftium · 7 months ago

https://github.com/GAM-team/got-your-back

- Open source

- Resume (so backups/restores will eventually complete)

Honorable mention: https://www.mailstore.com/en/products/mailstore-home/

- Not open source

- GUI with index: nice for searching mail locally

- Resume only for backup (so large restores generally fail)

crazygringo · 7 months ago

I've been using:

https://github.com/gaubert/gmvault

For a long time and it's worked great. But it seems like GYB is actively maintained, so maybe I should switch.

Thanks!

sbarre · 7 months ago

This isn't exactly what you're asking for, but Google offers a service called Takeout that lets you request and download backups of all your data from their services, including Gmail.

I have a reminder to trigger this every few months and update my local backup. If I recall it comes as a gzipped mbox file.

yes but I'd rather do it "continuously", the issue I want to back against is that Google locks me out of my account for some random reason

You can also use an IMAP client and set it to offline/download mode so it downloads everything and saves it locally. I think "offline mode" is what it's called in Evolution--not sure what Thunderbird or other clients call it.

TekMol · 7 months ago

Shouldn't this be "imap to sqlite" or something? Why tie it to one specific email provider?

Because _it is_ specific to Gmail. It's using OAuth and presumable API access.

IMAP is much harder, and much slower, and is bound by Google's bandwidth limits.

pastage · 7 months ago

Doing a mbox export with Google Takeout from gmail is pretty fast.

phh · 7 months ago

FWIW, for several years I've tried backuping my gmail account with imap (including some stuff made specifically for gmail): It never succeeded. The best syncer were running for one month, and after one month it hit some mails that it simply couldn't retrieve? Like I guess it was in too cold storage and timeout-ed? I don't know.

So I can understand why using Google's proprietary API might work better (or not, I don't know)

Anyway, as a sibling says, nowadays Google Takeout includes mbox and work properly (and is pretty fast, like half a day), but doesn't allow continuous update.

And I migrated to another mail provider (infomaniak), and I've thanked myself for using my own mail domain name years earlier.

pertique · 7 months ago

I had the same problem when I switched off Google. I didn't have a ton of data, and I just wanted content for past search purposes, so I didn't dig into how the data would be transformed but I can at least offer my scuffed solution.

I installed a third-party client (Thunderbird, but I imagine any would work) on a local box, signed in with both emails, and just copied the mail over from one to the other. Low-tech, but it worked quite well. I may have forced some local cache/download for the original email, but I can't recall. I'll check later if it preserves headers and the like. I assume it would, but it wasn't that important to me.

I actually thought about writing at some point about the process of getting off gmail and all the funny things I ran across.

natmaka · 7 months ago

Did you use imapsync?

https://imapsync.lamiral.info/

yread · 7 months ago

Would be nice to enable fulltext search as well

padjo · 7 months ago

Yes! I find gmail’s full text search surprisingly bad given it’s run by a search company.

Deleted Comment

porker · 7 months ago

But not as bad as Outlook 365's search...

NelsonMinar · 7 months ago

It got a lot worse recently when they added bad AI to it. Now it does dumb synonyms. Like I search for "doctorate" and it starts highlighting every instance of D alone, like the word "he'd". (Presumably trying to pick up Ph.D.?) For awhile searches for "A" would have it highlighting "the", too.

jgalt212 · 7 months ago

It's the least bad search--better than Yahoo, and Thunderbird desktop in my direct experience. However, I don't download the full message into Thunderbird out of fear of blasting through gmail bandwidth limits.

Agreed! One of the reasons we started working on Marco.

https://marcoapp.io

bytter · 7 months ago

Funny... I did the same thing yesterday, just because I wanted to list my recipient emails by domain. Code is awful, but here it is: https://github.com/hugoferreira/gmail-sqlite-db

alimbada · 7 months ago

Yep. I did the same to group by domain and sender.