I'm curious as to why you choose to break out specific headers in the schema.
For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.
If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.
For example
CREATE TABLE IF NOT EXISTS "messages" (
"id" INTEGER NOT NULL PRIMARY KEY, -- internal id
"message_id" TEXT NOT NULL, -- Gmail message id
"thread_id" TEXT NOT NULL, -- Gmail thread id
"headers" JSON NOT NULL, -- JSON object of { "header": value },
"subject" TEXT GENERATED ALWAYS AS (json_extract("headers", '$.Subject')) VIRTUAL NOT NULL)
...
);
CREATE INDEX subjectidx on messages(subject);
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as
ALTER TABLE messages ADD dkim TEXT GENERATED ALWAYS AS (json_extract("headers", '$."Dkim-Signature"')) VIRTUAL NOT NULL);
CREATE INDEX dkimidx on messages(dkim);
SELECT dkim, COUNT(0) FROM messages GROUP BY dkim;
Adding indexes to support a one off query seems like bad practice?
In general I prefer break out columns that I expect to have/use consistently, especially for something as stable as email headers. Maybe schema changes are a bit easier with a headers column, but imo its just trading the pain on write for pain on read (while leaving the door open to stuff failing silently).
I reach for a similar pattern a lot with postgres as I'm building up a system. Start with a think about the fields I know I want, and create the tables with them, and then store all the metadata I have lying around in a json column, then in 2 months when I realize what fields I actually need populate them from json, and then make my API keep them up to date, or make a view, or what ever.
I've found it really helpful to avoid the growing pains that come with "just shove it all in mongo", or "just put it on the file system", but not much cost.
Hey this is really neat! It's like those disk usage visualizers, except that it seems to focus on the total volume of the mail rather than the disk usage.
Is there a size option too? To see which senders are using most of my storage.
(Also your website's SSL certificate has expired.)
No currently not. It would be easy to add though. I haven't updated the tool in a while (after using it to clean up my Gmail inbox). Thanks for pointing out the certificate!
Funnily enough, the gmvault.org domain _that_ page points to is simply a parked GoDaddy placeholder. It's also not been updated in 10+ years except for two non-source files.
This looks interesting. I've DIY'd something similar with qdirstat before but you need to arrange your emails a certain way like dated folders and can't re-slice with different criteria.
On the other hand, qdirstat "cache" files are really easy to generate so can be used for visualizing a bunch of file-like things
I really lament that you cannot sign in even with an application specific password any more and you need to get an oauth client and go through an oauth flow. It’s my email, but Google takes away an open standard even for myself to access it.
Given the amount of spam I receive on my free Gmail addresses (compared to my paid for freelance one), and the amount of spam I receive from Gmail servers on my non Gmail-E-Mail accounts I get more and more inclined towards degoogling myself.
Especially as I receive more and more information that my freelance e-mail is put into spam by recipient systems.
Not sure how to get rid of my Google ecosystem routines, though. Feels daunting.
I set up Postfix to catch *@immibis.com. I use it for some less important things - first mailing list subscriptions, then I even used it to buy festival tickets. These are lower-risk things. If they don't work then it's not a big deal. In the latter case I'd be out $200 and not be able to go to that festival (which did actually happen, but not because of my email server, but because they tried to invent a hidden fee after I already paid, and I'll have to go to small claims court to get a refund). Now that I know it works, I use it by default for new less-important account signups. (And nobody's questioned me yet why the local-part of my email address is the name of their business)
I still wouldn't use *@immibis.com for my bank account. I'd use gmail for that. The bank is a corporation. If there's a problem between them and my email server, they'll tell me to suck it up, then delete my money. If there's a problem between them and Google's server they'll be forced to fix it. If there's a problem with my Google account, I can go to the bank office and say "Google banned me from Google, so I need to link to a different account" and they'll have a procedure for that. They won't have a procedure for "your mail server sends LF when it should be CRLF" or whatever weird issue could occur between them and a self-hosted mail server. But if my bank account was the last thing remaining on Google, in practice, it would still be a successful email de-googling. 99% is a pretty good success rate. The bank app runs on Android, anyway. Could switch banks and only do banking in person.
I find Youtube a good source of entertaining and informative content (certainly way better than something like Instagram) and I haven't replaced that yet.
After Mozilla jumped the shark and declared they hate privacy, I've been gradually moving things over to Zen Browser, which is based on Firefox. (I don't care that Zen isn't significantly more private than Firefox; I care there's someone in between me and Mozilla and that isn't Google)
POP3/IMAP work with any client that supports those protocols.
OAuth really doesn't. Every OAuth integration I've ever built always feels like it needs a tiny bit of custom development.
Also the OAuth flow is usually absolutely horrible for when you're trying to get a token for accessing your own data. I've had to spin up a temporary web app to handle a hunch of redirects just to get my own token!
Probably because if you get API access to someones email account it is game over. And people are stupid so some of them are going to click yes to some scammy app. And then they will blame Google for not protecting them.
This isn't exactly what you're asking for, but Google offers a service called Takeout that lets you request and download backups of all your data from their services, including Gmail.
I have a reminder to trigger this every few months and update my local backup. If I recall it comes as a gzipped mbox file.
You can also use an IMAP client and set it to offline/download mode so it downloads everything and saves it locally. I think "offline mode" is what it's called in Evolution--not sure what Thunderbird or other clients call it.
FWIW, for several years I've tried backuping my gmail account with imap (including some stuff made specifically for gmail): It never succeeded. The best syncer were running for one month, and after one month it hit some mails that it simply couldn't retrieve? Like I guess it was in too cold storage and timeout-ed? I don't know.
So I can understand why using Google's proprietary API might work better (or not, I don't know)
Anyway, as a sibling says, nowadays Google Takeout includes mbox and work properly (and is pretty fast, like half a day), but doesn't allow continuous update.
And I migrated to another mail provider (infomaniak), and I've thanked myself for using my own mail domain name years earlier.
I had the same problem when I switched off Google. I didn't have a ton of data, and I just wanted content for past search purposes, so I didn't dig into how the data would be transformed but I can at least offer my scuffed solution.
I installed a third-party client (Thunderbird, but I imagine any would work) on a local box, signed in with both emails, and just copied the mail over from one to the other. Low-tech, but it worked quite well. I may have forced some local cache/download for the original email, but I can't recall. I'll check later if it preserves headers and the like. I assume it would, but it wasn't that important to me.
I actually thought about writing at some point about the process of getting off gmail and all the funny things I ran across.
It got a lot worse recently when they added bad AI to it. Now it does dumb synonyms. Like I search for "doctorate" and it starts highlighting every instance of D alone, like the word "he'd". (Presumably trying to pick up Ph.D.?) For awhile searches for "A" would have it highlighting "the", too.
It's the least bad search--better than Yahoo, and Thunderbird desktop in my direct experience. However, I don't download the full message into Thunderbird out of fear of blasting through gmail bandwidth limits.
For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.
If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.
For example
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as or whatever you want.I find it useful to create indexes like this, then create VIEWs using these expressions instead of ALTER'ing the main table with generated columns.
In general I prefer break out columns that I expect to have/use consistently, especially for something as stable as email headers. Maybe schema changes are a bit easier with a headers column, but imo its just trading the pain on write for pain on read (while leaving the door open to stuff failing silently).
I've found it really helpful to avoid the growing pains that come with "just shove it all in mongo", or "just put it on the file system", but not much cost.
https://github.com/terhechte/postsack
Is there a size option too? To see which senders are using most of my storage.
(Also your website's SSL certificate has expired.)
Funnily enough, the gmvault.org domain _that_ page points to is simply a parked GoDaddy placeholder. It's also not been updated in 10+ years except for two non-source files.
On the other hand, qdirstat "cache" files are really easy to generate so can be used for visualizing a bunch of file-like things
Especially as I receive more and more information that my freelance e-mail is put into spam by recipient systems.
Not sure how to get rid of my Google ecosystem routines, though. Feels daunting.
I set up Postfix to catch *@immibis.com. I use it for some less important things - first mailing list subscriptions, then I even used it to buy festival tickets. These are lower-risk things. If they don't work then it's not a big deal. In the latter case I'd be out $200 and not be able to go to that festival (which did actually happen, but not because of my email server, but because they tried to invent a hidden fee after I already paid, and I'll have to go to small claims court to get a refund). Now that I know it works, I use it by default for new less-important account signups. (And nobody's questioned me yet why the local-part of my email address is the name of their business)
I still wouldn't use *@immibis.com for my bank account. I'd use gmail for that. The bank is a corporation. If there's a problem between them and my email server, they'll tell me to suck it up, then delete my money. If there's a problem between them and Google's server they'll be forced to fix it. If there's a problem with my Google account, I can go to the bank office and say "Google banned me from Google, so I need to link to a different account" and they'll have a procedure for that. They won't have a procedure for "your mail server sends LF when it should be CRLF" or whatever weird issue could occur between them and a self-hosted mail server. But if my bank account was the last thing remaining on Google, in practice, it would still be a successful email de-googling. 99% is a pretty good success rate. The bank app runs on Android, anyway. Could switch banks and only do banking in person.
I find Youtube a good source of entertaining and informative content (certainly way better than something like Instagram) and I haven't replaced that yet.
After Mozilla jumped the shark and declared they hate privacy, I've been gradually moving things over to Zen Browser, which is based on Firefox. (I don't care that Zen isn't significantly more private than Firefox; I care there's someone in between me and Mozilla and that isn't Google)
you dont get it done by moping about it, but by doing
OAuth really doesn't. Every OAuth integration I've ever built always feels like it needs a tiny bit of custom development.
Also the OAuth flow is usually absolutely horrible for when you're trying to get a token for accessing your own data. I've had to spin up a temporary web app to handle a hunch of redirects just to get my own token!
With an app password you have full IMAP access.
App passwords no longer exist on Google.
Gmail to SQLite describes 6 steps to get credentials working, but it is not true for me. After 6 steps:
- that Google said that my app was not published, so I published it
- Google said that app cannot be internal, because I am not a workspace user
- for external apps
- then it said I cannot use the app until it is verified
- in verification they wanted to know domain, address, other details
- they wanted to have my justification for scopes
- they wanted to have video explaining how the app is going to be used
- they will take some time to verify the data I provided them
It all looks like a maze of settings, where requiring any of users to go above the hoops required by Google is simply too much.
Links:
[0] https://github.com/rumca-js/Django-link-archive
Does anybody have insight as to why it’s so bad?
When those inevitably get used for nefarious purposes; Google image suffers as a result.
Don't jump through their hoops.
- Open source
- Resume (so backups/restores will eventually complete)
Honorable mention: https://www.mailstore.com/en/products/mailstore-home/
- Not open source
- GUI with index: nice for searching mail locally
- Resume only for backup (so large restores generally fail)
https://github.com/gaubert/gmvault
For a long time and it's worked great. But it seems like GYB is actively maintained, so maybe I should switch.
I have a reminder to trigger this every few months and update my local backup. If I recall it comes as a gzipped mbox file.
IMAP is much harder, and much slower, and is bound by Google's bandwidth limits.
So I can understand why using Google's proprietary API might work better (or not, I don't know)
Anyway, as a sibling says, nowadays Google Takeout includes mbox and work properly (and is pretty fast, like half a day), but doesn't allow continuous update.
And I migrated to another mail provider (infomaniak), and I've thanked myself for using my own mail domain name years earlier.
I installed a third-party client (Thunderbird, but I imagine any would work) on a local box, signed in with both emails, and just copied the mail over from one to the other. Low-tech, but it worked quite well. I may have forced some local cache/download for the original email, but I can't recall. I'll check later if it preserves headers and the like. I assume it would, but it wasn't that important to me.
I actually thought about writing at some point about the process of getting off gmail and all the funny things I ran across.
https://imapsync.lamiral.info/
Deleted Comment
https://marcoapp.io