Readit News logoReadit News
httparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
httparchive · 2 years ago
Update: Google has been helping me out now, thankfully. Hopefully we can make sure this doesn't happen to others.
httparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
mulmen · 2 years ago
I wouldn’t expect either of those filters to utilize a partition key if one exists. So yeah, you probably did a full table scan every time. Is the partitioning documented somewhere?
httparchive · 2 years ago
Yeah, 'LIKE' ops usually give you a full table scan, which is brutal. If it was my own data I'd chop the fields up and index them properly - which is the issue here, it's not your data, so you don't get a say in the indexes, but you do have to pay per row scanned even if you can't apply an index of your own.
nothttparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
jeffparsons · 2 years ago
IANAL, but if this happened to me I would be gathering as many examples as I could of this having happened to other people. The angle being: Google knows this is a huge issue. Effectively, they know that they have (presumably accidentally) created a really dangerous trap for small players, and have chosen to do nothing about it.

In some jurisdictions I think that reduces the legitimacy of their claim that you actually owe them money.

EDIT: Even better, focus on the examples where Google "forgave" the debt; you could argue that those examples prove that Google knows it's at least partly their fault.

nothttparchive · 2 years ago
The FTC is already investigating: https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/202...

I think we (the developer community) need to start pushing back against this abuse, it's getting out of control.

The thing that bothers me the most is I caught this $14k charge b/c I'm a small fry and that money matters to me. How many big accounts just wouldn't notice that? I can't help but think a very non-trivial % of all cloud revenue is just obscure fees that nobody notices - engineers doing the engineering, accounts receivable pays the bills, and the cloud providers get fat.

httparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
mulmen · 2 years ago
Can you be more specific? What filtering did you apply? How many columns did you select?
httparchive · 2 years ago
SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'
httparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
johnnyo · 2 years ago
Can you post what a $14,000 SQL query looks like?

If nothing else, it can be an example in my SQL 101 course.

httparchive · 2 years ago
Here you go!

SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'

---

There's no LIMIT on it b/c I actually need all the results.

httparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
buremba · 2 years ago
This happens because Google hides the query cost behind its abstracted "TBs scanned" (for their data format, not even open-source so it's hard to estimate in advance) or even worse "slots" mechanism. Only a fraction of people try to understand how much these slots cost and most of them are the people who got an unexpected bill after using BigQuery and became more aware of how the product works.

If GCP would return the query cost in the API and show it directly in the console when you run a query, it would be much easier for their users but unfortunately, it's not Google's interest for obvious reasons.

httparchive · 2 years ago
Exactly, even after seeing the issue I can't make heads or tails of what the hell a "TBs scanned" is relative to row counts, etc. Likewise, it seems to place a lot of assumptions on knowing what tables include - and on a dataset you didn't build yourself how can you know the tables are optimized to lower your costs? Hell, how can you even know what the costs are?
httparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
romeros · 2 years ago
This is just a single data point but I had a surprise bill with Google. I talked to the support and got it waived off.

I used Amazon EC2 instances for years and I always felt in control. There were never any surprises. I knew even in the worst case situation I would be okay because I had faith in the Amazon support. With Google I felt insecure. I never played with any of Google cloud services since then.

Amazon's customer first policy is really true. They try their absolute best to make sure there are no surprises to a great extent. Even the UI is very intuitive.

httparchive · 2 years ago
Yeah, I have spent much more than $14k to date and would have spent much more over time, losing my business isn't rational. I think it's just another "Google can't do customer support to literally save their life" example.
httparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
epanchin · 2 years ago
To be brutally honest, it’s badly considered queries like yours that mean these services cannot be free.
httparchive · 2 years ago
I've forgotten more Sql than most people ever learn. Time is also valuable and I make trade-offs. Should I spend hours (eg. $$$) to optimize or run a non-optimized query in the background for a different cost? Well, I didn't think the time/benefit/cost equation favored tuning, if I had known that I'd have spent time on tuning. If you offer something for "free" and then change the cost, and don't have any alerting mechanisms to inefficient queries, it's impossible to evaluate trade offs.
httparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
mulmen · 2 years ago
What was the query you ran?
httparchive · 2 years ago
I was doing historical evaluation for a few sites, so I was running a query for each month going back to 2016 for each site. I've done this before with no real issues, and if I knew the charges were rapidly exploding I'd have halted the script immediately - but instead it ran for 2 hours and the first notice I got was the CC charge.
httparchive commented on Warning: $14k BigQuery charge in 2 hours   discuss.httparchive.org/t... · Posted by u/nothttparchive
darth_avocado · 2 years ago
I am sorry but this seems to be more of a “TLDR; didn’t read;” situation. The http archive clearly mentions that the data is available for offline processing or for querying online on BQ. And in the “Getting started” section of the instructions, it is mentioned multiple times on how BQ will charge you. And even if it wasn’t mentioned anywhere, it’s a little presumptuous to assume a tool for processing data will not charge you money for literally processing TBs of data again and again.

> Note: BigQuery has a free tier that you can use to get started without enabling billing. At the time of this writing, the free tier allows 10GB of storage and 1TB of data processing per month. Google also provides a $300 credit for new accounts.

> Note: The size of the tables you query are important because BigQuery is billed based on the number of processed data. There is 1TB of processed data included in the free tier, so running a full scan query on one of the larger tables can easily eat up your quota. This is where it becomes important to design queries that process only the data you wish to explore

> When we look at the results of this, you can see how much data was processed during this query. Writing efficient queries limits the number of bytes processed - which is helpful since that's how BigQuery is billed. Note: There is 1TB free per month

https://github.com/HTTPArchive/httparchive.org/blob/main/doc...

httparchive · 2 years ago
Yes, sure there's stuff I could have done better, and stayed up all night looking at the fine print. But that's not the point - this is *warning* to other people who see the Internet Archive logo, the words "public", and for some dumb reason also trust Google. I'm hoping this doesn't happen to others, I learned a costly lesson.

u/httparchive

KarmaCake day125February 20, 2024View Original