I have been doing something similar for New Zealand since the start of the year with Playwright/Typescript dumping parquet files to cloud storage. I've just collecting the data I have not yet displayed it. Most of the work is getting around the reverse proxy services like Akamai and Cloudflare.
At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.
Can anyone comment how supermarkets exploit customer segmentation by updating prices? How do the time-poor and poor-poor people generally respond?
“Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.”
Let's say there are three brands of some item. Each week one of the brands is rotated to $1 while the others are $2. And let's also suppose that the supermarket pays 80c per item.
The smart shopper might only buy in bulk once every three weeks when his favourite brand at a lower price, or twitch to the cheapest brand every week. A hurried or lazy shopper might always pick their favourite brand every week. If they buy one item a week the lazy shopper would have spent $5, while the smart shopper has only spent $3.
They've made 60c off the smart shopper and $2.60 off the lazy shopper. By segmenting out the lazy shoppers they've made $2. The whole idea of rotating the prices is nothing to do with the cost of goods sold it's all about making shopping a pain in the ass for busy people and catching them out.
I think with the current climate wrt the big supermarkets in AU, now would be the time to push your luck. The court of public opinion will definitely not be on the supermarkets side, and the government may even step in.
Agreed. Hopefully the govs price gouging mitigation strategy includes free flow of information (allowing scraping for price comparison).
I’ve been interested in price comparison for Australia for a while, am a Product designer/manager with a concept prototype design, looking for others interested to work on it. My email is on my profile if you are.
Aussie here. I hadn't heard that price scraping is only quasi-legal here and that scrapers get shut down by the big supermarkets - but then again I'm not surprised.
I'm thinking of starting a little price comparison site, mainly to compare select products at Colesworths vs Aldi (I've just started doing more regular grocery shopping at Aldi myself). But as far as I know, Aldi don't have any prices / catalogues online, so my plan is to just manually enter the data myself in the short-term, and to appeal to crowdsourcing the data in the long-term. And plan is to just make it a simple SSG site (e.g. Hugo powered), data all in simple markdown / json files, data all sourced via github pull requests.
Feel free to get in touch if you'd like to help out, or if you know of anything similar that already exists: greenash dot net dot au slash contact
I built one called https://bbdeals.in/ for India. I mostly use it to buy just fruits and its saved me about 20% of sending. which is not bad in these hard times.
Building crawlers and infra to support it tool not more than 20 hours.
Those who order grocery delivery online would benefit from price comparisons, because they can order from multiple stores at the same time. In addition, there's only one marketplace that has all the prices from different stores.
>Those who order grocery delivery online would benefit from price comparisons, because they can order from multiple stores at the same time.
Not really, since the delivery fees/tips that you have to pay would eat up all the savings, unless maybe if you're buying for a family of 5 or something.
I think the fees they tack on for online orders would ruin ordering different products from different stores. It mostly makes sense with staples that don't perish.
With fresh produce I find Pak n Save a lot more variable with quality, making online orders more risky despite the lower cost.
I was planning on doing the same in NZ. I would be keen to chat to you about it (email in HN profile). I am a data scientist
Did you notice anything pre and post Whittakers price increase(s)? They must have a brilliant PR firm in retainer for every major news outlet to more or less push the line that increased prices are a good thing for the consumer. I noticed more aggressive "sales" more recently, but unsure if I am just paying more attention.
My prediction is that they will decrease the size of the bars soon.
I think Whittaker's changed their recipe some time in the last year. Whittaker's was what Cadbury used to be (good) but now I think they have both followed the same course. Markedly lower quality. This is the 200g blocks fwiw not sure about the wee 50g peanut slab.
Nice writeup. I've been through similar problems that you have with my contact lens price comparison website https://lenspricer.com/ that I run in ~30 countries. I have found, like you, that websites changing their HTML is a pain.
One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).
I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.
A fun project, but challenging at times, and annoying problems to fix.
Doing the work we need. Every year I get fucked by my insurance company when buying a basic thing - contacts. Pricing is all over the place and coverage is usually 30% done by mail in reimbursement.
Thanks!
I'm curious, can you wear contact lenses while working? I notice my eyes get tired when I look at a monitor for too long. Have you found any solutions for that?
I use contact lenses basically every day, and I have had no problems working in front of screens. There's a huge difference between the different brands. Mine is one of the more expensive ones (Acuvue Oasys 1-Day), so that might be part of it, but each eye is compatible with different lenses.
If I were you I would go to an optometrist and talk about this. They can also often give you free trials for different contacts and you can find one that works for you.
My eye doctor recommended wearing “screen glasses”. They are a small prescription (maybe 0.25 or 0.5) with blue blocking. It’s small but it does help; I work on normal glasses at night (so my eyes can rest) and contacts + screen glasses during the day and they are really close.
For Germany, below the prices it says "some links may be sponsored", but it does not mark which ones. Is that even legal? Also there seem to be very few shops, are maybe all the links sponsored? Also idealo.de finds lower prices.
When I decided to put the text like that, I had looked at maybe 10-20 of the biggest price comparison websites across different countries because I of course want to make sure I respect all regulations that there are. I found that many of them don't even write anywhere that the links may be sponsored, and you have to go to the "about" page or similar to find this. I think that I actually go further than most of them when it comes to making it known that some links may be sponsored.
Now that you mention idealo, there seems to be no mention at all on a product page that they are paid by the stores, you have to click the "rank" link in the footer to be brought to a page https://www.idealo.de/aktion/ranking where they write this.
> One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it)
In the U.S. at least, big retailers will have product suppliers build slightly different SKUs for them to make price comparisons tricky. Costco is somewhat notorious for this where almost everything electronics (and many other products) sold in their stores is a custom SKU -- often with slightly product configuration.
Costco does this for sure, but Costco also creates their own products. For instance there are some variations of a package set that can only be bought at Costco, so you aren't getting the exact same box and items as anywhere else.
Yeah it is to some degree. I tried to use it as much as possible, but there's always those annoying edge cases that makes me not trust the results and I have to check everything, and it ended up being faster just building some simple UI where I can easily classify the name myself.
Part of the problem is simply due to bad data from the websites. Just as an example - there's a 2-week contact lens called "Acuvue Oasys". And there's a completely different 1-day contact lens called "Acuvue Oasys 1-Day". Some sites have been bad at writing this properly, so both variants may be called "Acuvue Oasys" (or close to it), and the way to distinguish them is to look at the image to see which actual lens they mean, look at the price etc.
It's true that this could probably also be handled by AI, but in the end, classifying the lenses takes like 1-2% of the time it takes to make a scraper for a website so I found it was not worth trying to build a very good LLM classifier for this.
I created a similar website which got lots of interest in my city. I scrape even app and websites data using a single server at Linode with 2GB of RAM with 5 IPv4 and 1000 IPv6 (which is free) and every single product is scraped at most 40 minutes interval, never more than that with avg time of 25 minutes. I use curl impersonate and scrape JSON as much as possible because 90% of markets provide prices from Ajax calls and the other 10% I use regex to easily parse the HTML. You can check it at https://www.economizafloripa.com.br
And then try to sell it back to businesses, even suggesting they use the data to train AI. You also make it sound like there’s a team manually doing all the work.
That whole page makes my view of the project go from “helpful tool for the people, to wrestle back control from corporations selling basic necessities” to “just another attempt to make money”. Which is your prerogative, I was just expecting something different and more ethically driven when I read the homepage.
Where does this lack ethics? It seems that they are providing a useful service, that they created with their hard work. People are allowed to make money with their work.
> The second kind is nastier.
>
> They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.
I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.
> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!
This is interesting because I believe the two major supermarkets in Australia can create a duopoly in anti-competitive pricing by just employing price analysis AI algorithms on each side and the algorithms will likely end up cooperating to maximise profit. This can probably be done legally through publicly obtained prices and illegally by sharing supply cost or profit per product data. The result is likely to be similar. Two trained AIs will maximise profit in weird ways using (super)multidimensional regression analysis (which is all AI is), and the consumer will pay for maximised profits to ostensible competitors. If the pricing data can be obtained like this, not much more is needed to implement a duopoly-focused pair of machine learning implementations.
The rationale is that if all prices are out there in the open, consumers will end up paying a higher price, as the actors (supermarkets) will end up pricing their stuff equally, at a point where everyone makes a maximum profit.
For years said supermarkets have employed "price hunters", which are just people that go to competitor stores and register the prices of everything.
Here in Norway you will oftentimes notice that supermarket A will have sale/rebates on certain items one week, then the next week or after supermarket B will have something similar, to attract customers.
> They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products. However the way they write the prices has changed and now a bag of chips doesn't cost €1.99 but €199. To catch these changes I rely on my transformation step being as strict as possible with its inputs.
You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.
Yeah I thought about that, but I've seen cases that a product jumped more than 100%.
I used this kind of heuristic to check if a scrape was successful by checking that the amount of products scraped today is within ~10% of the average of the last 7 days or so
The hard thing is not scraping, but getting around the increasingly sophisticated blockers.
You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.
Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.
I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it
I wonder if we could get some legislation in place to require that they publish pricing data via an API so we don't have to tangle with the blockers at all.
I'd prefer that governments enact legislation that prevents discriminating against IP addresses, perhaps under net neutrality laws.
For anyone with some clout/money who would like to stop corporations like Akamai and Cloudflare from unilaterally blocking IP addresses, the way that works is you file a lawsuit against the corporations and get an injunction to stop a practice (like IP blacklisting) during the legal proceedings. IANAL, so please forgive me if my terminology isn't quite right here:
Injunctions have been used with great success for a century or more to stop corporations from polluting or destroying ecosystems. The idea is that since anyone can file an injunction, that creates an incentive for corporations to follow the law or risk having their work halted for months or years as the case proceeds.
I'd argue that unilaterally blocking IP addresses on a wide scale pollutes the ecosystem of the internet, so can't be allowed to continue.
Of course, corporations have thought of all of this, so have gone to great lengths to lobby governments and use regulatory capture to install politicians and judges who rule in their favor to pay back campaign contributions they've received from those same corporations:
So now the pressures that corporations have applied on the legal system to protect their own interests at the cost of employees, taxpayers and the environment have started to affect other industries like ours in tech.
You'll tend to hear that disruptive ideas like I've discussed are bad for business from the mainstream media and corporate PR departments, since they're protecting their own interests. That's why I feel that the heart of hacker culture is in disrupting the status quo.
Would be nice to have a price transparency of goods. It would make processes like this much more easier to track by store, and region.
For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).
On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.
I do track the price per unit (kg, lt, etc) and I was a bit on the fence on whether I should show and graph that number instead of the price that someone would pay at the checkout, but I opted for the latter to keep it more "familiar" with the prices people see.
Having said that, that's definitely something that I could add and it would show when the shrinkflation occured if any.
In my experience, grocers always do include unit prices…at least in the USA. I’ve lived in Florida, Indiana, California, and New York, and in 35 years of life, I can’t remember ever not seeing the price per oz, per pound, per fl oz, etc. right next to the total price for food/drink and most home goods.
There may be some exceptions, but I’m struggling to think of any except things where weight/volume aren’t really relevant to the value — e.g., a sponge.
At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.
“Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.”
The smart shopper might only buy in bulk once every three weeks when his favourite brand at a lower price, or twitch to the cheapest brand every week. A hurried or lazy shopper might always pick their favourite brand every week. If they buy one item a week the lazy shopper would have spent $5, while the smart shopper has only spent $3.
They've made 60c off the smart shopper and $2.60 off the lazy shopper. By segmenting out the lazy shoppers they've made $2. The whole idea of rotating the prices is nothing to do with the cost of goods sold it's all about making shopping a pain in the ass for busy people and catching them out.
There are so many scrapers that come and go doing this in AU but are usually shut down by the big supermarkets.
It's a cycle of usefulness and "why doesn't this exist", except it had existed many times before.
I’ve been interested in price comparison for Australia for a while, am a Product designer/manager with a concept prototype design, looking for others interested to work on it. My email is on my profile if you are.
I'm thinking of starting a little price comparison site, mainly to compare select products at Colesworths vs Aldi (I've just started doing more regular grocery shopping at Aldi myself). But as far as I know, Aldi don't have any prices / catalogues online, so my plan is to just manually enter the data myself in the short-term, and to appeal to crowdsourcing the data in the long-term. And plan is to just make it a simple SSG site (e.g. Hugo powered), data all in simple markdown / json files, data all sourced via github pull requests.
Feel free to get in touch if you'd like to help out, or if you know of anything similar that already exists: greenash dot net dot au slash contact
With the corresponding repo too: https://github.com/Javex/hotprices-au
You might be breaking the sites terms and conditions but that does not mean its illegal.
Dan Murphy uses a similar thing, they have their own price checking algorithm.
Building crawlers and infra to support it tool not more than 20 hours.
Not really, since the delivery fees/tips that you have to pay would eat up all the savings, unless maybe if you're buying for a family of 5 or something.
With fresh produce I find Pak n Save a lot more variable with quality, making online orders more risky despite the lower cost.
Did you notice anything pre and post Whittakers price increase(s)? They must have a brilliant PR firm in retainer for every major news outlet to more or less push the line that increased prices are a good thing for the consumer. I noticed more aggressive "sales" more recently, but unsure if I am just paying more attention.
My prediction is that they will decrease the size of the bars soon.
One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).
I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.
A fun project, but challenging at times, and annoying problems to fix.
If I were you I would go to an optometrist and talk about this. They can also often give you free trials for different contacts and you can find one that works for you.
When I was in my 20s, this was absolutely not a problem.
When I hit my 30s, I started wearing glasses instead of contacts basically all the time, and it wasn't a problem.
Now that I'm in my 40s, I'm having to take my glasses off to read a monitor and most things that are closer than my arm's reach.
I can use an E-Ink device all day without my eyes getting tired.
Now that you mention idealo, there seems to be no mention at all on a product page that they are paid by the stores, you have to click the "rank" link in the footer to be brought to a page https://www.idealo.de/aktion/ranking where they write this.
In the U.S. at least, big retailers will have product suppliers build slightly different SKUs for them to make price comparisons tricky. Costco is somewhat notorious for this where almost everything electronics (and many other products) sold in their stores is a custom SKU -- often with slightly product configuration.
Part of the problem is simply due to bad data from the websites. Just as an example - there's a 2-week contact lens called "Acuvue Oasys". And there's a completely different 1-day contact lens called "Acuvue Oasys 1-Day". Some sites have been bad at writing this properly, so both variants may be called "Acuvue Oasys" (or close to it), and the way to distinguish them is to look at the image to see which actual lens they mean, look at the price etc.
It's true that this could probably also be handled by AI, but in the end, classifying the lenses takes like 1-2% of the time it takes to make a scraper for a website so I found it was not worth trying to build a very good LLM classifier for this.
And then try to sell it back to businesses, even suggesting they use the data to train AI. You also make it sound like there’s a team manually doing all the work.
https://www.economizafloripa.com.br/?q=parceria-comercial
That whole page makes my view of the project go from “helpful tool for the people, to wrestle back control from corporations selling basic necessities” to “just another attempt to make money”. Which is your prerogative, I was just expecting something different and more ethically driven when I read the homepage.
> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.
I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/
Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.
The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...
It works for the most part, as long as at least one correct barcode number is provided for a product.
> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!
The rationale is that if all prices are out there in the open, consumers will end up paying a higher price, as the actors (supermarkets) will end up pricing their stuff equally, at a point where everyone makes a maximum profit.
For years said supermarkets have employed "price hunters", which are just people that go to competitor stores and register the prices of everything.
Here in Norway you will oftentimes notice that supermarket A will have sale/rebates on certain items one week, then the next week or after supermarket B will have something similar, to attract customers.
You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.
I used this kind of heuristic to check if a scrape was successful by checking that the amount of products scraped today is within ~10% of the average of the last 7 days or so
You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.
Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.
I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it
For anyone with some clout/money who would like to stop corporations like Akamai and Cloudflare from unilaterally blocking IP addresses, the way that works is you file a lawsuit against the corporations and get an injunction to stop a practice (like IP blacklisting) during the legal proceedings. IANAL, so please forgive me if my terminology isn't quite right here:
https://pro.bloomberglaw.com/insights/litigation/how-to-file...
https://www.law.cornell.edu/wex/injunctive_relief
Injunctions have been used with great success for a century or more to stop corporations from polluting or destroying ecosystems. The idea is that since anyone can file an injunction, that creates an incentive for corporations to follow the law or risk having their work halted for months or years as the case proceeds.
I'd argue that unilaterally blocking IP addresses on a wide scale pollutes the ecosystem of the internet, so can't be allowed to continue.
Of course, corporations have thought of all of this, so have gone to great lengths to lobby governments and use regulatory capture to install politicians and judges who rule in their favor to pay back campaign contributions they've received from those same corporations:
https://www.crowell.com/en/insights/client-alerts/supreme-co...
https://www.mcneeslaw.com/nlrb-injunction/
So now the pressures that corporations have applied on the legal system to protect their own interests at the cost of employees, taxpayers and the environment have started to affect other industries like ours in tech.
You'll tend to hear that disruptive ideas like I've discussed are bad for business from the mainstream media and corporate PR departments, since they're protecting their own interests. That's why I feel that the heart of hacker culture is in disrupting the status quo.
Since this is just a side project, if it starts demanding too much of my time too often I'll just stop it and open both the code and the data.
BTW, how could the network request not appear in the network tab?
For me the hardest part is to correlate and compare products across supermarkets
See nextjs server side, I believe they mention that as a security thing in their docs.
In terms of comparison, most names tend to be the same. So some similarity search if it's in the same category matches good enough.
For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).
On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.
Having said that, that's definitely something that I could add and it would show when the shrinkflation occured if any.
There may be some exceptions, but I’m struggling to think of any except things where weight/volume aren’t really relevant to the value — e.g., a sponge.