Tracking supermarket prices with Playwright

I have been doing something similar for New Zealand since the start of the year with Playwright/Typescript dumping parquet files to cloud storage. I've just collecting the data I have not yet displayed it. Most of the work is getting around the reverse proxy services like Akamai and Cloudflare.

At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.

ustad · a year ago

Can anyone comment how supermarkets exploit customer segmentation by updating prices? How do the time-poor and poor-poor people generally respond?

“Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.”

brikym · a year ago

Let's say there are three brands of some item. Each week one of the brands is rotated to $1 while the others are $2. And let's also suppose that the supermarket pays 80c per item.

The smart shopper might only buy in bulk once every three weeks when his favourite brand at a lower price, or twitch to the cheapest brand every week. A hurried or lazy shopper might always pick their favourite brand every week. If they buy one item a week the lazy shopper would have spent $5, while the smart shopper has only spent $3.

They've made 60c off the smart shopper and $2.60 off the lazy shopper. By segmenting out the lazy shoppers they've made $2. The whole idea of rotating the prices is nothing to do with the cost of goods sold it's all about making shopping a pain in the ass for busy people and catching them out.

seoulmetro · a year ago

Legality of this is rocky in Australia. I dare say that NZ is the same?

There are so many scrapers that come and go doing this in AU but are usually shut down by the big supermarkets.

It's a cycle of usefulness and "why doesn't this exist", except it had existed many times before.

russelg · a year ago

I think with the current climate wrt the big supermarkets in AU, now would be the time to push your luck. The court of public opinion will definitely not be on the supermarkets side, and the government may even step in.

timrkn · a year ago

Agreed. Hopefully the govs price gouging mitigation strategy includes free flow of information (allowing scraping for price comparison).

I’ve been interested in price comparison for Australia for a while, am a Product designer/manager with a concept prototype design, looking for others interested to work on it. My email is on my profile if you are.

jaza · a year ago

Aussie here. I hadn't heard that price scraping is only quasi-legal here and that scrapers get shut down by the big supermarkets - but then again I'm not surprised.

I'm thinking of starting a little price comparison site, mainly to compare select products at Colesworths vs Aldi (I've just started doing more regular grocery shopping at Aldi myself). But as far as I know, Aldi don't have any prices / catalogues online, so my plan is to just manually enter the data myself in the short-term, and to appeal to crowdsourcing the data in the long-term. And plan is to just make it a simple SSG site (e.g. Hugo powered), data all in simple markdown / json files, data all sourced via github pull requests.

Feel free to get in touch if you'd like to help out, or if you know of anything similar that already exists: greenash dot net dot au slash contact

_nivlac_ · a year ago

For the other commenters here - looks like this site does the job? https://hotprices.org/

With the corresponding repo too: https://github.com/Javex/hotprices-au

sumedh · a year ago

> Legality of this is rocky in Australia. I dare say that NZ is the same?

You might be breaking the sites terms and conditions but that does not mean its illegal.

Dan Murphy uses a similar thing, they have their own price checking algorithm.

Dev102 · a year ago

I built one called https://bbdeals.in/ for India. I mostly use it to buy just fruits and its saved me about 20% of sending. which is not bad in these hard times.

Building crawlers and infra to support it tool not more than 20 hours.

alwinaugustin · a year ago

Does this work for HYD only?

pikelet · a year ago

As a kiwi, are your able to make any of these (or your) projects? I'm quite interested.

walterbell · a year ago

Those who order grocery delivery online would benefit from price comparisons, because they can order from multiple stores at the same time. In addition, there's only one marketplace that has all the prices from different stores.

gruez · a year ago

>Those who order grocery delivery online would benefit from price comparisons, because they can order from multiple stores at the same time.

Not really, since the delivery fees/tips that you have to pay would eat up all the savings, unless maybe if you're buying for a family of 5 or something.

teruakohatu · a year ago

I think the fees they tack on for online orders would ruin ordering different products from different stores. It mostly makes sense with staples that don't perish.

With fresh produce I find Pak n Save a lot more variable with quality, making online orders more risky despite the lower cost.

teruakohatu · a year ago

I was planning on doing the same in NZ. I would be keen to chat to you about it (email in HN profile). I am a data scientist

Did you notice anything pre and post Whittakers price increase(s)? They must have a brilliant PR firm in retainer for every major news outlet to more or less push the line that increased prices are a good thing for the consumer. I noticed more aggressive "sales" more recently, but unsure if I am just paying more attention.

My prediction is that they will decrease the size of the bars soon.

scubadude · a year ago

I think Whittaker's changed their recipe some time in the last year. Whittaker's was what Cadbury used to be (good) but now I think they have both followed the same course. Markedly lower quality. This is the 200g blocks fwiw not sure about the wee 50g peanut slab.

Nice writeup. I've been through similar problems that you have with my contact lens price comparison website https://lenspricer.com/ that I run in ~30 countries. I have found, like you, that websites changing their HTML is a pain.

One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).

I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.

A fun project, but challenging at times, and annoying problems to fix.

siamese_puff · a year ago

Doing the work we need. Every year I get fucked by my insurance company when buying a basic thing - contacts. Pricing is all over the place and coverage is usually 30% done by mail in reimbursement. Thanks!

RasmusFromDK · a year ago

Thanks for the nice words!

heap_perms · a year ago

I'm curious, can you wear contact lenses while working? I notice my eyes get tired when I look at a monitor for too long. Have you found any solutions for that?

RasmusFromDK · a year ago

I use contact lenses basically every day, and I have had no problems working in front of screens. There's a huge difference between the different brands. Mine is one of the more expensive ones (Acuvue Oasys 1-Day), so that might be part of it, but each eye is compatible with different lenses.

If I were you I would go to an optometrist and talk about this. They can also often give you free trials for different contacts and you can find one that works for you.

pavel_lishin · a year ago

This is very likely age-dependent.

When I was in my 20s, this was absolutely not a problem.

When I hit my 30s, I started wearing glasses instead of contacts basically all the time, and it wasn't a problem.

Now that I'm in my 40s, I'm having to take my glasses off to read a monitor and most things that are closer than my arm's reach.

kristianbrigman · a year ago

My eye doctor recommended wearing “screen glasses”. They are a small prescription (maybe 0.25 or 0.5) with blue blocking. It’s small but it does help; I work on normal glasses at night (so my eyes can rest) and contacts + screen glasses during the day and they are really close.

dotancohen · a year ago

Go try an E-Ink device. B&N Nooks are small Android tablets in disguise, you just need to install a launcher. Boox devices are also Android.

I can use an E-Ink device all day without my eyes getting tired.

siamese_puff · a year ago

I cannot, personally. They dry out

shellfishgene · a year ago

For Germany, below the prices it says "some links may be sponsored", but it does not mark which ones. Is that even legal? Also there seem to be very few shops, are maybe all the links sponsored? Also idealo.de finds lower prices.

RasmusFromDK · a year ago

When I decided to put the text like that, I had looked at maybe 10-20 of the biggest price comparison websites across different countries because I of course want to make sure I respect all regulations that there are. I found that many of them don't even write anywhere that the links may be sponsored, and you have to go to the "about" page or similar to find this. I think that I actually go further than most of them when it comes to making it known that some links may be sponsored.

Now that you mention idealo, there seems to be no mention at all on a product page that they are paid by the stores, you have to click the "rank" link in the footer to be brought to a page https://www.idealo.de/aktion/ranking where they write this.

bane · a year ago

> One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it)

In the U.S. at least, big retailers will have product suppliers build slightly different SKUs for them to make price comparisons tricky. Costco is somewhat notorious for this where almost everything electronics (and many other products) sold in their stores is a custom SKU -- often with slightly product configuration.

throwaway7ahgb · a year ago

Costco does this for sure, but Costco also creates their own products. For instance there are some variations of a package set that can only be bought at Costco, so you aren't getting the exact same box and items as anywhere else.

bob_theslob646 · a year ago

Would that still matter if you just compare by description?

ludvigk · a year ago

Isn’t this a use-case where LLMs could really help?

RasmusFromDK · a year ago

Yeah it is to some degree. I tried to use it as much as possible, but there's always those annoying edge cases that makes me not trust the results and I have to check everything, and it ended up being faster just building some simple UI where I can easily classify the name myself.

Part of the problem is simply due to bad data from the websites. Just as an example - there's a 2-week contact lens called "Acuvue Oasys". And there's a completely different 1-day contact lens called "Acuvue Oasys 1-Day". Some sites have been bad at writing this properly, so both variants may be called "Acuvue Oasys" (or close to it), and the way to distinguish them is to look at the image to see which actual lens they mean, look at the price etc.

It's true that this could probably also be handled by AI, but in the end, classifying the lenses takes like 1-2% of the time it takes to make a scraper for a website so I found it was not worth trying to build a very good LLM classifier for this.

brunoqc · a year ago

Do you support Canada?

batata004 · a year ago

I created a similar website which got lots of interest in my city. I scrape even app and websites data using a single server at Linode with 2GB of RAM with 5 IPv4 and 1000 IPv6 (which is free) and every single product is scraped at most 40 minutes interval, never more than that with avg time of 25 minutes. I use curl impersonate and scrape JSON as much as possible because 90% of markets provide prices from Ajax calls and the other 10% I use regex to easily parse the HTML. You can check it at https://www.economizafloripa.com.br

latexr · a year ago

> I scrape even app and websites data

And then try to sell it back to businesses, even suggesting they use the data to train AI. You also make it sound like there’s a team manually doing all the work.

https://www.economizafloripa.com.br/?q=parceria-comercial

That whole page makes my view of the project go from “helpful tool for the people, to wrestle back control from corporations selling basic necessities” to “just another attempt to make money”. Which is your prerogative, I was just expecting something different and more ethically driven when I read the homepage.

mechanical_bear · a year ago

Where does this lack ethics? It seems that they are providing a useful service, that they created with their hard work. People are allowed to make money with their work.

presentation · a year ago

It’s almost like people try to do valuable services for others in exchange for money.

How does the ipv6 rotation work in this flow?

maerten · a year ago

Nice article!

> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.

I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/

Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.

The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...

It works for the most part, as long as at least one correct barcode number is provided for a product.

sakisv · a year ago

Thanks!

> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!

Tryk · a year ago

Awesome, have been looking for something like this!

pcblues · a year ago

This is interesting because I believe the two major supermarkets in Australia can create a duopoly in anti-competitive pricing by just employing price analysis AI algorithms on each side and the algorithms will likely end up cooperating to maximise profit. This can probably be done legally through publicly obtained prices and illegally by sharing supply cost or profit per product data. The result is likely to be similar. Two trained AIs will maximise profit in weird ways using (super)multidimensional regression analysis (which is all AI is), and the consumer will pay for maximised profits to ostensible competitors. If the pricing data can be obtained like this, not much more is needed to implement a duopoly-focused pair of machine learning implementations.

TrackerFF · a year ago

Here in Norway, what is called the "competition authority"(https://konkurransetilsynet.no/norwegian-competition-authori...), is frequently critical to open and transparent (food) price information for that exact reason.

The rationale is that if all prices are out there in the open, consumers will end up paying a higher price, as the actors (supermarkets) will end up pricing their stuff equally, at a point where everyone makes a maximum profit.

For years said supermarkets have employed "price hunters", which are just people that go to competitor stores and register the prices of everything.

Here in Norway you will oftentimes notice that supermarket A will have sale/rebates on certain items one week, then the next week or after supermarket B will have something similar, to attract customers.

The word I was looking for was collusion, but done with software and without people-based collusion.

avador · a year ago

Compusion.

seanwilson · a year ago

> They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products. However the way they write the prices has changed and now a bag of chips doesn't cost €1.99 but €199. To catch these changes I rely on my transformation step being as strict as possible with its inputs.

You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.

z3t4 · a year ago

Sanity checks in programming are underrated, not only are they cheap performance vice, they catch bugs early that would otherwise poison the state.

Yeah I thought about that, but I've seen cases that a product jumped more than 100%.

I used this kind of heuristic to check if a scrape was successful by checking that the amount of products scraped today is within ~10% of the average of the last 7 days or so

langsoul-com · a year ago

The hard thing is not scraping, but getting around the increasingly sophisticated blockers.

You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.

Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.

I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it

__MatrixMan__ · a year ago

I wonder if we could get some legislation in place to require that they publish pricing data via an API so we don't have to tangle with the blockers at all.

immibis · a year ago

Perhaps in Europe. Anywhere else, forget about it.

zackmorris · a year ago

I'd prefer that governments enact legislation that prevents discriminating against IP addresses, perhaps under net neutrality laws.

For anyone with some clout/money who would like to stop corporations like Akamai and Cloudflare from unilaterally blocking IP addresses, the way that works is you file a lawsuit against the corporations and get an injunction to stop a practice (like IP blacklisting) during the legal proceedings. IANAL, so please forgive me if my terminology isn't quite right here:

https://pro.bloomberglaw.com/insights/litigation/how-to-file...

https://www.law.cornell.edu/wex/injunctive_relief

Injunctions have been used with great success for a century or more to stop corporations from polluting or destroying ecosystems. The idea is that since anyone can file an injunction, that creates an incentive for corporations to follow the law or risk having their work halted for months or years as the case proceeds.

I'd argue that unilaterally blocking IP addresses on a wide scale pollutes the ecosystem of the internet, so can't be allowed to continue.

Of course, corporations have thought of all of this, so have gone to great lengths to lobby governments and use regulatory capture to install politicians and judges who rule in their favor to pay back campaign contributions they've received from those same corporations:

https://www.crowell.com/en/insights/client-alerts/supreme-co...

https://www.mcneeslaw.com/nlrb-injunction/

So now the pressures that corporations have applied on the legal system to protect their own interests at the cost of employees, taxpayers and the environment have started to affect other industries like ours in tech.

You'll tend to hear that disruptive ideas like I've discussed are bad for business from the mainstream media and corporate PR departments, since they're protecting their own interests. That's why I feel that the heart of hacker culture is in disrupting the status quo.

Thankfully I'm not there yet.

Since this is just a side project, if it starts demanding too much of my time too often I'll just stop it and open both the code and the data.

BTW, how could the network request not appear in the network tab?

For me the hardest part is to correlate and compare products across supermarkets

If they don't populate the page via Ajax or network requests. Ie server side, then no requests for supermarket data will appear.

See nextjs server side, I believe they mention that as a security thing in their docs.

In terms of comparison, most names tend to be the same. So some similarity search if it's in the same category matches good enough.

seanthemon · a year ago

And you couldn't use OCR and simply take an image of the product list? Not ideal, but difficult or impossible to track depending on your method.

You'll get blocked before even seeing the page most times.

eddyfromtheblok · a year ago

Crowdsource it with a browser extension

xyst · a year ago

Would be nice to have a price transparency of goods. It would make processes like this much more easier to track by store, and region.

For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).

On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.

I do track the price per unit (kg, lt, etc) and I was a bit on the fence on whether I should show and graph that number instead of the price that someone would pay at the checkout, but I opted for the latter to keep it more "familiar" with the prices people see.

Having said that, that's definitely something that I could add and it would show when the shrinkflation occured if any.

barbazoo · a year ago

Grocers not putting per unit prices on the label is a pet peeve of mine. I can’t imagine any purpose not rooted in customer hostility.

baronswindle · a year ago

In my experience, grocers always do include unit prices…at least in the USA. I’ve lived in Florida, Indiana, California, and New York, and in 35 years of life, I can’t remember ever not seeing the price per oz, per pound, per fl oz, etc. right next to the total price for food/drink and most home goods.

There may be some exceptions, but I’m struggling to think of any except things where weight/volume aren’t really relevant to the value — e.g., a sponge.

dawnerd · a year ago

Or when they change what unit to display so you can’t easily cross compare.

girvo · a year ago

It's required by law in Australia, which is nice

candiddevmike · a year ago

Imagine mandating transparent cost of goods pricing. I'd love to see farmer was paid X, manufacturer Y, and grocer added Z.