thalissonvs (u/thalissonvs)

thalissonvs commented on Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass github.com/autoscrape-lab... · Posted by u/thalissonvs

mrweasel · 2 months ago

You know that the captcha is there to prevent you from doing e.g. automated data mining, depends on the site obviously. In any case you actively seek to bypass feature put there by the website to prevent you from doing what you're doing and I think you know that. Does that not give you any moral concerns?

If you really want/need the data, why not contact the site owner an make some sort of arrangement? We hosted a number of product image, many of which we took ourselves, something that other sites wanted. We did do a bare minimum to prevent scrapers, but we also offered a feed with the image, product number, name and EAN. We charged a small fee, but you then got either an XML feed or a CSV and you could just pick out the new additions and download those.

thalissonvs · 2 months ago

I'm not actually bypassing the captcha with reverse engineering or anything like that, much less integrating with external services. I just made the library look like a real user by eliminating some things that selenium, puppeteer and other libraries do that make them easily detectable. You can still do different types of blocking, such as blocking based on IP address, rate limiting, or even using a captcha that requires a challenge, such as recaptchav2

thalissonvs commented on Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass github.com/autoscrape-lab... · Posted by u/thalissonvs

mfrye0 · 2 months ago

Checking it out and I see you're using CDP.

It's been a bit, but I'm pretty sure use of CDP can be detected. Has anything changed on that front, or are you aware and you're just bypassing with automated captcha handling?

thalissonvs · 2 months ago

CDP itself is not detectable. It turns out that other libraries like puppeteer and playwright often leave obvious traces, like create contexts with common prefixes, defining attributes in the navigator property.

I did a clean implementation on top of the CDP, without many signals for tracking. I added realistic interactions, among other measures.

thalissonvs commented on Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass github.com/autoscrape-lab... · Posted by u/thalissonvs

bobbyraduloff · 2 months ago

Is there a write up on how you deal with the captchas?

thalissonvs · 2 months ago

you can check the official documentation, there's a section 'Deep Dive'

thalissonvs commented on Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass github.com/autoscrape-lab... · Posted by u/thalissonvs

renegat0x0 · 2 months ago

I think I will add this to my AIO package. My project allows to crawl pages. Provides a barebones page, and scraping results are passed as JSON.

This is something that was very useful for me not to setup selenium for the x time. I just use one crawling server for my projects.

Link:

https://github.com/rumca-js/crawler-buddy

thalissonvs · 2 months ago

cool, left a star :)

thalissonvs commented on Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass github.com/autoscrape-lab... · Posted by u/thalissonvs

hk1337 · 2 months ago

> Say goodbye to webdriver compatibility nightmares

That's cool but Chrome is the only browser I have had these issues with. We have a cron process that uses selenium, initially with Chrome, and every time there was a chrome browser update we had to update the web driver. I switched it to Firefox and haven't had to update the web driver since.

I like the async portion of this but this seems like MechanicalSoup?

*EDIT* MechanicalSoup doesn't necessarily have async, AFAIK.

thalissonvs · 2 months ago

I don't think it's similar. The library has many other features that Selenium doesn't have. It has few dependencies, which makes installation faster, allows scraping multiple tabs simultaneously because it’s async, and has a much simpler syntax and element searching, without all the verbosity of Selenium. Even for cases that don’t involve captchas, I still believe it’s definitely worth using.

thalissonvs commented on Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass github.com/autoscrape-lab... · Posted by u/thalissonvs

jdnier · 2 months ago

Hi, just wondering what you're thinking about how your tool might be abused.

thalissonvs · 2 months ago

Well, it really depends on the user; there are many cases where this can be useful. Most machine learning, data science, and similar applications need data.

thalissonvs commented on Pydoll: Async Web Automation in Python github.com/thalissonvs/py... · Posted by u/thalissonvs

thalissonvs · 5 months ago

Pydoll is an innovative Python library that's redefining Chromium browser automation! Unlike other solutions, Pydoll completely eliminates the need for webdrivers, providing a much more fluid and reliable automation experience. Zero Webdrivers! Say goodbye to webdriver compatibility and configuration headaches Native Captcha Bypass! Naturally passes through Cloudflare Turnstile and reCAPTCHA v3 * Performance thanks to native asynchronous programming Realistic Interactions that simulate human behavior Advanced Event System for complex and reactive automations