Readit News logoReadit News
annowiki commented on Web Scraping in Python – The Complete Guide   proxiesapi.com/articles/w... · Posted by u/anticlickwise
caesil · 2 years ago
Not only is scraping not dead but it has won the arms race. There are ways around every defense, and this will only accelerate as AI advances.

The CAPTCHAs and walls are more of a desperate, doomed retreat.

annowiki · 2 years ago
How do you get around 403/401's from WSJ/Reuters/Axios? Because I've tried user agent manipulation and it seems like I'd have to use selenium and headless to deal with them.
annowiki commented on In Defense of Simple Architectures (2022)   danluu.com/simple-archite... · Posted by u/Brajeshwar
annowiki · 2 years ago
> Another area is with software we’ve had to build (instead of buy). When we started out, we strongly preferred buying software over building it because a team of only a few engineers can’t afford the time cost of building everything. That was the right choice at the time even though the “buy” option generally gives you tools that don’t work. In cases where vendors can’t be convinced to fix showstopping bugs that are critical blockers for us, it does make sense to build more of our own tools and maintain in-house expertise in more areas, in contradiction to the standard advice that a company should only choose to “build” in its core competency. Much of that complexity is complexity that we don’t want to take on, but in some product categories, even after fairly extensive research we haven’t found any vendor that seems likely to provide a product that works for us. To be fair to our vendors, the problem they’d need to solve to deliver a working solution to us is much more complex than the problem we need to solve since our vendors are taking on the complexity of solving a problem for every customer, whereas we only need to solve the problem for one customer, ourselves.

This is more and more my philosophy. I've been working on a data science project with headline scraping (I want to do topic modeling on headlines during the course of the election) and kept preferring roll your own solutions to off the shelf ones.

For instance, instead of using flask (as I did in a previous iteration of this project a few years ago) I went with Jinja2 and rolled my own static site generator. For scraping I used scrapy on my last project, on this one I wrote my own queue and scraper class. It works fantastically.

annowiki commented on Give AI curiosity, and it will watch TV forever (2018)   qz.com/1366484/give-ai-cu... · Posted by u/rzk
throwaway4aday · 2 years ago
Using prediction error as the definition of curiosity rings hollow for me. Curiosity in my mind is more about mapping out an unexplored thing and not about being surprised.
annowiki · 2 years ago
Actually it seems pretty accurate. Novelty-seeking is a well known phenomenon in curious individuals. https://en.wikipedia.org/wiki/Novelty_seeking

Literally getting dopamine rewards for seeing something new is what keeps people glued to tik tok feeds and twitter.

I tend to get bored halfway through a book if it is predictable.

annowiki commented on Why Aren't We Sieve-Ing?   brooker.co.za/blog/2023/1... · Posted by u/xyzzy_plugh
annowiki · 2 years ago
This is not really "sieve-ing" per the article, but what prevents me from running another process that periodically queries the data in a cache? Like just running a celery queue in Python that continually checks the cache for out of date information constantly updating it? Is there a word for this? Is this a common technique?

Deleted Comment

annowiki commented on Show HN: While painting this, I had nothing in mind   krickelkrackel.com/en... · Posted by u/hemmert
hemmert · 3 years ago
Thank you!

Actually, the longest part of building this was finding the right wording. Here are some previous attempts:

- When painting this, I had no artistic intent

- When painting this, there was nothing I wanted to express with the picture

- When painting this, I wanted to express nothing

In German, there's a nice ambiguity: "Bei diesem Bild habe ich mir nichts gedacht" is somewhere between 'I was not thinking' and 'I wanted to express nothing.'. It's also the reason why I went with a German stamp (which is my mother tongue) and not an English one.

It is SO complicated to say that I wasn't up to anything with these pictures.

annowiki · 3 years ago
There's a piece of graffiti from the May 1968 Paris riots that goes "I have something to say but I don't know what." Reminds me of this. I can't find the original French but I think it's something like "J'ai quelque chose a dire mais je ne sais quoi."

Found it: https://www.dicocitations.com/citations/citation-67489.php

They use the pas

annowiki commented on A different approach to building C++ projects   rachelbythebay.com/bb/... · Posted by u/kimmk
bogwog · 3 years ago
Conan + (any build system) = problem solved

Conan has a learning curve, but it’s totally worth it. Anyone making their own build system should get some experience with a state of the art package manager before writing a single line of code, because chances are that it already solves whatever problem is motivating you.

annowiki · 3 years ago
I started as a python programmer and was very used to package managers. I believed in them, I championed them. When I switched to C++ for work I was very disheartened that there wasn't a standard.

Conan obviously has promise, I haven't spent much time with it, most of my experience with C++ package managers is with nuget and vcpkg. However, my attitude toward package managers is changing.

I increasingly like _not_ using package managers because it makes me (and my company) way way way less likely to bloat our software with unnecessary third party dependencies.

I wrote this in another thread: I never believed you should write something yourself if you can find a package for it. My boss told me I should write it all myself, I could probably write it to be faster. I encountered a case where I needed to compare version numbers in python. For the heck of it I wrote the simplest, quickest, most naive solution I could come up with and then timed it against the most recommended version comparison package in python. I blew it away by 20x throughput.

I don't believe in package managers anymore. Obviously I'll keep using pip and sqlalchemy in Python, but I'll happily spend the 20-30 minutes it takes adding something like nlohmann-json or md4c to my project over worrying about maintaining a package manager for c++ these days. Precisely because it makes me think twice about adding another dependency.

annowiki commented on Ask HN: Math books that made you significantly better at math?    · Posted by u/optbuild
annowiki · 3 years ago
A Programmer's Introduction to Mathematics https://pimbook.org/

It introduces math from a mathematician's point of view (complete with proofs, etc.) rather than rote memorization and exercises, but it does so from the perspective of a programmer.

annowiki commented on Ask HN: How would you design an alternative Twitter    · Posted by u/dustedcodes
matt_s · 3 years ago
I think you're jumping into the technical bits right away without thinking through requirements/features. Maybe we (the royal 'we' as in all of humanity) shouldn't have a public town square? When you build something to have marginalized voices be heard you are also including all marginalized voices. The MarginalizedVoice super object has EqualRightsForSquirrels as well as HatefulRacistUncle child objects. There are very valid points, that people don't like to hear, about how the concept of someone/group choosing what MarginalizedVoice gets heard and what doesn't isn't fair. If the basis of a platform is "public town square" you're going to have to deal with all MarginalizedVoices.

Content moderation (incl comments) doesn't scale so don't build something with public town squares. That's only a feature platform builders want in order to sell advertisements. If the thought of not having an ad-driven platform leads you to "users won't pay for it" then maybe think of a platform users would pay for or some other way to have it be sustainable.

annowiki · 3 years ago
You have me thinking about kind of a cool board idea. 150 person twitter boards. Cap it at 150. People in that group can all vote on their own moderation, they can't interact with groups in other boards through quote tweeting or voting, though obviously they can copy paste.

You might get racist boards, but then its easy to get rid of all of them at once.

150 being https://en.wikipedia.org/wiki/Dunbar%27s_number of course.

I have no way to distribute anything. I tried to do my own annotations board on literature but no one joined. I just think it sounds cool to be in a personable board like that.

u/annowiki

KarmaCake day165September 6, 2019View Original