Readit News logoReadit News
hdvr commented on The value of hitting the HN front page   mooreds.com/wordpress/arc... · Posted by u/mooreds
PaulHoule · 18 days ago
To go into more detail.

My best model was developed about two years ago and hasn't been updated. It uses bag-of-word features as an input into logistic regression. I tried a lot of things, like BERT+pooling, and they didn't help. A model that only considers the domain is not as good as the bag-of-words.

This kind of model reaches a plateau when it has seen about 10,000-20,000 samples so for any domain (e.g. nytimes.com, phys.org) that has more than a few thousand submissions it would make sense to train a model just for that domain.

YOShiNoN and I have also submitted so many articles in the last two years that it would be worth it for me personally to make a model based on our own submissions because ultimately I'm drawing them from a different probability distribution. (I have no idea to what extent submissions behave different depending on whether or not I submit them, I know I have both fans and haters.)

I see recommendation problems as involving both: "is the topic relevant" and "is the article good quality?" The title is good for the first but very limited for the second. The domain is probably more indicative of the second but my own concept of quality is nuanced and has a bit of "dose makes the poison" kind of thinking. For instance I think phys.org articles draw out a conclusion in a scientific paper that you might not get from a superficial read (good) but they also have obnoxious ads (bad). So I feel like I only want to post a certain fraction of those.

So far as regression goes, this is what bothers me. An article that has the potential to get 800 votes might get submitted 10 times and get

1,50,4,800,1,200,1,35,105,8

votes or something like that. The ultimate predictor would show me the probability distribution, but maybe that's asking too much, and all I can really expect is the mean which is about 120 in that case. That's not a bad estimate on some level, but if was using the L2 norm I'd get a very high loss except in that case where it was 105. The loss is going to be high no matter what prediction I make so it's not like I can make a better model can cut my loss in half, but rather I can make a better model and reduce my loss by 0.1% which doesn't seem like too great of a victory -- though on some level it is an honest account of the fact that it's a crap shoot and the real uncertainty of the problem which will never go away. On the other hand, the logistic regression model gives a probability which is a very direct expression of that uncertainty.

hdvr · 18 days ago
It's an interesting problem. If most of the votes concentrate on the first submission, I wouldn't bother including subsequent submissions in the model. However if this is not the case (as in your example), you could actually include the past voting sequence, submission times, and domain, as predictors. In your example, the 800 votes might then (ideally) correspond to a better time slot and source/domain than the first single vote.
hdvr commented on The value of hitting the HN front page   mooreds.com/wordpress/arc... · Posted by u/mooreds
PaulHoule · 18 days ago
Re: “HN is very fickle“

I have a model that, given a headline, predicts if the story will get >10 votes. It’s a terrible model, for a few reasons. The most fundamental is that if the same article is submitted 10 times it could get wildly different scores, that’s the way it does. The tail end of the model [1] is logistic regression because it deals gracefully with this kind of situation. I wish I knew how to treat this as a regression problem (predict the score), there is probably a better loss function than what I use, but when I treat it at as a regression problem I get an even worse model.

The highest score this model ever gives is 70% for something like “Richard Stallman is dead”

I have another model that predicts If the comment/score ratio > 0.5 which is about the average for the site. This is a much better model, close to the first recommender models I made. Trained on articles with score > 10 the input is less noisy for one thing. It’s how a learned y’all like to talk about cars.

[1] what attention folks call the “head”

hdvr · 18 days ago
It seems predicting the score directly (regression) is almost impossible without considering the associated domain. E.g. headlines with the letters GPT in it from openai.com, get an order of magnitude more votes than similar headlines from other sites.
hdvr commented on The value of hitting the HN front page   mooreds.com/wordpress/arc... · Posted by u/mooreds
hdvr · 18 days ago
A few years ago, on my birthday, I quickly checked the visitor stats for a little side project I had started (r-exercises.com). Instead of the usual 2 or 3 live visitors, there were hundreds. It looked odd to me—more like a glitch—so I quickly returned to the party, serving food and drinks to my guests.

Later, while cleaning up after the party, I remembered the unusual spike in visitors and decided to check again. To my surprise, there were still hundreds of live visitors. The total visitor count for the day was around 10,000. After tracking down the source, I discovered that a really kind person had shared the directory/landing page I had created just a few days earlier—right here on Hacker News. It had made it to the front page, with around 200 upvotes and 40 comments: https://news.ycombinator.com/item?id=12153811

For me, the value of hitting the HN front page was twofold. First, it felt like validation for my little side project, and it encouraged me to take it more seriously (despite having a busy daily schedule as a freelance data scientist). But perhaps more importantly, it broadened my horizons and introduced me to a whole new world of information, ideas, and discussions here on HN.

Thank you HN for this wonderful birthday gift!

u/hdvr

KarmaCake day158September 18, 2024
About
https://www.linkedin.com/in/gezondheidszorg/
View Original