Readit News logoReadit News
Posted by u/tamaharbor 4 years ago
Ask HN: Why are there so many duplicate articles?
Should we be scanning recent articles in order to cut down on duplicate submissions?
gabrielsroka · 4 years ago
lproven · 4 years ago
Oh, nicely done! :-)
dang · 4 years ago
We treat a submission as a duplicate if the story has had significant attention in the last year or so. This is in the FAQ: https://news.ycombinator.com/newsfaq.html.

If a story hasn't*had significant attention in the last year or so, then we don't treat it as a dupe, because it's important for good articles to get multiple chances at getting attention. Otherwise the randomness of what gets noticed on /newest would be even more dominant than it already is.

snet0 · 4 years ago
In favour of duplicates, not everyone is on HN frequently, and can miss on some excellent posts if they're not reposted.
giancarlostoro · 4 years ago
There's a way to look at daily historical versions of the front page:

https://news.ycombinator.com/front

I kind of wish you could do hourly for the past day or so.

beardyw · 4 years ago
It does spot duplicates. I have submitted something and been immediately directed to an existing submission of the same story.

Presumably an exact URL match, and maybe within a timescale?

Rerarom · 4 years ago
Yeah, that's the issue, the autospotter is not perfect. Maybe someone from HN could enlighten us.
pvg · 4 years ago
There's novelsworth of moderator commentary on dupes, the HN search thing will dig them up for you.
anthropodie · 4 years ago
I like it the way it is currently. Now suppose you add some smartness to website , on what basis you decide which duplicate to remove? HN is the last website I want to add AI to it's recommendation system. We are already being fed so much, on every other platform.

Let community moderate the itself. It's old school and maybe dumb but not everything needs smart ass AI.

lproven · 4 years ago
You are right.

I propose automatic de-dupe: whatever the title says, if the exact same URL has already been submitted, just count it as an upvote on the existing story... optional: if the submitted caption is different, then add a comment that says "also submitted by X with the caption `Y`."

lproven · 4 years ago
That does not match what I see. You did see the addendum after the ellipsis, right?

I have seen my own articles submitted more than once.

smugma · 4 years ago
This is existing behavior
bitxbitxbitcoin · 4 years ago
People want upvotes so they submit it. Setting up scanning for recent articles is probably harder than it seems. Can’t think of any other forum that does it.
krapp · 4 years ago
>Setting up scanning for recent articles is probably harder than it seems. Can’t think of any other forum that does it.

The vast majority of dupes posted here are the same domain, site and title. Catching that would be as easy as a call to the Algolia API. I'm using the Hacker News enhancement suite add-on for Firefox and it does that to generate a list of prior threads. That leaves edge cases, but human curation should be enough for that.

Then again, HN checks for canonical URLs, it could check for canonical title, author and excerpt text in metadata as well. But just a bare title match would catch almost everything, especially during high velocity periods. Then again again, I think I mentioned this in the past and dang said they tried to find more robust solutions in software but false negatives and edge cases made it infeasible.

HN could prompt submitters to reply to an existing thread instead of posting a duplicate, if an open thread exists, while allowing the option to post. That should remove the possibility of the software rejecting a legit post due to a bad match. Having the process be entirely automated would probably be a bad idea.

lysergia · 4 years ago
Some sites have multiple versions of the same page. Like there would be a GitHub repo, and then a GitHub page residing on site.github.io that is usually the same thing. Also some projects eventually get an entire dedicated domain after residing on GitHub for a number of years.
PaulHoule · 4 years ago
It's a tough problem but if people want to make "the next thing" which is much better than we've had before it would be one way to do it.
gerikson · 4 years ago
Lobste.rs does it. If a submission is a duplicate within six months it's not accepted.
mellavora · 4 years ago
Great. thanks. As if I didn't already waste too much time on HN, now I need to keep an eye on lobsters.

At least now I know what I'm going to do this weekend.

taubek · 4 years ago
Who marks something as a duplicate? Moderators or is there some kind of a bot?
WithinReason · 4 years ago
AFAIK when there is a single comment under an article linking to HN it's recognized as a dupe of that link.