Readit News logoReadit News
robert1er commented on Using NLP on raw HTML from scrapers More efficient than training them directly?    · Posted by u/robert1er
PaulHoule · 3 years ago
I found these papers:

https://arxiv.org/abs/2210.03945

https://arxiv.org/abs/2202.00217

https://arxiv.org/abs/2201.10608

My take is that fine-tuning a language model to parse HTML shouldn't be terribly difficult but you probably do need a computer with a good GPU. The one problem that current LLMs have is that they all have a limited attention window. BERT has a 512 token window and ChatGPT has a 4096 token window, where typically a token is less than a word. There are models with much longer windows (reformer) but those don't work as well as the state-of-the-art models, at least not yet.

Practically that means you can't feed a huge HTML document into the model without splitting it up first, if you do split it up you're going to lose the ability of the model to see the document as a whole (for instance match up the <div> and </div> tags)

robert1er · 3 years ago
Thank you! This is really helpful. As someone who knows a bit about NLP but isn't from that world, I don't know if I'd have found these papers or known where to look. I appreciate the help.
robert1er commented on Using NLP on raw HTML from scrapers More efficient than training them directly?    · Posted by u/robert1er
PaulHoule · 3 years ago
It’s worth trying.
robert1er · 3 years ago
I personally don’t have the tools to do this and I don’t have enough of a budget to do this as an experiment. Where would you go to find the right people to evaluate this concept? I’ve looked around stack overflow, Reddit, etc but can’t seem to find anyone talking about this.
robert1er commented on Twilio, Asana to List on Long Term Stock Exchange   wsj.com/articles/twilio-a... · Posted by u/rayshan
eries · 5 years ago
Founder/CEO of LTSE here. Happy to answer questions if anyone wants to know more, AMA
robert1er · 5 years ago
How correlated can we expect the LTSE performance and the stock performance on other exchanges to be?
robert1er commented on I Built a Bot to Apply to Thousands of Jobs at Once   fastcompany.com/3069166/i... · Posted by u/miraj
robert1er · 9 years ago
I'm the guy who wrote the article. I just came across this conversation and I have to say that you folks have touched on a ton of really interesting elements that I didn't have the space to write. Overall, I agree with a ton of what I'm reading here and the critiques are also mostly fair. I just joined so that I could thank you all but I'd be happy to answer specific questions if anyone would like. Honestly, this is one of the most thoughtful discussions about the article I've come across. Thanks!

u/robert1er

KarmaCake day3April 1, 2017View Original