Advanced NLP with SpaCy

No offense, but isn't the NLP field effectively solved with the creation of LLMs, or at least for the majority of the tasks you would expect from an NLP application?

I am sure you can find some special areas or niches where traditional NLP approaches would outcompete a black box like LLMs. But with the LLMs becoming much more efficient now after quantization to the point you can run them locally, I think there is a good argument in saying simple NLP is basically solved.

cantdutchthis · 2 years ago

> I think there is a good argument in saying simple NLP is basically solved

In my experience LLMs can get about 70-80% accuracy on a bunch of NER and text classification tasks if you give it a reasonable prompt. That's not nothing and it's something that you can get started with super quickly. But you'll have slow responses and typically a 3rd party running the inference.

Annotating data yourself to about 2000-3000 examples, on the datasets that I ran my benchmarks on, may get you closer to 80-90%. You'll typically also get fast inference that you can run on your own hardware no problem. By annotating the data myself I also like to think that I understand the problem much better as a consequence.

Don't get me wrong. LLMs are cool and interesting ... but they don't seem to replace old-school methods just yet.

2sk21 · 2 years ago

Indeed, one of my last big projects I worked on before I retired was using an LSTM model for named entity recognition. It was many times faster than anything you can do with an LLM.

mxkopy · 2 years ago

I doubt LLMs will ever replace old-school methods precisely for this reason:

> By annotating the data myself I also like to think that I understand the problem much better as a consequence.

I’m sure that once we peer into the black box, we’ll find just more refined old-school methods.

pletnes · 2 years ago

Good points. This means you can use LLMs to create an 80% accurate dataset, then manually correct the last 20% to get all the way.

rhdunn · 2 years ago

LLMs are useful for things like predicting/generating text, and summarizing text. They are not useful if you want to do other NLP tasks that include things like:

1. Identifying (and highlighting/extracting) the language that spans of text are in within a different language (e.g. a French phrase in English).

2. Text search and highlighting, where you need to do things like performing word stemming or lemmatization on the search input, find matching documents, and highlight spans.

Also, LLMs have a different tokenization model, so you have tokens like "don"/"'t!" -- this makes them difficult to work with in downstream NLP applications where having words, numbers, punctuation, and symbols as specific tokens are more useful.

NLP is not a solved task, as things like part of speech classification (identifying nouns, adjectives, etc.) are not 100% accurate, and tend to have a lower sentence accuracy compared to word accuracy.

yorwba · 2 years ago

1. Prompt:

"""

The following English text contains several French phrases. List all of them.

Text: Gabonese President Ali Bongo Odimba was deposed in a coup d'etat spearheaded by his father's former aide-de-camp Brice Oligui Nguema, shortly after the announcement that Bongo had won the 2023 election.

List of French phrases from the text:

"""

Response:

"""

coup d'etat

- aide-de-camp

"""

2. Prompt:

"""

Turn all words in the following text into their lemma form.

Lemmatized text:

"""

Response:

"""

Gabonese President Ali Bongo Odimba be depose in a coup d'etat spearhead by his father's former aide-de-camp Brice Oligui Nguema, shortly after the announcement that Bongo have win the 2023 election.

"""

Maybe too slow or too inaccurate or too expensive compared to bespoke solutions, but certainly not completely useless.

If you want a different tokenization, just turn the LLM response back into text and retokenize it.

billythemaniam · 2 years ago

I agree LLMs alone aren't good at search but their embeddings replace the need for stemming, manual synonym lists, etc in most cases. LLMs can also be used for query understanding which can improve the keywords submitted to the engine and extracting the best snippet for a highlight. LLMs + search are better than either alone. However LLMs still have an inference performance/cost issue which may make them unsuitable for some search use cases.

rmbyrro · 2 years ago

Spacy is a great library and still way better than LLMs in a narrow set of tasks for a few of reasons:

- Effiency: get the same job done in less time, with less memory

- Consistency: always gives you the same result, no prompt engineering required, no hallucinations, no creativity

- Security: parse untrusted text without worrying about prompt injection

I work in a project that uses complex NLP pipelines. LLMs are replacing or enhancing parts of it, but not at all replacing spacy.

Al-Khwarizmi · 2 years ago

This is a common opinion, but when I speak to small companies that want to use NLP (e.g. in the medical domain) and I give an account of the advantages and disadvantages of "classic" NLP vs. generative LLMs for information extraction applications, they tend to prefer classic NLP more often than not.

The possibility of models making things up combined with zero explainability, together with high costs (or using third-party services and having to upload sensitive data who knows where) are red flags for many.

This may change in a few years if the weaknesses of generative LLMs are successfully addressed, but for the moment I think "classic" NLP still has its place.

roenxi · 2 years ago

1) It is extremely rare for a field to ever 'be solved'. There is still active research into how to multiply 2 numbers together. NLP is not anywhere close to solved.

2) LLMs have different trade-offs to fundamental techniques. Linear regression still gets lots of use despite there usually being a theoretically better method for any specific application. There will be parallels to that in NLP.

3) Isn't the article talking about training things like LLM? It is right there - "Chapter 4: Training a neural network model".

axpy906 · 2 years ago

Large Language Model in my opinion means that it has billions of parameters and possibly cannot fit on a single machine.

Not sure what Spacy is doing under the hood these days but always thought of “neural net” as word2vec type model which won’t hit the above definition.

visarga · 2 years ago

Funny thing, you can get LLM+Code Interpreter to do your linear regression with sklearn. So LLM+toys "can do" linear regression, or any other library algorithm.

And I think I read in some paper that pure LLMs can do regression tasks in few shot mode. Maybe not as good as sklearn, but a decent result.

bob1029 · 2 years ago

The #1 thing for me right now is determinism and traceability. I am in a "serious" business domain and non-determinism is a big no-no. We have to be able to justify everything in a traditional sense at the end of the day. Explaining to a regulator that we declined a customer as a consequence of a vast, unthinkable sea of weights and biases is not going to fly.

For each predicted output token, I want to know exactly which source document(s) were utilized including indices from those documents and relevant statistics.

fogx · 2 years ago

> For each predicted output token, I want to know exactly which source document(s) were utilized including indices from those documents and relevant statistics.

You don't do this for any other kind of decision or tool, why do you need it for LLMs?

i think in truth you need a source that convinces you (or the regulator) that your choice is acceptable, so that you can pass off the responsibility. If the LLM were to give you an answer backed by relevant (cited) source documents of regulations and a good explanation it would make no difference compared to a human worker doing the same. -> this is already possible

gattilorenz · 2 years ago

I don’t understand what it means to be “solved”. It’s like saying that “architecture is now solved”, “physics is solved”, or “programming is solved”.

It’s a field of science and/or engineering, it’s not like we will ever run out of things to try/build/investigate. LLMs work… to a certain extent, with limitations and tradeoffs, and for some things. Would you spend days, money and Co2 to split a huge text corpus in sentences with a LLM, if a simpler program can do that just as well, there’s no need to find “the perfect prompt” (and hope that the LLM doesn’t forget one sentence, adds something inbetween, etc etc) and it gets done in three hours?

markisus · 2 years ago

I take it to mean that there is an effective generally accepted solution or methodology for problems in the field. Bridge building has been largely solved by methods of mathematical and computational structural analysis, manufacturing, and government regulation. We know how to build a bridge. Before the solution was known, designers would just go by intuition and we wouldn’t have any actual assurance that the bridge would hold. We can probably never solve larger domains like physics, programming, or architecture.

syllogism · 2 years ago

(Original author of spaCy and Explosion CTO here)

Okay so, first some terminology. LLMs can mean a bunch of different things, people call models the size of BERT LLMs sometimes. So let's talk specifically about in-context learning (ICL) with either zero or a few examples. So we'll say LLM ICL, and contrast that with techniques where you annotate enough data to train with, which might only be something like 10-40 hours of annotation. The something you do with that data is probably training a task-specific classification model initialised with weights from a language modelling objective. This is sometimes called "fine-tuning", but fine-tuning can also mean taking an LLM and adapting its ICL. So we'll just call it "training", and the fact you use transfer learning like BERT or even word vectors is just tactics.

Okay. So, here's something that might surprise you: ICL actually sucks at most predictive tasks currently. Let's take NER. Performance on NER on some datasets is _below 2003_. Here's some recent discussion: https://twitter.com/mayhewsw/status/1700139745769046409

The discussion focusses on how bad the CoNLL 2003 dataset is, and indeed it's a crap dataset. But experiments have also been done on other datasets, e.g. check out the comparison of ICL and training in this paper from Microsoft: https://universal-ner.github.io/ . When GPT4 is used this one paper reports it slightly better on some tasks: https://arxiv.org/abs/2308.10092 . Frustratingly they don't do enough GPT4 experiments. This other paper also does a huge number of experiments, but not with GPT4: https://arxiv.org/pdf/2303.10420.pdf

The findings across the literature are really clear. ICL is generally much worse than training a model in accuracy, and you generally don't need much training data to surpass ICL in accuracy.

For tasks like text classification, ICL sometimes does okay. But you need to pick the problem characteristics carefully. Most text classification tasks people actually want to do have something like 20 labels, the texts are kind of long, and the labels don't capture the category especially well. Applying ICL to such tasks is very non-trivial. Your prompt balloons up if you have lots of classes to predict between, and providing the examples is hard if your texts are even a few hundred words.

Let's say you want to do something ultra simple: classify articles into categories for some news site or blog. This is the type of problem text classifiers have been eating for breakfast for 20 years. This is not a difficult problem -- a unigram bag of words does fine, and the method of estimating the weights can be almost anything, like just averaged perceptron will be totally okay.

But how will an LLM be able to do this? Probably your topic categories include several different types of article under them. If you know what those types of article are you can separate them out and make sure they're represented in the prompt. But now we're back at needing a tonne of domain knowledge about your problem -- that's like having to write custom features to make your model work. We all switched to deep learning so we wouldn't have to do that.

LLMs build a much more sophisticated representation of the meaning of the data to classify. But then you give them very few examples of the problem. So they can only build a shallow function from this deep representation. If you have lots of examples, you can learn a complex function from shallower features. For a great many classification tasks, this is better. The rest of your system usually needs the classification module to have some sort of consistent behaviours anyway. To do that, you basically have to make an annotation manual, and then you want to annotate evaluation documents. Once you're there the effort to make training data and train a model is minimal.

The other elephant in the room is the expense of the LLM solutions. The papers are missing results on GPT4 not because they're lazy, but because it's so expensive to use GPT4 as a classification solution that they want to get the rest of their results out the door.

The world cannot migrate all its current NLP models for text classification and NER to ICL. There are nowhere near enough GPUs in the world for that to happen. And I don't know about you, but I expect the number of text classification and NER models to grow, not shrink. So, the idea that we'll stop training small models for these tasks is just implausible. The OpenAI models that support batching are almost viable for prediction, but models like GPT4 don't support it (perhaps due to the mixture of experts?), so it's super slow.

The other thing is, many aspects of language that are useful as annotations are consistent linguistic features. The English language codes for proper names and numeric entities. They behave differently in the grammar. So some sort of named entity annotation can be done once, and then the model trained and reused. This is what spaCy does. We do this for a variety of other useful annotations across languages. We actually need to do much more: we need to collect new annotations for these models to keep them up to date, and we need to do this for more tasks, such as semantic role labelling. But it's definitely a good way to reuse work. We can do this annotation once, train the models, and users can reuse the models.

The strength of ICL is that you can get started very easily, without doing the work of annotation and training. There's lots of research on making ICL few-shot learning less bad on arbitrary text classification, NER and other tasks. We're working hard to take these results from the literature and build best-practice prompts and parsers you can use as a drop-in annotation module in spaCy: https://github.com/explosion/spacy-llm . Our annotation tool Prodigy also supports initializing the annotations from an LLM, and just correcting the output: https://prodigy.ai . The idea is to let you start with an LLM, and then transition to a model you train yourself, which can be run much faster.

akgfab · 2 years ago

Also enjoyed your blog post on similar issues: https://explosion.ai/blog/against-llm-maximalism

Excited to see how curated transformers works as an alternative to hf!

ianbicking · 2 years ago

"If you know what those types of article are you can separate them out and make sure they're represented in the prompt. But now we're back at needing a tonne of domain knowledge about your problem -- that's like having to write custom features to make your model work"

I think this is where the different perspectives come into play.

If you're an NLP practitioner you are thinking, oh no! I need to know a lot about audience intention and how the articles are represented navigationally and the kind of variety people are looking for and how articles might fit multiple categories, etc, etc. And you have to think about these things on a meta level ("prompt engineering"), because you have to instruct the model on how to act in an abstract way.

If you're someone who wants to run a news site then you already are thinking a ton about these things, and probably have a dozen things you'd like to change and adjust, new ways of presenting content, etc. You _wish_ you could be thinking about these domain-specific topics.

What feels like a bug to the NLP practitioner – needing a deep understanding of the domain – is a feature to... everyone else. It's a feature to the people who care most about the results.

The other big perspective difference here, I believe, is how you think about goals. How many tasks ARE categorization? My intuition is that it's a quite small number. There are many tasks that can be implemented with one step as categorization, but that is seldom the task. To the NLP practitioner categorization might seem very prominent – that's when someone calls you up or hands over the work. But with an LLM you might be able to do a much larger portion of the real task, with or without categorization.

Even with a categorization task, when I'm working with an LLM I usually don't produce just a "category", but produce other information at the same time, often using natural language as a first-class data type because it can be fed back into an LLM. In my experience the results are often (usually?) much higher quality because I'm not breaking things down into steps where the inaccuracies propagate between steps, but instead going right for the result, and using a model that can basically "check" itself against general knowledge, scrubbing out nonsensical results during inference. (As a result the remaining inaccuracies often appear plausible and are labeled "hallucinations"... it can make things more challenging, but what we don't see are the multitude of obvious inaccuracies that a more traditional NLP system would create, and which in a sense exist momentarily during the LLM inference.)

og_kalu · 2 years ago

>ICL is generally much worse than training a model in accuracy, and you generally don't need much training data to surpass ICL in accuracy.

For the same model is a huge asterisk you seem to be missing here. Finetuned GPT-4 is better than ICL Gpt-4 and so on but there's no guarantee that finetuned GPT-3 will be better than ICL GPT-4 like how 4 beats the finetuned Med-Palm on medical domain tests.

You can train bespoke models with worse accuracy.

esafak · 2 years ago

At the same time, I expect the industry to consolidate NLP on LLMs. Standardization has benefits, and compute only gets cheaper.

ATMLOTTOBEER · 2 years ago

/thread. Really complete explanation

janalsncm · 2 years ago

Usually businesses want fast, accurate, and cheap.

LLMs are somewhat fast, somewhat accurate, and not cheap.

Basic NLP techniques can be faster, more accurate, and cheaper.

jerrygenser · 2 years ago

LLM are not somewhat fast when measured against CPU models in spacy that are specialized for a task. They are multiple orders of magnitude slower for e.g. token classification/NER, text classification, etc.

that's not to mention the plumbing that spacy provides that are mostly bindings to C code for tokenization, lemmatization, etc. things like that which are more algorithms problems than machine learning problems.

graboid · 2 years ago

I think for a lot of "simple NLP" stuff like constituency parsing, dependency parsing, POS tagging, etc. there have been fast and accurate models/algorithms for quite some time, and putting even a pruned, quantized, hyper-optimized LLM at them would feel like shooting a shotgun at a mosquito.

hnhg · 2 years ago

I have wondered this but Spacy seem to be about delivering additional value on top of the LLMs. Take a look here: https://spacy.io/usage/large-language-models

pharmakom · 2 years ago

We still use regex despite there being LLMs that can approximate this functionality

They have different trade offs in the solution space.

I have no doubt that prompt engineering will eat into a bunch of work that was previously done using NLP though - particularly for prototyping.

jakderrida · 2 years ago

Quite honestly, I'd imagine getting LLMs to generate regex solutions seems most feasible. Regex can basically run through a character vector full of millions in impressive time. Plugging all of them into an LLM and awaiting a response taking 20 seconds each might be impractical for most use cases. However, asking an LLM to pack the regex statement full of synonyms and | operators just might seem a better solution. Especially if you can give the LLM some samples of what you're looking for.

globular-toast · 2 years ago

Remember when neural nets were invented then forgotten about for decades because machine learning was solved by support vector machines etc?

abhgh · 2 years ago

"NLP is basically solved" --- somewhat but not the entirely yet. I work on a variety of specialized NLP use-cases for clients and there are different strengths these approaches have. The big thing with LLMs is the ability to deploy something quickly if (a) if you can craft an appropriate prompt, (b) put up enough guardrails to stop surfacing hallucinated responses to users. For assistant or co-pilot kind of systems (b) is somewhat easy to deal with, since the user is expected to curate or edit the LLM's responses, hence their proliferation in such use-cases. Note though: if the LLM sounds authoritative enough it might bias the user into believing it is right - not a big problem when the user is relying on it for generating good prose, but this is a problem when presenting facts (esp. based on a specialized knowledge base).

The downsides are (a) latency, (b) cost, and (c) the need for specialized training. In some applications, you require near real-time responses, and smaller models are still better here. The cost angle is a little tricky - it depends both on the volume of calls you want to make and if this cost translates into revenue for the possibly incremental benefit you derive from an LLM. As an example, lets say you have a chatbot that does NER/slot-filling using a spacy or stanza today, and lets also assume ChatGPT can do better - does the incremental accuracy, that comes at a cost since you're paying OpenAI - translate into incremental revenue (or profit)? I am not sure what the answer here is - its probably a NO right now, but there is a positive deferred benefit - as your chatbot solution becomes better in many small and large ways, it might sell better in the future. The specialized training part is when you can gather a use-case-specific dataset that can get a fairly good accuracy (comparable or greater than an LLM), esp. considering (a) and (b). Note that these concerns are largely true even for self-hosted LLMs like llama - just that the precise breakeven point changes.

Rant423 · 2 years ago

Sometimes you want deterministic, rule-based results (e.g. https://spacy.io/usage/rule-based-matching) and not a "fuzzy" one (LLMs)

jgalt212 · 2 years ago

I wouldn't trust an LLM in any application where you need hi 90s% accuracy.

politelemon · 2 years ago

At least for those of us unfamiliar with the field, LLMs are an easy way of getting the task done. The only thing worth noting I suppose is that the most effective ones are behind paywalls.

In some cases though you may want the NLP task to be run locally - you want it to be free, and should not require excessive resources - for those cases libraries like Spacy and NLTK make sense. Yes there are projects like llama.cpp and friends, but it's a fast moving field and they aren't suited for production applications yet, and even then require high end hardware setups which not everyone will necessarily be running in.

jakderrida · 2 years ago

>At least for those of us unfamiliar with the field, LLMs are an easy way of getting the task done.

This is actually an excellent point. You don't really need to know, or even give a damn, how LLMs work in order to make use of them. Find me a C++ library where I can be 100% clueless as to what it does while also integrating it into my code.

hzay · 2 years ago

Regression is unsolved, for a start. Re LLMs, they are expensive to finetune and their inference times & memory footprints are poor compared to smaller models.

rolisz · 2 years ago

Spacy has integration with LLMs, so it can leverage them.

But in general you can often get by with models that are 10-100 times smaller, but are not as general as an LLM.