The best part of this article is perhaps the following critique of ngrams and by extension their popular use in modern algorithms:
> The text of Etymonline is built entirely from print sources, and is done entirely by human beings. Ngrams are not. They are unreliable, a sloppy product of an ignorant technology, one made to sell and distract, one never taught the difference between "influence" and "inform."
> Why are they on the site at all? Because now, online, pictures win and words lose. The war is over; they won.
One never taught the difference between "influence" and "inform". What a scathing rebuke of our modern world and the social media that is part of it. Algorithms that attempt to quantify human speech and interaction and get it wrong most of the time in their quest to maximize their owner's profits.
This somber warning is especially poignant in an age more and more ruled by generative AI, which I'm told is essentially an ngram predictor.
> The text of Etymonline is built entirely from print sources, and is done entirely by human beings. Ngrams are not.
I'm confused about this part actually. I assume by "entirely from print sources" it means it does not include digital sources? That doesn't sound very relevant to the issues mentioned in the article though: unless it uses the "complete" set of all print source, it totally could have the same skewed-dataset issues too; and humans can make the same mistake as OCR does.
Etymonline compiles the information on etymology and historical usage from printed books (eg the Oxford English Dictionary). That is what is being referred to here. They are not having humans tally up different words from books. That data is entirely from ngrams.
Influence and inform are two sides of the same moral coin, where we claim others ideas aren't their own, whereas we are the virtuous informed ones who draw our own conclusions.
The low-pass filter of the mind only allows in what fits somewhere inside the existing framework. If you don't reject something, then being informed by it and being influenced by it are the same thing. In that framework, people who claim to be informed come off as high and mighty and a little lacking in self consciousness.
Disagree, influencing someone and informing someone are orthogonal.
Influencing someone just means changing their behavior and/or beliefs. This can be done with either the truth or lies, or even just opinion (green is better than blue - neither true nor false).
Informing someone specifically means giving them true information, which may or may not influence them.
This is the fundamental problem of data analysis: your analysis is only as good as your data.
This is not an easy problem.
It's hard in general to evaluate data quality: How do we know when our data is good? Are we sure? How do we measure that and report on it?
If we do have some qualitative or quantitative assessment of data quality, how do we present it in a way that is integrated with the results of our analysis?
And if we want to quantitatively adjust our results for data quality, how do we do that?
There are answers to the above, but they lie beyond the realm of a simple line chart, and they tend to require a fair amount of custom effort for each project.
For example in the Google Ngrams case, one could present the data quality information on a chart showing the composition of data sources over time, broken out into broad categories like "academic" and "news". But then you have to assign categories to all those documents, which might be easy or hard depending on how they were obtained. And then you also have to post a link to that chart somewhere very prominently, so that people actually look at it, and maybe include some explanatory disclaimer text. That would help, but it's not going to prevent the intuitive reaction when a human looks at a time series of word usage declining.
Maybe a better option is to try to quantify the uncertainty in the word usage time series and overlay that on the chart. There are well-established visualization techniques for doing this. but how do we quantify uncertainty in word usage? In this case, our count of usages is exact: the only uncertainty is uncertainty related to sampling. In order to quantify uncertainty, we must estimate how much our sample of documents deviates from all documents written at that time. It might be doable, but it doesn't sound easy. And once we have that done, will people actually interpret that uncertainty overlay correctly? Or will they just look at the line going down and ignore the rest?
Your analysis is only as good as your data. This has been a fundamental problem for as long as we have been trying to analyze data, and it's never going to go away. We would do well to remember this as we move into the "AI age".
It also says something about us as well: throughout our lives, we learn from data. We observe and consider and form opinions. How good it is the data that we have observed? Are our conclusions valid?
The authors assert that the ngram statistics for "said" are wrong, and imply that they have evidence of the contrary, but they don't provide the evidence. Looking at their own website, all they provide is google ngram statistics: https://www.etymonline.com/word/said#etymonline_v_25922.
This coupled with the huge failing of not displaying zero on the y-axis of their graph, and even interpreting the bad graph wrong, makes me not believe them at all. A very low quality article.
A decline to half the usage of "said" within 6 decades, followed by a recovery to the previous level within two decades? Show me evidence that the English language changed so fast in that way. It's extraordinary and you'd have to bring something convincing. Otherwise I believe their hypothesis and their conclusion that ngrams are bunk.
Yeah they interpreted the "toast" graph wrong. They should be more careful to read shitty graphs that cut off at the low point.
It depends entirely on what the data set is, and to conclude that it's "wrong" you'd have to consider the underlying data too. Google ngrams makes no claim to be a consistent benchmark type data set. Over time the content its based on shifts, which can cause effects like this.
To make any sort of claim like "this word's usage changes over time" in an academic sense you'd need to include a discussion of the data sources you used and why those are representative of word usage over time. The fact that they'd even try to use google ngrams in this way shows how little they actually researched the topic.
Google ngrams is a cute data set that can sometimes show rough trends, but it's not some "authoritative source on usage over time" and it doesn't claim to be.
The authors, on the other hand, are claiming to be authoritative and thus the burden of evidence on their claims is far far far higher. I didn't even get into their completely unobjective and vague accusations of "AI" somehow doing something bad. Ngrams don't involve AI, it's simple word counting.
It's possible (but I think unlikely) it could be somewhat due to different usage of words than the English language changing completely (which clearly didn't happen).
i.e. maybe instead of lots of books having direct text like "David said" or "Dora said", over time there was a trend to use a different more varied/descriptive way of describing that, i.e. "David replied" or "Dora retorted"?
It’s hard to present evidence because there’s only one source. So the article basically calls out flaws in the methodology of Google Books/Ngram.
I think this is reasonable. As otherwise we end up accepting things that exist solely, but are flawed. Just because something exists and is easy to use doesn’t mean it’s right.
Just like the answer to “the most tweeted thing is X therefore it is most popular and important” does not require a separate study to find the truth. It’s acceptable just to say “this is a stupid methodology, don’t accept it just because that’s what twitter says.”
I think what you want is for someone (yourself, me, the author) to review newspapers or some similar source and determine how the frequency percent changes over time for the word "said".
This is a reasonable request, but I also think it's fine for the author to state it _as an expert_ that newspapers continued using said at a similar frequency. The story they tell us plausible, and I don't really think the burden of proof is on them.
A low effort comment. That "said" haven't declined and raised the way shown isn't what needs evidence.
It's the extraordinary claim that it has that does.
That claim is Google's, and before accusing the author of the blog, maybe how representative their unseen dataset is. Should we take statistics with no knowledge of their input set at face value because "trust Google"?
Google isn't claiming any such statement. It's merely providing fun statistics based on their data set. With that context, when I read a headline claiming that the statistics are "wrong," it would imply that the counts are somehow off. Maybe due to a bug in the algorithm or the like.
Instead, we get a strawman put up where they misrepresent what the data set is, make up things that its "claiming," fail to investigate the underlying data sources and look into "why" they see the trend they see, and also fail to provide any alternative data.
It's cheap and snobby grandstanding, ironically complete with faulty interpretations of the little data they DO present.
EtymOnline isn't in the business of tracking shifts in the popularity of words over time, they set out to track shifts in meaning. So it's understandable that they don't have any specific contrary evidence in their listing for "said".
As for why they don't include the evidence in TFA, as others have noted, it's the extraordinary claim that "said" dropped to nearly 1/3 of its peak usage that needs extraordinary evidence backing it up. It's plenty sufficient for them to say "this doesn't make any sense at all on its face, and is most likely due to a major shift in the genre makeup of Google's dataset".
> Ngram says toast almost vanishes from the English language by 1980, and then it pops back up.
The Ngram plot does not say that. It shows usage dropping ~40% (since 1800). It’s indeed a problem that the graph Y axis doesn’t go to zero, as others have pointed out. But did the etymonline authors really not notice this before declaring incorrectly what it says? I would find that hard to believe (especially considering the subsequent “see, no dip” example that has a zero Y and a small but visible plateau around 1980), and it’s ironic considering the hyperbolic and accusatory title and and opening sentence.
The graph axis isn't the only problem. The word "toast" did not drop in usage by 40%, Google's dataset shifted dramatically towards a different genre than it was composed of previously. I've been in conversations with people trying to explain those drops in the 70s, and no one (myself included) realized that it was such a dramatic flaw in the data.
That’s fair, the article has a very valid point, which would be made even stronger without the misreading of the plots they’re critiquing, whether it was accidental or intentional. I always thought Ngrams were weird too, I remember in the past thinking some of the dramatic shifts it shows were unlikely.
When it comes to results like this it is more “lusting for clickbait” or the scientific equivalent thereof. (e.g. papers in Science and Nature aren’t really particularly likely to be right, but they are particularly likely to be outrageous, particularly in fields like physics that aren’t their center)
On the other hand, “Real Clear Poltics” always had a toxic sounding name to me since there is nothing “Real” or “Clear” about poltics: I think the best book about politics is Hunter S. Thompson’s Fear and Loathing on the Campaign Trail ‘72 which is a druggie’s personal experience following the candidates around and picking up hitchhikers on the road at 3am and getting strung out on the train and having moments of jarring sobriety like the time when he understood the parliamentary maneuvering that won McGovern the nomination while more conventional journalists were at a loss.
What I do know is 20 years from now an impeccably researched book will come out that makes a strong case that what we believed about political events today was all wrong and really it was something different. In the meantime different people are going to have radically different perspectives and… that’s the way it is. Adjectives like “real” and “clear” are an attempt to shut down most of those perspectives and pretend one of those viewpoints is privileged. Makes me thing of Baudrillard’s thorough shitting on the word “real” in Simulacra and Simulation which ought to completely convince you that people peddling the fake will be heralded by the word “real”.
(Or for that matter, that Scientology calls itself the “science of certainty.”)
> 20 years from now an impeccably researched book will come out that makes a strong case that what we believed about political events today was all wrong and really it was something different
The one good thing about politics is that the motives are crystal clear, politicians want to stay in power first, and only secondarily want to improve things.
Once you know this, everything makes sense. Even if we never find out what "really" happened
> politicians want to stay in power first, and only secondarily want to improve things.
The politicians who want to be in power first, and only secondarily want to improve things, tend to be the politicians in power.
Politicians who want to improve things first do exist, but they tend not to achieve power, because power is not their goal, and they are out-maneuvered by the first type.
Notably, politicians who want to improve things are easily side-tracked by suggesting that their proposed policy is not the best way to improve things, and that some other way would be better. This explains to some degree a lot of infighting on the left, because many do want to genuinely help, but it's never 100% clear what the best way to help is. It also explains why the right can put aside major differences of opinion (2A is important to fight the government who can't be trusted, but support the troops and arm the police!) to achieve power, because acquiring and maintaining power is more important than exactly what you plan to do with it.
politicians want to stay in power first, and only secondarily want to improve things.
In all honesty, many don't even want to improve things. Most people with power, love power. It's contrary to their nature to change a system that confers power to themselves. That's not just in your own, but in any nation, the people in power will be resistant to change.
That’s as close as you will get to a master narrative but it isn’t all of it.
Politicians aren’t always sure what will win for them, often face a menu of unappetizing choices and have other motivations too. (Quite a few of the better Republicans have quit in disgust in the last decade: I watched the pope speak in front of congress flanked by Joe Biden, then VP and John Boehner, then House Speaker when the pope obliquely said they should start behaving like adults and then Boehner quit a few days later and got into the cannabis business.)
I was an elected member of the state committee of the Green Party of New York and found myself arguing against a course of action that I emotionally agreed with, thought was a tactical mistake, and that my constituents were (it turns out fatally) divided about. It was a strategic disaster in the end.
You can never construct a representative image of the past. You are operating with a limited amount of sources which have survived in one form or another. They are not evenly distributed across time and space. There is an inherent “data loss” problem when a person dies - gone are all the impressions, unwritten experiences, familiar smells. Even a living person’s memory may not be reliable at one point.
Wikipedia is not meant to be an archive of all information. It's meant to be an encyclopedia of things that are notable [1], which is probably where the confusion comes from.
As you can imagine, the topic of what notability is, has been discussed at length since Wikipedia's inception [2].
Sure, what you pay attention to will impact what you remember, but this experience goes further and show how your attention can be manipulated to be blind to ploted events.
It seems to me that Google Ngram isn't wrong. It's reporting statistics on the words it correctly identified in the corpus. The problem is the context of the statistics. You may somewhat confidently say the word "said" dips in usage at such and such time in the Google Books corpus. You can more confidently say it dips at such and such time for the subset of the corpus for which OCR correctly identified every instance of the word.
But you can't make claims in a broader context like "this word dipped in usage at such and such time" without having sufficient data.
And this is why sampling methodology is so much more vastly important in drawing inferential population statistics than sample size.
Sample 1 million books from an academic corpus, and you'll turn up a very different linguistic corpus than selecting the ten best-selling books for each decade of the 20th century.
> The text of Etymonline is built entirely from print sources, and is done entirely by human beings. Ngrams are not. They are unreliable, a sloppy product of an ignorant technology, one made to sell and distract, one never taught the difference between "influence" and "inform."
> Why are they on the site at all? Because now, online, pictures win and words lose. The war is over; they won.
One never taught the difference between "influence" and "inform". What a scathing rebuke of our modern world and the social media that is part of it. Algorithms that attempt to quantify human speech and interaction and get it wrong most of the time in their quest to maximize their owner's profits.
This somber warning is especially poignant in an age more and more ruled by generative AI, which I'm told is essentially an ngram predictor.
I'm confused about this part actually. I assume by "entirely from print sources" it means it does not include digital sources? That doesn't sound very relevant to the issues mentioned in the article though: unless it uses the "complete" set of all print source, it totally could have the same skewed-dataset issues too; and humans can make the same mistake as OCR does.
The low-pass filter of the mind only allows in what fits somewhere inside the existing framework. If you don't reject something, then being informed by it and being influenced by it are the same thing. In that framework, people who claim to be informed come off as high and mighty and a little lacking in self consciousness.
Influencing someone just means changing their behavior and/or beliefs. This can be done with either the truth or lies, or even just opinion (green is better than blue - neither true nor false).
Informing someone specifically means giving them true information, which may or may not influence them.
Electronics are like a devouring spirit, they don't produce, they eat.
In Dictionopolis they do! Any Phantom Tollbooth peeps here?
https://en.wikipedia.org/wiki/The_Phantom_Tollbooth
This is not an easy problem.
It's hard in general to evaluate data quality: How do we know when our data is good? Are we sure? How do we measure that and report on it?
If we do have some qualitative or quantitative assessment of data quality, how do we present it in a way that is integrated with the results of our analysis?
And if we want to quantitatively adjust our results for data quality, how do we do that?
There are answers to the above, but they lie beyond the realm of a simple line chart, and they tend to require a fair amount of custom effort for each project.
For example in the Google Ngrams case, one could present the data quality information on a chart showing the composition of data sources over time, broken out into broad categories like "academic" and "news". But then you have to assign categories to all those documents, which might be easy or hard depending on how they were obtained. And then you also have to post a link to that chart somewhere very prominently, so that people actually look at it, and maybe include some explanatory disclaimer text. That would help, but it's not going to prevent the intuitive reaction when a human looks at a time series of word usage declining.
Maybe a better option is to try to quantify the uncertainty in the word usage time series and overlay that on the chart. There are well-established visualization techniques for doing this. but how do we quantify uncertainty in word usage? In this case, our count of usages is exact: the only uncertainty is uncertainty related to sampling. In order to quantify uncertainty, we must estimate how much our sample of documents deviates from all documents written at that time. It might be doable, but it doesn't sound easy. And once we have that done, will people actually interpret that uncertainty overlay correctly? Or will they just look at the line going down and ignore the rest?
Your analysis is only as good as your data. This has been a fundamental problem for as long as we have been trying to analyze data, and it's never going to go away. We would do well to remember this as we move into the "AI age".
It also says something about us as well: throughout our lives, we learn from data. We observe and consider and form opinions. How good it is the data that we have observed? Are our conclusions valid?
This coupled with the huge failing of not displaying zero on the y-axis of their graph, and even interpreting the bad graph wrong, makes me not believe them at all. A very low quality article.
Yeah they interpreted the "toast" graph wrong. They should be more careful to read shitty graphs that cut off at the low point.
To make any sort of claim like "this word's usage changes over time" in an academic sense you'd need to include a discussion of the data sources you used and why those are representative of word usage over time. The fact that they'd even try to use google ngrams in this way shows how little they actually researched the topic.
Google ngrams is a cute data set that can sometimes show rough trends, but it's not some "authoritative source on usage over time" and it doesn't claim to be.
The authors, on the other hand, are claiming to be authoritative and thus the burden of evidence on their claims is far far far higher. I didn't even get into their completely unobjective and vague accusations of "AI" somehow doing something bad. Ngrams don't involve AI, it's simple word counting.
i.e. maybe instead of lots of books having direct text like "David said" or "Dora said", over time there was a trend to use a different more varied/descriptive way of describing that, i.e. "David replied" or "Dora retorted"?
I think this is reasonable. As otherwise we end up accepting things that exist solely, but are flawed. Just because something exists and is easy to use doesn’t mean it’s right.
Just like the answer to “the most tweeted thing is X therefore it is most popular and important” does not require a separate study to find the truth. It’s acceptable just to say “this is a stupid methodology, don’t accept it just because that’s what twitter says.”
This is a reasonable request, but I also think it's fine for the author to state it _as an expert_ that newspapers continued using said at a similar frequency. The story they tell us plausible, and I don't really think the burden of proof is on them.
It's the extraordinary claim that it has that does.
That claim is Google's, and before accusing the author of the blog, maybe how representative their unseen dataset is. Should we take statistics with no knowledge of their input set at face value because "trust Google"?
Instead, we get a strawman put up where they misrepresent what the data set is, make up things that its "claiming," fail to investigate the underlying data sources and look into "why" they see the trend they see, and also fail to provide any alternative data.
It's cheap and snobby grandstanding, ironically complete with faulty interpretations of the little data they DO present.
As for why they don't include the evidence in TFA, as others have noted, it's the extraordinary claim that "said" dropped to nearly 1/3 of its peak usage that needs extraordinary evidence backing it up. It's plenty sufficient for them to say "this doesn't make any sense at all on its face, and is most likely due to a major shift in the genre makeup of Google's dataset".
The Ngram plot does not say that. It shows usage dropping ~40% (since 1800). It’s indeed a problem that the graph Y axis doesn’t go to zero, as others have pointed out. But did the etymonline authors really not notice this before declaring incorrectly what it says? I would find that hard to believe (especially considering the subsequent “see, no dip” example that has a zero Y and a small but visible plateau around 1980), and it’s ironic considering the hyperbolic and accusatory title and and opening sentence.
When it comes to results like this it is more “lusting for clickbait” or the scientific equivalent thereof. (e.g. papers in Science and Nature aren’t really particularly likely to be right, but they are particularly likely to be outrageous, particularly in fields like physics that aren’t their center)
On the other hand, “Real Clear Poltics” always had a toxic sounding name to me since there is nothing “Real” or “Clear” about poltics: I think the best book about politics is Hunter S. Thompson’s Fear and Loathing on the Campaign Trail ‘72 which is a druggie’s personal experience following the candidates around and picking up hitchhikers on the road at 3am and getting strung out on the train and having moments of jarring sobriety like the time when he understood the parliamentary maneuvering that won McGovern the nomination while more conventional journalists were at a loss.
What I do know is 20 years from now an impeccably researched book will come out that makes a strong case that what we believed about political events today was all wrong and really it was something different. In the meantime different people are going to have radically different perspectives and… that’s the way it is. Adjectives like “real” and “clear” are an attempt to shut down most of those perspectives and pretend one of those viewpoints is privileged. Makes me thing of Baudrillard’s thorough shitting on the word “real” in Simulacra and Simulation which ought to completely convince you that people peddling the fake will be heralded by the word “real”.
(Or for that matter, that Scientology calls itself the “science of certainty.”)
> 20 years from now an impeccably researched book will come out that makes a strong case that what we believed about political events today was all wrong and really it was something different
The one good thing about politics is that the motives are crystal clear, politicians want to stay in power first, and only secondarily want to improve things.
Once you know this, everything makes sense. Even if we never find out what "really" happened
The politicians who want to be in power first, and only secondarily want to improve things, tend to be the politicians in power.
Politicians who want to improve things first do exist, but they tend not to achieve power, because power is not their goal, and they are out-maneuvered by the first type.
Notably, politicians who want to improve things are easily side-tracked by suggesting that their proposed policy is not the best way to improve things, and that some other way would be better. This explains to some degree a lot of infighting on the left, because many do want to genuinely help, but it's never 100% clear what the best way to help is. It also explains why the right can put aside major differences of opinion (2A is important to fight the government who can't be trusted, but support the troops and arm the police!) to achieve power, because acquiring and maintaining power is more important than exactly what you plan to do with it.
In all honesty, many don't even want to improve things. Most people with power, love power. It's contrary to their nature to change a system that confers power to themselves. That's not just in your own, but in any nation, the people in power will be resistant to change.
Politicians aren’t always sure what will win for them, often face a menu of unappetizing choices and have other motivations too. (Quite a few of the better Republicans have quit in disgust in the last decade: I watched the pope speak in front of congress flanked by Joe Biden, then VP and John Boehner, then House Speaker when the pope obliquely said they should start behaving like adults and then Boehner quit a few days later and got into the cannabis business.)
I was an elected member of the state committee of the Green Party of New York and found myself arguing against a course of action that I emotionally agreed with, thought was a tactical mistake, and that my constituents were (it turns out fatally) divided about. It was a strategic disaster in the end.
As you can imagine, the topic of what notability is, has been discussed at length since Wikipedia's inception [2].
[1] Notability according to Wikipedia https://en.wikipedia.org/wiki/Wikipedia:Notability
[2] Oldest Wikipedia talk comments I could find on Notability https://en.m.wikipedia.org/w/index.php?title=Special:History...
One example to test for yourself: https://youtu.be/vJG698U2Mvo?si=16fwk8wG8Yyhim5t
Sure, what you pay attention to will impact what you remember, but this experience goes further and show how your attention can be manipulated to be blind to ploted events.
Are you supposed to not see the gorilla? I assumed it's the trap and there's some slightly less obvious catch in there.
Until you've solved the grand unified theory, you can never be fully confident in the completeness of your data or statistical inferences.
What's wrong is misleading the public away from this understanding.
Sample 1 million books from an academic corpus, and you'll turn up a very different linguistic corpus than selecting the ten best-selling books for each decade of the 20th century.