mmmmpancakes (u/mmmmpancakes)

mmmmpancakes commented on Understanding Stein's Paradox (2021) joe-antognini.github.io/m... · Posted by u/robertvc

gweinberg · 2 years ago

Right. The question is when (if ever) you would actually want to be minimizing the rms of the vector error. For most of us, the answer is "never".

I remember back in 7th or 8th grade I asked my math teacher why we want to minimize the rms error rather than the sum of the absolute values of the errors. She couldn't give me a good answer, but the book All of Statistics does answer why (and under what circumstances) that is the right thing to do.

mmmmpancakes · 2 years ago

So this is just showing a bit of your ignorance of stats.

The general notion of compound risk is not specific to MSE loss. You can formulate it for any loss function, including L1 loss which you seem to prefer.

Steins paradox and James Stein estimator is just a special case for normal random variables and MSE loss of the more general theory of compound estimation, which is trying to find an estimator which can leverage all the data to reduce overall error.

This idea, compound estimation and James-Stein, is by now out-dated. Later came the invention of empirical Bayes estimation and the more modern bayesian hierarchical modelling eventually once we had compute for that.

One thing you can recover from EB is the James-Stein estimator, as a special case, in fact, you can design much better families of estimators that are optimal with respect to Bayes risk in compound estimation settings.

This is broadly useful in pretty much any situation where you have a large scale experiment where many small samples are drawn and similar stats are computed in parallel, or when the data has a natural hierarchical structure. For examples, biostats, but also various internet data applications.

so yeah, suggest to be a bit more open to ideas you dont know anything about. @zeroonetwothree is not agreeing with you here, they're pointing out that you cooked up an irrelevant "example" and then claim the technique doesnt make sense there. Of course, it doesnt, but thats not because the idea of JS isnt broadly useful.

----

Another thing is that JS estimator can be viewed as an example of improving overall bias-variance by regularization, although the connection to regularization as most people in ML use it is maybe less obvious. If you think regularization isn't broadly applicable and very important... i've got some news for you.

mmmmpancakes commented on Understanding Stein's Paradox (2021) joe-antognini.github.io/m... · Posted by u/robertvc

credit_guy · 2 years ago

Stein's paradox is bogus. Somebody needs to say that.

Here's one wikipedia example:

  > Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements.

Here's what's bogus about this: the "better estimate (on average)" is mathematically true ... for a certain definition of "better estimate". But whatever that definition is, it is irrelevant to the real world. If you believe you get a better estimate of the US wheat yield by estimating also the number of Wimbledon spectators and the weight of a candy bar in a shop, then you probably believe in telepathy and astrology too.

mmmmpancakes · 2 years ago

No it is not bogus, you just don't know much stats apparently.

mmmmpancakes commented on A bird's eye view of Polars pola.rs/posts/polars_bird... · Posted by u/rbanffy

hermitcrab · 2 years ago

CSV is a terrible format. But it is extensively used. See also:

Why isn’t there a decent file format for tabular data? https://news.ycombinator.com/item?id=31220841

mmmmpancakes · 2 years ago

parquet is perfectly fine

> There is Parquet. It is very efficient with it’s columnar storage and compression. But it is binary, so can’t be viewed or edited with standard tools, which is a pain.

I can open parquet in excel

mmmmpancakes commented on A bird's eye view of Polars pola.rs/posts/polars_bird... · Posted by u/rbanffy

dajt · 2 years ago

LOL.

I have been programming for over 30 years on all sorts of systems and the Pandas DataFrame API is completely beyond me. Just trying to get a cell value seems way more difficult than it should be.

Same with xarray datasets.

I just loaded the same CSV into Pandas and Polars and Polars did a much better job of it.

mmmmpancakes · 2 years ago

To me, Polars feels like almost exactly how I would want to redesign Pandas interfaces for small - medium sized data processing, given my previous experience with Pandas and PySpark. Throw out all the custom multi index nonsense, throw out numpy and handle types properly, memory map arrow, focus on method chaining interface, do standard stuff like groupby and window functions in the standard way, and implement all the query optimizations under the hood that we know make stuff way faster.

To be fair, Polars has the benefit of hindsight and designing their interfaces and syntax from scratch. The poor choices in Pandas were made long ago, and its adoption and evolution into the most popular dataframe library for python feels like mostly about timing the market than having the best software product.

mmmmpancakes commented on A bird's eye view of Polars pola.rs/posts/polars_bird... · Posted by u/rbanffy

Vaslo · 2 years ago

I’ve been moving over to use polars more often for my data work. It’s much faster than pandas at things like imputing millions of lines. It’s also a little more intuitive and you don’t waste time pissing around with indexes everytime you transform a dataframe.

Polars biggest downfall is that pandas/matplotlib are so ubiquitous in data science and polars just plays so differently than pandas including using hvplot as its default plotting package, etc. It really is trying to do much of its ecosystem exactly how it wants to maximize productivity, speed, etc. This may slow down the adoption of it, but hopefully it will push others in the better direction.

mmmmpancakes · 2 years ago

Suggest the following pattern:

1. load and process / aggregate in polars to get the smaller dataset that goes into your plot. 2. df.to_pandas() 3. apply your favourite vis library that works with pandas.

There's no use case i can think of where building a data viz interface more specific to polars than this is beneficial or necessary.

mmmmpancakes commented on A bird's eye view of Polars pola.rs/posts/polars_bird... · Posted by u/rbanffy

vundercind · 2 years ago

“Fancy array-of-associative-arrays with their own set of non-Python-native field data types and a lot of helper methods for sorting and import and export and such, but weirdly always missing any helpers that would be both hard to write and very useful”

“SQLite but only for Python and worse. Kind of.”

“That annoying transitional step you have no direct need for but that you have to do anyway, because all data-wrangling tools in Python assume a dataframe”

(I use this stuff daily)

[edit] “Imagine if you could read in from a database or CSV and then work with the returned rows directly and as if they were still a table/spreadsheet. Except by ‘directly’ I mean ‘after turning it into a dataframe and discarding/fucking-up all your data type information’. And then with another fucking-everything-up step if you want to write it out so you can do real stuff with it elsewhere.”

mmmmpancakes · 2 years ago

i mean, if you are reading and writing csv then yeah, you've already fucked up.

mmmmpancakes commented on Star neuroscientist may have manipulated data to support a major stroke trial science.org/content/artic... · Posted by u/EndXA

mmmmpancakes · 2 years ago

Scientific fraud makes me sad and angry.

mmmmpancakes commented on Algebra feynmanlectures.caltech.e... · Posted by u/luu

melagonster · 2 years ago

but government hired professors to design content of mathematics textbooks for students. there are some problem...

mmmmpancakes · 2 years ago

Disagree. The purpose of a textbook and a lecture is very different. A good textbook can be a helpful resource for teaching and lecturing, but it is not sufficient to guarantee high quality math education. Conversely, a good educator who deeply understands the material can deliver fantastic education without a good textbook. Claiming that profs writing bad textbooks is the cause of poor quality in class math instruction is absurd.

mmmmpancakes commented on Algebra feynmanlectures.caltech.e... · Posted by u/luu

maroonblazer · 2 years ago

Every time I read such lucid explanations of math like this I'm filled with resentment for the math instruction I received in junior high and high school in the U.S. There's a real sense of 'play' that these explanations evoke and that make thinking about numbers and their relationships genuinely fun and interesting. That sense of play was entirely absent from my early math education. It was all "These are the rules. Memorize them for the quiz tomorrow."

I can't help feeling like my math upbringing was akin to a child being raised by parents who speak their own made-up language. Integrating with the rest of the normal-language-speaking world is anxiety inducing and filled with challenges that may never be completely overcome.

mmmmpancakes · 2 years ago

The ability to communicate math this way is honestly rare. It comes from a combination of deep understanding, long experience in communicating math, and a certain level of "culturing" that is specific to the academic experience.

Among the best, Feynman was singular in his ability to communicate math and physics.

In other words, don't be so hard on the teachers who were disappointing in comparison to the stellar examples you see from top mathematical communicators. What you're reading is quite rare and, while education quality could certainly improve, its not fair to expect this of a 5th grade teacher who covers 5 topics in a day. Even for the best, developing this type of material takes time and thought that a school teacher probably does not have.