The only reason you need: PCA is not a feature selection algorithm
Is the author misunderstanding something very basic or are they deliberately writing this way for clicks and attention? I can see that they have great credentials so probably the latter? It's a weird article.
It's very clear if you read the article that what the author is calling "feature selection" might be better termed "feature generation". He explicitly calls out what he means in the post:
> When used for feature selection, data scientists typically regard z^p:=(z_1,…,z_p) as a feature vector than contains fewer and richer representations than the original input x for predicting a target y.
I don't even think this is necessarily incorrect terminology, especially given the author's background of working primarily for Google and the like. It's the difference between considering feature section as "choosing from a list of the provided features" vs "choosing from the set of all possible features". The author's term makes perfect sense given the latter.
PCA is used for this all the time in the field. There have been an astounding number of presentations I've seen where people start with PCA/SVD as the first round of feature transformation. I always ask "why are you doing that?" and the answer is always mumbling with shoulder shrugging.
This is a solid post and I find it odd that you try to dismiss it as either ignorant or click bait, when a quick skim of it dismisses both of these options.
The eigen values do give information as to energy, so if your selection criteria is simply to pick the linear component with the most energy, you can use PCA to select that feature and extract it with the corresponding eigen vector. The MUSIC algorithm is a classic example.
Edit: having now actually read the article, the case I mention falls into the author's test of (do linear combinations of my features make sense and have as much of a relationship to the target as the features themselves).
so what's a good way/algorithm/strategy/method/technique to select features? (obviously asking for a friend who's not that familiar with this, whereas I've all the black belts in feature engineering!)
So you have an example? I've never seen this and would like to see what people are doing here. I teach a lot of newbies data mining, so I'm very interested in how people get it wrong
practically speaking i have never had a reason to do this with PCA, but i find this reaction very weird. i have encountered the concept of using PCA to reduce the dimensionality of your data before training many times, including in university. cursory searches show plenty of discussion, questions on crossvalidated, etc.
PCA is a form for dimensionality reduction. It can be used to reduce your data to a only few features while preserving some properties like the predictive quality of your model. Simple example: you give PCA 10 input features, it gives back the principal components, you pick the first 2 principal components and use them in your regression model (you still select how many componenets to use yourself, PCA doesn't do that for you).
Feature selection is well... selecting features. You start with your original 10 features and you pick which ones to use. You can do this based on domain knowledge, based on analysing variance, based on correlation between features, based on feature importance in some model, by adding/removing/shuffling features and seeing how your model performs etc.
Now that we've all said "feature selection is not dimensionality reduction" to our hearts' content, could we return to the point of the article?
Regardless of whether you're doing feature selection or dimensionality reduction, the point remains that, if you're doing supervised learning, PCA is just compressing your X space, without any regard to your y. It could be that the last principal component of X, containing only 0.1% of the variance, contains 100% of the correlation between X and y.
Using PCA for dimensionality reduction in a supervised learning context means throwing out an unknown amount of signal, which could be up to 100% of the signal.
Now for unsupervised, exploratory analysis, PCA is definitely a candidate, but there are plenty of often-better alternatives there too.
You've never had to any kind of factor analysis in your work or done any searching for latent variables that map to customer/stakeholder question? Given the number of people I've worked with that are interested in modeling "engagement", I find this hard to believe.
PCA is an incredibly valuable tool that I've used in most jobs I've had. It's just a terrible idea as a default part of a feature engineering pipeline (which is what the author is talking about in terms of "feature selection"), for reasons outline in this article.
I suggest you don't be quite so quick to dismiss important concepts in this area, and before criticizing this post, at least read through it (I noticed your comment about misunderstand what the author is discussing by "feature selection" is the top comment here).
PCA can be construed as a loss compression of the data matrix. In fact, Eckart-Young theorem shows that this is an optimal compression (optimal w.r.t low rank, i.e. space to hold the values). In the language of OP, this shows the minimal energy loss for a given space constraint.
The key word is "lossy". It may as well be that the loss'ed part had the signal for further classification down the pipeline. Or may be not. It depends on the case.
There was a recent discussion on PCA for classification that I had walked into, however, everyone had left the building when I joined. Since I run into this conceptual misunderstanding of PCA's relevance to classification often, let me repeat what I said there.
The problem with using PCA as a preprocessing step for linear classification is that this dimensionality reduction step is being done without paying any heed to the end goal -- better linear separation of the classes. One can get lucky and get a low-d projection that separates well but that is pure luck.
Let me see if I can draw an example
The '+' and '-' denote the data points of the two different classes. In this example the PCA direction will be the along the X axis, which would be the worst axis to project on to separate the classes. The best in this case would have been the Y axis.
A far better approach would be to use a dimensionality reduction technique that is aware of the end goal. One such example is Fisher discriminant analysis and its kernelized variant.
> 1) Does linearly combining my features make any sense?
> 2) Can I think of an explanation for why the linearly combined features could have as simple a relationship to the target as the original features?
the article provided a negative example where pca does not fit, but doesnt provide an example where it does or what pca is actually used for. I come away from this article thinking pca is useless.
what would be an example where 2) is true?
I cannot answer 2) without already having experience of what explanations there could possibly be. ( 2) is almost begging the question, at least pedagogically- pca is good when the features are good for pca)
When does linearly combining my features "make sense"? again, an example is not provided
I mean, that's the point. What kind of real world complexities are actually linearly explainable? Almost none. It goes all the way back to why classical statistics failed to provide real world value after a certain point and the trend has been going towards using non-linear black boxes for the last decade.
The article is silly because PCA cannot select features. It is all about dimensionality reduction. You should think of PCA as the equivalent to VAEs from the neural network world. The idea would be something like this: you have big images (let's say 4k) and this is too expensive to train with/store forever. So, you collect a training set, train a PCA on these images, and then you can convert your 4k images to 720p or even 10 numbers, which you then use to predict/train whatever you want. Of course, we have algorithms that scale images but maybe all your images are of cats and there is a specific linear transformation that contains more information from the 4k image than simply scaling. The implicit thing here is that you still are collecting 4k images but just immediately compress them down using your trained PCA transformation.
So, although you have less numbers than before, you still need to collect the original data. A real feature selection process would be able to do something like: "the proximity of the closest Applebees is not important to predict house prices, you should probably stop wasting your time calculating this number". As others have mentioned, L1 regression or some statistical procedure to identify useless features is typically how this is done. I would also add that domain knowledge is probably your #1 feature selection because we have to restrict the variables we input in the first place and which data we prioritize is inherently selecting the features.
Dimensionality reduction is a compressing data in a way that retains the most important information for the task
Feature selection is removing unimportant information (keeping/collecting, or selecting, only the important parts)
Both cut down on the amount of data you end up with, but one does it by finding a representation that is smaller, the other does it by discarding unnecessary data (or, rather, telling you which data is necessary, so you can stop collecting the unnecessary data).
What would you recommend for feature selection in say, single-cell RNA seq studies (Typical dataset is ~10,000 x ~30000 (cells x genes) with >90% of your table filled with 0s (which could be due to biological or technical noise)
PCA and UMAP are yes, dimensionality reduction methods, but are often seen as tools for feature selection.
Is the author misunderstanding something very basic or are they deliberately writing this way for clicks and attention? I can see that they have great credentials so probably the latter? It's a weird article.
> When used for feature selection, data scientists typically regard z^p:=(z_1,…,z_p) as a feature vector than contains fewer and richer representations than the original input x for predicting a target y.
I don't even think this is necessarily incorrect terminology, especially given the author's background of working primarily for Google and the like. It's the difference between considering feature section as "choosing from a list of the provided features" vs "choosing from the set of all possible features". The author's term makes perfect sense given the latter.
PCA is used for this all the time in the field. There have been an astounding number of presentations I've seen where people start with PCA/SVD as the first round of feature transformation. I always ask "why are you doing that?" and the answer is always mumbling with shoulder shrugging.
This is a solid post and I find it odd that you try to dismiss it as either ignorant or click bait, when a quick skim of it dismisses both of these options.
For anyone that isn't aware, the role of pca is to create new (synthetic) features that represent the original features.
It does not tell you which features of the original set are good for feature selection purposes.
Edit: having now actually read the article, the case I mention falls into the author's test of (do linear combinations of my features make sense and have as much of a relationship to the target as the features themselves).
Feature selection is well... selecting features. You start with your original 10 features and you pick which ones to use. You can do this based on domain knowledge, based on analysing variance, based on correlation between features, based on feature importance in some model, by adding/removing/shuffling features and seeing how your model performs etc.
Yeah. Here's a few
https://scribe.citizen4.eu/feature-selection-how-to-throw-aw...
Regardless of whether you're doing feature selection or dimensionality reduction, the point remains that, if you're doing supervised learning, PCA is just compressing your X space, without any regard to your y. It could be that the last principal component of X, containing only 0.1% of the variance, contains 100% of the correlation between X and y.
Using PCA for dimensionality reduction in a supervised learning context means throwing out an unknown amount of signal, which could be up to 100% of the signal.
Now for unsupervised, exploratory analysis, PCA is definitely a candidate, but there are plenty of often-better alternatives there too.
PCA is an incredibly valuable tool that I've used in most jobs I've had. It's just a terrible idea as a default part of a feature engineering pipeline (which is what the author is talking about in terms of "feature selection"), for reasons outline in this article.
I suggest you don't be quite so quick to dismiss important concepts in this area, and before criticizing this post, at least read through it (I noticed your comment about misunderstand what the author is discussing by "feature selection" is the top comment here).
If you want to do dimension reduction and only have X, use PCA.
If you want to do dimension reduction and also have a "y", consider PLS or CCA (canonical correlation analysis).
Some background: https://www.cs.cmu.edu/~tom/10701_sp11/slides/CCA_tutorial.p...
The key word is "lossy". It may as well be that the loss'ed part had the signal for further classification down the pipeline. Or may be not. It depends on the case.
The problem with using PCA as a preprocessing step for linear classification is that this dimensionality reduction step is being done without paying any heed to the end goal -- better linear separation of the classes. One can get lucky and get a low-d projection that separates well but that is pure luck. Let me see if I can draw an example
The '+' and '-' denote the data points of the two different classes. In this example the PCA direction will be the along the X axis, which would be the worst axis to project on to separate the classes. The best in this case would have been the Y axis.A far better approach would be to use a dimensionality reduction technique that is aware of the end goal. One such example is Fisher discriminant analysis and its kernelized variant.
> 2) Can I think of an explanation for why the linearly combined features could have as simple a relationship to the target as the original features?
the article provided a negative example where pca does not fit, but doesnt provide an example where it does or what pca is actually used for. I come away from this article thinking pca is useless.
what would be an example where 2) is true?
I cannot answer 2) without already having experience of what explanations there could possibly be. ( 2) is almost begging the question, at least pedagogically- pca is good when the features are good for pca)
When does linearly combining my features "make sense"? again, an example is not provided
Also, "after a certain point" is doing a lot of heavy lifting in that sentance.
So, although you have less numbers than before, you still need to collect the original data. A real feature selection process would be able to do something like: "the proximity of the closest Applebees is not important to predict house prices, you should probably stop wasting your time calculating this number". As others have mentioned, L1 regression or some statistical procedure to identify useless features is typically how this is done. I would also add that domain knowledge is probably your #1 feature selection because we have to restrict the variables we input in the first place and which data we prioritize is inherently selecting the features.
Dimensionality reduction is a compressing data in a way that retains the most important information for the task
Feature selection is removing unimportant information (keeping/collecting, or selecting, only the important parts)
Both cut down on the amount of data you end up with, but one does it by finding a representation that is smaller, the other does it by discarding unnecessary data (or, rather, telling you which data is necessary, so you can stop collecting the unnecessary data).
If your models fits well (and doesn't overfit) after PCA, then go for it. If not, revisit.
PCA has its place, and as the other commenter said, sure, it's not a feature selection algorithm. Or you can just feature select manually.
PCA and UMAP are yes, dimensionality reduction methods, but are often seen as tools for feature selection.
See slide 61 Here: https://physiology.med.cornell.edu/faculty/skrabanek/lab/ang...