Dimension Reduction: SIR vs. PCA

By: Tanmay Kenjale


Background

The Principal Component Analysis (PCA) and Sliced Inverse Regression (SIR) are both dimension reduction techniques. This means that they both seek to reduce the number of predictor variables in a dataset while maintaining information in the dataset. Dimension reduction can reduce computation times and improve model performance.

The way that PCA and SIR conduct dimension reduction is both similar and different. First of all, they are both feature extraction methods rather than feature elimination methods. The techniques do not eliminate any of the original predictor variables. Instead, they create new predictors by taking different combinations of the original predictors. The major difference between PCA and SIR is that PCA is an unsupervised technique and SIR is a supervised technqiue. This means that PCA considers the predictor variables when conducting dimension reduction, while SIR considers the response variable.

This notebook uses simulated data to highlight the advantages of the Sliced Inverse Regression (SIR) over the Principal Component Analysis (PCA) in a regression problem. We will also show that SIR has the potential to fail in certain situations. Feel free to interact with the visualizations!

Import Libraries and Define Plotting Functions






Part 1: Performing PCA and SIR on Linear Data

We will first conduct PCA and SIR on a simulated dataset that contains 10 randomly-generated predictor variables $X_{1...10}$ and 1 response variable $Y$.
The response variable was generated by the following formula: $Y = X_{1} + X_{2} + \epsilon$ , where $\epsilon$ is an error term.
Out of the 10 predictors, 2 predictors are interacting to form a linear relationship with $Y$. We will attempt to use PCA and SIR to find just one predictor that can capture the linear relationship with $Y$.

Plot $X_1$ vs. $X_2$

We can see that the relationship between $X_1$ and $X_2$ is completely random. This is an indicator that PCA will not perform well because PCA depends on the relationships within the predictor variables.

Plot $X_1$ vs. $X_2$ vs. $Y$

The relationship between the two predictor variables and the response variable is linear in nature. SIR should be able to capture this relationship with just one variable.

Principal Component Analysis (PCA)

We will find out if PCA can create one predictor variable that is able to form a linear relationship with $Y$.

Plot the First Principal Component Against $Y$

As we can see, the first Principal Component is not able to capture the linear relationship with $Y$. Instead, $PC_1$ has no relationship with $Y$.

Sliced Inverse Regression (SIR)

We will find out if SIR can create one predictor variable that is able to form a linear relationship with $Y$.

SIR sorts the data by $Y$ in ascending order. It then slices the data into multiple partitions, each of which contains data points corresponding to a range of $Y$.

Plot of the Data Sliced by $Y$

Average the $X$ Values Within Each Slice

Plot the Average Values of $X_1$ and $X_2$

Perform SIR

SIR performs a Principal Component Analysis on the slice means to form new predictor variables.

Plot the First SIR Component Against $Y$

SIR is able to capture the linear relationship with $Y$ with just one variable. We have effectively reduced the dimensionality of $X$ while maintaining regression performance.


Part 2: The Limitations of SIR

SIR depends on using the slice means to perform dimension reduction. What happens when most of the slice means are zero or close to zero? In this scenario, SIR will not be able to perform well. This can occur when the data is symmetric on the $Y$ axis.

Symmetrical Data

We will now conduct PCA and SIR on a simulated dataset that contains 10 randomly-generated predictor variables $X_{1...10}$ and 1 response variable $Y$.
The response variable was generated by the following formula: $Y = (X_{1} + X_{2})^2 + \epsilon$ , where $\epsilon$ is an error term.
Out of the 10 predictors, 2 predictors are interacting to form a quadratic relationship with $Y$. We will attempt to use PCA and SIR to find just one predictor that can capture the quadratic relationship with $Y$.

Plot $X_1$ vs. $X_2$

Like the first example, our first two predictors are uncorrelated.

Plot $X_1$ vs. $X_2$ vs. $Y$

The relationship between the two predictor variables and the response variable is quadratic in nature. This is a symmetric relationship.

Principal Component Analysis (PCA)

Plot the First Principal Component Against $Y$

As expected, the first Principal Component is not able to capture the quadratic relationship with $Y$. Instead, $PC_1$ has no relationship with $Y$.

Sliced Inverse Regression (SIR)

Plot of Data Sliced by $Y$

Average the $X$ Values Within Each Slice

Plot the Average Values of $X_1$ and $X_2$

These averages are very close to zero and therefore do not provide much information about the relationship with $Y$.

Perform SIR

Plot the First SIR Component against $Y$

The first component of SIR does not capture the quadratic relationship with $Y$. SIR will not work with highly symmetrical data. Consider other options such as the Sliced Average Variance Estimation (SAVE) or the Kernel Sliced Inverse Regression (KSIR).