# Lasso, Ridge, and Robust Regression

## 1. Regression

Regression is a predictive modeling technique in machine learning, which predicts continuous outcomes by investigating the relationship between independent/input variables and a dependent variable/output.

Figure 1. Regression

Linear regression finds the best line (or hyperplane) that best describes the linear relationship between the input variable (X) and the target variable (y). Robust, Lasso, and Ridge regressions are part of the Linear Regression family, where input parameters and output parameters are assumed to have a Linear relationship.

#### Linear Regression is a good practice for the below scenarios.

• Linear Regression performs well when the dataset is linearly separable. It is used to find the nature of the relationship between the input and output variables.
• Linear Regression is easier to implement, and interpret and very efficient to train.

#### Problems with Linear Regression.

• Linear Regression is limited to datasets having Linear Relationships.
• Linear Regression is Sensitive to Outliers.
• Linear Regression is prone to noise and overfitting.
• Linear Regression assumes that the input variables are independent of each other, hence any multicollinearity must be removed.

##### Let’s see the first problem of Linear Regression, which is Outliers.

As the data may contain outliers in real-world cases, the model fitting can be biased. Robust regression aims at overcoming this problem.

## 2. Robust Regression

Robust regression is an alternative approach to ordinary linear regression when the data contains outliers. It provides much better regression coefficient estimates when outliers are present in the data.

Let’s recall the Loss function of Linear Regression, which is Mean Square Error (MSE) i.e Norm 2. It increases sharply with the size of the residual.
Residual: The difference between the predicted value and the actual value.

$$J(θ) = \frac{1}{n} \displaystyle\sum_{i=1}^n(y - ŷ)^2$$

Figure 2. Norm 2

Figure 2. Norm 2

The problem with MSE is that it is highly sensitive toward large errors compared to small ones(i.e outliers). So, the alternative is to use the sum of the absolute value of residuals as a loss function instead of squaring the residual, i.e Norm 1. This achieves robustness. However, it is hard to work with in practice because the absolute value function is not differentiable at 0.

$$J(θ) = \frac{1}{n}\displaystyle\sum_{i=1}^n|y - ŷ|$$

Figure 3. Norm 1

Huber loss solves this problem. It preserves the differentiability of a function like Norm 2 and is insensitive to outliers like Norm 1. Huber loss is a combination of L2 and L1 functions, it looks like as below.

##### L2/Ridge regularization is used when,
• The number of predictor variables in a given set exceeds the number of observations.
• The data set has multicollinearity (correlations between predictor variables).
##### Using Ridge regularization
• It penalizes the magnitude of coefficients of features.
• Minimizes the error between the actual and predicted observation.

It forces to get weights from the shape of Norm 2, as shown below.

Figure 12. Ridge Regression Shape

Let’s try to minimize the loss function of Ridge Regression. The amount of shrinkage is controlled by λ which multiplies the ridge penalty effect of λ on the ridge:

• When λ = 0, the objective becomes similar to simple linear regression. So we get the same coefficients as simple linear regression.
• As λ increases, more coefficients are set to zero and eliminated because of infinite weightage on the square of coefficients. In theory, λ = infinity means all coefficients are eliminated.
• If 0 < λ < ∞, the magnitude of λ decides the weightage given to the different parts of the objective.

Let’s try to solve this mathematically. For a single data point, consider

\begin{aligned} h_θ(X_i) \space & = \space Xθ\\ L(θ) \space & = \space (y-xθ)^2 + λ \displaystyle\sum_{i=1}^n θ^2 \\ & \space = \space y^2 + 2xyθ + x^2θ^2 + λ \displaystyle\sum_{i=1}^n θ^2 \\ \end{aligned}

Apply first-order derivative w.r.to θ to find local minima,

\begin{aligned} \frac{ƏL2}{Əθ} \space & = \space - 2xy + 2x^2θ + 2λθ\\ θ \space & = \space \frac{xy}{x^2+λ} \\ \end{aligned}

Optimal θ, i.e $θ^*$ will become 0 only when λ = ∞, so it is clear that in ridge regression, coefficients can never become zero.

L2 and its derivative look like below.

Figure 13. L2 Derivative

Unequal importance given to all features during Feature selection

Let’s see the working of Ridge Regression Consider the weights/coefficients corresponding to 5 input features as below.
This is just an example.

$θ = [θ_0, θ_1 , θ_2 , θ_3 , θ_4 , θ_5] \\ θ = [5, 10, 8, 6, 4, 2]$

\begin{aligned} Ridge \space / \space L2 \space penalty & = \space θ_0^2+θ_1^2+θ_2^2+θ_3^2+θ_4^2+θ_5^2 \\ & =\space 25 + 100 + 64 + 36 + 16 + 4 = 245 \\ \end{aligned}

If we shrink each parameter by 1, The penalty looks like the below.
\begin{aligned} θ_0 & \space => \space 16 + 100 + 64 + 36 + 16 + 4 = 236 \\ age \space of \space the \space house & \space => \space 25 + 81 + 64 + 36 + 16 + 4 = 226 \\ Sq.ft & \space => \space 25 + 100 + 49 + 36 + 16 + 4 = 230\\ No. of. Rooms & \space => \space 25 + 100 + 64 + 25 + 16 + 4 = 234\\ Neighborhood & \space => \space 25 + 100 + 64 + 36 + 9 + 4 = 238\\ Avg \space Temp & \space => \space 25 + 100 + 64 + 36 + 16 + 1 = 242\\ \end{aligned}

By shrinking $θ_0$ from 5 to 4, the ridge penalty is reduced by 9, from 245 to 236
by shrinking the age parameter from 10 to 9, the ridge penalty is reduced by 19, from 245 to 226
by shrinking the Year manufactured parameter from 8 to 7, the ridge penalty is reduced by 15, from 245 to 230
by shrinking the Origin parameter from 6 to 5, the ridge penalty is reduced by 9, from 245 to 234
by shrinking the Income parameter from 4 to 3, the ridge penalty is reduced by 8, from 245 to 238
by shrinking the Temp parameter from 2 to 1, the ridge penalty is reduced by 3, from 245 to 242

The reduction in penalty is not the same in the case of all features. Since the ridge penalty squares the individual model parameter, the large values are taken into account much heavier than smaller values. This means that our ridge regression model would prioritize minimizing large model parameters over small model parameters. This is usually a good thing because if our parameters are already small, they don’t need to be reduced even further.

To summarize, in this post we have learned definition of regularization, types of differet Regulizations and differences between Lasso, Ridge and Robust Regressions with examples.

Yayyy, wasn’t it great learning 😎 See you in the next post 👋 👋