Lasso vs Ridge
Goldman Sachs
Explain the difference between Lasso and Ridge regression
Answer
Recall first that classial linear regression models the observation variable $y$ as $y=X \beta + \epsilon$ and then estimates the coefficients $\beta$ by minimising the squared difference between observed and predicted variables $RSS = \sum_i (y_i - \hat{y}_i)$. The problem with linear regression, and the reason we introduce lasso and ridge, is that when there are many or highly correlated variables, linear regression overfits the data and its estimates become unstable (high variance). In other words, linear regression gives undue importance to explanatory variables which don't have predictive power, and so misses the big picture trend while trying to fit unimportant details.
Ridge and lasso regressions solve this by also penalising the size of coefficients, and thus shrinking the unimportant ones. Ridge regression does this by penalising the squares of coefficients, while lasso penalises their absolute values. Concretely, Ridge regression finds coefficients by minimising: $$RSS + \lambda \sum_i \beta_i^2$$while lasso minimises: $$RSS + \lambda \sum_i \left | \beta_i \right |$$The difference is that ridge will make unimportant coefficients small, whereas lasso will make them exactly 0. To understand why, note the shape of the two restriction regions: for some constant $k$ in the 2 coefficient case, the regions $\sum_i \beta_i^2 \lt k$ for ridge and $\sum_i \left | \beta_i \right |$ for lasso are plotted below.
You can see that lasso region will often produce optimum points on the rhomboid's edges, which cut one of the axes at 0, shrinking the unimportant coefficients to exactly 0. The circular restriction region on the other hand will often shrink unimportant coefficients to small values close to the axes, but not exactly 0.