Skip to content

DOC add FAQ entry for the many linear model classes #19861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 16, 2021
41 changes: 41 additions & 0 deletions doc/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -396,3 +396,44 @@ and not at test time, for resampling and similar uses,
like at `imbalanced-learn`.
In general, these use cases can be solved
with a custom meta estimator rather than a Pipeline

Why are there so many different estimators for linear models?
-------------------------------------------------------------
Usually, there is one classifier and one regressor per model type, e.g.
:class:`~ensemble.GradientBoostingClassifier` and
:class:`~ensemble.GradientBoostingRegressor`. Both have similar options and
both have the parameter `loss`, which is especially useful in the regression
case as it enables the estimation of conditional mean as well as conditional
quantiles.

For linear models, there are many estimator classes which are very close to
each other. Let us have a look at

- :class:`~linear_model.LinearRegression`, no penalty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say here that it is there only for teaching purpose. IMHO, it should never be used, and I would love to retire this class (though I do know that not everybody agrees with this desire).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it is the wrong place to mention this. And I belong to the mentioned small group...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- :class:`~linear_model.LinearRegression`, no penalty
- :class:`~linear_model.LinearRegression`, no penalty (for teaching purposes, prefer Ridge with a small penalty)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be read as "prefer Ridge with a small penalty for teaching purposes", so maybe:

Suggested change
- :class:`~linear_model.LinearRegression`, no penalty
- :class:`~linear_model.LinearRegression`, no penalty. This estimator exists mostly for teaching purpose. In general Ridge with a small penalty is preferable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm not sure that this is the right place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @lorentzenchr. I think the message to convey is boiling down to a listing of the linear model in regression. We can make it explicit beforehand that we are stating which models to use but rather making a catalogue of the model

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now thinking that this is only an entry in the FAQ, I think this is even more relevant to not add usage information.

- :class:`~linear_model.Ridge`, L2 penalty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that it would make sense to list RidgeCV, LassoCV, and ElasticNetCV (maybe together after SGD) mentioning that penalized models where the optimum penalty is found via CV?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that while I was writing this PR and decided against it. I'd say that they build their own group, meaning LinearRegression, Ridge, Lasso, ElasticNet do the same thing and RidgeCV, LassoCV and ElasticNetCV do the another same thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal on my side, we can keep it as it is.

- :class:`~linear_model.Lasso`, L1 penalty (sparse models)
- :class:`~linear_model.ElasticNet`, L1 + L2 penalty (less sparse models)
- :class:`~linear_model.SGDRegressor` with `loss='squared_loss'`

**Maintainer perspective:**
They all do in principle the same and are different only by the penalty they
impose. This, however, has a large impact on the way the underlying
optimization problem is solved. In the end, this amounts to usage of different
methods and tricks from linear algebra. A special case is `SGDRegressor` which
comprises all 4 previous models and is different by the optimization procedure.
A further side effect is that the different estimators favor different data
layouts (`X` c-contiguous or f-contiguous, sparse csr or csc). This complexity
of the seemingly simple linear models is the reason for having different
estimator classes for different penalties.

**User perspective:**
First, the current design is inspired by the scientific literature where linear
regression models with different regularization/penalty were given different
names, e.g. *ridge regression*. Having different model classes with according
names makes it easier for users to find those regression models.
Secondly, if all the 5 above mentioned linear models were unified into a single
class, there would be parameters with a lot of options like the ``solver``
parameter. On top of that, there would be a lot of exclusive interactions
between different parameters. For example, the possible options of the
parameters ``solver``, ``precompute`` and ``selection`` would depend on the
chosen values of the penalty parameters ``alpha`` and ``l1_ratio``.