scikit-learn · glemaitre · Apr 16, 2021 · Apr 11, 2021 · Apr 11, 2021 · Apr 11, 2021
diff --git a/doc/faq.rst b/doc/faq.rst
@@ -396,3 +396,44 @@ and not at test time, for resampling and similar uses,
 like at `imbalanced-learn`.
 In general, these use cases can be solved
 with a custom meta estimator rather than a Pipeline
+
+Why are there so many different estimators for linear models?
+-------------------------------------------------------------
+Usually, there is one classifier and one regressor per model type, e.g.
+:class:`~ensemble.GradientBoostingClassifier` and
+:class:`~ensemble.GradientBoostingRegressor`. Both have similar options and
+both have the parameter `loss`, which is especially useful in the regression
+case as it enables the estimation of conditional mean as well as conditional
+quantiles.
+
+For linear models, there are many estimator classes which are very close to
+each other. Let us have a look at
+
+- :class:`~linear_model.LinearRegression`, no penalty
- :class:`~linear_model.LinearRegression`, no penalty
+- :class:`~linear_model.LinearRegression`, no penalty (for teaching purposes, prefer Ridge with a small penalty)
- :class:`~linear_model.LinearRegression`, no penalty
+- :class:`~linear_model.LinearRegression`, no penalty. This estimator exists mostly for teaching purpose. In general Ridge with a small penalty is preferable.
- :class:`~linear_model.LinearRegression`, no penalty
+- :class:`~linear_model.LinearRegression`, no penalty (for teaching purposes, prefer Ridge with a small penalty)
- :class:`~linear_model.LinearRegression`, no penalty
+- :class:`~linear_model.LinearRegression`, no penalty. This estimator exists mostly for teaching purpose. In general Ridge with a small penalty is preferable.
+- :class:`~linear_model.Ridge`, L2 penalty
+- :class:`~linear_model.Lasso`, L1 penalty (sparse models)
+- :class:`~linear_model.ElasticNet`, L1 + L2 penalty (less sparse models)
+- :class:`~linear_model.SGDRegressor` with `loss='squared_loss'`
+
+**Maintainer perspective:**
+They all do in principle the same and are different only by the penalty they
+impose. This, however, has a large impact on the way the underlying
+optimization problem is solved. In the end, this amounts to usage of different
+methods and tricks from linear algebra. A special case is `SGDRegressor` which
+comprises all 4 previous models and is different by the optimization procedure.
+A further side effect is that the different estimators favor different data
+layouts (`X` c-contiguous or f-contiguous, sparse csr or csc). This complexity
+of the seemingly simple linear models is the reason for having different
+estimator classes for different penalties.
+
+**User perspective:**
+First, the current design is inspired by the scientific literature where linear
+regression models with different regularization/penalty were given different
+names, e.g. *ridge regression*. Having different model classes with according
+names makes it easier for users to find those regression models.
+Secondly, if all the 5 above mentioned linear models were unified into a single
+class, there would be parameters with a lot of options like the ``solver``
+parameter. On top of that, there would be a lot of exclusive interactions
+between different parameters. For example, the possible options of the
+parameters ``solver``, ``precompute`` and ``selection`` would depend on the
+chosen values of the penalty parameters ``alpha`` and ``l1_ratio``.