Skip to content

Restructure Keras Scikit-Learn wrappers to better implement Scikit-Learn API #37201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Mar 1, 2020

This is a modification of #32533. I am opening a new PR because that one seems stalled and I made a lot of changes/improvements (but keeping the same idea).

A quick summary:
The existing scikit-learn wrappers for Keras models are not compatible with many scikit-learn functions. Additionally, they require that dataset dimensions be determined before calling the fit method, which is unlike the scikit-learn estimators and makes it hard to build dynamically adaptable pipelines.

What this PR does:
This PR does not change any API.
By moving the storage of parameters from self.sk_params to self.__dict__, compatibility with a lot of the scikit-learn functionalities are improved. Additionally, I gave the model building function the ability to request the data that will be fitted (to determine dimensions) as well as any other attributes of the wrapper instance. Finally, I enabled copying/pickling of wrapped models as well as the ability to wrap instances of Model, which should allow for greater flexibility in incorporating into an exsiting Keras workflow.

How is it tested:
All existing tests are working and were unchanged. This confirms that there were no API changes. New tests were added for all of the new functionality as well as some of the most common scikit-learn operations that were previously broken.
I would like to credit @daviddiazvico , the author of the original PR: I borrowed a lot of the tests he had written as well as the original idea for fixing these issues. I will make him a co-author of the final commit if this gets approved.

@tensorflow-bot tensorflow-bot bot added the size:XL CL Change Size:Extra Large label Mar 1, 2020
@daviddiazvico
Copy link

Thanks for your effort @adriangb ! I'm sorry I couldn't write back before.

Of course, I don't have any problem with you updating the other PR, but I'm not sure if I have to do something to allow you to edit it. I've added you as a collaborator to my tf fork.

In any case, this is a nice improvement for tf.keras in my opinion, but I don't know if the development team is interested. My PR has been open for a while and the same changes were proposed in the keras repo about a year ago now.

@adriangb
Copy link
Contributor Author

adriangb commented Mar 1, 2020

I'm hoping that maybe the reviewers were just a bit busy?

I feel like the wrappers are used quite often, especially in beginner tutorials. I think it's important that the initial experience be seamless.

In addition to your PR, this resolves several issues: #33204, #36074, #34689 and #36137. There are/will be more, like the comment I posted on your PR.

I'm hoping we can at least get some feedback from @fchollet or @pavithrasv regarding interest in these changes.

@adriangb adriangb changed the title Restructure Scikit-Learn wrappers to better implement Scikit-Learn API Restructure Keras Scikit-Learn wrappers to better implement Scikit-Learn API Mar 1, 2020
@gbaned gbaned self-assigned this Mar 2, 2020
@gbaned gbaned added the comp:keras Keras related issues label Mar 2, 2020
@gbaned gbaned requested a review from fchollet March 2, 2020 04:16
@gbaned gbaned added the awaiting review Pull request awaiting review label Mar 5, 2020
@adriangb
Copy link
Contributor Author

adriangb commented Mar 5, 2020

I was able to add built-in support for all of the multi-output modes that Scikit-Learn supports, as well as a framework to easily support multi-input models. This means this would close #34689 as well.

Because of the great modularity of the Functional API, only a limited number of multi-output cases can be automatically supported (those that scikit-learn itself supports with 1-1 mapping of model outputs to y columns) and no multi-input models can be supported out of the box. If a user wants to use a more complex model with the wrappers, they would have to subclass and manually define the desired behavior. As I show in the tests, all that is needed to implement multi-input is to define a behvior for _pre_process_X (for example, columns 0-3 are one input, 4 is another input). All of this was done with no changes to the API.

Although the original problem statement ('fix compatibility') was quite large, I do feel that this PR has grown very large. I would personally prefer to split it into smaller PRs (even if that means more work for me), but I will leave that up to the reviewers.

@gbaned , is there a timeline for this review process? It would be nice to at least get tests running so that I can see if there are issues.

@adriangb
Copy link
Contributor Author

adriangb commented Mar 7, 2020

I realized that we actually need to re-implement not only an R^2 score, but also a classifier accuracy score: Keras does not use the sample_weight parameter for metrics, but Scikit-Learn does use it for scoring. That and the fact that Keras metrics don't support the Scikit-Learn style multi-output concept, so all of those scores are already being handled manually.

This prompted me to think: does it make sense to make scikit-learn an optional dependency that is only imported within this module? I can see pros and cons, just playing devil's advocate here.

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Mar 9, 2020
k-w-w
k-w-w previously approved these changes Mar 9, 2020
Copy link
Contributor

@k-w-w k-w-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to run the presubmits.

Can you add tests to make sure that saving and loading works with the wrapper?

@tensorflow-bot tensorflow-bot bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 9, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 9, 2020
@adriangb
Copy link
Contributor Author

adriangb commented Mar 9, 2020

Approving to run the presubmits.

Can you add tests to make sure that saving and loading works with the wrapper?

Thank you for kicking off those tests!

There are several tests for pickling/unpickling of Functional API models with and without Callbacks, etc. I think the only thing that is missing is a test for subclassed models. I'll add that in the next few days.

@adriangb
Copy link
Contributor Author

adriangb commented Mar 10, 2020

Quite a few errors:

  • Pylint: my fault, I was using 4 spaces. Fixing.
  • Scipy import error: for some CI builds, the scikit-learn version was pinned to a 2016 version that now has some broken functionality that these new changes were relying on. I bumped the version.
  • Python 2 errors: I'm ignoring these.
  • API changes: it's complaining about changes to internal parameters that should never have been public in the first place. I now appended an _ to them to fix this for the future. It also doesn't like that I capitalized X to match scikit-learn.
  • Windows builds: I have no idea why these are failing. Any advice would be much appreciated.

Also, I added the test suggested by @k-w-w (it's called SerializeCustomLayers fyi).

I guess another approval is needed for tests to run again, it'd be nice to fix the windows build errors before that though.

@tensorflow-bot tensorflow-bot bot removed the ready to pull PR ready for merge process label Mar 10, 2020
@k-w-w k-w-w added the kokoro:force-run Tests on submitted change label Mar 10, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 10, 2020
@k-w-w
Copy link
Contributor

k-w-w commented Mar 10, 2020

Thanks for fixing the bugs! I'm pretty sure the windows tests are unrelated. Running the tests again

@gbaned gbaned requested a review from k-w-w March 16, 2020 12:06
@k-w-w k-w-w added the kokoro:force-run Tests on submitted change label Mar 17, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 17, 2020
Copy link
Contributor

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Adrian,

Thank you for the PR. In general, the scikit-learn API wrapper for Keras is not well-maintained at this time. Because we don't have the resources to maintain it (or even to review a very involved PR like yours), we are considering deprecating it.

Since you are a user of this feature and you've already spent a lot of time developing improvements, I would recommend starting a new repository & pip package hosting an up-to-date version of the API wrapper (effectively, do a fork). We could then redirect users of tf.keras.wrappers.scikit_learn to your repository & pip package, while we deprecate this functionality in tf.keras.

What do you think?

Sequential.evaluate,
Sequential.fit,
Sequential.predict,
Sequential.predict_classes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_classses is now deprecated.

@adriangb
Copy link
Contributor Author

adriangb commented Apr 10, 2020

@fchollet, thank you for looking at this!

I think that could work really well. But I'd like to list out pros and cons to consider

Pros:

  • More flexibility. scikit-learn can become required for the wrapper, which eliminates a lot of code duplication for this wrapper and removes the need for scikit-learn from TF CI.
  • Relieves maintenance burden from Keras/TF team.

Cons:

  • Improvements in Keras might take a long time to trickle over.
  • When issues do crop up in the future, wrapper maintainers (i.e. me) might have to open an issue with Keras/TF to get support in fixing things.

Overall, I think the pros outweigh the cons.

I will start working on getting a separate repo with CI/publishing working. In the meantime, if you could do a brief review of this PR as it currently stands, that would be super useful to make sure the initial release is as good as possible.

adriangb added a commit to adriangb/scikeras that referenced this pull request Apr 11, 2020
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Apr 12, 2020
@adriangb
Copy link
Contributor Author

Are there any preferences as far as:

  • Naming of the repo/package or any project this should live under. I came up with sklearn-keras-wrap
  • Documentation style or content. I planned on copying the existing docs but just keeping everying in the README.md for simplicity.
  • Python versions: I think it would be easier to make this package >=Python3.5
  • Linting: I would switch to flake8/black
  • Testing: I would switch to pytest.

I played around with a packaging a bit, everything seems to work as far as CI/testing/releasing. It would have scikit-learn>=0.21.0 and tensorflow>=2.1.0 as dependencies.

@gbaned gbaned added the awaiting review Pull request awaiting review label Apr 13, 2020
@gbaned gbaned requested a review from fchollet April 24, 2020 08:00
adriangb added a commit to adriangb/scikeras that referenced this pull request May 19, 2020
adriangb added a commit to adriangb/scikeras that referenced this pull request May 19, 2020
@adriangb
Copy link
Contributor Author

A quick update: the package is now fully operational. I settled on the name SciKeras.
PyPi: https://pypi.org/project/scikeras/
GitHub: https://github.com/adriangb/scikeras

Some important updates since this PR:

  1. Inherited BaseWrapper from sklearn.base.BaseEstimator.
  2. Implemented tags interface.
  3. Use OneHotEncoder/LabelEncoder instead of manual numpy work.
  4. Fix model serialization bugs.
  5. A lot of cleanup.

With all of this, estimators created with these wrappers now pass all of scikit-learn's estimator checks, except those that require setting a random state. As far as I understand, it is not possible to easily set a random seed in tf.

@adriangb
Copy link
Contributor Author

adriangb commented Jul 6, 2020

Hi @gbaned, just checking if there are any updates on this proposal/PR? Thanks!

@gbaned
Copy link
Contributor

gbaned commented Jul 9, 2020

Hi @gbaned, just checking if there are any updates on this proposal/PR? Thanks!

Hi @adriangb, It is waiting for approval. Thanks!

@fchollet
Copy link
Contributor

Thanks for the update. It's great to see that you've already released the new package. We can recommend that people start using it instead of keras.wrappers.scikit_learn.

With all of this, estimators created with these wrappers now pass all of scikit-learn's estimator checks, except those that require setting a random state. As far as I understand, it is not possible to easily set a random seed in tf.

This should be fixable: https://www.tensorflow.org/api_docs/python/tf/random/set_seed

What do you want us to do with the current PR? Should we close it?

@adriangb
Copy link
Contributor Author

We can recommend that people start using it instead of keras.wrappers.scikit_learn.

That sounds good.

This should be fixable: https://www.tensorflow.org/api_docs/python/tf/random/set_seed

Will take a look, thank you.

What do you want us to do with the current PR? Should we close it?

I think let's keep it open for a bit longer. Dask is looking to adopt SciKeras as a wrapper (here), so as they do their testing I expect there to be a couple of issues that crop up in the next couple of weeks that I may need input from the TF team on. Unless the TF team is willing to check the SciKeras repo if they are tagged.

@fchollet
Copy link
Contributor

fchollet commented Jul 16, 2020

I think let's keep it open for a bit longer. Dask is looking to adopt SciKeras as a wrapper (here), so as they do their testing I expect there to be a couple of issues that crop up in the next couple of weeks that I may need input from the TF team on. Unless the TF team is willing to check the SciKeras repo if they are tagged.

Ok, sounds good! Please reach out if you need anything from us (over email preferably, so we don't miss it). We'll start recommending your library as soon as it starts getting traction then 👍

@gbaned gbaned removed the awaiting review Pull request awaiting review label Jul 17, 2020
@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Jul 24, 2020
@gbaned
Copy link
Contributor

gbaned commented Jul 29, 2020

@adriangb Any update on this PR? Please. Thanks!

@gbaned
Copy link
Contributor

gbaned commented Aug 6, 2020

@adriangb Any update on this PR? Please. Thanks!

@adriangb
Copy link
Contributor Author

adriangb commented Aug 6, 2020

Hi @gbaned, as per François' comment above, the plan is to not merge this PR and instead move this part of tf.keras to an external package. I had asked to keep this PR open for communication and help with bringup of the external package, but I've since established communication with François directly and the external package is making good progress, so I think we can close this PR 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:keras Keras related issues size:XL CL Change Size:Extra Large stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

Successfully merging this pull request may close these issues.