Skip to content

Multi processing bug fix #1026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 19, 2022

Conversation

Ernaldis
Copy link
Contributor

This PR proposes a fix to the bug described in #1013. In brief, when a user tries to populate a table which has already been populated while also setting processes to be greater than 1, the population should be skipped, just like it is when processes is equal to 1. Instead, a value error is thrown.

This PR also alters the return value to also return the number of keys processed. This can be useful in automating workflows.

Closes #1013

@Ernaldis Ernaldis added the bug Indicates an unexpected problem or unintended behavior label May 18, 2022
@Ernaldis Ernaldis self-assigned this May 18, 2022
@Ernaldis
Copy link
Contributor Author

Note to self: add unit tests which check return value of populate runs

@@ -252,8 +256,7 @@ def handler(signum, frame):
if reserve_jobs:
signal.signal(signal.SIGTERM, old_handler)

if suppress_errors:
return error_list
return error_list, nkeys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does populate need to return nkeys now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be useful in automating workflows, since it allows us to determine how many keys were found to be processed.
For example, this features is used in the neurophotonics project in check_db. This allows the EC2 instance to continue working until the user stops inserting keys into the database, for those times that keys are inserted after the instance started, but before the initial workload is finished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current users expect populate to return the error list. This change may break existing code. Changing the API should be done carefully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a valid concern, but I think it is also useful to be able to tell how many keys were processed.

I notice that there is nothing returned if suppress_errors == False, the default value. What if the return was left as just error_list in the case suppress_errors == True, and returned nkeys only if suppress_erros == False?

if processes < 0:
raise Exception("processes must not be negative")
elif processes == 0:
return error_list, nkeys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here error_list is empty. Why does this need to be return? What's the use case for processes == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, error_list will indeed always be an empty list, but I think it is best to return it anyway. If we do it like this, then autopopulate will always return a list and an integer, regardless of which case this happens to be. Since we can always expect the same return type, we will never have to later worry about which case we will encounter.

As for the use case of processes == 0, this is the case which caused the error the PR is meant to solve. Consider the case in which you are using multiprocessing (that is, when you call this function, you have set processes to be greater than one), but the table you are populating has no keys remaining to process. This case occurred when automating the workflow of the neurophotonics project, since populate was being called on every table.

In this case, we would like for the function to do nothing and move on to the next task. Instead, it throws a value error. This is because of lines (215-216) in the autopopulate function.

if processes > 1:
            processes = min(processes, nkeys, mp.cpu_count())

Since processes has been set to a value greater than 1, the second line here runs. Since the table has already been populated in this case, nkeys = 0. Since there is no reason for any of these three values to be negative, the minimum is 0. Thus processes=0 at this point.

Previously, if processes was not equal to 1, we would try to spawn multiple processes equal to processes. Since 0 != 1, we try to call multiprocessing with a pool of 0 processes, which created a value error.

This line prevents this exception from being thrown by handling the case in which processes == 0. Now, the function simply reports that there were no errors and no keys to process, and moves on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if nkeys <= 0, it should just quietly exit. That's consistent with previous behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case nkeys == 0, it does quietly exit. It returns ([], 0), but it doesn't mutate any values, initialize new variables or print anything to the screen. Users don't have to use the return value, but they can if they want to.

The case nkeys < 0 should never occur. My understanding is that this value represents the number of keys which need to be processed. If this value is negative, an error has occurred.

Co-authored-by: Dimitri Yatsenko <dimitri@datajoint.com>
@dimitri-yatsenko
Copy link
Member

@Ernaldis would you set the release date and actually make the release?

@dimitri-yatsenko dimitri-yatsenko self-requested a review May 18, 2022 22:02
Copy link
Collaborator

@guzman-raphael guzman-raphael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ernaldis Thanks for patching this! 🤝

Only have a small issue with the updated documentation since it doesn't seem correct. Although, should be straightforward to include.

@@ -173,8 +173,7 @@ def populate(
:param limit: if not None, check at most this many keys
:param max_calls: if not None, populate at most this many keys
:param display_progress: if True, report progress_bar
:param processes: number of processes to use. When set to a large number, then
uses as many as CPU cores
:param processes: number of processes to use. Set to None to use all cores
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea to improve how this can be specified but doesn't seem to be true based on what is here.

If you do in fact set process=None the following would happen:

  • [216]: Should throw an error like: TypeError: '>' not supported between instances of 'NoneType' and 'int'

You could probably just replace that particular if-block with something like this:

min(nkeys, *([processes, mp.cpu_count()] if processes else [mp.cpu_count()]))

OR

min(*(_ for _ in (processes, nkeys, mp.cpu_count()) if _))

These approaches would leverage argument expansion to our benefit to expand only the relevant values into the min function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops good point.

@Ernaldis Ernaldis force-pushed the multi-processing-bug-fix branch from 146f07e to 91c8db4 Compare May 19, 2022 16:06
@Ernaldis Ernaldis requested a review from guzman-raphael May 19, 2022 16:09
Copy link
Collaborator

@guzman-raphael guzman-raphael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ernaldis This will still raise a TypeError. If you update this test with processes=None you will see what I am talking about. That test is meant for testing multiprocessing anyway. Probably you can just get rid of the if since it seems unnecessary. The default is already 1 which should always be the min.

@guzman-raphael guzman-raphael merged commit 0ff34f2 into datajoint:master May 19, 2022

if processes > 1:
processes = min(processes, nkeys, mp.cpu_count())
processes = min(*(_ for _ in (processes, nkeys, mp.cpu_count()) if _))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, processes=0 will have the same effect as processes=None.

@Ernaldis Ernaldis deleted the multi-processing-bug-fix branch May 19, 2022 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError thrown when multi-processing runs into previously computed tables
4 participants