-
Notifications
You must be signed in to change notification settings - Fork 110
trying to debug the hanging problem #951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive debug logging to the MFC scheduler to help diagnose hanging thread issues, along with increased walltime for CI jobs. The changes focus on adding visibility into thread lifecycle, GPU device assignment, and potential deadlock scenarios.
- Adds extensive debug logging throughout the scheduler with a new
--sched-debug
flag - Increases walltime for Phoenix CI jobs from 2-3 hours to 3-4 hours
- Enhances error messages with GPU device context and timeout handling
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
toolchain/mfc/sched.py | Adds comprehensive debug logging system and enhanced error handling for thread management |
toolchain/mfc/args.py | Adds new --sched-debug command-line argument |
.github/workflows/phoenix/test.sh | Enables scheduler debug logging and fixes typo in flag name |
.github/workflows/phoenix/submit.sh | Increases walltime from 3 to 4 hours |
.github/workflows/phoenix/submit-bench.sh | Increases walltime from 2 to 3 hours |
Comments suppressed due to low confidence (1)
toolchain/mfc/sched.py:178
- The variable 'device_idx' is not used within the loop body. Consider using an underscore '_' instead to indicate it's intentionally unused.
for device_idx in range(task.ppn):
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
User description
debug hanging + add some walltime
PR Type
Bug fix, Enhancement
Description
Add scheduler debug logging to diagnose hanging issues
Increase walltime limits for CI jobs
Improve thread joining with timeouts and error handling
Fix typo in test script scheduler debug flag
Diagram Walkthrough
File Walkthrough
args.py
Add scheduler debug command line option
toolchain/mfc/args.py
--sched-debug
command line argument for enabling detailedscheduler debug logging
sched.py
Enhanced scheduler with debug logging and robust thread handling
toolchain/mfc/sched.py
GPU)
test.sh
Enable scheduler debug in test script
.github/workflows/phoenix/test.sh
--schedul-debug
flag to test command (contains typo)submit-bench.sh
Increase benchmark job walltime limit
.github/workflows/phoenix/submit-bench.sh
submit.sh
Increase regular job walltime limit
.github/workflows/phoenix/submit.sh