trying to debug the hanging problem #951

sbryngelson · 2025-07-18T15:01:55Z

User description

debug hanging + add some walltime

PR Type

Bug fix, Enhancement

Description

Add scheduler debug logging to diagnose hanging issues
Increase walltime limits for CI jobs
Improve thread joining with timeouts and error handling
Fix typo in test script scheduler debug flag

Diagram Walkthrough

flowchart LR
  A["Add --sched-debug flag"] --> B["Enhanced scheduler logging"]
  B --> C["Improved thread joining"]
  C --> D["Better error handling"]
  E["Increase walltime"] --> F["Prevent CI timeouts"]

File Walkthrough

Relevant files

Enhancement

args.py `Add scheduler debug command line option` toolchain/mfc/args.py Add `--sched-debug` command line argument for enabling detailed scheduler debug logging	+1/-0

Bug fix

sched.py `Enhanced scheduler with debug logging and robust thread handling` toolchain/mfc/sched.py Add comprehensive debug logging throughout scheduler execution Implement thread joining with configurable timeouts (30s CPU, 120s GPU) Enhance error handling with detailed context information Add GPU device load tracking debug output	+72/-6
test.sh `Enable scheduler debug in test script` .github/workflows/phoenix/test.sh Add `--schedul-debug` flag to test command (contains typo)	+1/-1

Configuration changes

submit-bench.sh `Increase benchmark job walltime limit` .github/workflows/phoenix/submit-bench.sh Increase walltime from 2 hours to 3 hours for benchmark jobs	+1/-1
submit.sh `Increase regular job walltime limit` .github/workflows/phoenix/submit.sh Increase walltime from 3 hours to 4 hours for regular jobs	+1/-1

Copilot

Pull Request Overview

This PR adds comprehensive debug logging to the MFC scheduler to help diagnose hanging thread issues, along with increased walltime for CI jobs. The changes focus on adding visibility into thread lifecycle, GPU device assignment, and potential deadlock scenarios.

Adds extensive debug logging throughout the scheduler with a new --sched-debug flag
Increases walltime for Phoenix CI jobs from 2-3 hours to 3-4 hours
Enhances error messages with GPU device context and timeout handling

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
toolchain/mfc/sched.py	Adds comprehensive debug logging system and enhanced error handling for thread management
toolchain/mfc/args.py	Adds new `--sched-debug` command-line argument
.github/workflows/phoenix/test.sh	Enables scheduler debug logging and fixes typo in flag name
.github/workflows/phoenix/submit.sh	Increases walltime from 3 to 4 hours
.github/workflows/phoenix/submit-bench.sh	Increases walltime from 2 to 3 hours

Comments suppressed due to low confidence (1)

toolchain/mfc/sched.py:178

The variable 'device_idx' is not used within the loop body. Consider using an underscore '_' instead to indicate it's intentionally unused.

                for device_idx in range(task.ppn):

.github/workflows/phoenix/test.sh

qodo-merge-pro · 2025-07-18T15:02:25Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Typo Bug The flag `--schedul-debug` contains a typo and should be `--sched-debug` to match the argument definition in args.py. This will cause the debug logging feature to not work as intended. ./mfc.sh test --max-attempts 3 -a --schedul-debug -j $n_test_threads $device_opts -- -c phoenix Performance Impact Extensive debug logging is added throughout the scheduler with multiple debug_log calls per task and thread operation. This could significantly impact performance when debug mode is enabled, especially for large test suites. def debug_log(msg): if debug_enabled: cons.print(msg) if debug_enabled: debug_log(f"[SCHED DEBUG] Starting scheduler: {len(tasks)} tasks, {nThreads} threads, GPU mode: {gpu_mode}") if gpu_mode: debug_log(f"[SCHED DEBUG] GPU devices: {devices}") def join_first_dead_thread(progress, complete_tracker) -> None: nonlocal threads, nAvailable debug_log(f"[SCHED DEBUG] Checking {len(threads)} active threads for completion") for threadID, threadHolder in enumerate(threads): # Check if thread is not alive OR if it's been running for too long thread_not_alive = not threadHolder.thread.is_alive() debug_log(f"[SCHED DEBUG] Thread {threadID}: alive={threadHolder.thread.is_alive()}, devices={threadHolder.devices}") if thread_not_alive: debug_log(f"[SCHED DEBUG] Thread {threadID} detected as dead, attempting to join...") # Properly join the thread with timeout to prevent infinite hangs join_start_time = time.time() timeout_duration = 120.0 if gpu_mode else 30.0 # Longer timeout for GPU debug_log(f"[SCHED DEBUG] Joining thread {threadID} with {timeout_duration}s timeout...") try: threadHolder.thread.join(timeout=timeout_duration) join_end_time = time.time() join_duration = join_end_time - join_start_time debug_log(f"[SCHED DEBUG] Thread {threadID} join completed in {join_duration:.2f}s") # Double-check that thread actually finished joining if threadHolder.thread.is_alive(): # Thread didn't finish within timeout - this is a serious issue debug_log(f"[SCHED DEBUG] ERROR: Thread {threadID} still alive after {timeout_duration}s timeout!") debug_log(f"[SCHED DEBUG] Thread {threadID} devices: {threadHolder.devices}") debug_log(f"[SCHED DEBUG] Thread {threadID} exception: {threadHolder.thread.exc}") raise RuntimeError(f"Thread {threadID} failed to join within {timeout_duration} seconds timeout. " f"Thread may be hung or in an inconsistent state. " f"GPU devices: {threadHolder.devices}") except Exception as join_exc: # Handle join-specific exceptions with more context debug_log(f"[SCHED DEBUG] Exception during thread {threadID} join: {join_exc}") raise RuntimeError(f"Failed to join thread {threadID}: {join_exc}. " f"This may indicate a system threading issue or hung test case. " f"GPU devices: {threadHolder.devices}") from join_exc debug_log(f"[SCHED DEBUG] Thread {threadID} successfully joined") # Check for and propagate any exceptions that occurred in the worker thread # But only if the worker function didn't complete successfully # (This allows test failures to be handled gracefully by handle_case) if threadHolder.thread.exc is not None: debug_log(f"[SCHED DEBUG] Thread {threadID} had exception: {threadHolder.thread.exc}") debug_log(f"[SCHED DEBUG] Thread {threadID} completed successfully: {threadHolder.thread.completed_successfully}") if threadHolder.thread.completed_successfully: # Test framework handled the exception gracefully (e.g., test failure) # Don't re-raise - this is expected behavior debug_log(f"[SCHED DEBUG] Thread {threadID} exception was handled gracefully by test framework") pass # Unhandled exception - this indicates a real problem elif hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info: error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}" debug_log(f"[SCHED DEBUG] Thread {threadID} had unhandled exception!") raise RuntimeError(error_msg) from threadHolder.thread.exc else: debug_log(f"[SCHED DEBUG] Thread {threadID} had unhandled exception without details") raise threadHolder.thread.exc # Update scheduler state nAvailable += threadHolder.ppn for device in threadHolder.devices or set(): old_load = sched.LOAD[device] sched.LOAD[device] -= threadHolder.load / threadHolder.ppn debug_log(f"[SCHED DEBUG] Device {device} load: {old_load:.3f} -> {sched.LOAD[device]:.3f}") progress.advance(complete_tracker) debug_log(f"[SCHED DEBUG] Thread {threadID} cleanup complete, removing from active threads") del threads[threadID] break debug_log(f"[SCHED DEBUG] join_first_dead_thread completed, {len(threads)} threads remaining") with rich.progress.Progress(console=cons.raw, transient=True) as progress: queue_tracker = progress.add_task("Queued ", total=len(tasks)) complete_tracker = progress.add_task("Completed", total=len(tasks)) debug_log(f"[SCHED DEBUG] Starting task queue processing...") # Queue Tests for task_idx, task in enumerate(tasks): debug_log(f"[SCHED DEBUG] Processing task {task_idx+1}/{len(tasks)}: ppn={task.ppn}, load={task.load}") # Wait until there are threads available while nAvailable < task.ppn: debug_log(f"[SCHED DEBUG] Waiting for resources: need {task.ppn}, have {nAvailable}") # This is important if "-j 1" is used (the default) since there # are test cases that require test.ppn=2 if task.ppn > nThreads and nAvailable > 0: debug_log(f"[SCHED DEBUG] Task requires more threads ({task.ppn}) than available ({nThreads}), but some are free ({nAvailable})") break # Keep track of threads that are done join_first_dead_thread(progress, complete_tracker) # Do not overwhelm this core with this loop time.sleep(0.05) # Launch Thread progress.advance(queue_tracker) use_devices = None # Use the least loaded devices if devices is not None: use_devices = set() debug_log(f"[SCHED DEBUG] Assigning GPU devices for task {task_idx+1}") for device_idx in range(task.ppn): device = min(sched.LOAD.items(), key=lambda x: x[1])[0] sched.LOAD[device] += task.load / task.ppn use_devices.add(device) debug_log(f"[SCHED DEBUG] Assigned device {device} (load now: {sched.LOAD[device]:.3f})") nAvailable -= task.ppn debug_log(f"[SCHED DEBUG] Starting thread for task {task_idx+1}, devices: {use_devices}") thread = WorkerThread(target=task.func, args=tuple(task.args) + (use_devices,)) thread.start() threads.append(WorkerThreadHolder(thread, task.ppn, task.load, use_devices)) debug_log(f"[SCHED DEBUG] Thread started for task {task_idx+1}, {len(threads)} active threads") debug_log(f"[SCHED DEBUG] All tasks queued, waiting for completion...") # Wait for the last tests to complete (MOVED INSIDE CONTEXT) while len(threads) != 0: debug_log(f"[SCHED DEBUG] Waiting for {len(threads)} threads to complete...") # Keep track of threads that are done join_first_dead_thread(progress, complete_tracker) # Do not overwhelm this core with this loop time.sleep(0.05) debug_log(f"[SCHED DEBUG] Scheduler completed successfully!") Logic Error The device assignment loop uses `device_idx` as loop variable but assigns the same device multiple times. The original code used `_` indicating the loop variable wasn't used, but now it's named `device_idx` but still not used properly for device selection. for device_idx in range(task.ppn): device = min(sched.LOAD.items(), key=lambda x: x[1])[0] sched.LOAD[device] += task.load / task.ppn use_devices.add(device) debug_log(f"[SCHED DEBUG] Assigned device {device} (load now: {sched.LOAD[device]:.3f})")

qodo-merge-pro · 2025-07-18T15:02:58Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	✅ ~~Fix typo in debug flag~~ Suggestion Impact: The suggestion identified a typo in the debug flag, and the commit fixed it, though to --schedule-debug instead of --sched-debug as suggested code diff: -./mfc.sh test --max-attempts 3 -a --schedul-debug -j $n_test_threads $device_opts -- -c phoenix +./mfc.sh test --max-attempts 3 -a --schedule-debug -j $n_test_threads $device_opts -- -c phoenix The flag `--schedul-debug` contains a typo and should be `--sched-debug` to match the argument definition in `args.py`. This typo will cause the debug flag to be ignored or potentially cause an error. .github/workflows/phoenix/test.sh [25] -./mfc.sh test --max-attempts 3 -a --schedul-debug -j $n_test_threads $device_opts -- -c phoenix +./mfc.sh test --max-attempts 3 -a --sched-debug -j $n_test_threads $device_opts -- -c phoenix Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a typo (`--schedul-debug`) that would prevent the newly added debug logging feature from being enabled in the CI script.	High
General	Verify argument name consistency The argument name should use hyphens instead of underscores to match the CLI argument `--sched-debug` defined in `args.py`. Python's argparse typically converts hyphens to underscores in the namespace, so this should be `sched_debug`. toolchain/mfc/sched.py [50] +debug_enabled = ARG("sched_debug", False) # Check for --sched-debug flag - Suggestion importance[1-10]: 2 __ Why: The suggestion correctly observes that `sched_debug` is the right way to access the `--sched-debug` argument, but it proposes no change as the code is already correct, making it of very low impact.	Low
Update

debug hang

de8a820

Copilot AI review requested due to automatic review settings July 18, 2025 15:01

sbryngelson requested a review from a team as a code owner July 18, 2025 15:01

Copilot AI reviewed Jul 18, 2025

View reviewed changes

.github/workflows/phoenix/test.sh Outdated Show resolved Hide resolved

qodo-merge-pro bot added the Review effort 4/5 label Jul 18, 2025

sbryngelson added 2 commits July 18, 2025 11:26

debug

84e5fe9

fix

05df25c

sbryngelson marked this pull request as draft July 19, 2025 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

trying to debug the hanging problem #951

trying to debug the hanging problem #951

sbryngelson commented Jul 18, 2025 •

edited by qodo-merge-pro bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

qodo-merge-pro bot commented Jul 18, 2025

Uh oh!

qodo-merge-pro bot commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

trying to debug the hanging problem #951

Are you sure you want to change the base?

trying to debug the hanging problem #951

Conversation

sbryngelson commented Jul 18, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

qodo-merge-pro bot commented Jul 18, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-merge-pro bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Uh oh!

Uh oh!

sbryngelson commented Jul 18, 2025 •

edited by qodo-merge-pro bot

Loading

qodo-merge-pro bot commented Jul 18, 2025 •

edited

Loading