Skip to content

trying to debug the hanging problem #951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Jul 18, 2025

User description

debug hanging + add some walltime


PR Type

Bug fix, Enhancement


Description

  • Add scheduler debug logging to diagnose hanging issues

  • Increase walltime limits for CI jobs

  • Improve thread joining with timeouts and error handling

  • Fix typo in test script scheduler debug flag


Diagram Walkthrough

flowchart LR
  A["Add --sched-debug flag"] --> B["Enhanced scheduler logging"]
  B --> C["Improved thread joining"]
  C --> D["Better error handling"]
  E["Increase walltime"] --> F["Prevent CI timeouts"]
Loading

File Walkthrough

Relevant files
Enhancement
args.py
Add scheduler debug command line option                                   

toolchain/mfc/args.py

  • Add --sched-debug command line argument for enabling detailed
    scheduler debug logging
+1/-0     
Bug fix
sched.py
Enhanced scheduler with debug logging and robust thread handling

toolchain/mfc/sched.py

  • Add comprehensive debug logging throughout scheduler execution
  • Implement thread joining with configurable timeouts (30s CPU, 120s
    GPU)
  • Enhance error handling with detailed context information
  • Add GPU device load tracking debug output
+72/-6   
test.sh
Enable scheduler debug in test script                                       

.github/workflows/phoenix/test.sh

  • Add --schedul-debug flag to test command (contains typo)
+1/-1     
Configuration changes
submit-bench.sh
Increase benchmark job walltime limit                                       

.github/workflows/phoenix/submit-bench.sh

  • Increase walltime from 2 hours to 3 hours for benchmark jobs
+1/-1     
submit.sh
Increase regular job walltime limit                                           

.github/workflows/phoenix/submit.sh

  • Increase walltime from 3 hours to 4 hours for regular jobs
+1/-1     

@Copilot Copilot AI review requested due to automatic review settings July 18, 2025 15:01
@sbryngelson sbryngelson requested a review from a team as a code owner July 18, 2025 15:01
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive debug logging to the MFC scheduler to help diagnose hanging thread issues, along with increased walltime for CI jobs. The changes focus on adding visibility into thread lifecycle, GPU device assignment, and potential deadlock scenarios.

  • Adds extensive debug logging throughout the scheduler with a new --sched-debug flag
  • Increases walltime for Phoenix CI jobs from 2-3 hours to 3-4 hours
  • Enhances error messages with GPU device context and timeout handling

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
toolchain/mfc/sched.py Adds comprehensive debug logging system and enhanced error handling for thread management
toolchain/mfc/args.py Adds new --sched-debug command-line argument
.github/workflows/phoenix/test.sh Enables scheduler debug logging and fixes typo in flag name
.github/workflows/phoenix/submit.sh Increases walltime from 3 to 4 hours
.github/workflows/phoenix/submit-bench.sh Increases walltime from 2 to 3 hours
Comments suppressed due to low confidence (1)

toolchain/mfc/sched.py:178

  • The variable 'device_idx' is not used within the loop body. Consider using an underscore '_' instead to indicate it's intentionally unused.
                for device_idx in range(task.ppn):

Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Typo Bug

The flag --schedul-debug contains a typo and should be --sched-debug to match the argument definition in args.py. This will cause the debug logging feature to not work as intended.

./mfc.sh test --max-attempts 3 -a --schedul-debug -j $n_test_threads $device_opts -- -c phoenix
Performance Impact

Extensive debug logging is added throughout the scheduler with multiple debug_log calls per task and thread operation. This could significantly impact performance when debug mode is enabled, especially for large test suites.

def debug_log(msg):
    if debug_enabled:
        cons.print(msg)

if debug_enabled:
    debug_log(f"[SCHED DEBUG] Starting scheduler: {len(tasks)} tasks, {nThreads} threads, GPU mode: {gpu_mode}")
    if gpu_mode:
        debug_log(f"[SCHED DEBUG] GPU devices: {devices}")

def join_first_dead_thread(progress, complete_tracker) -> None:
    nonlocal threads, nAvailable

    debug_log(f"[SCHED DEBUG] Checking {len(threads)} active threads for completion")

    for threadID, threadHolder in enumerate(threads):
        # Check if thread is not alive OR if it's been running for too long
        thread_not_alive = not threadHolder.thread.is_alive()

        debug_log(f"[SCHED DEBUG] Thread {threadID}: alive={threadHolder.thread.is_alive()}, devices={threadHolder.devices}")

        if thread_not_alive:
            debug_log(f"[SCHED DEBUG] Thread {threadID} detected as dead, attempting to join...")

            # Properly join the thread with timeout to prevent infinite hangs
            join_start_time = time.time()
            timeout_duration = 120.0 if gpu_mode else 30.0  # Longer timeout for GPU

            debug_log(f"[SCHED DEBUG] Joining thread {threadID} with {timeout_duration}s timeout...")

            try:
                threadHolder.thread.join(timeout=timeout_duration)
                join_end_time = time.time()
                join_duration = join_end_time - join_start_time

                debug_log(f"[SCHED DEBUG] Thread {threadID} join completed in {join_duration:.2f}s")

                # Double-check that thread actually finished joining
                if threadHolder.thread.is_alive():
                    # Thread didn't finish within timeout - this is a serious issue
                    debug_log(f"[SCHED DEBUG] ERROR: Thread {threadID} still alive after {timeout_duration}s timeout!")
                    debug_log(f"[SCHED DEBUG] Thread {threadID} devices: {threadHolder.devices}")
                    debug_log(f"[SCHED DEBUG] Thread {threadID} exception: {threadHolder.thread.exc}")
                    raise RuntimeError(f"Thread {threadID} failed to join within {timeout_duration} seconds timeout. "
                                     f"Thread may be hung or in an inconsistent state. "
                                     f"GPU devices: {threadHolder.devices}")

            except Exception as join_exc:
                # Handle join-specific exceptions with more context
                debug_log(f"[SCHED DEBUG] Exception during thread {threadID} join: {join_exc}")
                raise RuntimeError(f"Failed to join thread {threadID}: {join_exc}. "
                                 f"This may indicate a system threading issue or hung test case. "
                                 f"GPU devices: {threadHolder.devices}") from join_exc

            debug_log(f"[SCHED DEBUG] Thread {threadID} successfully joined")

            # Check for and propagate any exceptions that occurred in the worker thread
            # But only if the worker function didn't complete successfully
            # (This allows test failures to be handled gracefully by handle_case)
            if threadHolder.thread.exc is not None:
                debug_log(f"[SCHED DEBUG] Thread {threadID} had exception: {threadHolder.thread.exc}")
                debug_log(f"[SCHED DEBUG] Thread {threadID} completed successfully: {threadHolder.thread.completed_successfully}")

                if threadHolder.thread.completed_successfully:
                    # Test framework handled the exception gracefully (e.g., test failure)
                    # Don't re-raise - this is expected behavior
                    debug_log(f"[SCHED DEBUG] Thread {threadID} exception was handled gracefully by test framework")
                    pass
                # Unhandled exception - this indicates a real problem
                elif hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info:
                    error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}"
                    debug_log(f"[SCHED DEBUG] Thread {threadID} had unhandled exception!")
                    raise RuntimeError(error_msg) from threadHolder.thread.exc
                else:
                    debug_log(f"[SCHED DEBUG] Thread {threadID} had unhandled exception without details")
                    raise threadHolder.thread.exc

            # Update scheduler state
            nAvailable += threadHolder.ppn
            for device in threadHolder.devices or set():
                old_load = sched.LOAD[device]
                sched.LOAD[device] -= threadHolder.load / threadHolder.ppn
                debug_log(f"[SCHED DEBUG] Device {device} load: {old_load:.3f} -> {sched.LOAD[device]:.3f}")

            progress.advance(complete_tracker)

            debug_log(f"[SCHED DEBUG] Thread {threadID} cleanup complete, removing from active threads")
            del threads[threadID]

            break

    debug_log(f"[SCHED DEBUG] join_first_dead_thread completed, {len(threads)} threads remaining")

with rich.progress.Progress(console=cons.raw, transient=True) as progress:
    queue_tracker    = progress.add_task("Queued   ", total=len(tasks))
    complete_tracker = progress.add_task("Completed", total=len(tasks))

    debug_log(f"[SCHED DEBUG] Starting task queue processing...")

    # Queue Tests
    for task_idx, task in enumerate(tasks):
        debug_log(f"[SCHED DEBUG] Processing task {task_idx+1}/{len(tasks)}: ppn={task.ppn}, load={task.load}")

        # Wait until there are threads available
        while nAvailable < task.ppn:
            debug_log(f"[SCHED DEBUG] Waiting for resources: need {task.ppn}, have {nAvailable}")

            # This is important if "-j 1" is used (the default) since there
            # are test cases that require test.ppn=2
            if task.ppn > nThreads and nAvailable > 0:
                debug_log(f"[SCHED DEBUG] Task requires more threads ({task.ppn}) than available ({nThreads}), but some are free ({nAvailable})")
                break

            # Keep track of threads that are done
            join_first_dead_thread(progress, complete_tracker)

            # Do not overwhelm this core with this loop
            time.sleep(0.05)

        # Launch Thread
        progress.advance(queue_tracker)

        use_devices = None
        # Use the least loaded devices
        if devices is not None:
            use_devices = set()
            debug_log(f"[SCHED DEBUG] Assigning GPU devices for task {task_idx+1}")
            for device_idx in range(task.ppn):
                device = min(sched.LOAD.items(), key=lambda x: x[1])[0]
                sched.LOAD[device] += task.load / task.ppn
                use_devices.add(device)
                debug_log(f"[SCHED DEBUG] Assigned device {device} (load now: {sched.LOAD[device]:.3f})")

        nAvailable -= task.ppn

        debug_log(f"[SCHED DEBUG] Starting thread for task {task_idx+1}, devices: {use_devices}")
        thread = WorkerThread(target=task.func, args=tuple(task.args) + (use_devices,))
        thread.start()

        threads.append(WorkerThreadHolder(thread, task.ppn, task.load, use_devices))
        debug_log(f"[SCHED DEBUG] Thread started for task {task_idx+1}, {len(threads)} active threads")

    debug_log(f"[SCHED DEBUG] All tasks queued, waiting for completion...")

    # Wait for the last tests to complete (MOVED INSIDE CONTEXT)
    while len(threads) != 0:
        debug_log(f"[SCHED DEBUG] Waiting for {len(threads)} threads to complete...")

        # Keep track of threads that are done
        join_first_dead_thread(progress, complete_tracker)

        # Do not overwhelm this core with this loop
        time.sleep(0.05)

    debug_log(f"[SCHED DEBUG] Scheduler completed successfully!")
Logic Error

The device assignment loop uses device_idx as loop variable but assigns the same device multiple times. The original code used _ indicating the loop variable wasn't used, but now it's named device_idx but still not used properly for device selection.

for device_idx in range(task.ppn):
    device = min(sched.LOAD.items(), key=lambda x: x[1])[0]
    sched.LOAD[device] += task.load / task.ppn
    use_devices.add(device)
    debug_log(f"[SCHED DEBUG] Assigned device {device} (load now: {sched.LOAD[device]:.3f})")

Copy link

qodo-merge-pro bot commented Jul 18, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix typo in debug flag
Suggestion Impact:The suggestion identified a typo in the debug flag, and the commit fixed it, though to --schedule-debug instead of --sched-debug as suggested

code diff:

-./mfc.sh test --max-attempts 3 -a --schedul-debug -j $n_test_threads $device_opts -- -c phoenix
+./mfc.sh test --max-attempts 3 -a --schedule-debug -j $n_test_threads $device_opts -- -c phoenix

The flag --schedul-debug contains a typo and should be --sched-debug to match
the argument definition in args.py. This typo will cause the debug flag to be
ignored or potentially cause an error.

.github/workflows/phoenix/test.sh [25]

-./mfc.sh test --max-attempts 3 -a --schedul-debug -j $n_test_threads $device_opts -- -c phoenix
+./mfc.sh test --max-attempts 3 -a --sched-debug -j $n_test_threads $device_opts -- -c phoenix
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a typo (--schedul-debug) that would prevent the newly added debug logging feature from being enabled in the CI script.

High
General
Verify argument name consistency

The argument name should use hyphens instead of underscores to match the CLI
argument --sched-debug defined in args.py. Python's argparse typically converts
hyphens to underscores in the namespace, so this should be sched_debug.

toolchain/mfc/sched.py [50]

+debug_enabled = ARG("sched_debug", False)  # Check for --sched-debug flag
 
-
Suggestion importance[1-10]: 2

__

Why: The suggestion correctly observes that sched_debug is the right way to access the --sched-debug argument, but it proposes no change as the code is already correct, making it of very low impact.

Low
  • Update

@sbryngelson sbryngelson marked this pull request as draft July 19, 2025 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

1 participant