Skip to content

Generating MFC Images and Testing Them on OSPool #935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 68 commits into
base: master
Choose a base branch
from

Conversation

Malmahrouqi3
Copy link
Collaborator

@Malmahrouqi3 Malmahrouqi3 commented Jul 11, 2025

User description

Description

Concerning (#654),
Generating four images CPU, CPU_Benchmark, GPU, and GPU_Benchmark. All MFC builds occur on a GitHub runner, while testing and storing latest images take place on OSPOOL. They are retrievable on the CI itself as the images are pre-built MFC with pre-installed packages that can be accessed with simple commands.

Debugging info,
To locally generate images, apptainer build mfc_cpu.sif Singularity.cpu
To start shell instance, apptainer shell --fakeroot --writable-tmpfs mfc_cpu.sif
To execute directly specific commands, apptainer exec --fakeroot --writable-tmpfs mfc_cpu.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'

To-dos,

  • Proper packages and base container for each recipe.
  • htcondor test script to request specific allocations per image.
  • Sanity-check by using the images on various resources/clusters.
  • Maintainer triggered if needed, otherwise most recent images will only be hosted.

Note to Self: current secrets are hosted in the fork, and prior to merge new dedicated ones should be added to the base repo. To do so, request access point under "GATech_Bryngelson" project, then upload public SSH key to https://registry.cilogon.org/. Later on, update secrets which include private SSH key and user@host.

Ref's
NVIDIA Container


PR Type

Other


Description

  • Remove existing CI workflows and testing infrastructure

  • Add Singularity container image building workflow

  • Create four container definitions for CPU/GPU variants

  • Implement automated image building and testing on OSPool


Changes diagram

flowchart LR
  A["Old CI Workflows"] -- "removed" --> B["Deleted Files"]
  C["New Container Workflow"] -- "builds" --> D["Singularity Images"]
  D -- "stores on" --> E["OSPool"]
  F["Container Definitions"] -- "defines" --> G["CPU/GPU Variants"]
Loading

Changes walkthrough 📝

Relevant files
Miscellaneous
17 files
build.sh
Remove Frontier build script                                                         
+0/-9     
submit.sh
Remove Frontier job submission script                                       
+0/-56   
test.sh
Remove Frontier test script                                                           
+0/-10   
bench.sh
Remove Phoenix benchmark script                                                   
+0/-20   
submit-bench.sh
Remove Phoenix benchmark submission script                             
+0/-64   
submit.sh
Remove Phoenix job submission script                                         
+0/-64   
test.sh
Remove Phoenix test script                                                             
+0/-21   
bench.yml
Remove benchmark workflow                                                               
+0/-68   
cleanliness.yml
Remove code cleanliness workflow                                                 
+0/-127 
coverage.yml
Remove coverage check workflow                                                     
+0/-48   
docs.yml
Remove documentation workflow                                                       
+0/-76   
formatting.yml
Remove formatting check workflow                                                 
+0/-19   
line-count.yml
Remove line count workflow                                                             
+0/-54   
lint-source.yml
Remove source linting workflow                                                     
+0/-51   
lint-toolchain.yml
Remove toolchain linting workflow                                               
+0/-17   
spelling.yml
Remove spell check workflow                                                           
+0/-17   
test.yml
Remove main test suite workflow                                                   
+0/-131 
Enhancement
5 files
container-image.yml
Add Singularity image building workflow                                   
+63/-0   
Singularity.cpu
Add CPU container definition                                                         
+24/-0   
Singularity.cpu_bench
Add CPU benchmark container definition                                     
+27/-0   
Singularity.gpu
Add GPU container definition                                                         
+34/-0   
Singularity.gpu_bench
Add GPU benchmark container definition                                     
+32/-0   

Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • @Malmahrouqi3
    Copy link
    Collaborator Author

    Malmahrouqi3 commented Jul 11, 2025

    As of right now, I relied solely on ./mfc.sh test -a in all HTCondor job to ensure proper functionality of images and I will look into potential failure modes. CPU(GPU)_Benchmark images are identical to the standard version until given further instructions on the dedicated use of them. I was thinking of passing them as well onto Phoenix as another testbed since retrieving the images would take few moments avoiding the overhead of a full MFC build and compile.

    Requested allocated resources are quite excessive for now and will be optimized later on to not get stuck in the queue forever.

    @sbryngelson
    Copy link
    Member

    Grab the new workflow files from master and you can start doing CI again. You may need to merge in any changes you made. lmk if you have questions.

    @Malmahrouqi3
    Copy link
    Collaborator Author

    Malmahrouqi3 commented Jul 13, 2025

    Status Update: I faced a hurdle with ssh connectivity whether using SSH Keys (public/private) or Credentials (ssh user@host). Out of nowhere, the Access Point would deny access arbitrarily. If the issue persists, I will contact OSPool support. In the meantime, I will improve batch job requirements.

    Host key verification failed.
    scp: Connection closed
    Process completed with exit code 255.
    

    Edit 1: I am going to inquire on how to ensure each job instance occurs on a distinct cluster i.e. 5-10 instances of a single job would run on 5-10 unique clusters increasing failure potentials.

    Edit 2: The batch job specs sorta prevent concurrency of job instances to be on the same machine/cluster.

    requirements = (HAS_SINGULARITY =?= TRUE) && (Arch == "X86_64")
    rank = -SlotID - 100 * JobsOnMachine
    concurrency_limits = mfc_distributed:1
    +ConcurrencyLimits = "mfc_distributed"
    

    Edit 3: use grep "GLIDEIN_ResourceName" mfc_cpu_*.log to print out all allocated machines.

    Edit 4: requesting distinct host os for CPU cases and distinct GPU compatibility for GPU cases will ensure uniqueness of cluster in each job instance, but the queue time is gonna be too lengthy.

    @sbryngelson sbryngelson marked this pull request as draft July 14, 2025 07:04
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Development

    Successfully merging this pull request may close these issues.

    2 participants