Skip to content

Training an TFNO with navier-stokes, with flops count #583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ML4SC
Copy link

@ML4SC ML4SC commented Apr 19, 2025

I've been working with TFNO models and recently developed a script that demonstrates model performance along with FLOPs analysis for both forward and backward passes.

I'd like to contribute to the NeuralOperator project by developing a training example and accompanying documentation that:

  • Demonstrates TFNO performance on the 2D Navier-Stokes equations
  • Includes FLOPs profiling for model introspection and optimization
  • Trains TFNO on multiple GPUs, with ongoing work to optimize communication loops between GPUs
  • Discusses strategies for efficient CPU–GPU communication during training

Please let me know if this would be a valuable addition to the project — I've opened a PR and would greatly appreciate any feedback as I iterate.

@JeanKossaifi
Copy link
Member

JeanKossaifi commented Apr 20, 2025 via email

@ML4SC
Copy link
Author

ML4SC commented May 2, 2025

Hi Jean,

Thank you for your quick reply. I’ve used torch.profiler to record memory usage and kernel activity on a per-epoch basis. For CPU–GPU transfers, I’ve enabled pin_memory=True and non_blocking=True and set up asynchronous data loading to handle larger batch volumes.

When working with very high-resolution data, I’m exploring a distributed streaming approach, but I haven’t yet found any existing functionality for that in the NeuralOperators codebase. If I’ve overlooked something, could you point me to the relevant module or function? Otherwise, any guidance on where to start implementing distributed data streaming would be greatly appreciated.

Thanks again for your help!

Best,
Natalie

@ML4SC
Copy link
Author

ML4SC commented May 2, 2025

Meanwhile, I’d be grateful for any feedback or suggestions you have on my TFNO example using the Navier–Stokes dataset.

@JeanKossaifi
Copy link
Member

Thank you @ML4SC - the example looks good, did you get to try building the doc and checking the result?
The training script probably should be in scripts, though I'm not sure if it is needed compared to the existing training script - what do you think @dhpitt ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants