Syndeo: Secure, Portable Ray Clusters
Published 2024-9-1
High-performance computing (HPC) remains a critical enabler of modern AI research, yet the divergence between cloud-based and on-premises infrastructure has created a challenge for researchers and engineers. On one hand, cloud computing offers scalability and convenience, but at a significantly higher cost. On the other hand, on-premises clusters, often managed with Slurm, provide cost efficiency and control but lack seamless compatibility with modern AI-centric cloud tools like Ray. This is where Syndeo comes in.
Syndeo is an open-source software framework that allows developers to run Ray clusters on Slurm and other HPC environments with secure containerization. The goal is simple: write code once, deploy anywhere. By bridging the gap between cloud and on-premises compute, Syndeo enables researchers to leverage the best of both worlds.
- Read the paper: Syndeo: Portable Ray Clusters with Secure Containerization (IEEE HPEC 2024)
- Check the code: GitHub Repository
- Explore the documentation: Syndeo Docs
The Problem: Cloud vs. On-Premises HPC
Organizations often choose between two computing models:
- Cloud-based compute: Scalable, easy to deploy, but often prohibitively expensive for long-term use.
- On-premises HPC clusters: More cost-effective in the long run but requires substantial maintenance and lacks built-in support for modern cloud-native schedulers.
A core issue in this divide is scheduler incompatibility. Slurm, the dominant scheduler in academic and scientific HPC clusters, is fundamentally different from Ray, a cloud-native scheduler optimized for distributed AI workloads. Slurm was designed for multi-tenant, on-premises HPC, while Ray was built for single-tenant, cloud-based compute. Without a unified approach, researchers face tedious and error-prone efforts when moving workloads between the two.
Syndeo: A Cross-Compatible, Secure Solution
Syndeo addresses three fundamental challenges in HPC computing: portability, scalability, and security.
- Portability: Syndeo containerizes Ray clusters using Apptainer (formerly Singularity), ensuring that workloads can be seamlessly deployed across different environments, whether on AWS, Azure, Google Cloud, or an on-premises cluster.
- Scalability: It allows researchers to deploy a Ray Cluster inside a Slurm cluster, effectively running a scheduler within a scheduler. This configuration supports dynamic resource allocation while leveraging Slurm’s efficient job scheduling.
- Security: Unlike Docker, which has security risks in multi-tenant environments, Syndeo uses Apptainer/Singularity containers that enforce strict user privilege separation.
How It Works
Syndeo enables Ray clusters to function within a Slurm-managed HPC environment through a four-phase deployment:
- Container Creation: Users build a containerized Ray environment using Apptainer/Singularity.
- Ray Head Setup: A Ray head node is launched inside Slurm, with networking configured to allow worker nodes to join.
- Ray Worker Addition: Worker nodes join the cluster dynamically, forming a self-contained Ray cluster.
- Execution & Management: Jobs are submitted to Ray, allowing users to leverage Ray’s advanced scheduling while still benefiting from Slurm’s HPC resources.
Once established, this setup allows cross-platform scheduling with minimal code changes. Researchers can run distributed AI training workloads, reinforcement learning simulations, and parallelized data analysis while maintaining performance, security, and cost efficiency.
Performance Benchmarks
Our experiments deploying Syndeo on an on-premises Slurm-managed cluster demonstrated linear scalability for various reinforcement learning tasks. By increasing the number of CPU workers in a Ray cluster, we observed substantial speedups in throughput:
- A 420-worker Ray cluster showed 9x–12x throughput improvement compared to a 28-worker baseline.
- Certain workloads, like MuJoCo-based physics simulations, achieved near-ideal parallel efficiency, validating Syndeo’s ability to scale effectively.
However, as expected, increased communication costs in some environments (e.g., Humanoid simulations) resulted in diminishing returns beyond 400+ workers. This underscores the need for intelligent workload balancing, which Syndeo facilitates through Ray’s global object store and task dependency management.
Why This Matters
The advent of AI-driven scientific computing requires solutions that unify cloud-native flexibility with on-premises efficiency. Syndeo is a critical step toward achieving this goal by allowing HPC users to seamlessly leverage modern AI scheduling frameworks without sacrificing security or portability.
For researchers, this means being able to develop and test AI models in the cloud, then deploy them at scale on cost-effective HPC clusters—without rewriting code. For industry practitioners, this means being able to integrate Slurm-based compute resources with cloud-based AI pipelines, ensuring maximum infrastructure utilization.