Cost-Efficient and Scalable GPU Scheduling Strategies in Multi-Tenant Cloud Environments for AI Workloads

Gopi Kathiresan

Authors

Gopi Kathiresan Senior Software Engineer, Morgan Stanley, Atlanta GA, USA Author

Keywords:

GPU Scheduling, AI Workloads, Cloud Computing, Resource Allocation, Workload Prediction, Cost Optimization, Multi-Tenant Systems, Refactoring

Abstract

A recent dramatic growth in the number of artificial intelligence (AI) workloads has led to a growth in contention and inefficiencies in the utilization of GPU resources in cloud computing environments that have subsequently led to a growth in operational costs. Platforms with multi-tenant where heterogeneous workloads share resources are particularly sensitive to such resource bottlenecks. In this paper we present a versatile GPU scheduling system that offers a trade-off among cost-effectiveness, performance isolation, and fairness in heterogeneous AI workloads. The proposed system dynamically optimizes GPU allocation based on multi-objective scheduling algorithm with the help of machine learning-based workload prediction. In order to reduce the impact of GPU fragmentation, we combine automatic memory management, temporal multiplexing of training tasks, spatial segmentation of inference tasks. With large scale testing via real-world workloads, such as computer vision, NLP, and scientific computing, we show that it can maximize GPU utilization by 65 percent and decrease the average job run time by 40 percent over FIFO baselines. Predictive scaling, preemptible instances and smart provisioning are also used in our framework to save 45 percent on cloud infrastructure costs. Other contributions are fairness-aware scheduling policy, mixed-precision workload support, and new GPU defragmentation strategies. The results provide a viable way to scalable, cost-efficient, and tenant-aware GPU scheduling in cloud-native AI platforms, establishing the premise of the next-generation high-performance AI infrastructure available to a large range of users and businesses.

References

Magaud, N. (2017). Transferring Arithmetic Decision Procedures (on Z) to Alternative Representations. A Survey. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Ye, Z., Sun, P., Gao, W., Zhang, T., Wang, X., Yan, S., & Luo, Y. (2022). A STRAEA : a fair deep learning scheduler for Multi-Tenant GPU clusters. In IEEE TRANSAC-TIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (Vol. 33, Issue 11, p. 2781). https://doi.org/10.1109/TPDS.2021.3136245

Jain, P., Mo, X., Jain, A., Subbaraj, H., Durrani, R. S., Tumanov, A., Gonzalez, J., & Stoica, I. (2019). Dynamic Space-Time scheduling for GPU inference. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1901.00041

Zhang, S., Chen, Q., Cui, W., Zhao, H., Xue, C., Zheng, Z., Lin, W., & Guo, M. (2025). Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing. EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands, 573–588. https://doi.org/10.1145/3689031.3696070

Li, B., Patel, T., Samsi, S., Gadepally, V., & Tiwari, D. (2022). MISO: Exploiting Mul-ti-Instance GPU capability on Multi-Tenant Systems for Machine Learning. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2207.11428

Mohan, J., Phanishayee, A., Kulkarni, J., & Chidambaram, V. (2021). Synergy: Re-source sensitive DNN scheduling in Multi-Tenant clusters. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2110.06073

Choi, S., Lee, S., Kim, Y., Park, J., Kwon, Y., Huh, J., & School of Computing, KAIST. (2021). Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing. School of Computing, KAIST. https://jongse-park.github.io/files/paper/2022-atc-gpulet.pdf

Yu, F., Bray, S., Wang, D., Shangguan, L., Tang, X., Liu, C., & Chen, X. (2021). Au-tomated Runtime-Aware scheduling for Multi-Tenant DNN inference on GPU. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2111.14255

Chang, T. T., & Venkataraman, S. (2025, March). Eva: Cost-Efficient Cloud-Based Cluster Scheduling. In Proceedings of the Twentieth European Conference on Computer Systems (pp. 1399-1416). https://doi.org/10.48550/arXiv.2503.07437

Ye, Z., Gao, W., Hu, Q., Sun, P., Wang, X., Luo, Y., Zhang, T., & Wen, Y. (2024). Deep Learning workload scheduling in GPU datacenters: a survey. ACM Computing Surveys, 56(6), 1–38. https://doi.org/10.1145/3638757