Cost-Efficient and Scalable GPU Scheduling Strategies in Multi-Tenant Cloud Environments for AI Workloads
Keywords:
GPU Scheduling, AI Workloads, Cloud Computing, Resource Allocation, Workload Prediction, Cost Optimization, Multi-Tenant Systems, RefactoringAbstract
A recent dramatic growth in the number of artificial intelligence (AI) workloads has led to a growth in contention and inefficiencies in the utilization of GPU resources in cloud computing environments that have subsequently led to a growth in operational costs. Platforms with multi-tenant where heterogeneous workloads share resources are particularly sensitive to such resource bottlenecks. In this paper we present a versatile GPU scheduling system that offers a trade-off among cost-effectiveness, performance isolation, and fairness in heterogeneous AI workloads. The proposed system dynamically optimizes GPU allocation based on multi-objective scheduling algorithm with the help of machine learning-based workload prediction. In order to reduce the impact of GPU fragmentation, we combine automatic memory management, temporal multiplexing of training tasks, spatial segmentation of inference tasks. With large scale testing via real-world workloads, such as computer vision, NLP, and scientific computing, we show that it can maximize GPU utilization by 65 percent and decrease the average job run time by 40 percent over FIFO baselines. Predictive scaling, preemptible instances and smart provisioning are also used in our framework to save 45 percent on cloud infrastructure costs. Other contributions are fairness-aware scheduling policy, mixed-precision workload support, and new GPU defragmentation strategies. The results provide a viable way to scalable, cost-efficient, and tenant-aware GPU scheduling in cloud-native AI platforms, establishing the premise of the next-generation high-performance AI infrastructure available to a large range of users and businesses.
References
Magaud, N. (2017). Transferring Arithmetic Decision Procedures (on Z) to Alternative Representations. A Survey. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Ye, Z., Sun, P., Gao, W., Zhang, T., Wang, X., Yan, S., & Luo, Y. (2022). A STRAEA : a fair deep learning scheduler for Multi-Tenant GPU clusters. In IEEE TRANSAC-TIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (Vol. 33, Issue 11, p. 2781). https://doi.org/10.1109/TPDS.2021.3136245
Jain, P., Mo, X., Jain, A., Subbaraj, H., Durrani, R. S., Tumanov, A., Gonzalez, J., & Stoica, I. (2019). Dynamic Space-Time scheduling for GPU inference. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1901.00041
Zhang, S., Chen, Q., Cui, W., Zhao, H., Xue, C., Zheng, Z., Lin, W., & Guo, M. (2025). Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing. EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands, 573–588. https://doi.org/10.1145/3689031.3696070
Li, B., Patel, T., Samsi, S., Gadepally, V., & Tiwari, D. (2022). MISO: Exploiting Mul-ti-Instance GPU capability on Multi-Tenant Systems for Machine Learning. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2207.11428
Mohan, J., Phanishayee, A., Kulkarni, J., & Chidambaram, V. (2021). Synergy: Re-source sensitive DNN scheduling in Multi-Tenant clusters. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2110.06073
Choi, S., Lee, S., Kim, Y., Park, J., Kwon, Y., Huh, J., & School of Computing, KAIST. (2021). Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing. School of Computing, KAIST. https://jongse-park.github.io/files/paper/2022-atc-gpulet.pdf
Yu, F., Bray, S., Wang, D., Shangguan, L., Tang, X., Liu, C., & Chen, X. (2021). Au-tomated Runtime-Aware scheduling for Multi-Tenant DNN inference on GPU. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2111.14255
Chang, T. T., & Venkataraman, S. (2025, March). Eva: Cost-Efficient Cloud-Based Cluster Scheduling. In Proceedings of the Twentieth European Conference on Computer Systems (pp. 1399-1416). https://doi.org/10.48550/arXiv.2503.07437
Ye, Z., Gao, W., Hu, Q., Sun, P., Wang, X., Luo, Y., Zhang, T., & Wen, Y. (2024). Deep Learning workload scheduling in GPU datacenters: a survey. ACM Computing Surveys, 56(6), 1–38. https://doi.org/10.1145/3638757
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Gopi Kathiresan (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.




