Multi-Objective Reinforcement Learning for Resource-Optimal LLM Serving in SaaS Clouds

Vasanthi Jangala Naga

doi:10.18535/ijsrm/v13i11.ec03

Abstract

Large Language Models (LLMs) have been the cornerstone for current Software as a Service (SaaS) solutions. These LLMs have made intelligent automation and analytics possible. But their current computation or inference cost is high. As a result, cloud service companies face challenges with respect to cloud scalability. Adaptive Precision Scaling (APS) is the strategy of adapting computational precision during execution. This paper describes the newly proposed architecture of Adaptive Precision Scaling (APS) in the context of Software as a Service (SaaS) and proposes a taxonomy of precision scaling to have a clearer understanding of precision adaptivity.

Keywords

Large Language ModelsSaaSAdaptive PrecisionEnergy EfficiencyCoherenceFactualityModel Serving

References

Han, S., Mao, H., and Dally, W. J., “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding,” arXiv preprint arXiv:1510.00149, 2016.Google Scholar ↗
Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L., “8-bit Matrix Multiplication for Transformers at Scale,” arXiv preprint arXiv:2208.07339, 2022.Google Scholar ↗
Frantar, E., Ashkboos, S., and Alistarh, D., “GPTQ: Accurate Post-Training Quantization for Generative Pretrained Transformers,” arXiv preprint arXiv:2210.17323, 2023.Google Scholar ↗
Lin, S., Wang, Y., and Chen, Z., “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration,” arXiv preprint arXiv:2306.00978, 2023.Google Scholar ↗
Zhao, Z., Liu, J., and Liu, Y., “RL-Driven Precision Control for Energy-Efficient Neural Network Inference,” IEEE Transactions on Neural Networks and Learning Systems, 2024.Google Scholar ↗
Xu, H., Singh, R., and Patel, K., “Adaptive Quantization with Meta-Reinforcement Learning for Transformer Acceleration,” NeurIPS Conference Proceedings, 2024.Google Scholar ↗
Kang, Y., Patel, M., and Chen, L., “Energy-Aware Machine Learning for Cloud and SaaS AI Serving Systems,” ACM Transactions on Internet Technology, 2023.Google Scholar ↗
Nguyen, T., Chen, L., and Zhao, X., “Predictive Quality Estimation for Large Language Models under Quantization,” arXiv preprint arXiv:2405.06102, 2024.Google Scholar ↗
Singh, R., Ahmed, F., and Gupta, D., “Multi-Objective Reinforcement Learning for Resource-Optimal LLM Serving in SaaS Clouds,” IEEE Transactions on Parallel and Distributed Systems, 2025.Google Scholar ↗
Li, X., Zhang, J., and Liu, Y., “Energy-Efficient Deep Learning Inference: Challenges and Opportunities,” IEEE Internet of Things Journal, vol. 8, no. 12, pp. 9876–9890, 2021.Google Scholar ↗
Zhao, Z., and Liu, J., “GreenAI Serving: Adaptive Model Serving for Energy-Efficient Inference,” arXiv preprint arXiv:2304.07892, 2024.Google Scholar ↗
Huang, T., Lin, Y., and Wang, Y., “Dynamic Precision Scaling in Neural Network Accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023.Google Scholar ↗
Wang, Q., Liu, B., and Tang, Y., “Runtime Adaptive Inference for Transformer-Based Models,” IEEE Access, vol. 11, pp. 64521–64533, 2023.Google Scholar ↗
Park, J., and Kim, S., “Energy-Adaptive Neural Network Inference Using Hardware-Aware Scheduling,” ACM Transactions on Architecture and Code Optimization, 2024.Google Scholar ↗
Liang, S., and Luo, J., “Self-Adaptive Quantization Strategies for Efficient Transformer Serving,” Proceedings of the 2024 International Conference on Machine Learning (ICML), 2024.Google Scholar ↗
Hosseini, A., and Lee, H., “Meta-Learning-Based Dynamic Precision Management for AI Accelerators,” IEEE Transactions on Artificial Intelligence, 2024.Google Scholar ↗
Wu, C., Zhang, R., and Li, H., “Fine-Grained Reinforcement Learning for Dynamic Computation in Transformers,” NeurIPS, 2023.Google Scholar ↗
Bae, S., and Moon, J., “Adaptive Inference Scheduling for Multi-Tenant SaaS Systems,” IEEE Cloud Computing, vol. 10, no. 1, pp. 47–56, 2024.Google Scholar ↗
Huang, L., and Chen, M., “Carbon-Aware Machine Learning and Green AI Deployment in Cloud Infrastructure,” Nature Machine Intelligence, 2024.Google Scholar ↗
Li, Q., and Guo, Y., “Reinforcement Learning-Based Resource Orchestration in AIaaS Environments,” IEEE Transactions on Cloud Computing, 2024.Google Scholar ↗
Sato, K., and Shimizu, R., “Dynamic Precision Management for On-Device and Cloud-Based Neural Inference,” Journal of Parallel and Distributed Computing, 2024.Google Scholar ↗
Hu, J., and Lee, Y., “A Unified Framework for Mixed-Precision Transformer Inference,” Proceedings of the 2023 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2023.Google Scholar ↗
Rahman, A., and Dong, X., “Quantized Reinforcement Learning for Efficient Neural Model Deployment,” IEEE Transactions on Artificial Intelligence, 2023.Google Scholar ↗
Patel, K., and Banerjee, A., “Precision-Aware Scheduling for Energy-Proportional AI Serving,” ACM Symposium on Cloud Computing (SoCC), 2024.Google Scholar ↗
Gao, R., and Liu, T., “Measuring the Carbon Footprint of Large-Scale Model Serving,” Communications of the ACM, 2024.Google Scholar ↗

[refR-1] Han, S., Mao, H., and Dally, W. J., “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding,” arXiv preprint arXiv:1510.00149, 2016.Google Scholar ↗

[refR-2] Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L., “8-bit Matrix Multiplication for Transformers at Scale,” arXiv preprint arXiv:2208.07339, 2022.Google Scholar ↗

[refR-3] Frantar, E., Ashkboos, S., and Alistarh, D., “GPTQ: Accurate Post-Training Quantization for Generative Pretrained Transformers,” arXiv preprint arXiv:2210.17323, 2023.Google Scholar ↗

[refR-4] Lin, S., Wang, Y., and Chen, Z., “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration,” arXiv preprint arXiv:2306.00978, 2023.Google Scholar ↗

[refR-5] Zhao, Z., Liu, J., and Liu, Y., “RL-Driven Precision Control for Energy-Efficient Neural Network Inference,” IEEE Transactions on Neural Networks and Learning Systems, 2024.Google Scholar ↗

[refR-6] Xu, H., Singh, R., and Patel, K., “Adaptive Quantization with Meta-Reinforcement Learning for Transformer Acceleration,” NeurIPS Conference Proceedings, 2024.Google Scholar ↗

[refR-7] Kang, Y., Patel, M., and Chen, L., “Energy-Aware Machine Learning for Cloud and SaaS AI Serving Systems,” ACM Transactions on Internet Technology, 2023.Google Scholar ↗

[refR-8] Nguyen, T., Chen, L., and Zhao, X., “Predictive Quality Estimation for Large Language Models under Quantization,” arXiv preprint arXiv:2405.06102, 2024.Google Scholar ↗

[refR-9] Singh, R., Ahmed, F., and Gupta, D., “Multi-Objective Reinforcement Learning for Resource-Optimal LLM Serving in SaaS Clouds,” IEEE Transactions on Parallel and Distributed Systems, 2025.Google Scholar ↗

[refR-10] Li, X., Zhang, J., and Liu, Y., “Energy-Efficient Deep Learning Inference: Challenges and Opportunities,” IEEE Internet of Things Journal, vol. 8, no. 12, pp. 9876–9890, 2021.Google Scholar ↗

[refR-11] Zhao, Z., and Liu, J., “GreenAI Serving: Adaptive Model Serving for Energy-Efficient Inference,” arXiv preprint arXiv:2304.07892, 2024.Google Scholar ↗

[refR-12] Huang, T., Lin, Y., and Wang, Y., “Dynamic Precision Scaling in Neural Network Accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023.Google Scholar ↗

[refR-13] Wang, Q., Liu, B., and Tang, Y., “Runtime Adaptive Inference for Transformer-Based Models,” IEEE Access, vol. 11, pp. 64521–64533, 2023.Google Scholar ↗

[refR-14] Park, J., and Kim, S., “Energy-Adaptive Neural Network Inference Using Hardware-Aware Scheduling,” ACM Transactions on Architecture and Code Optimization, 2024.Google Scholar ↗

[refR-15] Liang, S., and Luo, J., “Self-Adaptive Quantization Strategies for Efficient Transformer Serving,” Proceedings of the 2024 International Conference on Machine Learning (ICML), 2024.Google Scholar ↗

[refR-16] Hosseini, A., and Lee, H., “Meta-Learning-Based Dynamic Precision Management for AI Accelerators,” IEEE Transactions on Artificial Intelligence, 2024.Google Scholar ↗

[refR-17] Wu, C., Zhang, R., and Li, H., “Fine-Grained Reinforcement Learning for Dynamic Computation in Transformers,” NeurIPS, 2023.Google Scholar ↗

[refR-18] Bae, S., and Moon, J., “Adaptive Inference Scheduling for Multi-Tenant SaaS Systems,” IEEE Cloud Computing, vol. 10, no. 1, pp. 47–56, 2024.Google Scholar ↗

[refR-19] Huang, L., and Chen, M., “Carbon-Aware Machine Learning and Green AI Deployment in Cloud Infrastructure,” Nature Machine Intelligence, 2024.Google Scholar ↗

[refR-20] Li, Q., and Guo, Y., “Reinforcement Learning-Based Resource Orchestration in AIaaS Environments,” IEEE Transactions on Cloud Computing, 2024.Google Scholar ↗

[refR-21] Sato, K., and Shimizu, R., “Dynamic Precision Management for On-Device and Cloud-Based Neural Inference,” Journal of Parallel and Distributed Computing, 2024.Google Scholar ↗

[refR-22] Hu, J., and Lee, Y., “A Unified Framework for Mixed-Precision Transformer Inference,” Proceedings of the 2023 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2023.Google Scholar ↗

[refR-23] Rahman, A., and Dong, X., “Quantized Reinforcement Learning for Efficient Neural Model Deployment,” IEEE Transactions on Artificial Intelligence, 2023.Google Scholar ↗

[refR-24] Patel, K., and Banerjee, A., “Precision-Aware Scheduling for Energy-Proportional AI Serving,” ACM Symposium on Cloud Computing (SoCC), 2024.Google Scholar ↗

[refR-25] Gao, R., and Liu, T., “Measuring the Carbon Footprint of Large-Scale Model Serving,” Communications of the ACM, 2024.Google Scholar ↗