Title
Scanflow-K8s: Agent-based Framework for Autonomic Management and Supervision of ML Workflows in Kubernetes Clusters
Date Issued
01 January 2022
Access level
metadata only access
Resource Type
conference paper
Author(s)
Liu P.
Guitart J.
Dholakia A.
Ellison D.
Hodak M.
Universitat Politecnica de Catalunya
Publisher(s)
Institute of Electrical and Electronics Engineers Inc.
Abstract
Machine Learning (ML) projects are currently heavily based on workflows composed of some reproducible steps and executed as containerized pipelines to build or deploy ML models efficiently because of the flexibility, portability, and fast delivery they provide to the ML life-cycle. However, deployed models need to be watched and constantly managed, supervised, and debugged to guarantee their availability, validity, and robustness in unexpected situations. Therefore, containerized ML workflows would benefit from leveraging flexible and diverse autonomic capabilities. This work presents an architecture for autonomic ML workflows with abilities for multi-layered control, based on an agent-based approach that enables autonomic management and supervision of ML workflows at the application layer and the infrastructure layer (by collaborating with the orchestrator). We redesign the Scanflow ML framework to support such multi-agent approach by using triggers, primitives, and strategies. We also implement a practical platform, so-called Scanflow-K8s, that enables autonomic ML workflows on Kubernetes clusters based on the Scanflow agents. MNIST image classification and MLPerf ImageNet classification benchmarks are used as case studies to show the capabilities of Scanflow-K8s under different scenarios. The experimental results demonstrate the feasibility and effectiveness of our proposed agent approach and the Scanflow-K8s platform for the autonomic management of ML workflows in Kubernetes clusters at multiple layers.
Start page
376
End page
385
Language
English
OCDE Knowledge area
Ingeniería eléctrica, Ingeniería electrónica Ciencias de la computación
Scopus EID
2-s2.0-85135763990
ISBN
9781665499569
Resource of which it is part
Proceedings - 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2022
ISBN of the container
978-166549956-9
Source funding
Generalitat de Catalunya
Sources of information: Directorio de Producción Científica Scopus