SlideShare a Scribd company logo
1 of 35
Download to read offline
Dan Sun, Bloomberg
Theofilos Papapanagiotou, AWS
The State and Future of
Cloud Native ML Model Serving
Productionize AI Model is Challenging
“Launching AI application pilots is deceptively easy, but
deploying them into production is notoriously challenging.”
Inference
request
Inference
response
Productionize AI Model is Challenging
“Launching AI application pilots is deceptively easy, but
deploying them into production is notoriously challenging.”
Inference
request
Inference
response
Pre Processing
Post Processing
Model
Input
Model
Output
Feature-Store
Extract features,
image/text preprocessing
Scalability
Security
Model Store
REST/gRPC
Load balancer
Reproducibility
/Portability
Observability
Why Kubernetes is a Great Platform for Serving Models?
● Microservice: Kubernetes handles container orchestration, facilitate
deployment of microservices with resource management capabilities
and load balancing.
● Reproducibility/Portability: ML model deployments can be written
once, reproduced, and run with declarative yaml in a cloud agnostic
way.
● Scalability: Kubernetes provides horizontal scaling features for both
CPU and GPU workload.
● Fault-tolerance: Detect and recover from container failures, more
resilient to outages and minimize downtime.
What’s KServe?
● KServe is a highly scalable and standards-based cloud-native
model inference platform on Kubernetes for Trusted AI that
encapsulates the complexity of deploying models to production.
● KServe can be deployed standalone or as an add-on component with
Kubeflow in the cloud or on-premises environment.
KServe History
2021
2022
2019
2020
NVIDIA contributed
Inference Protocol
IBM contributed
ModelMesh
The Current State
● Features supported as of KServe 0.10 release
Core Inference
● Transformer/Predictor
● Serving Runtimes
● Custom Runtime SDK
● Open Inference Protocol
● Serverless Autoscaling
● Cloud/PVC Storage
● Maintenance
● Quality checks
● Common tasks
Advanced
Inference
● ModelMesh for
Multi-Model Serving
● Inference Graph
● Payload Logging
● Request Batching
● Canary Rollout
● Maintenance
● Quality checks
● Common tasks
Model Explanability
& Monitoring
● Text, Image, Tabular
Explainer
● Bias Detector
● Adversarial Detector
● Outlier Detector
● Drift Detector
Serving Runtimes
InferenceService (for user)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-sklearn-isvc
spec:
predictor:
model:
modelFormat:
name: sklearn
version: 1.0
storageUri: s3://bucket/sklearn/mnist.joblib
runtime: kserve-sklearnserver # optional
Serving Runtime (for KServe admin)
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: kserve-sklearnserver
spec:
supportedModelFormats:
- name: sklearn
version: "1"
autoSelect: true
containers:
- name: kserve-container
image: kserve/sklearnserver:latest
args:
- --model_name={{.Name}}
- --model_dir=/mnt/models
- --http_port=8080
resources:
requests:
cpu: "1"
memory: 2Gi
● ModelSpec specifies the model formats or version for trained model
● KServe automatically selects the serving runtime to instantiate the
deployment that supports the given model format.
Serving Runtime Support Matrix
Serving
Runtime/
Model
Format
scikit-learn xgboost lightgbm TensorFlow PyTorch TorchScript ONNX MLFlow Custom
MLServer
(open)
Triton
(open)
TorchServe
(v1, open)
KServe
Runtime
(v1, open)
TFServing
(v1)
Open Inference Protocol
● Open inference protocol enables a standardized high performance inference
data plane.
● Allows interoperability between serving runtimes.
● Allows building client or benchmarking tools that can work with a wide range of
serving runtimes
● It is implemented by KServe, MLServer, Triton, TorchServe, OpenVino, AMD
Inference Server.
https://github.com/kserve/open-inference-protocol
Open Inference Protocol
● REST vs. gRPC: Ease of use vs high performance
REST gRPC
GET v2/health/live rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse)
GET v2/health/ready rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse)
GET v2/models/{model_name}/ready rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse)
GET v2/models/{model_name} rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse)
POST v2/models/{model_name}/infer rpc Modelnfer(ModelInferRequest) returns (ModelInferResponse)
Open Inference Protocol Codec
● Single Input vs. Multiple Inputs with numpy/pandas codec
Feature 1
Name: f1
Value: 1.0
Feature 2
Name: f2
Value: 2.0
Feature 3
Name: f3
Value: 3.0
{
“inputs”: [
{
“name”: “tensor”,
“shape”: [1, 3],
“datatype”: “FP32”,
“data”: [[1.0, 2.0, 3.0]]
}
]
}
Feature 1
Name: f1
Value: 1.0
Feature 2
Name: f2
Value: 2
Feature 3
Name: f3
Value: “foo”
{
“inputs”: [
{
“name”: “f1”,
“shape”: [1,2],
“datatype”: “FP32”,
“data”: [[1.0, 2.0]]
},
{
“name”: “f3”,
“shape”: [1, 2],
“datatype”: “BYTES”,
“data”: [[“foo”, “bar”]]
}
]
}
Numpy
codec
Pandas
codec
f1, f2, f3
0 1.0, 2, foo
1 2.0, 4, bar
1.0, 2.0, 3.0
4.0, 5.0, 6.0
Inference Service Deployment
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-feast-transformer"
spec:
transformer:
containers:
- image:
kserve/driver-transformer:latest
name: kserve-container
command:
- "python -m driver_transformer"
args:
- --entity_ids
- driver_id
- --feature_refs
- driver_hourly_stats:acc_rate
- driver_hourly_stats:avg_daily_trips
- driver_hourly_stats:conv_rate
Predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://pv/driver"
https://kserve.github.io/website/0.10/modelserving/v1beta1/transformer/feast/
Model Storage Patterns
● Runs as init container to download the models before starting KServe container.
● Runs as sidecar to pull models as needed in ModelMesh.
/mnt/models
storage-initializer
kserve-container
model-puller
container
modelmesh
container
/models
|--
model1/
|- model1.bin
model2/
|- model2.bin
model3/
|- model3.bin
Multi-Model Serving
● Single model single container usually under-utilize the resources.
● You can share the container resources by deploying multiple models, especially for the
GPU sharing cases.
● When scaling up the same set of models need to be scaled out together.
GPU
model1 model2 model3
GPU
model1 model2 model3
Inference Service Replica 1 Inference Service Replica 2
ModelMesh
● Models can be scaled independently while sharing the container resource
● ModelMesh is a scale-out layer for model servers that can load and serve multiple
models concurrently which uses the same KServe InferenceService API.
● Designed for high volume, high density, high churn production serving, it has powered
IBM production AI services for a number of years and manages thousands of models.
Kubernetes schedules
pod onto nodes
ModelMesh schedules
models onto container
Inference Graph
● Inference Graph is built for more complex multi-stage inference pipelines.
● Inference Graph is deployed in a declarative way and highly scalable.
● Inference Graph supports Sequence, Switch, Ensemble and Splitter nodes.
● Inference Graph is highly composable. It is made up with a list of routing nodes and each node
consists of a set of routing steps which can be either route to an InferecenService or another node.
apiVersion: "serving.kserve.io/v1alpha1"
kind: "InferenceGraph"
metadata:
name: "dog-breed-pipeline"
spec:
nodes:
root:
routerType: Sequence
steps:
- serviceName: cat-dog-classifier
name: cat_dog_classifier # step name
- serviceName: dog-breed-classifier
name: dog_breed_classifier
data: $request
condition: "[@this].#(predictions.0=="dog")"
KServe Cloud Native Ecosystem
● KServe offers seamless integration with many CNCF projects to empower
the production model deployments for security, observability, and
serverless capabilities.
Serverless Inference
● Autoscaling based on incoming request RPS or Concurrency.
● Unified autoscaler for both CPUs and GPUs.
Queue-proxy kserve-container
Concurrent
Requests = 2
Container Concurrency = 1
KnativeCon talk 2022
ServiceMesh: Secure Inference Service
● Istio control plane mounts the client certificates to sidecar proxy to allow automatic
authentication between transformer and predictor and encrypt service traffic.
● Authorization can be implemented with Istio Authorization Policy or Open Policy Agent.
queue-proxy kserve-container queue-proxy kserve-container
http/grpc mTLS
Istio Ingress
Gateway
KServe Transformer KServe Predictor
Authz check
SPIFFE and SPIRE: Strong Identity
● SPIFFE is a secure identity framework that can be federated across heterogeneous environments.
● SPIRE distinguishes from other identity providers, such as API keys or secrets, with an attestation
process before issuing the credential.
queue-proxy kserve-container
KServe Predictor
Workload
Spire Agent
Spire Server
Spire Agent
JWT/SVID
VM
Kubernetes
KServe Observability
● OpenTelemetry can be used to instructment, collect and export metrics, log
and tracing data that you can analyze to understand the production Inference
Service performance.
Gateway Queue proxy KServe container
Cloud Native Buildpacks
● Cloud Native Buildpacks can turn your custom transformer or predictor into a
container image without Dockerfile and run anywhere.
● Ensures meeting the security and compliance requirements.
Looking Forward
KServe 1.0 Roadmap
● Graduate KServe core inference capabilities to stable/GA
● Open Inference Protocol coverage for all supported serving runtimes
● Graduate KServe SDK to 1.0
● Graduate ModelMesh and Inference Graph
● Large Language Model (LLM) Support
● KServe Observability and Security improvements
● Update KServe 1.0 documentation
https://github.com/kserve/kserve/blob/master/ROADMAP.md
Large Language Models
Source: https://huggingface.co/blog/large-language-models
KServe released
Distributed Inference for LLM
● Large transformer based models with
hundreds of billions of parameters are
computationally expensive.
● NVIDIA FasterTransformer implements
an accelerated engine for inference of
transformer based model spanning many
GPUs and nodes in distributed manner.
● Huggingface Accelerate allows
distributed inference with sharded
checkpoints using less memory.
● Tensor parallelism allows running
inference on multiple GPUs.
● Pipeline parallelism allows on
multi-GPU and multi-node
environment.
Large Language Model Challenges
● Inference hardware requirements
○ Cost: Requires significant computational resources with high end GPUs and large
amount of memory
○ Latency: long response time up to tens of seconds
● Model blob/file sizes in GBs(BLOOM):
○ 176bln params = 360GB
○ 72 splits of 5GB which needs to mapped to multiple GPU devices
● Model loading time
○ From network (S3, minio) to instance disk
○ From instance disk to CPU RAM
○ From CPU RAM to GPU RAM
● Model Serving Runtime: FasterTransformer-Triton (32GB)
● Model data can be sensitive and private for inference
LLM(BLOOM 176bln) Inference Demo
curl https://fastertransformer.default.kubecon.theofpa.people.aws.dev/v1/models/fastertransformer:predict -d '{
"inputs":
[
{
"input":"Kubernetes is the best platform to serve your models because",
"output_len":"40"
}
]
}'
[["Kubernetes is the best platform to serve your models because it is a cloud-based
service that is built on top of the Kubernetes API. It is a great platform for developers
who want to build their own Kubernetes application"]]
https://github.com/kserve/kserve/tree/master/docs/samples/v1beta1/triton/fastertransformer
LLM Download and Loading Time
$ git lfs clone https://huggingface.co/bigscience/bloom-560m
$ ls -lh bloom-560m
total 3.2G
-rw-r--r-- 1 theofpa domain^users 693 Apr 13 11:16 config.json
-rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 flax_model.msgpack
-rw-r--r-- 1 theofpa domain^users 16K Apr 13 11:16 LICENSE
-rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 model.safetensors
-rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 pytorch_model.bin
-rw-r--r-- 1 theofpa domain^users 21K Apr 13 11:16 README.md
-rw-r--r-- 1 theofpa domain^users 85 Apr 13 11:16 special_tokens_map.json
-rw-r--r-- 1 theofpa domain^users 222 Apr 13 11:16 tokenizer_config.json
-rw-r--r-- 1 theofpa domain^users 14M Apr 13 11:16 tokenizer.json
Huggingface transformer to FasterTransformer conversion based on tensor parallelisms and target precision.
python3 FasterTransformer/examples/pytorch/gpt/utils/huggingface_bloom_convert.py
-o fastertransformer/1 -i ./bloom-560m/
$ ls -lhS fastertransformer/1 | head
total 2.1G
-rw-r--r-- 1 theofpa domain^users 980M Apr 13 11:28 model.wte.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.0.mlp.dense_4h_to_h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.0.mlp.dense_h_to_4h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.10.mlp.dense_4h_to_h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.10.mlp.dense_h_to_4h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.11.mlp.dense_4h_to_h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.11.mlp.dense_h_to_4h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.12.mlp.dense_4h_to_h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.12.mlp.dense_h_to_4h.weight.0.bin
Image source: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/multigpu.html#Sharding
storage-initializer:
2023-04-13 19:
28:34.151 1 root INFO [downlo
Copying contents of s3://kubecon-models/mod
2023-04-13 19:
29:28.922 1 root INFO [downlo
Successfully copied s3://kubecon-models/mod
/mnt/models
predictor:
I0413 19:29:34.205155 1 libfastertransforme
TRITONBACKEND_ModelInitialize: fastertransf
1)
I0413 19:29:45.448307 1 model_lifecycle.cc:
successfully loaded 'fastertransformer' ver
BloombergGPT - Powered by KServe
● Trained approximately 53 days on 64 servers, each with 8 A100 GPUs on
AWS Sagemaker.
● HF checkpoints are converted into FasterTransformer format setting
target precision to BF16, tensor parallelism 2 and pipeline parallelism 1.
● Deployed on KServe on-prem for internal product development with
NVIDIA Triton FasterTransformer Serving Runtime on 2 A100 GPU (80G
memory) for each replica.
https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/
BloombergGPT Paper: https://arxiv.org/abs/2303.17564
BloombergGPT - KServe Deployment
Preprocessing
(Tokenizer)
Triton FasterTransformer
Predictor
Input tokens generated tokens
gRPC
GPU 1 GPU 2
Predictor
Text Input Text Output
KServe Community
● KServe Website
● KServe on GitHub
● KServe Community
Session QR Codes will be
sent via email before the event
Please scan the QR Code above
to leave feedback on this session

More Related Content

What's hot

GitOps - Operation By Pull Request
GitOps - Operation By Pull RequestGitOps - Operation By Pull Request
GitOps - Operation By Pull RequestKasper Nissen
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopSudhir Tonse
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetesRishabh Indoria
 
Getting Started with Kubernetes
Getting Started with Kubernetes Getting Started with Kubernetes
Getting Started with Kubernetes VMware Tanzu
 
Introduction to Kubernetes and GKE
Introduction to Kubernetes and GKEIntroduction to Kubernetes and GKE
Introduction to Kubernetes and GKEOpsta
 
Openshift Container Platform
Openshift Container PlatformOpenshift Container Platform
Openshift Container PlatformDLT Solutions
 
Kubernetes Deployment Strategies
Kubernetes Deployment StrategiesKubernetes Deployment Strategies
Kubernetes Deployment StrategiesAbdennour TM
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요Jo Hoon
 
Containers Anywhere with OpenShift by Red Hat
Containers Anywhere with OpenShift by Red HatContainers Anywhere with OpenShift by Red Hat
Containers Anywhere with OpenShift by Red HatAmazon Web Services
 
Automatically scaling Kubernetes workloads - SVC215-S - New York AWS Summit
Automatically scaling Kubernetes workloads - SVC215-S - New York AWS SummitAutomatically scaling Kubernetes workloads - SVC215-S - New York AWS Summit
Automatically scaling Kubernetes workloads - SVC215-S - New York AWS SummitAmazon Web Services
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes NetworkingCJ Cullen
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Ryan Jarvinen
 
Integrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environmentconfluent
 
Kubernetes dealing with storage and persistence
Kubernetes  dealing with storage and persistenceKubernetes  dealing with storage and persistence
Kubernetes dealing with storage and persistenceJanakiram MSV
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetescraigbox
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?Kai Wähner
 

What's hot (20)

GitOps - Operation By Pull Request
GitOps - Operation By Pull RequestGitOps - Operation By Pull Request
GitOps - Operation By Pull Request
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash Workshop
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Getting Started with Kubernetes
Getting Started with Kubernetes Getting Started with Kubernetes
Getting Started with Kubernetes
 
Introduction to Kubernetes and GKE
Introduction to Kubernetes and GKEIntroduction to Kubernetes and GKE
Introduction to Kubernetes and GKE
 
Openshift Container Platform
Openshift Container PlatformOpenshift Container Platform
Openshift Container Platform
 
Kubernetes Deployment Strategies
Kubernetes Deployment StrategiesKubernetes Deployment Strategies
Kubernetes Deployment Strategies
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
 
Containers Anywhere with OpenShift by Red Hat
Containers Anywhere with OpenShift by Red HatContainers Anywhere with OpenShift by Red Hat
Containers Anywhere with OpenShift by Red Hat
 
Serving models using KFServing
Serving models using KFServingServing models using KFServing
Serving models using KFServing
 
Automatically scaling Kubernetes workloads - SVC215-S - New York AWS Summit
Automatically scaling Kubernetes workloads - SVC215-S - New York AWS SummitAutomatically scaling Kubernetes workloads - SVC215-S - New York AWS Summit
Automatically scaling Kubernetes workloads - SVC215-S - New York AWS Summit
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17
 
Integrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environment
 
Kubernetes dealing with storage and persistence
Kubernetes  dealing with storage and persistenceKubernetes  dealing with storage and persistence
Kubernetes dealing with storage and persistence
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetes
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 

Similar to Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving

Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusChandresh Pancholi
 
Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh CodeOps Technologies LLP
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
Private Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerPrivate Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerDavinder Kohli
 
Being Stateful in Kubernetes
Being Stateful in KubernetesBeing Stateful in Kubernetes
Being Stateful in KubernetesKnoldus Inc.
 
A Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfA Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfAvinashUpadhyaya3
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kuberneteskloia
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage makerPhilipBasford
 
Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administrationAshish Sharma
 
WSO2Con 2015 USA: Introducing Microservices Server
WSO2Con 2015 USA: Introducing Microservices ServerWSO2Con 2015 USA: Introducing Microservices Server
WSO2Con 2015 USA: Introducing Microservices ServerWSO2
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerAWSCOMSUM
 
Being Stateful In Kubernetes
Being Stateful In KubernetesBeing Stateful In Kubernetes
Being Stateful In KubernetesKnoldus Inc.
 
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...Jitendra Bafna
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learningAntje Barth
 
04_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
04_Azure Kubernetes Service: Basic Practices for Developers_GAB201904_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
04_Azure Kubernetes Service: Basic Practices for Developers_GAB2019Kumton Suttiraksiri
 
Kubernetes: від знайомства до використання у CI/CD
Kubernetes: від знайомства до використання у CI/CDKubernetes: від знайомства до використання у CI/CD
Kubernetes: від знайомства до використання у CI/CDStfalcon Meetups
 
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018Amazon Web Services Korea
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS Riyadh User Group
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsFederico Michele Facca
 

Similar to Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving (20)

Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheus
 
Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Private Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerPrivate Cloud with Open Stack, Docker
Private Cloud with Open Stack, Docker
 
Being Stateful in Kubernetes
Being Stateful in KubernetesBeing Stateful in Kubernetes
Being Stateful in Kubernetes
 
A Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfA Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdf
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 
Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administration
 
WSO2Con 2015 USA: Introducing Microservices Server
WSO2Con 2015 USA: Introducing Microservices ServerWSO2Con 2015 USA: Introducing Microservices Server
WSO2Con 2015 USA: Introducing Microservices Server
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
 
Being Stateful In Kubernetes
Being Stateful In KubernetesBeing Stateful In Kubernetes
Being Stateful In Kubernetes
 
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learning
 
04_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
04_Azure Kubernetes Service: Basic Practices for Developers_GAB201904_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
04_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
 
AWS in Practice
AWS in PracticeAWS in Practice
AWS in Practice
 
Kubernetes: від знайомства до використання у CI/CD
Kubernetes: від знайомства до використання у CI/CDKubernetes: від знайомства до використання у CI/CD
Kubernetes: від знайомства до використання у CI/CD
 
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
 

Recently uploaded

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 

Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving

  • 1.
  • 2. Dan Sun, Bloomberg Theofilos Papapanagiotou, AWS The State and Future of Cloud Native ML Model Serving
  • 3. Productionize AI Model is Challenging “Launching AI application pilots is deceptively easy, but deploying them into production is notoriously challenging.” Inference request Inference response
  • 4. Productionize AI Model is Challenging “Launching AI application pilots is deceptively easy, but deploying them into production is notoriously challenging.” Inference request Inference response Pre Processing Post Processing Model Input Model Output Feature-Store Extract features, image/text preprocessing Scalability Security Model Store REST/gRPC Load balancer Reproducibility /Portability Observability
  • 5. Why Kubernetes is a Great Platform for Serving Models? ● Microservice: Kubernetes handles container orchestration, facilitate deployment of microservices with resource management capabilities and load balancing. ● Reproducibility/Portability: ML model deployments can be written once, reproduced, and run with declarative yaml in a cloud agnostic way. ● Scalability: Kubernetes provides horizontal scaling features for both CPU and GPU workload. ● Fault-tolerance: Detect and recover from container failures, more resilient to outages and minimize downtime.
  • 6. What’s KServe? ● KServe is a highly scalable and standards-based cloud-native model inference platform on Kubernetes for Trusted AI that encapsulates the complexity of deploying models to production. ● KServe can be deployed standalone or as an add-on component with Kubeflow in the cloud or on-premises environment.
  • 8. The Current State ● Features supported as of KServe 0.10 release Core Inference ● Transformer/Predictor ● Serving Runtimes ● Custom Runtime SDK ● Open Inference Protocol ● Serverless Autoscaling ● Cloud/PVC Storage ● Maintenance ● Quality checks ● Common tasks Advanced Inference ● ModelMesh for Multi-Model Serving ● Inference Graph ● Payload Logging ● Request Batching ● Canary Rollout ● Maintenance ● Quality checks ● Common tasks Model Explanability & Monitoring ● Text, Image, Tabular Explainer ● Bias Detector ● Adversarial Detector ● Outlier Detector ● Drift Detector
  • 9. Serving Runtimes InferenceService (for user) apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: example-sklearn-isvc spec: predictor: model: modelFormat: name: sklearn version: 1.0 storageUri: s3://bucket/sklearn/mnist.joblib runtime: kserve-sklearnserver # optional Serving Runtime (for KServe admin) apiVersion: serving.kserve.io/v1alpha1 kind: ClusterServingRuntime metadata: name: kserve-sklearnserver spec: supportedModelFormats: - name: sklearn version: "1" autoSelect: true containers: - name: kserve-container image: kserve/sklearnserver:latest args: - --model_name={{.Name}} - --model_dir=/mnt/models - --http_port=8080 resources: requests: cpu: "1" memory: 2Gi ● ModelSpec specifies the model formats or version for trained model ● KServe automatically selects the serving runtime to instantiate the deployment that supports the given model format.
  • 10. Serving Runtime Support Matrix Serving Runtime/ Model Format scikit-learn xgboost lightgbm TensorFlow PyTorch TorchScript ONNX MLFlow Custom MLServer (open) Triton (open) TorchServe (v1, open) KServe Runtime (v1, open) TFServing (v1)
  • 11. Open Inference Protocol ● Open inference protocol enables a standardized high performance inference data plane. ● Allows interoperability between serving runtimes. ● Allows building client or benchmarking tools that can work with a wide range of serving runtimes ● It is implemented by KServe, MLServer, Triton, TorchServe, OpenVino, AMD Inference Server. https://github.com/kserve/open-inference-protocol
  • 12. Open Inference Protocol ● REST vs. gRPC: Ease of use vs high performance REST gRPC GET v2/health/live rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) GET v2/health/ready rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) GET v2/models/{model_name}/ready rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) GET v2/models/{model_name} rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) POST v2/models/{model_name}/infer rpc Modelnfer(ModelInferRequest) returns (ModelInferResponse)
  • 13. Open Inference Protocol Codec ● Single Input vs. Multiple Inputs with numpy/pandas codec Feature 1 Name: f1 Value: 1.0 Feature 2 Name: f2 Value: 2.0 Feature 3 Name: f3 Value: 3.0 { “inputs”: [ { “name”: “tensor”, “shape”: [1, 3], “datatype”: “FP32”, “data”: [[1.0, 2.0, 3.0]] } ] } Feature 1 Name: f1 Value: 1.0 Feature 2 Name: f2 Value: 2 Feature 3 Name: f3 Value: “foo” { “inputs”: [ { “name”: “f1”, “shape”: [1,2], “datatype”: “FP32”, “data”: [[1.0, 2.0]] }, { “name”: “f3”, “shape”: [1, 2], “datatype”: “BYTES”, “data”: [[“foo”, “bar”]] } ] } Numpy codec Pandas codec f1, f2, f3 0 1.0, 2, foo 1 2.0, 4, bar 1.0, 2.0, 3.0 4.0, 5.0, 6.0
  • 14. Inference Service Deployment apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "sklearn-feast-transformer" spec: transformer: containers: - image: kserve/driver-transformer:latest name: kserve-container command: - "python -m driver_transformer" args: - --entity_ids - driver_id - --feature_refs - driver_hourly_stats:acc_rate - driver_hourly_stats:avg_daily_trips - driver_hourly_stats:conv_rate Predictor: model: modelFormat: name: sklearn storageUri: "gs://pv/driver" https://kserve.github.io/website/0.10/modelserving/v1beta1/transformer/feast/
  • 15. Model Storage Patterns ● Runs as init container to download the models before starting KServe container. ● Runs as sidecar to pull models as needed in ModelMesh. /mnt/models storage-initializer kserve-container model-puller container modelmesh container /models |-- model1/ |- model1.bin model2/ |- model2.bin model3/ |- model3.bin
  • 16. Multi-Model Serving ● Single model single container usually under-utilize the resources. ● You can share the container resources by deploying multiple models, especially for the GPU sharing cases. ● When scaling up the same set of models need to be scaled out together. GPU model1 model2 model3 GPU model1 model2 model3 Inference Service Replica 1 Inference Service Replica 2
  • 17. ModelMesh ● Models can be scaled independently while sharing the container resource ● ModelMesh is a scale-out layer for model servers that can load and serve multiple models concurrently which uses the same KServe InferenceService API. ● Designed for high volume, high density, high churn production serving, it has powered IBM production AI services for a number of years and manages thousands of models. Kubernetes schedules pod onto nodes ModelMesh schedules models onto container
  • 18. Inference Graph ● Inference Graph is built for more complex multi-stage inference pipelines. ● Inference Graph is deployed in a declarative way and highly scalable. ● Inference Graph supports Sequence, Switch, Ensemble and Splitter nodes. ● Inference Graph is highly composable. It is made up with a list of routing nodes and each node consists of a set of routing steps which can be either route to an InferecenService or another node. apiVersion: "serving.kserve.io/v1alpha1" kind: "InferenceGraph" metadata: name: "dog-breed-pipeline" spec: nodes: root: routerType: Sequence steps: - serviceName: cat-dog-classifier name: cat_dog_classifier # step name - serviceName: dog-breed-classifier name: dog_breed_classifier data: $request condition: "[@this].#(predictions.0=="dog")"
  • 19. KServe Cloud Native Ecosystem ● KServe offers seamless integration with many CNCF projects to empower the production model deployments for security, observability, and serverless capabilities.
  • 20. Serverless Inference ● Autoscaling based on incoming request RPS or Concurrency. ● Unified autoscaler for both CPUs and GPUs. Queue-proxy kserve-container Concurrent Requests = 2 Container Concurrency = 1 KnativeCon talk 2022
  • 21. ServiceMesh: Secure Inference Service ● Istio control plane mounts the client certificates to sidecar proxy to allow automatic authentication between transformer and predictor and encrypt service traffic. ● Authorization can be implemented with Istio Authorization Policy or Open Policy Agent. queue-proxy kserve-container queue-proxy kserve-container http/grpc mTLS Istio Ingress Gateway KServe Transformer KServe Predictor Authz check
  • 22. SPIFFE and SPIRE: Strong Identity ● SPIFFE is a secure identity framework that can be federated across heterogeneous environments. ● SPIRE distinguishes from other identity providers, such as API keys or secrets, with an attestation process before issuing the credential. queue-proxy kserve-container KServe Predictor Workload Spire Agent Spire Server Spire Agent JWT/SVID VM Kubernetes
  • 23. KServe Observability ● OpenTelemetry can be used to instructment, collect and export metrics, log and tracing data that you can analyze to understand the production Inference Service performance. Gateway Queue proxy KServe container
  • 24. Cloud Native Buildpacks ● Cloud Native Buildpacks can turn your custom transformer or predictor into a container image without Dockerfile and run anywhere. ● Ensures meeting the security and compliance requirements.
  • 26. KServe 1.0 Roadmap ● Graduate KServe core inference capabilities to stable/GA ● Open Inference Protocol coverage for all supported serving runtimes ● Graduate KServe SDK to 1.0 ● Graduate ModelMesh and Inference Graph ● Large Language Model (LLM) Support ● KServe Observability and Security improvements ● Update KServe 1.0 documentation https://github.com/kserve/kserve/blob/master/ROADMAP.md
  • 27. Large Language Models Source: https://huggingface.co/blog/large-language-models KServe released
  • 28. Distributed Inference for LLM ● Large transformer based models with hundreds of billions of parameters are computationally expensive. ● NVIDIA FasterTransformer implements an accelerated engine for inference of transformer based model spanning many GPUs and nodes in distributed manner. ● Huggingface Accelerate allows distributed inference with sharded checkpoints using less memory. ● Tensor parallelism allows running inference on multiple GPUs. ● Pipeline parallelism allows on multi-GPU and multi-node environment.
  • 29. Large Language Model Challenges ● Inference hardware requirements ○ Cost: Requires significant computational resources with high end GPUs and large amount of memory ○ Latency: long response time up to tens of seconds ● Model blob/file sizes in GBs(BLOOM): ○ 176bln params = 360GB ○ 72 splits of 5GB which needs to mapped to multiple GPU devices ● Model loading time ○ From network (S3, minio) to instance disk ○ From instance disk to CPU RAM ○ From CPU RAM to GPU RAM ● Model Serving Runtime: FasterTransformer-Triton (32GB) ● Model data can be sensitive and private for inference
  • 30. LLM(BLOOM 176bln) Inference Demo curl https://fastertransformer.default.kubecon.theofpa.people.aws.dev/v1/models/fastertransformer:predict -d '{ "inputs": [ { "input":"Kubernetes is the best platform to serve your models because", "output_len":"40" } ] }' [["Kubernetes is the best platform to serve your models because it is a cloud-based service that is built on top of the Kubernetes API. It is a great platform for developers who want to build their own Kubernetes application"]] https://github.com/kserve/kserve/tree/master/docs/samples/v1beta1/triton/fastertransformer
  • 31. LLM Download and Loading Time $ git lfs clone https://huggingface.co/bigscience/bloom-560m $ ls -lh bloom-560m total 3.2G -rw-r--r-- 1 theofpa domain^users 693 Apr 13 11:16 config.json -rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 flax_model.msgpack -rw-r--r-- 1 theofpa domain^users 16K Apr 13 11:16 LICENSE -rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 model.safetensors -rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 pytorch_model.bin -rw-r--r-- 1 theofpa domain^users 21K Apr 13 11:16 README.md -rw-r--r-- 1 theofpa domain^users 85 Apr 13 11:16 special_tokens_map.json -rw-r--r-- 1 theofpa domain^users 222 Apr 13 11:16 tokenizer_config.json -rw-r--r-- 1 theofpa domain^users 14M Apr 13 11:16 tokenizer.json Huggingface transformer to FasterTransformer conversion based on tensor parallelisms and target precision. python3 FasterTransformer/examples/pytorch/gpt/utils/huggingface_bloom_convert.py -o fastertransformer/1 -i ./bloom-560m/ $ ls -lhS fastertransformer/1 | head total 2.1G -rw-r--r-- 1 theofpa domain^users 980M Apr 13 11:28 model.wte.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.0.mlp.dense_4h_to_h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.0.mlp.dense_h_to_4h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.10.mlp.dense_4h_to_h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.10.mlp.dense_h_to_4h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.11.mlp.dense_4h_to_h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.11.mlp.dense_h_to_4h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.12.mlp.dense_4h_to_h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.12.mlp.dense_h_to_4h.weight.0.bin Image source: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/multigpu.html#Sharding storage-initializer: 2023-04-13 19: 28:34.151 1 root INFO [downlo Copying contents of s3://kubecon-models/mod 2023-04-13 19: 29:28.922 1 root INFO [downlo Successfully copied s3://kubecon-models/mod /mnt/models predictor: I0413 19:29:34.205155 1 libfastertransforme TRITONBACKEND_ModelInitialize: fastertransf 1) I0413 19:29:45.448307 1 model_lifecycle.cc: successfully loaded 'fastertransformer' ver
  • 32. BloombergGPT - Powered by KServe ● Trained approximately 53 days on 64 servers, each with 8 A100 GPUs on AWS Sagemaker. ● HF checkpoints are converted into FasterTransformer format setting target precision to BF16, tensor parallelism 2 and pipeline parallelism 1. ● Deployed on KServe on-prem for internal product development with NVIDIA Triton FasterTransformer Serving Runtime on 2 A100 GPU (80G memory) for each replica. https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/ BloombergGPT Paper: https://arxiv.org/abs/2303.17564
  • 33. BloombergGPT - KServe Deployment Preprocessing (Tokenizer) Triton FasterTransformer Predictor Input tokens generated tokens gRPC GPU 1 GPU 2 Predictor Text Input Text Output
  • 34. KServe Community ● KServe Website ● KServe on GitHub ● KServe Community
  • 35. Session QR Codes will be sent via email before the event Please scan the QR Code above to leave feedback on this session