Serverless Machine Learning

Jim Dowling
jdowling@kth.se
2022-11-04


Enterprise Al Value Chain

Operational ML
with real-time data

Operational ML
with historical data

o
a
c
>
n
D
o
&
a
3
a

Analytical ML

Bi: Al:
DESCRIPTIVE & DIAGNOSTIC PREDICTIVE & PRESCRIPTIVE
ANALYTICS ANALYTICS


Modern Enterprise Data and ML Infrastructure

Operational DBs Analytical Data Stores ML Infrastructure

Data Warehouse

sine Fo cost |

‘Snowflake / Databricks / BQ

Online
App

Data Lake

Predictions

KV Store
Objec/GraphoB
Elastic


Monolithic ML Pipeline

@ Apipeline is a program that takes and input and produces an output

@ End-to-end ML Pipelines are a single pipeline that transforms raw data into features
and trains and scores the model in one single program

Feature
Engineering

Model

/ Train | / Evaluate


Problems with Monolithic ML Pipelines

>

They are often not modular - their components are not modular and cannot be
independently scaled or deployed on different hardware (e.g., CPUs for feature engi-
neering, GPUs for model training).

v

They are difficult to test - production software needs automated tests to ensure
features and models are of high quality.

v

They tightly couple the execution of feature engineering, model training, and infer-
ence steps - running them in the same pipeline program at the same time.

> They do not promote reuse of features/models/code. The code for computing fea-
tures (feature logic) cannot be easily disentangled from its pipeline jungle.


Modular water pipes in a Google Datacenter. Instead of one giant water pipe (our
monolithic notebook), separate water pipes reduce the blast radius if one fails. Color
coding makes it easier to debug problems in a damaged water pipe.


Pipelines as Modular Programs

>» Modularity involves structuring your code such that its functionality is separated into
independent classes and/or functions that can be more easily reused and tested.

» Modules should be placed in accessible classes or functions, keeping them small and
easy to understand and document.

» Modules enable code to be more easily reused in different pipelines.

> Modules enable code to be more easily independently tested, enabling the easier and
earlier discovery of bugs.


Supervised ML Pipeline Stages

train(features, labels)— > model

model(features)— > predictions

Inference

es | --»Predictions
Pipeline

iy Feature reneable Feature | transformed

data Pipeline | features Store | features spe

Training
Pipeline


\
Prediction i
Service/UI '

features Logs
& model _— (predictions)

features model

Serverless
Feature & Model Store

=


ML Pipeline Stages - Data Sources

_——.
Inference | peeaieiony
Pipeline

Feature | reusable ,| Feature |rransformed  spodel

Pipeline | features Store | features a
Training
Pipeline

A


Connect to Data Sources and Read Raw Data

> Discover data sources, securely connect to heterogeneous data sources

» Manage dependencies such as connectors and drivers

> Manage connection information securely: network endpoint, database/table names,
authentication credentials such as API keys or credentials (username/password)


Heterogeneous Data Sources

i Type of Data
Tabular data

Unstructured data

: Free-text search dats
Documents / Objects
Graph data
Time-series data

Queued data

REST APIs

Web scraped

“Examples

Customer, transactions, marketing, sales, etc

images) sound: video:

appllcation/service ings

JSON

- Social network graphs

Performance metrics

Messages, events

Salesforce, Hubspot, etc

Electricity prices, air quality


File Formats for different Data Sources

Snowflake, Databricks, BQ,

Tabular data "csv, .parquet, .tfrecord, .-avro : Re dshift, 83, ADLS, ecs
Unstructured data i images, sound, video ' 83, GES ADLS, DES
" Free-text search —_ application/servce logs. _ el Elasticsearch, Solr
“Documents JSON —— MongoDB
Graph data i Sdcial RECN Neo4J

Time-se S data Performance n metrics i infuxDB, Prometheus
Queued alate iAvro : Kafka, Kinesis
REST APIs REST API with API key "Saas Platform

Web scraped N/A : Websites publishing data


Connect
to Data

Sources

Feature
Store

ML Pipeline Stages - Feature Pipelines

_
Inference
Pipeline
eee
hh,
Training
Pipeline

A

| -- Predictions

Feature Pipelines

feature-pipeline.py

Input Data. =———->| /—> Feature(s) -------- > Feature
Ly} Led Led Le] Se

VACE = Validate, Aggregate, Compress (dimensionality reduction), Extract (Binning, Crosses, etc)


Feature Pipelines

> A feature pipeline is a program that orchestrates the execution of feature engineering
steps on input data to create feature values.

Examples of feature engineering steps:
> Clean, validate, data

> Data de-duplication, pseudononymization, data wrangling

> Feature extraction, aggregations, dimensionality reduction, feature binning, feature
crosses


Tabular Data

Column (feature)

Table (feature group)

1111 2222 3333 4444

| |firsaazesaseaees| |stzs¢ | inde arsio |
111 2222 3333 4444| |$66.29 Stockholm
ba — [[stocknoim J

—,.. (feature) value

(vector)

Primary Key
(Entity ID)


Tabular Data as Features, Labels, Entity (or Primary) Keys,
Event Time

1111 2222 3333 4444 ‘2022-01-01 08:44

| |
)22 3333 4444 2022-01-01 19:44 Rio De Janeiro False

leray |
1111 2222 3333 4444 2022-01-01 20:44

= — bre


Tabular Data in Pandas

z

2022-01-01 08:44

(2022-01-01 19:44

“2022.01.01 20:44

2022-01-01 20:55


Exploratory Data Analysis in Pandas

Useful EDA Commands
df.head()

df.describe()
df[col].unique()
df[col].nunique()
df.isnull().sum()

df[col].value_counts()

sns.histplot(...)

Description

Returns the first few rows of df.

Returns descriptive statistics for df. Use with numerical features.
Returns all values unique for a column, col, in df.

Returns the number of unique values for a column, col, in df.
Returns the number of null values in all columns in df.

Returns the number of values for with different values. Use with both numerical
and categorical variables.

Plot a histogram for a DataFrame or selected columns using Seaborn.


Aggregations in Pandas

Aggregation
df.count()

df.first(), df.last()
df.mean(), df.median()
df.min(), df.max()
df.std(), df.var()
df.mad()

df.prod()

df.sum()

Description

Count the number of rows

First and last rows

Mean and median

Minimum and maximum
Standard deviation and variance
Mean absolute deviation
Product of all rows

Sum of all rows


Rolling Windows in Pandas

What is the 7 day rolling max/mean of the credit card transaction amounts?

# For rolling windows in Pandas, first set a DateTime column as index to the df

Credit-Card Transactions

t veld ox Lol ra re ee

Time
df.rolling('1D') .amount.max() ool
1-day
df.rolling('1W') .amount.mean() teal | | (a A |
1-week
wea red | on li li Lotte val 1 1 al

1-month

df.rolling('30D').amount.min()


Feature binning

Customer Age Groups


Feature Crosses

> A feature cross is a synthetic feature formed by multiplying (crossing) two or more
features. By multiplying features together, you encode nonlinearity in the feature
space.

> For example, imagine we are looking for credit card fraud activity within a geographic
region (e.g., a city district), how would we capture that as a feature?

> We could cross to a geographic area (binned latitude and binned longitude - a grid
identifying a city district) with the level of credit card activity within that geographic
area.

Binned Binned ‘cc spend_thr
Latitude Longitude _— Bist @blong® eovspend_ thr

xx yy 8000)

> 2000


Embeddings as Features

An embedding is a lower dimension representation of a sparse input that retains some
of the semantics of the input.

> An embedding store (vector database) stores semantically similar inputs close to-
gether in the embedding space. You can implement “similarity search” by finding
embeddings close in embedding space. You can even apply arithmetic on embeddings
to discover semantic relationships.

woman sit

|

*), oe faster slowest
daughter inn’
eo
England longer
™“/ ‘/, te fates
Paris Italy \ she here
aw J himself * lenge

iene herself

Image Embeddings enable Similarity Search


ML Pipeline Stages - Feature Store

C.

Inference

ei . }-- Predictions
Pipeline

Connect

to Data ri Model

Sources P zh,
Training
Pipeline

A


Store Features

There are two general ways people manage features and labels for both training and
serving:
> (1) Compute features on-demand as part of the model training or batch inference
pipeline.

> (2) Use a feature store to store the features so that they can be reused across
different models for both training and inference. For online models that require
features with either historical or contextual information, feature stores are typically
used.


ML Pipeline Stages - Training Pipelines

Inference
Pipeline

Connect
to Data
Sources

reusable Feature | transformed
features Store features

Feature
Pipeline


Feature Types

ee ee _ eee ousted ‘S20 Lineal iiaise:
1111 2222 3333 4444 (2022-01-01 19:44 $12.34 Rio De Janeiro _ False
1111 2222 3333 4444 2022-01-01 20:44 ($66.29 ‘Stockholm True

Stockholm

2022-01-01 20:55 $112.33

Reference: https: //www.hopsworks.ai/post/feature-types-for-machine-learning


Feature Types Taxonomy

: Categorical :

——

Feature Type

| Numerical ©

—

Ordinal

Nominal Embedding List Interval

Ratio


Feature

_ Untransformed
Store Input Features
T-HATE =

Model Training Pipelines

training pipeline py

Bid

Model
Registry

Transform features, Hyperparameter tuning, model Architecture, Train model (fit to data), Evaluate your model.


Model-Dependent Transformations

e Transformations for data compatibility
© Convert non-numeric features into numeric
o Resize inputs to a fixed size

e Transformations to improve model performance

co Many models perform badly if numerical
features do not follow a normal (Gaussian)
distribution

© Tokenization or lower-casing of text
features

o Allowing linear models to introduce
non-linearities into the feature space

log transformation function

a,

Exponential distribution

Normal distribution

0 500 1000 1500 2000
Target

Reference: https: //developers.google.com/machine-learning /data-

prep/transform/introduction


Transformations in Pandas

1 sns.histplot(df_norm)

o <AxesSubplot: ylabel='Count'>
Fo
©
# ET
bs ‘
» is
10 x»
‘ 10
> 2» © © © WO
carry 02 ny 06 08 10

+ 1 columns = [‘amount']
2. df_exp = pd.DataFrame(data = array, colunns = columns)

1 # Min-Max Normalization in Pandas
2 df_norm = (df_exp-df_exp.min())/(df_exp.max()-df_exp.min())
3. df_norm.head()


Different types of Transformations

Type of Transformation
Scaling to Minimum And Maximum values
Scaling To Median And Quantiles
Gaussian Transformation
Logarithmic Transformation
Reciprocal Transformation
Square Root Transformation
Exponential Transformation

Box Cox Transformation

ML Algorithms that may need Transformations
Linear regression

Logistic regression

K Nearest neighbours

Neural networks

Support vector machines with radial bias kernel
functions

Principal components analysis

Linear discriminant analysis


Model Training with Train and Test Sets

Training Data (df_features, df_labels)

|

Train set : Test set
-  (X_test, y_test)

X_train = training set features X_test is test set features
y_train

training set labels y_test is test set labels


Model Training with Train and Test Sets in Scikti-Learn

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import xgboost as xgb

: X_train,X_test,y_train,y_test = Get train and test data sets as

features (X) and labels (y)

Use XGBQ0s! as moseliing algorithm
Train supervised ML classifier with
features and labels from train set

i Generate predictions with model on
: y_pred = model.predict(X_test) . it ——$— seer fentiian (X test)

4 model.fit(X_train y_train)

: report_dict lassification_report( é
: y_test. y_pred, output_dict=True) :

Evaluate model performance by
comparing predictions (y_pred) and
labels (y_test) for the test set


Model Training is an Iterative Process

EDA an

Archit Model Evaluation


Model-Centric Iteration to Improve Model Performance

Possible steps to improve your model performance:

> Try outa different supervised ML learning algorithm (e.g., random forest, feedforward
deep neural network, Gradient-boosted decision tree)

> Try out new combinations of hyperparameters (e.g., number of training epochs, the
learning rate, number of layers in a deep neural network, adjust regularizations such
as Dropout or BatchNorm)

> Evaluate your model on a validation set (keeping a separate holdout test set for final
model performance evaluation)


Data-Centric Iteration to Improve Model Performance

Steps to improve your model

> Add or remove features to or from your model (feature selection)

> Add more training data

>» Remove poor quality training samples

> Improve the quality of existing training samples (e.g., using Cleanlab or Snorkel)

v

Rank the importance of the training samples (Active Learning)


Train, Validation, and Test Sets

>» Random splits of the training data when the data is not time-series data

> Time-series splits of the training data when the data is time-series data

Training Data

70%

Validation
Set


Model Training is an Iterative Process

Training Data (df_features, df_labels)
|

Trainset Validation set : Test set
| (X_train, y_train) © (X_val,y_val) | (X_test, y_test)

= — : :

eet Hiocarem, 9. wo) it = model. predict (X_val) y_pred = model. predict (x_test)

metrics = classification_report(y_val, y_pred) metrics = evaluate(y_test, y_pred)

= = ‘

evaluate


ML Pipeline Stages - Inference Pipelines

Connect

Feature

reusable Feature transformed Wadal
Pipeline ‘

features Store | features

to Data
Sources

Training
Pipeline


Batch Inference Pipeline

Untransformed
Input Features T P fe)

|» Predictions Prediction
> Results

Feature

Store

TPO = Transform features, Predict, Output.


Online Inference Pipeline

TPP = Transform the input request into features, Predict using input features and the
model, Post-process predictions, before output results.

Results


Serverless ML with Python

> Write Feature, Training, and Inference Pipelines in Python

> Orchestrate the execution of Pipelines using Serverless Compute Platforms

> Store features and models in a serverless feature/model store

> Run a User Interface (UI), written in Python, on serverless infrastructure


Serverless Compute Platforms

Serverless Python Functions Orchestration Platforms
e Modal e Modal
e GitHub Actions e GitHub Actions
e render. com e Astronomer (Airflow)
e pythonanywhere.com e Dagster
e  replit.com e Prefect
e deta.sh e Azure Data Factory
e  linode.com e Amazon Managed Workflows for
e hetzner.com Apache Airflow (MWAA)
e digitalocean.com e Google Cloud Composer
e AWS lambda functions e Databricks Workflows
e Google Cloud Functions

Serverless Feature Stores and Model Registry /Serving

Feature Stores
> Hopsworks
Model Registry and Serving
> Hopsworks
>» AWS Sagemaker
> Databricks

> Google Vertex

Serverless User Interfaces

> Hugging Faces Spaces

> Streamlit Cloud


Iris Flower Dataset

https: //github.com/ID2223KTH/id2223kth.github.io/tree/master/src/serverless-ml-
intro

> 4 input features: sepal length, sepal width, petal length, petal width
> label (target): Iris Flower Type (one of Setosa, Versicolor, Virginica)

> Only 150 samples in the dataset


Serverless Iris with Modal, Hopsworks, and Hugging Face

Modal

i i
iris-training-pi — > ! faa tris ul
pipeline.py i i
Iris Flower Data — sas a<$<——$— §« OEE
iris.csv i i i d
features features model features

Hopsworks
Feature & Model Store


Iris Flowers: Feature Pipeline with Modal and Hopsworks

stub = modal.Stub()
hopsworks_image = modal. Image.debian_slim() .pip_install(["hopsworks"] )

@stub.function( image=hopsworks_image, schedule=modal.Period(days=1), \

secret=modal .Secret . from_name ("jim-hopsworks-ai"))

def £():
ee ho} sworks |
Ss as pd
sie hopsworks. login()

af y=

fs = project.get_feature_store()

iris_df = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/iris.csv")

iris_fg = fs.get_or_create_feature_group( name="iris_modal", version=1,
primary_key=["sepal_length","sepal_width","petal_length","petal_width"],
description="Iris flower dataset")

iris_fg.insert(iris_df)

Snateses—— yo meine:
with stub.run():
£0


Training Pipeline with Modal and Hopsworks

@stub.function(image=hopsworks_image, schedule=modal.Period(days=1) ,\
secret=modal.Secret . from_name("jim-hopsworks-ai") )
def £():
# lots of imports
project = hopsworks.1login()
fs = project.get_feature_store()
try:
feature_view = fs.get_feature_view(name="iris_modal", version=1)
except:
iris_fg = fs.get_feature_group(name="iris_modal", version=1)
query = iris_fg.select_all()
feature_view = fs.create_feature_view(name="iris_modal",
version=1,
description="Read from Iris flower dataset",
labels=["variety"],
query=query)
X_train, X_test, y_train, y_test = feature_view.train_test_split(0.2)
model = KNeighborsClassifier(n_neighbors=2)
model.fit(X_train, y_train.values.ravel())


Training Pipeline (ctd)

y_pred = model.predict (X_test)
metrics = classification_report(y_test, y_pred, output_dict=True)
results = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(results, [’True Setosa’, ’True Versicolor’, ’True Virginica’],
[’Pred Setosa’, ’Pred Versicolor’, ’Pred Virginica’])
cm = sns.heatmap(df_cm, annot=True)
fig = cm.get_figure()
joblib.dump(model, "iris_model/iris_model.pk1")
fig.savefig("iris_model/confusion_matrix.png")
input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema, output_schema)
mr = project.get_model_registry()
iris_model = mr.python.create_model(
name="iris_modal",
metrics={"accuracy" : metrics[’accuracy’]},
model_schema=model_schema,
description="Iris Flower Predictor")
iris_model.save("iris_model")


Interactive Inference Pipeline with Hugging Face/Hopsworks

model = mr.get_model("iris_modal", version=1)
model_dir = model.download()
model = joblib.load(model_dir + "/iris_model.pkl")

def iris(sepal_length, sepal_width, petal_length, petal_width):
input_list = []
input_list.append(sepal_length)
input_list.append(sepal_width)
input_list .append(petal_length)
input_list .append(petal_width)
res = model.predict (np.asarray(input_list).reshape(1, -1))
flower_url = "https://raw.githubusercontent.com/.../assets/" + res[0] + ".png"
return Image.open(requests.get(flower_url, stream=True) .raw)

demo = gr.Interface(
fn=iris, title="Iris Flower Predictive Analytics", allow_flagging="never",
description="Experiment with sepal/petal lengths/widths to predict which flower it is.",
inputs=[ gr.inputs.Number(default=1.0, label="sepal length (cm)"),
gr.inputs.Number(default=1.0, label="sepal width (cm)"),
gr.inputs.Number(default=1.0, label="petal length (cm)"),
gr.inputs.Number(default=1.0, label="petal width (cm)"),],
outputs=gr .Image(type="pil"))
demo. launch ()


Questions?

Acknowledgements

Some of the images are used with permission from Hopsworks AB.


