Ai pipelines powered by jupyter notebooks

AI pipelines powered by Jupyter notebooks
Luciano Resende
Open Source AI Platform Architect
@lresende1975

About me - Luciano Resende
Open Source AI Platform Architect – IBM – CODAIT
• Senior Technical Staff Member at IBM, contributing to open source for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Toree, Apache Spark among other projects related to AI/ML platforms
lresende@us.ibm.com
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
IBM Developer / © 2019 IBM Corporation 2

IBM Open Source Participation
IBM Developer / © 2019 IBM Corporation
Learn
Open Source @ IBM
Program touches
78,000
IBMers annually
Consume
Virtually all
IBM products
contain some
open source
• 40,363 pkgs
Per Year
Contribute
• >62K OS Certs
per year
• ~10K IBM
commits per
month
Connect
> 1000
active IBM
Contributors
Working in key OS
projects
3

IBM Open Source
Participation
IBM generated open source innovation
• 137 IBM Open Code projects w/1000+ Github
projects
• Projects graduates into full open governance:
Node-Red, OpenWhisk, SystemML, Blockchain
fabric among others
• developer.ibm.com/code/open/code/
Community
• IBM focused on 18 strategic communities
• Drive open governance in “Centers of Gravity”
• IBM Leaders drive key technologies and assure
freedom of action
The IBM OS Way is now open sourced
• Training, Recognition, Tooling
• Organization, Consuming, Contributing
4IBM Developer / © 2019 IBM Corporation

Technology leaders do more than just consume OSS
19
1998
“For more than 20 years, IBM and Red Hat have paved the
way for open communities to power innovative IT solutions.”
– Red Hat
Long IBM history of actively fostering balanced community participation
5
© 2019 IBM Corporation

Center for Open Source
Data and AI
Technologies
6
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait

IBM Data Asset eXchange (DAX)
7
• Curated free and open datasets under open data licenses
• Standardized dataset formats and metadata
• Ready for use in enterprise AI applications
• Complement to the Model Asset eXchange (MAX)
Data Asset eXchange
ibm.biz/data-asset-exchange
Model Asset eXchange
ibm.biz/model-exchange

AGENDA
Jupyter Notebooks
Analytic Workloads Pipelines
• IPython %run magic
• Jupyter NBConverter
• Papermill
• Apache Flow
AI/Deep Learning Workloads Pipelines
• AI Platforms
• Kubeflow and Kubeflow Pipelines
Announcements
Resources

Jupyter Notebooks

Jupyter Notebooks
Notebooks are interactive
computational environments, in
which you can combine code
execution, rich text, mathematics,
plots and rich media.

Jupyter Notebook
11
Simple, but Powerful
As simple as opening a web
page, with the capabilities of
a powerful, multilingual,
development environment.
Interactive widgets
Code can produce rich
outputs such as images,
videos, markdown, LaTeX
and JavaScript. Interactive
widgets can be used to
manipulate and visualize
data in real-time.
Language of choice
Jupyter Notebooks have
support for over 50
programming languages,
including those popular in
Data Science, Data
Engineer, and AI such as
Python, R, Julia and Scala.
Big Data Integration
Leverage Big Data platforms
such as Apache Spark from
Python, R and Scala.
Explore the same data with
pandas, scikit-learn,
ggplot2, dplyr, etc.
Share Notebooks
Notebooks can be shared
with others using e-mail,
Dropbox, Google Drive,
GitHub, etc

Jupyter Notebook Platform Architecture
Notebook UI runs on the browser
The Notebook Server serves the
’Notebooks’
Kernels interpret/execute cell contents
– Are responsible for code execution
– Abstracts different languages
– 1:1 relationship with Notebook
– Runs and consume resources as long as
notebook is running

Jupyter Notebook
Analytic Workloads

Analytic Workloads
Large amount of data
Shared across organization in Data
Lakes
Multiple workload types
– Data cleansing
– Data Warehouse
– Machine Learning and Insights

Analytic Workloads
Decompose Schedule/Run

Homegrown pipelines

Notebook Pipelines
using %run
%run built-in IPython magic
- Enables execution of notebooks or python
scripts
Notebook
Orchestrator
%run
%run
%run

Notebook Pipelines
using %run
%run built-in IPython magic
- Enables execution of notebooks or
python scripts
Limitations
- Available in the IPython kernel only
- Static
- No command line integration

Notebook Pipelines
using NBConvert
input
notebook(s)
orchestrator
result_1.ipynb result_2.ipynb
result_3.html result_4.pdf
output file(s)
ipynb, html, pdf
NBConvert
Jupyter NBConvert
https://nbconvert.readthedocs.io/en/latest/
Jupyter NBConvert enables executing
and converting notebooks to different
file formats.

Notebook Pipelines
using NBConvert
$ pip install nbconvert
$ jupyter nbconvert --to html --execute overview_with_run.ipynb
[NbConvertApp] Converting notebook overview_with_run.ipynb to html
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] Writing 300558 bytes to overview_with_run.html
$ open overview_with_run.html
Jupyter NBConvert
https://nbconvert.readthedocs.io/en/latest/
Jupyter NBConvert enables executing
and converting notebooks to different
file formats.
Advantages
– Support notebook chaining
– Convert results to immutable formats
Limitations
– No support for parameters

Notebook Pipelines
with Papermill

Papermill
Papermill is an open source tool
contributed by Netflix which enables
parameterizing, executing, and
analyzing Jupyter Notebooks.
Papermill lets you:
- Parameterize notebooks
- Execute notebooks
input
notebook
orchestrator
result_1.ipynb result_2.ipynb
result_3.html result_4.pdf
output file(s)
ipynb, html, pdf

Papermill
Papermill provides programmatic
interface so you can integrate with your
applications
import papermill as pm
pm.execute_notebook('input_nb.ipynb',
'outputs/20190402_run.ipynb')
...
# Each run can be placed in a unique / sortable path
pprint(files_in_directory('outputs'))
outputs/ ...
20190401_run.ipynb
20190402_run.ipynb

Papermill
Papermill provides a CLI that enables
easy integration with external tools and
simple schedulers as crontab.
$ papermill input_notebook.ipynb
outputs/{run_id}_out.ipynb
$ papermill input.ipynb report.ipynb -y '{"foo":"bar"}' &&
jupyter nbconvert --to html report.ipynb

Notebook Pipelines with
Apache Airflow

Apache Airflow
Airflow is a platform to
programmatically author, schedule and
monitor workflows. It’s enterprise
ready and used to build large and
complex workload pipelines.
Python Code
DAG
(Workflow)

Apache Airflow
Airflow is a platform to
programmatically author, schedule and
monitor workflows. It’s enterprise
ready and used to build large and
complex workload pipelines.
Airflow Papermill operator enables
Jupyter Notebooks to be integrated into
Airflow workflows/pipelines.
More information à https://airflow.readthedocs.io/en/latest/howto/operator/papermill.html

Analytic Workloads Pipelines Summary
%run NBConvert Papermill Apache
Airflow
Notebook Kernels IPython Multiple Multiple Multiple
Static versus Dynamic Static Dynamic Dynamic Dynamic
Programmatic APIs Yes Yes
Notebook Parameters Yes Yes
Heterogeneous pipelines/workflows Yes

Jupyter Notebook
AI / Deep Learning Workloads

Resource intensive workloads
Requires expensive hardware (GPU,
TPU)
Long Running training jobs
– Simple MINIST takes over one hour
WITHOUT a decent GPU
– Other non complex deep learning
model training can easily take over a
dat WITH GPUs

Training/Deploying Models requires a lot of DevOPS
33
Model Serving
Monitoring
Resource
Management
Configuration
Hyperparameter
Optimization
Reproducibility
IBM Developer / © 2019 IBM Corporation

AI / Deep Learning Workloads Challenges
• How to isolate the training environments to multiple jobs,
based on different deep learning frameworks (and/or
releases) can be submitted/trained on the same time.
• Ability to allocate individual system level resources such as
GPUs, TPUs, etc with different kernels for a period of time.
• Ability to allocate and free up system level resources such as
GPUs, TPUs, etc as they stop being used or when they are idle
for a period of time.

Source: https://github.com/Langhalsdino/Kubernetes-GPU-Guide
Containers and Kubernetes Platform
- Containers simplify management of
complicated and heterogenous AI/Deep
Learning infrastructure providing a required
isolation layer to different pods running
different Deep Learning frameworks
- Containers provides a flexible way to deploy
applications and are here to stay
- Kubernetes enables easy management of
containerized applications and resources
with the benefit of Elasticity and Quality of
Services

AI Platforms
AI/Deep Learning Platforms aim to
abstract the DevOPS tasks from the
Data Scientist providing a consistent
way to develop AI models independent
of the toolkit/framework being used.
FfDL

Kubeflow
• ML Toolkit for Kubernetes
• Open source and community driven
• Support multiple ML Frameworks
• End-to-end workflows that can be
shared, scaled and deployed

Kubeflow Pipelines
Kubeflow Pipelines is a platform for
building and deploying portable,
scalable machine learning (ML)
workflows based on Docker containers.
• End-to-end orchestration: enabling and simplifying the
orchestration of machine learning pipelines.
• Easy experimentation: making it easy for you to try
numerous ideas and techniques and manage your
various trials/experiments.
• Easy re-use: enabling you to re-use components and
pipelines to quickly create end-to-end solutions without
having to rebuild each time.

Kubeflow Pipelines
Two key takeaways : A Pipeline and a
Pipeline Component
A pipeline is a description of a machine
learning (ML) workflow, including all of
the components of the workflow and
how they work together.

Kubeflow Pipelines
A pipeline component is an
implementation of a pipeline task.
A component represents a step in the
workflow.

Kubeflow Pipelines
Each pipeline component is a container
that contains a program to perform the
task required for that particular step of
your workflow.

Kubeflow Pipelines

AI Workloads and Kubeflow Pipelines
Decompose Schedule/Run

Learn more about Kubeflow Pipelines
Building a secure and transparent ML pipeline
using open source technologies
Animesh Singh (IBM), Svetlana Levitan (IBM), Tommy Li (IBM)
1:30pm–5:00pm Tuesday, July 16, 2019
Incorporating Artificial Intelligence
Location: C123-124

Community Announcements
Jupyter Notebook 6.0
Release Availability
pip install --upgrade notebook

Community Resources
Jupyter.org
https://jupyter.org/
JupyterLab
https://jupyterlab.readthedocs.io/en/stable/
Papermill
https://github.com/nteract/papermill
Kubeflow
https://kubeflow.org
https://github.com/kubeflow/

Thank you!
@lresende1975

Fabric for
Deep Learning
FfDL provides a scalable, resilient, and
fault tolerant deep-learning framework
• Fabric for Deep Learning or FfDL (pronounced as ‘fiddle’) is an
open source project which aims at making Deep Learning easily
accessible to the people it matters the most i.e. Data Scientists,
and AI developers.
• FfDL Provides a consistent way to deploy, train and visualize Deep
Learning jobs across multiple frameworks like TensorFlow, Caffe,
PyTorch, Keras etc.
• FfDL is being developed in close collaboration with IBM Research
and IBM Watson. It forms the core of Watson`s Deep Learning
service in open source.
FfDL Github Page
https://github.com/IBM/FfDL
FfDL Technical Architecture Blog
http://developer.ibm.com/code/2018/03/20/democratize-ai-with-
fabric-for-deep-learning
Deep Learning as a Service within Watson Studio
https://www.ibm.com/cloud/deep-learning
Research paper: “Scalable Multi-Framework Management of
Deep Learning Training Jobs”
http://learningsys.org/nips17/assets/papers/paper_29.pdf
FfDL
48

Ai pipelines powered by jupyter notebooks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ai pipelines powered by jupyter notebooks

Similar to Ai pipelines powered by jupyter notebooks (20)

More from Luciano Resende

More from Luciano Resende (20)

Recently uploaded

Recently uploaded (20)

Ai pipelines powered by jupyter notebooks