More Related Content Similar to Ai pipelines powered by jupyter notebooks (20) More from Luciano Resende (20) Ai pipelines powered by jupyter notebooks1. AI pipelines powered by Jupyter notebooks
Luciano Resende
Open Source AI Platform Architect
@lresende1975
2. About me - Luciano Resende
Open Source AI Platform Architect – IBM – CODAIT
• Senior Technical Staff Member at IBM, contributing to open source for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Toree, Apache Spark among other projects related to AI/ML platforms
lresende@us.ibm.com
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
IBM Developer / © 2019 IBM Corporation 2
3. IBM Open Source Participation
IBM Developer / © 2019 IBM Corporation
Learn
Open Source @ IBM
Program touches
78,000
IBMers annually
Consume
Virtually all
IBM products
contain some
open source
• 40,363 pkgs
Per Year
Contribute
• >62K OS Certs
per year
• ~10K IBM
commits per
month
Connect
> 1000
active IBM
Contributors
Working in key OS
projects
3
4. IBM Open Source
Participation
IBM generated open source innovation
• 137 IBM Open Code projects w/1000+ Github
projects
• Projects graduates into full open governance:
Node-Red, OpenWhisk, SystemML, Blockchain
fabric among others
• developer.ibm.com/code/open/code/
Community
• IBM focused on 18 strategic communities
• Drive open governance in “Centers of Gravity”
• IBM Leaders drive key technologies and assure
freedom of action
The IBM OS Way is now open sourced
• Training, Recognition, Tooling
• Organization, Consuming, Contributing
4IBM Developer / © 2019 IBM Corporation
5. Technology leaders do more than just consume OSS
19
1998
“For more than 20 years, IBM and Red Hat have paved the
way for open communities to power innovative IT solutions.”
– Red Hat
Long IBM history of actively fostering balanced community participation
5
© 2019 IBM Corporation
6. Center for Open Source
Data and AI
Technologies
6
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
6IBM Developer / © 2019 IBM Corporation
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait
7. IBM Data Asset eXchange (DAX)
7
• Curated free and open datasets under open data licenses
• Standardized dataset formats and metadata
• Ready for use in enterprise AI applications
• Complement to the Model Asset eXchange (MAX)
Data Asset eXchange
ibm.biz/data-asset-exchange
Model Asset eXchange
ibm.biz/model-exchange
8. AGENDA
Jupyter Notebooks
Analytic Workloads Pipelines
• IPython %run magic
• Jupyter NBConverter
• Papermill
• Apache Flow
AI/Deep Learning Workloads Pipelines
• AI Platforms
• Kubeflow and Kubeflow Pipelines
Announcements
Resources
IBM Developer / © 2019 IBM Corporation 8
10. Jupyter Notebooks
Notebooks are interactive
computational environments, in
which you can combine code
execution, rich text, mathematics,
plots and rich media.
10IBM Developer / © 2019 IBM Corporation
11. Jupyter Notebook
11
Simple, but Powerful
As simple as opening a web
page, with the capabilities of
a powerful, multilingual,
development environment.
Interactive widgets
Code can produce rich
outputs such as images,
videos, markdown, LaTeX
and JavaScript. Interactive
widgets can be used to
manipulate and visualize
data in real-time.
Language of choice
Jupyter Notebooks have
support for over 50
programming languages,
including those popular in
Data Science, Data
Engineer, and AI such as
Python, R, Julia and Scala.
Big Data Integration
Leverage Big Data platforms
such as Apache Spark from
Python, R and Scala.
Explore the same data with
pandas, scikit-learn,
ggplot2, dplyr, etc.
Share Notebooks
Notebooks can be shared
with others using e-mail,
Dropbox, Google Drive,
GitHub, etc
12. Jupyter Notebook Platform Architecture
Notebook UI runs on the browser
The Notebook Server serves the
’Notebooks’
Kernels interpret/execute cell contents
– Are responsible for code execution
– Abstracts different languages
– 1:1 relationship with Notebook
– Runs and consume resources as long as
notebook is running
12IBM Developer / © 2019 IBM Corporation
14. Analytic Workloads
Large amount of data
Shared across organization in Data
Lakes
Multiple workload types
– Data cleansing
– Data Warehouse
– Machine Learning and Insights
14IBM Developer / © 2019 IBM Corporation
17. Notebook Pipelines
using %run
%run built-in IPython magic
- Enables execution of notebooks or python
scripts
IBM Developer / © 2019 IBM Corporation 17
Notebook
Orchestrator
%run
%run
%run
18. Notebook Pipelines
using %run
%run built-in IPython magic
- Enables execution of notebooks or
python scripts
Limitations
- Available in the IPython kernel only
- Static
- No command line integration
IBM Developer / © 2019 IBM Corporation 18
19. Notebook Pipelines
using NBConvert
IBM Developer / © 2019 IBM Corporation 19
input
notebook(s)
orchestrator
result_1.ipynb result_2.ipynb
result_3.html result_4.pdf
output file(s)
ipynb, html, pdf
NBConvert
Jupyter NBConvert
https://nbconvert.readthedocs.io/en/latest/
Jupyter NBConvert enables executing
and converting notebooks to different
file formats.
20. Notebook Pipelines
using NBConvert
$ pip install nbconvert
$ jupyter nbconvert --to html --execute overview_with_run.ipynb
[NbConvertApp] Converting notebook overview_with_run.ipynb to html
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] Writing 300558 bytes to overview_with_run.html
$ open overview_with_run.html
IBM Developer / © 2019 IBM Corporation 20
Jupyter NBConvert
https://nbconvert.readthedocs.io/en/latest/
Jupyter NBConvert enables executing
and converting notebooks to different
file formats.
Advantages
– Support notebook chaining
– Convert results to immutable formats
Limitations
– No support for parameters
22. Papermill
Papermill is an open source tool
contributed by Netflix which enables
parameterizing, executing, and
analyzing Jupyter Notebooks.
Papermill lets you:
- Parameterize notebooks
- Execute notebooks
IBM Developer / © 2019 IBM Corporation 22
input
notebook
orchestrator
result_1.ipynb result_2.ipynb
result_3.html result_4.pdf
output file(s)
ipynb, html, pdf
23. Papermill
Papermill provides programmatic
interface so you can integrate with your
applications
IBM Developer / © 2019 IBM Corporation 23
import papermill as pm
pm.execute_notebook('input_nb.ipynb',
'outputs/20190402_run.ipynb')
...
# Each run can be placed in a unique / sortable path
pprint(files_in_directory('outputs'))
outputs/ ...
20190401_run.ipynb
20190402_run.ipynb
24. Papermill
Papermill provides a CLI that enables
easy integration with external tools and
simple schedulers as crontab.
IBM Developer / © 2019 IBM Corporation 24
$ papermill input_notebook.ipynb
outputs/{run_id}_out.ipynb
$ papermill input.ipynb report.ipynb -y '{"foo":"bar"}' &&
jupyter nbconvert --to html report.ipynb
26. Apache Airflow
Airflow is a platform to
programmatically author, schedule and
monitor workflows. It’s enterprise
ready and used to build large and
complex workload pipelines.
IBM Developer / © 2019 IBM Corporation 26
Python Code
DAG
(Workflow)
27. Apache Airflow
Airflow is a platform to
programmatically author, schedule and
monitor workflows. It’s enterprise
ready and used to build large and
complex workload pipelines.
Airflow Papermill operator enables
Jupyter Notebooks to be integrated into
Airflow workflows/pipelines.
IBM Developer / © 2019 IBM Corporation 27
More information à https://airflow.readthedocs.io/en/latest/howto/operator/papermill.html
30. Analytic Workloads Pipelines Summary
%run NBConvert Papermill Apache
Airflow
Notebook Kernels IPython Multiple Multiple Multiple
Static versus Dynamic Static Dynamic Dynamic Dynamic
Programmatic APIs Yes Yes
Notebook Parameters Yes Yes
Heterogeneous pipelines/workflows Yes
32. AI / Deep Learning Workloads
Resource intensive workloads
Requires expensive hardware (GPU,
TPU)
Long Running training jobs
– Simple MINIST takes over one hour
WITHOUT a decent GPU
– Other non complex deep learning
model training can easily take over a
dat WITH GPUs
32IBM Developer / © 2019 IBM Corporation
33. Training/Deploying Models requires a lot of DevOPS
33
Model Serving
Monitoring
Resource
Management
Configuration
Hyperparameter
Optimization
Reproducibility
IBM Developer / © 2019 IBM Corporation
34. AI / Deep Learning Workloads Challenges
• How to isolate the training environments to multiple jobs,
based on different deep learning frameworks (and/or
releases) can be submitted/trained on the same time.
• Ability to allocate individual system level resources such as
GPUs, TPUs, etc with different kernels for a period of time.
• Ability to allocate and free up system level resources such as
GPUs, TPUs, etc as they stop being used or when they are idle
for a period of time.
IBM Developer / © 2019 IBM Corporation 34
35. AI / Deep Learning Workloads
Source: https://github.com/Langhalsdino/Kubernetes-GPU-Guide
IBM Developer / © 2019 IBM Corporation 35
Containers and Kubernetes Platform
- Containers simplify management of
complicated and heterogenous AI/Deep
Learning infrastructure providing a required
isolation layer to different pods running
different Deep Learning frameworks
- Containers provides a flexible way to deploy
applications and are here to stay
- Kubernetes enables easy management of
containerized applications and resources
with the benefit of Elasticity and Quality of
Services
36. AI Platforms
AI/Deep Learning Platforms aim to
abstract the DevOPS tasks from the
Data Scientist providing a consistent
way to develop AI models independent
of the toolkit/framework being used.
IBM Developer / © 2019 IBM Corporation 36
FfDL
37. Kubeflow
• ML Toolkit for Kubernetes
• Open source and community driven
• Support multiple ML Frameworks
• End-to-end workflows that can be
shared, scaled and deployed
IBM Developer / © 2019 IBM Corporation 37
38. Kubeflow Pipelines
Kubeflow Pipelines is a platform for
building and deploying portable,
scalable machine learning (ML)
workflows based on Docker containers.
• End-to-end orchestration: enabling and simplifying the
orchestration of machine learning pipelines.
• Easy experimentation: making it easy for you to try
numerous ideas and techniques and manage your
various trials/experiments.
• Easy re-use: enabling you to re-use components and
pipelines to quickly create end-to-end solutions without
having to rebuild each time.
IBM Developer / © 2019 IBM Corporation 38
39. Kubeflow Pipelines
IBM Developer / © 2019 IBM Corporation 39
Two key takeaways : A Pipeline and a
Pipeline Component
A pipeline is a description of a machine
learning (ML) workflow, including all of
the components of the workflow and
how they work together.
40. Kubeflow Pipelines
IBM Developer / © 2019 IBM Corporation 40
A pipeline component is an
implementation of a pipeline task.
A component represents a step in the
workflow.
41. Kubeflow Pipelines
IBM Developer / © 2019 IBM Corporation 41
Each pipeline component is a container
that contains a program to perform the
task required for that particular step of
your workflow.
44. Learn more about Kubeflow Pipelines
IBM Developer / © 2019 IBM Corporation 44
Building a secure and transparent ML pipeline
using open source technologies
Animesh Singh (IBM), Svetlana Levitan (IBM), Tommy Li (IBM)
1:30pm–5:00pm Tuesday, July 16, 2019
Incorporating Artificial Intelligence
Location: C123-124
46. Community Resources
IBM Developer / © 2019 IBM Corporation 46
Jupyter.org
https://jupyter.org/
JupyterLab
https://jupyterlab.readthedocs.io/en/stable/
Papermill
https://github.com/nteract/papermill
Kubeflow
https://kubeflow.org
https://github.com/kubeflow/
48. Fabric for
Deep Learning
FfDL provides a scalable, resilient, and
fault tolerant deep-learning framework
• Fabric for Deep Learning or FfDL (pronounced as ‘fiddle’) is an
open source project which aims at making Deep Learning easily
accessible to the people it matters the most i.e. Data Scientists,
and AI developers.
• FfDL Provides a consistent way to deploy, train and visualize Deep
Learning jobs across multiple frameworks like TensorFlow, Caffe,
PyTorch, Keras etc.
• FfDL is being developed in close collaboration with IBM Research
and IBM Watson. It forms the core of Watson`s Deep Learning
service in open source.
IBM Developer / © 2019 IBM Corporation 48
FfDL Github Page
https://github.com/IBM/FfDL
FfDL Technical Architecture Blog
http://developer.ibm.com/code/2018/03/20/democratize-ai-with-
fabric-for-deep-learning
Deep Learning as a Service within Watson Studio
https://www.ibm.com/cloud/deep-learning
Research paper: “Scalable Multi-Framework Management of
Deep Learning Training Jobs”
http://learningsys.org/nips17/assets/papers/paper_29.pdf
FfDL
48