SlideShare a Scribd company logo
1 of 26
Dataiku Flow and dctc
Data pipelines made easy
 Berlin Buzzwords 2013
Clément Stenac <clement.stenac@dataiku.com>
@ClementStenac
 CTO @ Dataiku
 Head of product R&D @ Exalead
(Search Engine Technology)
About me
02/06/2013Dataiku Training – Hadoop for Data Science 2
 The hard life of a Data Scientist
 Dataiku Flow
 DCTC
 Lunch !
02/06/2013DIP – Introduction to Dataiku Flow 3
Follow the Flow
Dataiku - Pig, Hive and Cascading
Tracker Log
MongoDB
MySQL
MySQL
Syslog
Product
Catalog
Order
Apache Logs
Session
Product Transformation
Category Affinity
Category Targeting
Customer Profile
Product Recommender
S3
Search Logs (External) Search Engine
Optimization
(Internal) Search
Ranking
MongoDB
MySQL
Partner FTP
Sync In Sync Out
Pig
Pig
Hive
Hive
ElasticSearch
Zooming more
Dataiku - Pig, Hive and Cascading
Page Views
Orders
Catalog
Bots, Special Users
Filtered Page Views
User Affinity
Product Popularity
User Similarity (Per Category)
Recommendation Graph
Recommendation
Order Summary
User Similarity (Per Brand)
Machine Learning
 Many tasks and tools
 Dozens of stage, evolves daily
 Exceptional situations are the norm
 Many pains
◦ Shared schemas
◦ Efficient incremental synchronization
and computation
◦ Data is bad
Real-life data pipelines
02/06/2013DIP – Introduction to Dataiku Flow 6
 1970 Shell scripts
 1977 Makefile
 1980 Makedeps
 1999 SCons/CMake
 2001 Maven
 … Shell Scripts
 2008 HaMake
 2009 Oozie
 ETLS, …
 Next ?
Dataiku - Pig, Hive and Cascading
An evolution similar to build
 Better dependencies
 Higher-level tasks
 The hard life of a Data Scientist
 Dataiku Flow
 DCTC
 Lunch !
02/06/2013DIP – Introduction to Dataiku Flow 8
Dataiku Flow is a data-driven
orchestration framework for
complex data pipelines
Introduction to Flow
02/06/2013DIP – Introduction to Dataiku Flow 9
 Manage data, not steps and taks
 Simplify common maintainance situations
◦ Data rebuilds
◦ Processing steps update
 Handle real day-to-day pains
◦ Data validity checks
◦ Transfers between systems
 Like a table : contains records, with a schema
 Can be partitioned
◦ Time partitioning (by day, by hour, …)
◦ « Value » partitioning (by country, by partner, …)
 Various backends
◦ Filesystem
◦ HDFS
◦ ElasticSearch
Concepts: Dataset
02/06/2013DIP – Introduction to Dataiku Flow 10
◦ SQL
◦ NoSQL (MongoDB, …)
◦ Cloud Storages
 Has input datasets and output datasets
 Declares dependencies from input to output
 Built-in tasks with strong integration
◦ Pig
◦ Hive
 Customizable tasks
◦ Shell script, Java, …
Concepts: Task
02/06/2013DIP – Introduction to Dataiku Flow 11
Aggregate
Visits
Visits
Customers
Weekly
aggregation
Daily
aggregation
◦ Python Pandas & SciKit
◦ Data transfers
Introduction to Flow
A sample Flow
02/06/2013DIP – Introduction to Dataiku Flow 12
Visits
Shaker
« cleanlogs »
Clean
logs
Web
Tracker
logs
Pig
« aggr_visits »
CRM table
customers
Shaker
« enrich_cust »
Clean
custs.
Browser
s
Referent
.
Hive
« customer_visits »
Cust last
visits
Pig
« customer_last_
product »
CRM table
products
Cust last
products
Flow is data-oriented
 Don’t ask « Run task A and then task B »
 Don’t even ask « Run all tasks that depend from task A »
 Ask « Do what’s needed so that my aggregated customers
data for 2013/01/25 is up to date »
 Flow manages dependencies between datasets, through tasks
 You don’t execute tasks, you compute or refresh datasets
Data-oriented
02/06/2013DIP – Introduction to Dataiku Flow 13
Partition-level dependencies
02/06/2013Dataiku Training – Hadoop for Data Science 14
Shaker
cleantask1 cleanlogwtlogs
Pig
« aggr_visits » weekly_aggr
 "wtlogs" and "cleanlog" are day-partitioned
 "weekly_aggr" needs the previous 7 days of clean logs
 "sliding days" partition-level dependency
 "Compute weekly_aggr for 2012-01-25"
◦ Automatically computes the required 7 partitions
◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition
◦ Perform cleantask1 in parallel for all missing / stale days
◦ Perform aggr_visits with the 7 partitions as input
sliding_days(7)
Automatic parallelism
02/06/2013DIP – Introduction to Dataiku Flow 15
 Flow computes the global DAG of required activities
 Compute activities that can take place in parallel
 Previous example: 8 activities
◦ 7 can be parallelized
◦ 1 requires the other 7 first
 Manages running activities
 Starts new activities based on available resources
 Datasets have a schema, available in all tools
 Advanced verification of computed data
◦ "Check that output is not empty"
◦ "Check that this custom query returns between X and Y records"
◦ "Check that this specific record is found in output"
◦ "Check that number of computed records for day B is no more
than 40% different than day A"
 Automatic tests for data pipelines
Schema and data validity checks
02/06/2013DIP – Introduction to Dataiku Flow 16
 Native knowledge of Pig and Hive formats
 Schema-aware loaders and storages
 A great ecosystem, but not omnipotent
◦ Not everything requires Hadoop's strong points
 Hadoop = first-class citizen of Flow, but not the only one
 Native integration of SQL capabilities
 Automatic incremental synchronization to/from MongoDB,
Vertica, ElasticSearch, …
 Custom tasks
Integrated in Hadoop,
open beyond
02/06/2013DIP – Introduction to Dataiku Flow 17
What about Oozie and Hcatalog ?
02/06/2013DIP – Introduction to Dataiku Flow 18
 Engine and core tasks are working
 Under active development for first betas
 Get more info and stay informed
http://flowbeta.dataiku.com
And while you wait, another thing
Ever been annoyed by data transfers ?
Are we there yet ?
02/06/2013DIP – Introduction to Dataiku Flow 19
Feel the pain
02/06/2013DIP – Introduction to Dataiku Flow 20
 The hard life of a Data Scientist
 Dataiku Flow
 DCTC
 Lunch !
02/06/2013DIP – Introduction to Dataiku Flow 21
 Extract from the core of Flow
 Manipulate files across filesystems
DCTC : Cloud data manipulation
02/06/2013DIP – Introduction to Dataiku Flow 22
# List the files and folders in a S3 bucket
% dctc ls s3://my-bucket
# Synchronize incrementally from GCS to local folder
% dctc sync gs://my-bucket/my-path target-directory
# Copy from GCS to HDFS, compress to .gz on the fly
# (decompress handled too)
% dctc cp –R –c gs://my-bucket/my-path hdfs:///data/input
# Dispatch the lines of a file to 8 files on S3, gzip-compressed
% dctc dispatch input s3://bucket/target –f random –nf 8 -c
DCTC : More examples
02/06/2013DIP – Introduction to Dataiku Flow 23
# cat from anywhere
% dctc cat ftp://account@:/pub/data/data.csv
# Edit a remote file (with $EDITOR)
% dctc edit ssh://account@:myfile.txt
# Transparently unzip
% dctc
# Head / tail from the cloud
% dctc tail s3://bucket/huge-log.csv
# Multi-account aware
% dctc sync s3://account1@path s3://account2@other_path
http://dctc.io
 Self-contained binary for Linux, OS X, Windows
 Amazon S3
 Google Cloud Storage
 FTP
Try it now
02/06/2013DIP – Introduction to Dataiku Flow 24
 HTTP
 SSH
 HDFS (through local install)
Questions ?
Marc Batty
Chief Customer Officer
marc.batty@dataiku.com
+33 6 45 65 67 04
@battymarc
Florian Douetteau
Chief Executive Officer
florian.douetteau@dataiku.com
+33 6 70 56 88 97
@fdouetteau
Thomas Cabrol
Chief Data Scientist
thomas.cabrol@dataiku.com
+33 7 86 42 62 81
@ThomasCabrol
Clément Stenac
Chief Technical Officer
clement.stenac@dataiku.com
+33 6 28 06 79 04
@ClementStenac
Dataiku Flow and dctc - Berlin Buzzwords

More Related Content

What's hot

How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?Vincent Terrasi
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !Christian Bilien
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...CloudxLab
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerDataWorks Summit
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Arohi Khandelwal
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data StoreRommel Garcia
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Jen Stirrup
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyNati Shalom
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabeDataiku
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 

What's hot (20)

How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
 
Big data 101
Big data 101Big data 101
Big data 101
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data Store
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 

Viewers also liked

Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunDataiku
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
Doctor Faustus by Micaela & Rocío
Doctor Faustus by Micaela & RocíoDoctor Faustus by Micaela & Rocío
Doctor Faustus by Micaela & RocíoAndrea Izzo
 
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...DataWorks Summit
 
Aggietarium slideshow final
Aggietarium slideshow finalAggietarium slideshow final
Aggietarium slideshow finalAna Monzon
 
Тематическое планирование 8 класс
Тематическое планирование 8 классТематическое планирование 8 класс
Тематическое планирование 8 классkoneqq
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 

Viewers also liked (17)

Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Practical Placement
Practical PlacementPractical Placement
Practical Placement
 
Doctor Faustus by Micaela & Rocío
Doctor Faustus by Micaela & RocíoDoctor Faustus by Micaela & Rocío
Doctor Faustus by Micaela & Rocío
 
Oxleas_Review_2012_website_version_1
Oxleas_Review_2012_website_version_1Oxleas_Review_2012_website_version_1
Oxleas_Review_2012_website_version_1
 
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
 
Aggietarium slideshow final
Aggietarium slideshow finalAggietarium slideshow final
Aggietarium slideshow final
 
Тематическое планирование 8 класс
Тематическое планирование 8 классТематическое планирование 8 класс
Тематическое планирование 8 класс
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine Learning
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Pay periods use show
Pay periods use showPay periods use show
Pay periods use show
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 

Similar to Dataiku Flow and dctc - Berlin Buzzwords

Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopFang Mac
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryMárton Kodok
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryMárton Kodok
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupJelena Zanko
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesArcadia Data
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayAjay Shriwastava
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 

Similar to Dataiku Flow and dctc - Berlin Buzzwords (20)

Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Oracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAsOracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAs
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid Meetup
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thDataiku
 

More from Dataiku (13)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Dataiku Flow and dctc - Berlin Buzzwords

  • 1. Dataiku Flow and dctc Data pipelines made easy  Berlin Buzzwords 2013
  • 2. Clément Stenac <clement.stenac@dataiku.com> @ClementStenac  CTO @ Dataiku  Head of product R&D @ Exalead (Search Engine Technology) About me 02/06/2013Dataiku Training – Hadoop for Data Science 2
  • 3.  The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! 02/06/2013DIP – Introduction to Dataiku Flow 3
  • 4. Follow the Flow Dataiku - Pig, Hive and Cascading Tracker Log MongoDB MySQL MySQL Syslog Product Catalog Order Apache Logs Session Product Transformation Category Affinity Category Targeting Customer Profile Product Recommender S3 Search Logs (External) Search Engine Optimization (Internal) Search Ranking MongoDB MySQL Partner FTP Sync In Sync Out Pig Pig Hive Hive ElasticSearch
  • 5. Zooming more Dataiku - Pig, Hive and Cascading Page Views Orders Catalog Bots, Special Users Filtered Page Views User Affinity Product Popularity User Similarity (Per Category) Recommendation Graph Recommendation Order Summary User Similarity (Per Brand) Machine Learning
  • 6.  Many tasks and tools  Dozens of stage, evolves daily  Exceptional situations are the norm  Many pains ◦ Shared schemas ◦ Efficient incremental synchronization and computation ◦ Data is bad Real-life data pipelines 02/06/2013DIP – Introduction to Dataiku Flow 6
  • 7.  1970 Shell scripts  1977 Makefile  1980 Makedeps  1999 SCons/CMake  2001 Maven  … Shell Scripts  2008 HaMake  2009 Oozie  ETLS, …  Next ? Dataiku - Pig, Hive and Cascading An evolution similar to build  Better dependencies  Higher-level tasks
  • 8.  The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! 02/06/2013DIP – Introduction to Dataiku Flow 8
  • 9. Dataiku Flow is a data-driven orchestration framework for complex data pipelines Introduction to Flow 02/06/2013DIP – Introduction to Dataiku Flow 9  Manage data, not steps and taks  Simplify common maintainance situations ◦ Data rebuilds ◦ Processing steps update  Handle real day-to-day pains ◦ Data validity checks ◦ Transfers between systems
  • 10.  Like a table : contains records, with a schema  Can be partitioned ◦ Time partitioning (by day, by hour, …) ◦ « Value » partitioning (by country, by partner, …)  Various backends ◦ Filesystem ◦ HDFS ◦ ElasticSearch Concepts: Dataset 02/06/2013DIP – Introduction to Dataiku Flow 10 ◦ SQL ◦ NoSQL (MongoDB, …) ◦ Cloud Storages
  • 11.  Has input datasets and output datasets  Declares dependencies from input to output  Built-in tasks with strong integration ◦ Pig ◦ Hive  Customizable tasks ◦ Shell script, Java, … Concepts: Task 02/06/2013DIP – Introduction to Dataiku Flow 11 Aggregate Visits Visits Customers Weekly aggregation Daily aggregation ◦ Python Pandas & SciKit ◦ Data transfers
  • 12. Introduction to Flow A sample Flow 02/06/2013DIP – Introduction to Dataiku Flow 12 Visits Shaker « cleanlogs » Clean logs Web Tracker logs Pig « aggr_visits » CRM table customers Shaker « enrich_cust » Clean custs. Browser s Referent . Hive « customer_visits » Cust last visits Pig « customer_last_ product » CRM table products Cust last products
  • 13. Flow is data-oriented  Don’t ask « Run task A and then task B »  Don’t even ask « Run all tasks that depend from task A »  Ask « Do what’s needed so that my aggregated customers data for 2013/01/25 is up to date »  Flow manages dependencies between datasets, through tasks  You don’t execute tasks, you compute or refresh datasets Data-oriented 02/06/2013DIP – Introduction to Dataiku Flow 13
  • 14. Partition-level dependencies 02/06/2013Dataiku Training – Hadoop for Data Science 14 Shaker cleantask1 cleanlogwtlogs Pig « aggr_visits » weekly_aggr  "wtlogs" and "cleanlog" are day-partitioned  "weekly_aggr" needs the previous 7 days of clean logs  "sliding days" partition-level dependency  "Compute weekly_aggr for 2012-01-25" ◦ Automatically computes the required 7 partitions ◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition ◦ Perform cleantask1 in parallel for all missing / stale days ◦ Perform aggr_visits with the 7 partitions as input sliding_days(7)
  • 15. Automatic parallelism 02/06/2013DIP – Introduction to Dataiku Flow 15  Flow computes the global DAG of required activities  Compute activities that can take place in parallel  Previous example: 8 activities ◦ 7 can be parallelized ◦ 1 requires the other 7 first  Manages running activities  Starts new activities based on available resources
  • 16.  Datasets have a schema, available in all tools  Advanced verification of computed data ◦ "Check that output is not empty" ◦ "Check that this custom query returns between X and Y records" ◦ "Check that this specific record is found in output" ◦ "Check that number of computed records for day B is no more than 40% different than day A"  Automatic tests for data pipelines Schema and data validity checks 02/06/2013DIP – Introduction to Dataiku Flow 16
  • 17.  Native knowledge of Pig and Hive formats  Schema-aware loaders and storages  A great ecosystem, but not omnipotent ◦ Not everything requires Hadoop's strong points  Hadoop = first-class citizen of Flow, but not the only one  Native integration of SQL capabilities  Automatic incremental synchronization to/from MongoDB, Vertica, ElasticSearch, …  Custom tasks Integrated in Hadoop, open beyond 02/06/2013DIP – Introduction to Dataiku Flow 17
  • 18. What about Oozie and Hcatalog ? 02/06/2013DIP – Introduction to Dataiku Flow 18
  • 19.  Engine and core tasks are working  Under active development for first betas  Get more info and stay informed http://flowbeta.dataiku.com And while you wait, another thing Ever been annoyed by data transfers ? Are we there yet ? 02/06/2013DIP – Introduction to Dataiku Flow 19
  • 20. Feel the pain 02/06/2013DIP – Introduction to Dataiku Flow 20
  • 21.  The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! 02/06/2013DIP – Introduction to Dataiku Flow 21
  • 22.  Extract from the core of Flow  Manipulate files across filesystems DCTC : Cloud data manipulation 02/06/2013DIP – Introduction to Dataiku Flow 22 # List the files and folders in a S3 bucket % dctc ls s3://my-bucket # Synchronize incrementally from GCS to local folder % dctc sync gs://my-bucket/my-path target-directory # Copy from GCS to HDFS, compress to .gz on the fly # (decompress handled too) % dctc cp –R –c gs://my-bucket/my-path hdfs:///data/input # Dispatch the lines of a file to 8 files on S3, gzip-compressed % dctc dispatch input s3://bucket/target –f random –nf 8 -c
  • 23. DCTC : More examples 02/06/2013DIP – Introduction to Dataiku Flow 23 # cat from anywhere % dctc cat ftp://account@:/pub/data/data.csv # Edit a remote file (with $EDITOR) % dctc edit ssh://account@:myfile.txt # Transparently unzip % dctc # Head / tail from the cloud % dctc tail s3://bucket/huge-log.csv # Multi-account aware % dctc sync s3://account1@path s3://account2@other_path
  • 24. http://dctc.io  Self-contained binary for Linux, OS X, Windows  Amazon S3  Google Cloud Storage  FTP Try it now 02/06/2013DIP – Introduction to Dataiku Flow 24  HTTP  SSH  HDFS (through local install)
  • 25. Questions ? Marc Batty Chief Customer Officer marc.batty@dataiku.com +33 6 45 65 67 04 @battymarc Florian Douetteau Chief Executive Officer florian.douetteau@dataiku.com +33 6 70 56 88 97 @fdouetteau Thomas Cabrol Chief Data Scientist thomas.cabrol@dataiku.com +33 7 86 42 62 81 @ThomasCabrol Clément Stenac Chief Technical Officer clement.stenac@dataiku.com +33 6 28 06 79 04 @ClementStenac