SlideShare a Scribd company logo
1 of 35
A Scalable Data Transformation Framework
@ Penton using the Hadoop Ecosystem
Raj Nair
Director–Data Platform
Kiru
Pakkirisamy CTO
AGENDA
• About Penton and Serendio Inc
• Data Processing at Penton
• PoC Use Case
• Functional Aspects of the Use Case
• Big Data Architecture, Design and Implementation
• Lessons Learned
• Conclusion
• Questions
About Penton
• Professional information services company
• Provide actionable information to five core markets
Agriculture Transportation Natural
Products
Infrastructure Industrial Design
& Manufacturing
Success Stories
EquipmentWatch.com Govalytics.com
Prices, Specs, Costs, Rental Analytics around Gov’t capital spending
down to county level
SourceESB NextTrend.com
Vertical Directory, electronic parts Identify new product trends in the natural
products industry
About Serendio
Serendio provides Big Data Science
Solutions & Services for
Data-Driven Enterprises.
www.serendio.com
Data Processing at Penton
What got us thinking?
• Business units process data in silos
• Heavy ETL
– Hours to process, in some cases days
• Not even using all the data we want
• Not logging what we needed to
• Can’t scale for future requirements
Data Processing Pipeline New
features
New
Insights
New
Products
Biz ValueAssembly Line
processing
Data Processing Pipeline
The Data Processing Pipeline
Penton examples
• Daily Inventory data, ingested throughout the day
(tens of thousands of parts)
• Auction and survey data gathered daily
• Aviation Fleet data, varying frequency
Ingest, store
Clean, validate
Apply Business Rules
Map
Analyze
Report
Distribute
Slow Extract, Transform and Load = Frustration + missed business SLAs
Won’t scale for future
Various data formats, mostly unstructured
What were our options?
Adopt Hadoop Ecosystem
- M/R: Ideal for Batch Processing
- Flexible for storage
- NoSQL: scale, usability and flexibility
Expand RDBMS options
- Expensive
- Complex
HBASE Oracle
SQL
Server
Drools
POC Use Case
Primary Use Case
• Daily model data – upload and map
– Ingest data, build buckets
– Map data (batch and interactive)
– Build Aggregates (dynamic)
Issue: Mapping time
Functional Aspects
Data Scrubbing
• Standardized names for fields/columns
• Example - Country
– Unites States of America -> USA
– United States -> USA
Data Mapping
• Converting Fields - > Ids
– Manufacturer - Caterpillar -> 25
– Model - Caterpillar/Front Loader -> 300
• Requires the use of lookup tables and partial/fuzzy
matching strings
Data Exporting
• Move scrubbed/mapped data to main RDBMS
Current Design
• Survey data loaded as CSV files
• Data needs to be scrubbed/mapped
• All CSV rows loaded into one table
• Once scrubbed/mapped data is loaded into main tables
• Not all rows are loaded, some may be used in the future
Key Pain Points
• CSV data table continues to grow
• Large size of the table impacts operations on rows in a single
file
• CSV data could grow rapidly in the future
Criteria for New Design
• Ability to store an individual file and manipulate it easily
– No join/relationships across CSV files
• Solution should have good integration with RDBMS
• Could possibly host the complete application in future
• Technology stack should possibly have advanced analytics
capabilities
NoSQL model would allow to quickly retrieve/address
individual file and manipulate it
Big Data Architecture
Solution Architecture
Existing Business
Applications
REST API
CSV and Rule Management Endpoints
HBASE
HADOOP HDFS
CSV
Files
Master database of
Products/ Parts
Current
Oracle
Schema
Push
Updates
Insert
Accepted
Data
RDB-> Data Upload UI
API Calls
MR Jobs
Launch
Survey
RESTDrools
Use HBase
as a store for
CSV files
Data manipulation
APIs exposed
through REST layer
Drools – for
rule based
data
scrubbing
Operations on
individual files in
UI through Hbase
Get/Put
Operations on
all/groups of files
using MR jobs
Hbase Schema Design
• One row per HBase row
• One file per HBase row
– One cell per column qualifier (simple and started the development
with this approach)
– One row per column qualifier (more performant approach)
Hbase Rowkey Design
• Row Key
– Composite
• Created Date (YYYYMMDD)
• User
• FileType
• GUID
• Salting for better region splitting
– One byte
Hbase Column Family Design
• Column Family
– Data separated from Metadata into two or more
column families
– One cf for mapping data (more later)
– One cf for analytics data (used by analytics
coprocessors)
M/R Jobs
• Jobs
– Scrubbing
– Mapping
– Export
• Schedule
– Manually from UI
– On schedule using Oozie
Sqoop Jobs
• One time
– FileDetailExport (current CSV)
– RuleImport (all current rules)
• Periodic
– Lookup Table Data import
• Manufacture
• Model
• State
• Country
• Currency
• Condition
• Participant
Application Integration - REST
• Hide HBase AP/Java APIs from rest of
application
• Language independence for PHP front-end
• REST APIs for
– CSV Management
– Drools Rule Management
Lessons Learned
Performance Benefits
• Mapping
– 20000 csv files, 20 million records
– Time taken – 1/3rd of RDBMS processing
• Metrics
– < 10 secs vs (Oracle Materialized View)
• Upload a file
– < 10 secs
• Delete a file
– < 10 secs
Hbase Tuning
• Heap Size for
– RegionServer
– MapReduce Tasks
• Table Compression
– SNAPPY for Column Family holding csv data
• Table data caching
– IN_MEMORY for lookup tables
Application Design Challenges
• Pagination – implemented using intermediate REST layer and
scan.setStartRow.
• Translating SQL queries
– Used Scan/Filter and Java (especially on coprocessor)
– No secondary indexes - used FuzzyRowFilter
– Maybe something like Phoenix would have helped
• Some issues in mixed mode. Want to move to 0.96.0 for
better/individual column family flushing but needed to 'port'
coprocessors (to protobuf)
Hbase Value Proposition
• Better response in UI for CSV file operations - Operations
within a file (map, import, reject etc) not dependent on the
db size
• Relieve load on RDBMS - no more CSV data tables
• Scale out batch processing performance on the cheap (vs
vertical RDBMS upgrade)
• Redundant store for CSV files
• Versioning to track data cleansing
Roadmap
• Benchmark with 0.96
• Retire Coprocessors in favor of Phoenix (?)
• Lookup Data tables are small. Need to find a better alternative
than HBase
• Design UI for a more Big Data appropriate model
– Search oriented paradigm, than exploratory/ paginative
– Add REST endpoints to support such UI
Wrap-Up
Conclusion
• PoC demonstrated
– value of the Hadoop ecosystem
– Co-existence of Big data technologies with current solutions
– Adoption can significantly improve scale
– New skill requirements
Thank You
Rajesh.Nair@Penton.com
Kiru@Serendio.com

More Related Content

What's hot

Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar
 
Rich Data Graphs for MapReduce
Rich Data Graphs for MapReduceRich Data Graphs for MapReduce
Rich Data Graphs for MapReduceScott Cinnamond
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or WorseEric Sun
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
Why Java Professionals Should Learn Hadoop
Why Java Professionals Should Learn HadoopWhy Java Professionals Should Learn Hadoop
Why Java Professionals Should Learn HadoopBigClasses Com
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 StepsScott Cinnamond
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
 
Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainingsGeek Trainings
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
SAP BW Powered by SAP HANA
SAP BW Powered by  SAP HANASAP BW Powered by  SAP HANA
SAP BW Powered by SAP HANABigClasses Com
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with RedshiftAmazon Web Services
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
NoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and ComparisonNoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and ComparisonMayuree Srikulwong
 

What's hot (20)

Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
 
Rich Data Graphs for MapReduce
Rich Data Graphs for MapReduceRich Data Graphs for MapReduce
Rich Data Graphs for MapReduce
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Why Java Professionals Should Learn Hadoop
Why Java Professionals Should Learn HadoopWhy Java Professionals Should Learn Hadoop
Why Java Professionals Should Learn Hadoop
 
Cutting edge Essbase
Cutting edge EssbaseCutting edge Essbase
Cutting edge Essbase
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 Steps
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Hadoop course content
Hadoop course contentHadoop course content
Hadoop course content
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainings
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
SAP BW Powered by SAP HANA
SAP BW Powered by  SAP HANASAP BW Powered by  SAP HANA
SAP BW Powered by SAP HANA
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Sap bw on hana ramireddy ppt
Sap bw on hana ramireddy pptSap bw on hana ramireddy ppt
Sap bw on hana ramireddy ppt
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
NoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and ComparisonNoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and Comparison
 

Viewers also liked

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Jboss jbpm and drools 1 introduction to drools architecture
Jboss jbpm and drools   1 introduction to drools architectureJboss jbpm and drools   1 introduction to drools architecture
Jboss jbpm and drools 1 introduction to drools architectureZoran Hristov
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
 
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTEJBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTEtsurdilovic
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

Viewers also liked (8)

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Jboss jbpm and drools 1 introduction to drools architecture
Jboss jbpm and drools   1 introduction to drools architectureJboss jbpm and drools   1 introduction to drools architecture
Jboss jbpm and drools 1 introduction to drools architecture
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTEJBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Business Rules on Hadoop
Business Rules on HadoopBusiness Rules on Hadoop
Business Rules on Hadoop
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to A Scalable Data Transformation Framework using Hadoop Ecosystem

Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysRahul Agarwal
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
 

Similar to A Scalable Data Transformation Framework using Hadoop Ecosystem (20)

SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Apache drill
Apache drillApache drill
Apache drill
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

A Scalable Data Transformation Framework using Hadoop Ecosystem

  • 1. A Scalable Data Transformation Framework @ Penton using the Hadoop Ecosystem Raj Nair Director–Data Platform Kiru Pakkirisamy CTO
  • 2. AGENDA • About Penton and Serendio Inc • Data Processing at Penton • PoC Use Case • Functional Aspects of the Use Case • Big Data Architecture, Design and Implementation • Lessons Learned • Conclusion • Questions
  • 3. About Penton • Professional information services company • Provide actionable information to five core markets Agriculture Transportation Natural Products Infrastructure Industrial Design & Manufacturing Success Stories EquipmentWatch.com Govalytics.com Prices, Specs, Costs, Rental Analytics around Gov’t capital spending down to county level SourceESB NextTrend.com Vertical Directory, electronic parts Identify new product trends in the natural products industry
  • 4. About Serendio Serendio provides Big Data Science Solutions & Services for Data-Driven Enterprises. www.serendio.com
  • 6. What got us thinking? • Business units process data in silos • Heavy ETL – Hours to process, in some cases days • Not even using all the data we want • Not logging what we needed to • Can’t scale for future requirements
  • 7. Data Processing Pipeline New features New Insights New Products Biz ValueAssembly Line processing Data Processing Pipeline The Data Processing Pipeline
  • 8. Penton examples • Daily Inventory data, ingested throughout the day (tens of thousands of parts) • Auction and survey data gathered daily • Aviation Fleet data, varying frequency Ingest, store Clean, validate Apply Business Rules Map Analyze Report Distribute Slow Extract, Transform and Load = Frustration + missed business SLAs Won’t scale for future Various data formats, mostly unstructured
  • 9. What were our options? Adopt Hadoop Ecosystem - M/R: Ideal for Batch Processing - Flexible for storage - NoSQL: scale, usability and flexibility Expand RDBMS options - Expensive - Complex HBASE Oracle SQL Server Drools
  • 11. Primary Use Case • Daily model data – upload and map – Ingest data, build buckets – Map data (batch and interactive) – Build Aggregates (dynamic) Issue: Mapping time
  • 13. Data Scrubbing • Standardized names for fields/columns • Example - Country – Unites States of America -> USA – United States -> USA
  • 14. Data Mapping • Converting Fields - > Ids – Manufacturer - Caterpillar -> 25 – Model - Caterpillar/Front Loader -> 300 • Requires the use of lookup tables and partial/fuzzy matching strings
  • 15. Data Exporting • Move scrubbed/mapped data to main RDBMS
  • 16. Current Design • Survey data loaded as CSV files • Data needs to be scrubbed/mapped • All CSV rows loaded into one table • Once scrubbed/mapped data is loaded into main tables • Not all rows are loaded, some may be used in the future
  • 17. Key Pain Points • CSV data table continues to grow • Large size of the table impacts operations on rows in a single file • CSV data could grow rapidly in the future
  • 18. Criteria for New Design • Ability to store an individual file and manipulate it easily – No join/relationships across CSV files • Solution should have good integration with RDBMS • Could possibly host the complete application in future • Technology stack should possibly have advanced analytics capabilities NoSQL model would allow to quickly retrieve/address individual file and manipulate it
  • 20. Solution Architecture Existing Business Applications REST API CSV and Rule Management Endpoints HBASE HADOOP HDFS CSV Files Master database of Products/ Parts Current Oracle Schema Push Updates Insert Accepted Data RDB-> Data Upload UI API Calls MR Jobs Launch Survey RESTDrools Use HBase as a store for CSV files Data manipulation APIs exposed through REST layer Drools – for rule based data scrubbing Operations on individual files in UI through Hbase Get/Put Operations on all/groups of files using MR jobs
  • 21. Hbase Schema Design • One row per HBase row • One file per HBase row – One cell per column qualifier (simple and started the development with this approach) – One row per column qualifier (more performant approach)
  • 22. Hbase Rowkey Design • Row Key – Composite • Created Date (YYYYMMDD) • User • FileType • GUID • Salting for better region splitting – One byte
  • 23. Hbase Column Family Design • Column Family – Data separated from Metadata into two or more column families – One cf for mapping data (more later) – One cf for analytics data (used by analytics coprocessors)
  • 24. M/R Jobs • Jobs – Scrubbing – Mapping – Export • Schedule – Manually from UI – On schedule using Oozie
  • 25. Sqoop Jobs • One time – FileDetailExport (current CSV) – RuleImport (all current rules) • Periodic – Lookup Table Data import • Manufacture • Model • State • Country • Currency • Condition • Participant
  • 26. Application Integration - REST • Hide HBase AP/Java APIs from rest of application • Language independence for PHP front-end • REST APIs for – CSV Management – Drools Rule Management
  • 28. Performance Benefits • Mapping – 20000 csv files, 20 million records – Time taken – 1/3rd of RDBMS processing • Metrics – < 10 secs vs (Oracle Materialized View) • Upload a file – < 10 secs • Delete a file – < 10 secs
  • 29. Hbase Tuning • Heap Size for – RegionServer – MapReduce Tasks • Table Compression – SNAPPY for Column Family holding csv data • Table data caching – IN_MEMORY for lookup tables
  • 30. Application Design Challenges • Pagination – implemented using intermediate REST layer and scan.setStartRow. • Translating SQL queries – Used Scan/Filter and Java (especially on coprocessor) – No secondary indexes - used FuzzyRowFilter – Maybe something like Phoenix would have helped • Some issues in mixed mode. Want to move to 0.96.0 for better/individual column family flushing but needed to 'port' coprocessors (to protobuf)
  • 31. Hbase Value Proposition • Better response in UI for CSV file operations - Operations within a file (map, import, reject etc) not dependent on the db size • Relieve load on RDBMS - no more CSV data tables • Scale out batch processing performance on the cheap (vs vertical RDBMS upgrade) • Redundant store for CSV files • Versioning to track data cleansing
  • 32. Roadmap • Benchmark with 0.96 • Retire Coprocessors in favor of Phoenix (?) • Lookup Data tables are small. Need to find a better alternative than HBase • Design UI for a more Big Data appropriate model – Search oriented paradigm, than exploratory/ paginative – Add REST endpoints to support such UI
  • 34. Conclusion • PoC demonstrated – value of the Hadoop ecosystem – Co-existence of Big data technologies with current solutions – Adoption can significantly improve scale – New skill requirements