Watch full webinar here:
Data lakes have been both praised and loathed. They can be incredibly useful to an organization, but it can also be the source of major headaches. Its ease to scale storage with minimal cost has opened the door to many new solutions, but also to a proliferation of runaway objects that have coined the term data swamp.
However, the addition of an MPP engine, based on Presto, to Denodo’s logical layer can change the way you think about the role of the data lake in your overall data strategy.
Watch on-demand this session to learn:
- The new MPP capabilities that Denodo includes
- How to use them to your advantage to improve security and governance of your lake
- New scenarios and solutions where your data fabric strategy can evolve
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
1. DATA VIRTUALIZATION
Packed Lunch Webinar Series
Sessions Covering Key Data Integration
Challenges Solved with Data Virtualization
2. Shaping the Role of a Data Lake in a
Modern Data Fabric Architecture
Pablo Alvarez
Global Director of
Product Management
Denodo
Alberto Pan
CTO
Denodo
3. 3
The Rise and Fall of the Data Lake
• Data Lakes were often the flagship initiatives of the
Hadoop era
• However, few data lakes manage to fulfill initial
expectations, and often failed to deliver results
• Those “data swamps” were often criticized for lack of
process, governance and security
• However, many of the technological advances of those
data lakes lived on in newer technologies
4. 4
The Advent of the Object Storage
• Object Storage is a form of storage for unstructured data (objects) that eliminates
scaling limitations of traditional storage options
• In other words, it is limitless in terms of capacity
• Its rooted in the Big Data initiatives of the early 2010’s, especially HDFS
• But it came to popularity with its adoption by cloud providers
• Nowadays, Amazon’s S3 (Simple Storage Service) and Azure’s ADLS (Azure Data Lake
Storage) are the most popular
• Although there are alternatives from other vendors (Google, Oracle, IBM, etc) and open
source options (like MinIO)
5. 5
Object Storage is the Foundation of Cloud Data Systems
• Modern cloud data systems, like cloud EDW, data lakes and
the “lakehouse”, have evolved based on the premise of
separation of processing and storage
• Unlike traditional EDW, processing power was not tied to
additional disk space
• Object storage technologies provided the limitless storage they
needed, in a more cost-efficient way, and adapted to the cloud
• Open formats, like Parquet and Avro, specifically designed for
interoperability on analytics, helped them grow and gain
adoption
7. 7
Common Usage Patterns for Modern Data Lakes
• Cheap storage for backup, old or rarely
used data
• Ingest 3rd party data
• Move non-critical workloads to
cheaper systems
• Data science playground
• New life for legacy Hadoop efforts
• And many others
8. 8
Can you work with an object storage alone?
• Object storage platforms provide limitless, cost-
efficient storage space
• However, they are still filesystems
• Although some client applications can connect and use
those files directly as if they were in a local filesystem,
processing data that way is not efficient
• In addition, object storage platforms offer limited
granularity in terms of security, and few options for
governance
• Incorporating an object storage in your data strategy
will need additional pieces
9. 9
What else do we need?
1. In order to process data in the object storage efficiently, we will need a modern MPP engine that
can work in parallel to process large data volumes
• Most new generation cloud data systems, like Snowflake, Databricks, Presto, Redshift, etc. follow that design
2. But an MPP engine alone is not enough, as seen by the failures of previous incarnations of Data
Lake projects!
3. We need to bring additional options for data management:
• Fine-grained security and access control
• Documentation, classification and search capabilities to bring cataloguing and governance into the process
• Data integration capabilities to ingest, massage, curate and expose information in the right format
4. Additionally, we need to keep in mind that data in the object storage is just a portion of the data in
the organization. All data should be managed with consistency, regardless of location
10. 10
Adding an MPP engine to the Denodo Platform
Logical Layer
Traditional
DB & DW
Cloud Excel
Lake filesystem
(S3/ADLS)
Lake Engine
MPP Engine
11. 11
How does it work?
• Easy, efficient MPP access
to content in the object
storage
• No need for an
additional external
engine
• Integrated security and
management
• Out-of-the-box MPP
options for caching and
query acceleration
Logical Layer MPP Coordinator
MPP worker
MPP worker
MPP worker
MPP worker
Object
Storage
12. 12
How does it work?
Object Storage configuration
Object Storage browsing
• Automated
deployment using
Kubernetes and Helm
charts
• Integrated
configuration
• Graphical browsing and
introspection of object
storage
14. 14
Move Non-Critical Workloads to Cheaper Systems
• Separation of compute and storage means that the
same data and queries can be computed with other
engines with minimal changes
• Denodo includes the tools to move and keep data
updated when needed
• A logical layer means that the change is transparent
for consumers
15. 15
Cheap storage for backup, old or rarely used data
• Object storage is a great option for data that
is rarely used but that need to be stored for
backup or compliance reasons
• These data can be exported into Parquet and
moved to the object storage
• Denodo can automatically map these data
and make it accessible at no additional cost
16. 16
Ingest 3rd party data
• An object storage where our partners have access is a
great way to offer a way to bring third party data into
the organization
• Data can be in parquet, but also in JSON, CSV or even
Excel
• Denodo can automatically map it
• And provide the right tools to massage and load in the
corporate systems on periodical bases
17. 17
Data Science Playground
• Denodo provides access in SQL to any company data asset
• This data can be easily moved into the object storage,
where the MPP engine can efficiently process it for
deeper analysis
• Denodo offers native python drivers and is compatible
with popular data scient toolkits (e.g. pandas) and tools
(R, DataIku, etc.)
• Additionally, a data scientist may prefer to export content
to a parquet file and connect directly to that file from a
different platform, like Databricks
18. 18
Conclusions
1. Object Storage technologies, especially in the cloud (S3,
ADLS, etc.), offer a very attractive and flexible technology
to store very large data volumes at low cost
2. New-gen MPP engines provide efficient processing
capabilities for data stored in an object storage,
especially when formats like Parquet are used
3. A logical layer, like Denodo, provides the additional
security, governance and data integration requirements
to safely introduce an object storage based data lake into
your data strategy
19. Fireside Chat:
Shaping the Role of a Data Lake in a
Modern Data Fabric Architecture
Pablo Alvarez
Global Director of
Product Management
Denodo
Alberto Pan
CTO
Denodo
21. 21
Next Steps
Access Denodo Platform in the Cloud.
Start your Free Trial today!
G E T STA RT E D TO DAY
www.denodo.com/free-trials
Logical Data Fabric
A Technical Whitepaper
DOWNLOAD WHITEPAPER