2. Our 40 Minutes
Today
• Drivers for Data Warehouse Modernization
• What is a Modern Data Warehouse
• Challenges for implementing a Modern Data
Warehouse
• Driving adoption and usage within the enterprise
• Measuring success factors and ROI
3. Data Warehouse Modernization – Drivers
Optimize Existing DW/BI Infrastructure of Create New
Capabilities
Handle Big Data and the 3 V’s
• Volume, Variety, Velocity
Integrate Multiple Data Silos
• ERP, CRM, HRM and others
Reduce Cost
• ETL process
• Analytical process
• Mainframe process
• Cloud feasibility for data analytics
Applying Science
• Unstructured data for enhancing analytics
• Data Science for advanced analytics
Reduce Time to Market by Faster Processing Analytics
4. Blueprint of a Modern Data Warehouse with Hadoop
The enterprise data warehouse (EDW) and Hadoop based warehouse would co-exist to allow the
enterprise to leverage the strengths of each architecture.
Landing and
ingestionStructured
Unstructured
External Social
Machine
Geospatial
Time Series
Streaming
Provisioning,Workflow, Monitoring and Security
Enterprise
Data Lake
Real-Time applications
Predictive
applications
Exploration &
discovery
Enterprise
applications
Traditional data
repositories
RDBMS MPP
5. Key Challenges for Modernization
“Through 2018, 90% of modernized warehouses will be useless
as they are overwhelmed with information assets captured for
uncertain use cases”
6. Key Challenges for Modernization
“Visual data-discovery, an important enabler of end user self-service,
will grow 2.5 x faster than the rest of the market, becoming by 2018 a
requirement for all enterprises.”
Making insights and data in the warehouse readily discoverable, accessible and
usable
7. Key Challenges for Modernization
Is the opposite of “Dumb” data
• hard to find
• hard to understand
• hard to combine
Data in the Lake has to be Smart
Rethink the information plumbing
• Supplement first , transform later
• Maximize ROI by protecting investments
Rethink ETL – Light weight data blending tools that
can allow for data wrangling when business cannot
wait
8. Key Challenges for Modernization
“By 2017, most business users and analysts in organizations will
have access to self-service tools to prepare data for analysis”
“Managed BI Self-Service Will Continue to Close
the Business and Technology Gap.”
Self Service BI over Hadoop
9. Using big data capabilities as
a “landing zone” before
determining what data should
be moved to the data
warehouse
PRE-PROCESSING
Moving infrequently
accessed data from data
warehouses into enterprise-
grade Hadoop
Moving associated workloads
to be serviced from Hadoop
OFFLOADING
Using big data capabilities to
explore and discover new
high value data from massive
amounts of raw data
EXPLORATION
Top 3 Tactics for Modernization
10. • Barriers to adoption: Complex, slow, needs expertise
Kyvos Solution: Build a BI Consumption Layer on your Data Lake
• Enable business users to explore data visually and interactively
• No waiting for reports
• Self service – no learning curve
• No need to move data out of Hadoop
• Eliminate scalability restrictions for BI
• Drill down to lowest levels of granularity
Bridging the Gap for Business User
12. BI Consumption Layer – Secure, Scalable Access for All Users
• Fine-grained access control
• Row and column level security
• Integration with kerberos, LDAP,
Active Directory
• Integration with security frameworks
• Role based access control
• Support for third party encryption
tools
• Support for single sign-on
13. Excel Spotfire
ICE JAVA
APP
MDX
Clients
Other Transformations
(Java / Scala)
Hive
HDFS
Jacobian Transformation
(Scala)
Impala SQL Server/ SSAS Spark
Business Need
• Evaluate risk across all asset classes
• Deliver interactive access at massive
scale
• Interface with Spotfire and in-house apps
• Reduce time to market
Challenges
• DATA SILOS – Teradata, SQL Server, and HDFS
• BIG DATA
• Data too large to look at all asset classes across desired time period
• 700 M transactions per day
• WEEKS – time to get results
• SLOW - response time to queries
Use Case
Investment Bank Risk Analysis
14. Excel Spotfire
ICE JAVA
APP
MDX
Clients
Other Transformations
(Java / Scala)
HDFS
Jacobian Transformation
(Scala)
KYVOS Spark
Solution Highlights
• One OLAP / caching layer for all three
UI’s: Excel, Spotfire, In-house
• Consolidated view of all asset classes
• Drill down to trade level – never possible
before
Results Obtained
• 20-day trend of risk – not achievable with previous Hive or Impala
solutions
• Daily updates of cubes
• Reduced time to market: eliminated need to move data to SSAS
• Interactive response times for users, even at massive scale
• No learning curve: support for all business UI’s
Use Case
Investment Bank Risk Analysis
15. • Can it deal with the scalability and granularity needed?
• How does it perform with “cold” queries for ad-hoc analysis?
• How efficiently does it deal with “warm” or repeated queries?
• Can business users access data seamlessly with their BI tools?
• Can diverse data sets be transformed and combined with no coding?
• Can it deal with incremental data updates efficiently?
• Can it deal with concurrent access without significant degradation?
• Is it enterprise ready to support availability and security requirements?
Evaluating Criteria
16. • Reduction in time to market
• Reduction in development time
• Increased business user productivity
• Reduced latency – reduced number of “hops” or diverse systems
supported
• Reduced operational costs
• Top-line benefits of insights that were not possible before
Measuring ROI
17. Visit us at Booth 1105
ajay.anand@kyvosinsights.com
vineet.tyagi@impetus.com
Q&A
Editor's Notes
In a real world scenario, enterprise data warehouse (EDW) and Hadoop based warehouse would co-exist to allow the organization to leverage the strengths of each architecture to its advantage.
Data Lake will have increasing amounts of data ingested at scale, if users don’t know is available, it will be useless
They need to find it by different means, searches etc, with full governance, not canned queries. Discovery can happen of both data and its context and services
When they find something they are interested in they should be immediately able to get it within the bounds of governance and work with it. Discoverability and accessibility have to go hand in hand, one builds on other but is not usable with other.
Hard to find: Dumb data requires that we know the exact location of a particular piece of information we’re interested in. We may need to know a specific part number that acts as a primary key in a database or in a Hadoop cluster, or we may need to know particular internal IDs used to identify the same employee in three different systems. To cope with this, we wrap dumb data with basic keyword search or with canned queries—solutions that help us retrieve known data but don’t help us ask new questions or uncover new information.
Hard to combine with other data: Dumb data is very provincial. It has identity and meaning within the confines of the particular silo in which it was created. Outside of that silo, however, dumb data is meaningless. An auto-incrementing integer key that uniquely identifies a customer within a CRM system is highly ambiguous when placed in the same context as data from a dozen other enterprise apps. A short text string such as “name” used to identify a particular data attribute within a key-value store such as MongoDB may collide with different attributes from other big data stores, databases, or spreadsheets when let loose in the wild.
Hard to understand: Even once we find relevant information, we’re limited in our ability to understand dumb data as it is generally not well-described. Dumb data is described by database, table, and column names or document and key identifiers that are often short, opaque, and ambiguous outside of the context of a specific data store. For decades, we’ve been dealing with this by building software that has hardcoded knowledge of what data is in which column in which database table. Hardcoding this knowledge into every software layer from the query layer through to the business logic and all the way up to the user interface makes software very complex. Complex software is prone to bugs and is expensive and time-consuming to change, compromising our ability to deliver the most up-to-date and relevant data to business decision makers in a timely manner.
Because most data is hard to find, combine with other data, and understand, its value ends up limited. The effort and cost needed to effectively use dumb data to drive business decisions is so high that we only use it for a few business problems, in particular those with static and predictable requirements. We might deploy a traditional BI tool to track and optimize widget sales by region, for instance, but we’re not able to apply the same analytic rigor to staffing client projects, understanding competitors’ strategies, providing proactive customer support, or any of hundreds of other day-to-day business activities that would benefit from a data-driven approach.
If data is dumb, then big data is very dumb. With Hadoop and other big data infrastructure, we now have the tools to collect data at will in volumes and varieties not previously seen. However, this only exacerbates the challenges of finding, combining, and understanding the data we need at any given time.
Add meaning
The first step is to add meaning to your data by richly describing both the entities within your data and the relationships between them. Equally important to describing the meaning of data is where the meaning is described. Dumb data often has its meaning recorded via data dictionaries, service interface documents, relational database catalogs, or other out-of-band mechanisms. To make data smarter, don’t rely on the meaning of the data to be hardcoded within software; instead, link the meaning of the data directly to the data itself.
There are several ways to describe data’s meaning, and the richer the description you choose, the smarter your data becomes. Data’s meaning can include:
Controlled vocabularies that describe the acceptable values of an attribute
Taxonomies that capture hierarchical relationships within the data
Schemas that communicate data types, cardinalities, and min/max value ranges
Ontologies that declaratively represent rich semantics of data such as transitive relationships, type intersections, or context-sensitive value restrictions
There are two benefits of adding meaning to your data. First, software can respond appropriately (for example, by performing data validation or automatically choosing the right UI widget) to the meaning of different data sets without having to be customized for each. Second, rich data descriptions attached to data can empower business experts to manipulate data themselves without relying on scarce IT or data science personnel for every new dashboard, visualization, or analysis.
Add context
The Sisyphean pursuit of a generally unreachable “single version of the truth” belies the importance of context in making data smart enough to be discovered and understood by business decision makers. The lack of context makes data unreliable and hard to trust and decreases the chance that decision makers will rely on it.
Just as meaning is traditionally captured separately from data, so, too, is contextual metadata usually divorced from the data it describes. To make data smarter, you must treat metadata as data. This means directly capturing and maintaining simple metadata such as the author or creation time of a piece of data, but it also means linking data to its full lineage, including the source of the data (e.g., a particular enterprise database, document, or social media posting) and any transformations on the data. Context can also include probability, confidence, and reliability metadata. Finally, data’s context might involve domain-specific attributes that limit the scope in which a particular piece of data is true, such as the time period for which a company had a particular ticker symbol or the position a patient was in when a blood pressure reading was obtained.
By representing contextual metadata alongside the data itself, users can query, search, and visualize both at once. There’s no need to create separate, time-consuming data-load processes that select data from a particular time period or author. There’s no need to login to separate applications to verify the trustworthiness of data within a business intelligence dashboard.
However, we work in an increasingly interconnected world, and we can’t ignore the edges of our big data clouds -- the points at which we exchange data with our supply chain partners, regulators, and customers.
Adopting standards is critical to enabling reuse of data on these edges and to avoiding the overhead of classic point-to-point data translations that traditionally consume substantial resources when exchanging data with third parties. Smart data standards come in two varieties:
Industry standards such as the Financial Industry Business Ontology (FIBO), CDISC in pharma, or HL7 in healthcare. These standards capture the meaning of data across an industry and ensure mutual understanding between organizations.
Technology standards such as the semantic Web standards RDF, OWL, and SPARQL. These standards provide an agreed-upon way to model and describe the flexible, context-rich, data graphs that form the foundation of smart data.
Self-Service Business Intelligence puts the power of analytics in the hands of end users to create their own reports and analysis of the data sets they want, on an as needed basis. The goal is to utilize data wrangling / blending and other capabilities to reduce IT’s involvement and expedite information to business users by delivering what Gartner refers to as “faster, more user-friendly and more relevant BI.”
It is an evolutionary paradigm, does not indulge the IT vs/ Business divide , IT really gets to play the enabler here, reinforcing governance, structuring user autonomy, accounting for user differentiation, and transforming IT’s role from serving business to offering cross-functional support, it is important to realize that self-service BI should not be considered a replacement for traditional BI tools and warehousing. By utilizing a hybrid approach of centralized and decentralized models and restructuring the organization accordingly, self-service BI functions best as a supplement to the conventional methods in which data is accessed more expediently and put in the hands of those who need it most.
Pre-Processing
using big data capabilities as a “landing zone” before determining what data should be moved to the data warehouse
Provide a landing zone for all data.
Persist the data to provide a query able archive of cold data.
Offloading
moving infrequently accessed data from data warehouses into enterprise-grade Hadoop
Moving associated workloads to be serviced from Hadoop
Leverage Hadoop’s large-scale batch processing efficiencies to preprocess and transform data for the warehouse.
Exploration
Using big data capabilities to explore and discover new high value data from massive amounts of raw data and free up the data warehouse for more structured, deep analytics
Enable an environment for ad hoc data discovery.