Theoretical assessment of the possibility of migration from a relational to a NoSQL database. This presentation is a short summary of the internship done during the master program.
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Migration of a relational database to a NoSQL store
1. An assessment of the
migration
September 1st, 2017 presented by Markiyan RIZUN mrizun@gmail.com
From a relational database to a NoSQL store
2. Outline. Contents of the presentation
2
II – Problem & Objective
III – Background
I – Internship context
IV – The Migration
V – The Analysis & Results
VI – Conclusion
3. I – Internship Context
3
SOFTEAM R&D department + Modelio software
6. II – Problem & Objective
6
“To migrate, or not to migrate, that is the question”
– William Shakespeare, Hamlet
7. Problem definition
7
Migration
“What really matter are the business requirements and access/write pattern of the
applications.”
– Chongxin Li
If to migrate? How to migrate?
• data
• queries
• structure
8. Objective definition
8
1. Discover the “symptoms” of a relational database that indicate the need of migration to NoSQL?
2. Propose a guideline for the migration that would allow to generate adapted NoSQL database
accordingly to the usage of a source relational database.
Migration
9. III – Background
9
A brief overview of the relational and NoSQL DBMSs*
*DBMS – database management system
10. Relational approach. Overview
10
• data is stored in interconnected relations (tables)
• strictly defined database schema
• introduced by E. F. Codd in 1969
• one query language (SQL) for all DBMSs
• relies on principle of normalization
12. “Normalization is far from being a panacea.”
– Christopher J. Date
- slow read performance
Relational approach. Normalization
12
+ no data duplication
+ consistent data
+ better flexibility
+ simplified design
+ saved space*
*accordingly to http://www.mkomo.com/cost-per-gigabyte-update storage cost is dropping rapidly, therefore nowadays it becomes almost irrelevant
- poor horizontal scalability
13. + fast read performance
Rapid querying and an ability to scale are critical for distributed systems.
Relational approach. Denormalization
13
- data duplication
- inconsistent data
- worse flexibility
- messy design
- requires more space
+ better horizontal scalability
14. NoSQL* approach. Overview
14
• schema-free
• non-relational
• distributed
• horizontally scalable
• no common querying language
*NoSQL – Not Only SQL
17. 2. NoSQL is not a replacement of relational approach, but an
alternative.
Relational & NoSQL. Conclusion
17
1. NoSQL is designed for modern large scale distributed
applications that work with big volumes of unstructured data.
18. IV – The Migration
18
Review of the existing migration approaches
20. The migration. Methods
20
• manual control over denormalization of database
• preservation of an equivalent normalized database
• full denormalization of database
• heuristic-based approach to creation of database
“What really matter are the business requirements and access/write pattern of the
applications.”
– Chongxin Li
None of the methods considers actual database usage.
- manual, without guidelines
- not adapted to NoSQL data model
- not adapted to specific database
- simplistic, manual
21. V – The Analysis & Results
21
Answering the migration questions: “If?” and “How?”
22. If to migrate? Discovering the “symptoms”
22
• denormalization may be a solution
• typically a relational database is normalized
• denormalization is unnatural for relational design
Denormalization motivates and simplifies the migration to NoSQL.
Denormalization is an attempt to artificially approximate
structure and characteristics of a relational database to NoSQL.
- fast querying / better horizontal scaling
- slow querying / poor horizontal scaling
- slow writes / NULL values
23. How to migrate? Defining important information
23
“What really matter are the business requirements and access/write pattern of the
applications.”
– Chongxin Li
• dynamic information
• static information
- monitoring and logging of real database activity
- considering indexes, views and procedures
- analysis of logged information
24. How to migrate? Proposing guidelines
24
• frequency of usage
• execution speed
• join operations
- allows to avoid unnecessary remodeling
- highlights important access patterns
- signify real database access patterns
26. The migration. Summary
26
• static database information analysis – potential access patterns revealed by indexes and procedures
• pre-denormalized relational database – a “symptom” that signifies necessity / possibility to migrate
• dynamic database information analysis – real access patterns revealed by join operations, queries’ speed & frequency
If to migrate?
How to migrate?
27. The migration. Modelio implementation*
27
• automatic model generation – generation of Modelio database model from source database
• document-oriented NoSQL store meta model – for MongoDB and Elasticsearch
• automatic Java code generation – generation of Java source code from Modelio model
Meta model
Features
*video tutorial is available at: https://youtu.be/wPDxk0YeTmw
29. Foundation. What do we have so far?
29
• “symptoms” of a database to migrate were discovered
• hypotheses for proper migration were proposed
Theoretical result
• meta model of NoSQL document oriented store and related functionality*
Practical result
*for details, see the report;
30. The real work. What will we do in the future?
30
• further developing theoretical analysis
• continue studying the question of migration
• confirming our hypotheses on practice
• implementing the migration on practice
to do on PhD
The topic of my internship is “An assessment of the migration from a relational database to a NoSQL store”. In a nutshell, during the internship I was conducting the analysis of the migration techniques and searching for the possibilities to improve the migration process.
The presentation is structured as follows. First, I will briefly present the company and tool that I was using for the development. In the following part, I define problem and our objective. Next, I will talk a bit about the context of my work. In particular, about relational and NoSQL database management systems. Later, I present the review of the migration techniques. Finally, we will arrive to the key part of the work – analysis and its results. In the end, I will make a short conclusion of the work as well as discuss future work.
A bit about where I worked and what I was doing.
I was working as an intern in the R&D department of the company called Softeam, which is located in the Parisian region.
Throughout my entire work to conduct the research, I was using Modelio, which enables users to model things. For example, one can create a UML diagram, Java project model or a model of a relational database. The models can be built manually or generated automatically from source elements, for example a database.
One of the most exciting events during the internship was our trip to Helsinki, Finland. Me and Andrey SADOVYKH (my supervisor) went there for the DataBio EU project meeting, in which Softeam participates. There we discussed contributions of the company to the project. Additionally, I presented my research work to the representatives of different companies. Besides the work, we had free time to see the country, which was great!
In this section I will describe the problem that we are facing and define our objectives.
Nowadays, many companies wish to migrate their databases from relational approach to NoSQL. I will explain these two approaches in a minute. So, the migration usually means the transformation of the database structure, transition of data and modification of queries. The migration is very costly and complex. Also, there are no rules on how the migration has to be executed. To solve this, many migration methods were proposed. However, all of them do not define the conditions that would indicate the need to migrate. So first problem is to understand “if to migrate?”. Also, most of the existing methods are simplistic, as during the structure transformation they do not attempt to adapt a relational database to principles of a chosen NoSQL. And they do not consider the actual usage of a database (e.g., queries, indexes, views) despite the fact that “What really matter are the business requirements and access/write pattern of the applications.” Therefore the second issue is to comprehend “how to migrate?” so that the database is well-adapted.
Therefore, the objective of our research is to answer the two questions and as a result:
discover what are the “symptoms” of a relational database that might indicate the need of migration to NoSQL;
propose a guideline for the migration that would allow to generate adapted NoSQL database accordingly to the usage of the source relational database.
Since I am talking about migration from a relational to a NoSQL database, I will briefly discuss these two database management approaches, starting with relational approach.
relational approach was introduced almost half a century ago by Edgar Codd in 1969, its principles are well-established. Data is stored in tables that are connected by relationships. The structure of the tables as well their relationships is strictly defined and, therefore, always known in advance. Relational approach is well-known for the querying language (SQL) that is able to work with any relational system. Finally, the most interesting point for us in the context of the migration is the principle of normalization that is strongly used for all relational databases. Here are some examples of relational database management systems.
The idea of normalization is to reorganize the structure of a database in order to achieve full data consistency. Normalization includes the step of table decomposition into more tables and step of the creation of new relationships between these tables. Properly normalized database eliminates any data redundancy, meaning that no data is duplicated. This allows to completely avoid any modification anomalies such as having inconsistent data for the same entity in two different tables.
So to sum up, normalization give the following benefits…
Nevertheless, despite the fact that normalization is a recommended approach for relational database design, it has two major flaws. As the information about single entity is scattered across many tables, it causes 1) slow read queries and 2) poor scalability (meaning over a distributed system). And two these factors are very important for the modern applications such as cloud application
And of course, due to these reasons there exists an opposite principle that is called denormalization. It has completely opposite disadvantages such as…
On the contrary to normalization, the read performance is much faster and horizontal scalability is improved too. And these two features are absolutely crucial in order to support distributed systems.
There are many different denormalization strategies that are discussed in the report, but we will not talk about the in the presentation.
Normally, with quite a few exceptions, NoSQL (meaning Not Only SQL) systems should satisfy following list.
All of them are non-relational, and one could argue that this is the main difference.
There is no strict structure definition, also known as being schema-free. That is why NoSQL DBs work with unstructured data and are flexible in that sense, unlike relational databases that work with highly structured data.
Distributed, meaning working on clusters of machines.
Therefore, they should be horizontally scalable. This means that without restructuring a database, one could easily add new node to cluster in order to handle growing number of data.
Usually, they are open source.
Unfortunately, there is no such language as SQL that would be common fr all systems.
I will not explain in details each type, its data model or characteristics. What is very important to understand, is that NoSQL types propose different approaches to handle data, and each of them has its own strong as well as weak sides. For example, key-value type has rapid query execution because of the data model, where each key points to a value – an actual data. However, querying being really fast, is poor in terms of flexibility. There is no possibility to filter data, only search by a key. To sum up, the idea is that one has to use certain NoSQL type for suited task and data. Sometimes it may be difficult to choose which one is the best for your case, and other times it is beneficial to use a few of them for one system
And some examples of NoSQL systems, as you can see there are a lot alternatives to choose from and each one is different than another. They vary in NoSQL types, and also in terms of their specific implementation features. This makes it even more problematic to select the correct system for your needs.
The main idea is that NoSQL is meant for modern large scale distributed applications, that work with massive volumes of unstructured data.
And of course, NoSQL is not a replacement of relational approach, but an alternative to it.
Ok, so now let us move to the migration and existing approaches.
Just to remind, that migration of relational databases to NoSQL means the transformation of the structure, transition of data and modification of queries and the application code.
We classify existing methods into four categories. First category focuses on the preservation of relational database schema and, therefore, the database access patterns in queries and applications. Unfortunately, such technique does not adapt the structure to the principles on NoSQL. The methods of the second group take similar approach, i.e., keep the database structure unchanged, but they additionally offer a possibility to manually adjust the target database’s structure, however never propose any guidelines to do so whatsoever. Third category follows the opposite direction – full denormalization of the relational database, hence the resulting structure adheres to some of the NoSQL design principles, but does not consider specific needs of concrete database. The techniques of the last category aim to define guidelines for transformation of a relational database to a NoSQL database, however the resulting heuristics is simplistic as it resembles denormalization ideas and more importantly it is completely manual.
NoSQL offers the same as denormalized RDBMS + fast writes / no Null values and many other features such as dynamic schema, automatic replication, etc.
Relational database access patterns and its business rules are the definitive factors for the generation of target NoSQL database schema. In each specific case of migration, NoSQL database schema and its denormalization level has to be adapted accordingly to its planned usage.
Static information (indexes, views and procedures) reveal typical, or at least, potential frequent database access patterns.
Dynamic information shows actual access patterns – the real usage of a database.
For the analysis step we highlight the following information from the log: execution speed, frequency of usage and join operations. We provide the analysis guidelines of this information, which allow to assess how to adapt a NoSQL document-oriented database schema.
Execution speed and frequency of usage of a query are critical factors for deciding if the database schema needs to be adjusted to this query. They point out the queries that are worth paying attention to. If the query is slow and frequent, then clearly structure of a database should be modified accordingly to the access patterns. Generally, this information enables to avoid unnecessary remodelling and highlights important access patterns.
Join operations signify database access patterns, therefore have critical impact on the decision making process when remodelling database schema.
Algorithm…
There is a high possibility that queries of a relational database exploit completely different access patterns, which leads to the situation when single structure of a database cannot be suitable for all of the queries to be efficient. Therefore, one can create several copies of the same data using different structures that are suitable for all the queries
To conclude, our approach for the migration from a relational database to a NoSQL database consists of the following recommendations:
The migration should rely on pre-denormalized relational database, which is also a “symptom” that signifies the necessity to migrate.
Adjust NoSQL schema accordingly to static database information (potential access patterns, indexing).
Remodel if needed respectively to database actual usage (queries’ speed and frequency, real access patterns).
To conclude, our approach for the migration from a relational database to a NoSQL database consists of the following recommendations:
The migration should rely on pre-denormalized relational database, which is also a “symptom” that signifies the necessity to migrate.
Adjust NoSQL schema accordingly to static database information (potential access patterns, indexing).
Remodel if needed respectively to database actual usage (queries’ speed and frequency, real access patterns).
The theoretical results presented in this work, alongside with document-oriented database meta model implementation, will serve as a strong foundation for our future research pursuits:
…
To do this, we plan to obtain an access to a database that is in production and a company would like to migrate it. Such an opportunity would enable us to either support or refute our hypotheses. Moreover, an access to real database would allow us to analyse its actual usage and asses how to properly migrate it. At the moment, we are negotiating with several companies that are willing to migrate their legacy relational systems to NoSQL modern databases. We discuss the possibility to get an access to their databases within the framework of the DataBio EU project, in which SOFTEAM participates.