So if we take our examples from the previous slide….Healthcare & Retail is mostly a batch oriented process.Location based is mostly a real time service.Each has specific requirements around how they use and process the data. Depending on how you want to use and process the data, you need to choose the proper technology to store/acquire that data…
Given those scenarios, here's how they might be storage/managed. HDFS is a great distributed file system. Parallel, highly scalable. However, it’s tuned primarily for bulk sequential read/write of file blocks. There are no indices for fast access to specific data records, it’s not well suited for lots of small files or updating files that have already been written. Primarily a batch system, write lots of data, then read it all in parallel over and over. NoSQL DB is a distributed key-value database. It has indices. It’s designed for high volume reads and writes of simple data. It’s not tuned for reading/writing huge files – use a file system for that.
Bottom line: NoSQL is about “data management scalability at cost” first and foremost. There are some technical features that are also important, but they come secondary. With enough effort (HW and SW) you can solve most of the technical problems with RDBMS systems. However, the whole reason that NoSQL was invented was to deal with the fact that it’s too expensive to manage Big Data using general purpose RDBMS systems. Regarding CAP: http://en.wikipedia.org/wiki/CAP_theoremThe CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:Consistency (all nodes see the same data at the same time)Availability (a guarantee that every request receives a response about whether it was successful or failed)Partition tolerance (the system continues to operate despite arbitrary message loss)According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three. RDBMS products focus on CA, where as NoSQL products focus on AP.
Cox Communications. 128-node Hadoop cluster. Home-grown distributed key-value storage using Berkeley DB. Would have used NoSQL DB if it had been available 2-3 yrs ago.
Cox Communications. 128-node Hadoop cluster. Home-grown distributed key-value storage using Berkeley DB. Would have used NoSQL DB if it had been available 2-3 yrs ago.
This slide shows the master-slave architecture of Oracle NoSQL DB. Master receives the write and it asynchronously replicate the data to the other replica-nodes.
Oracle NoSQL DB uses simple, understandable k-v pairs, simple get/insert/update/delete operations and ACID transactions. Different than SQL in an RDBMS, but the model and behavior is very familiar to application developers.Think of keys as a directory structure: multiple parts, allowing you to traverse the hierarchy. Major Key determines where the data is stored (which shard). Keys (M+m) are unique, only one value per unique Key. Minor Key allows you to have multiple records for a given Major Key. Keys are simple strings. Value is a byte string. It’s anything that you want it to be. The application knows what the structure and content of the value is. Support for a flexible data serialization format will be available in future releases (Apache Avro http://en.wikipedia.org/wiki/Apache_Avro).
This is basically a summary slide, highlighting the features of Oracle NoSQL Database, especially the that we think set us apart from some of the other products that are out on the market. General Purpose: What we mean here is that Oracle NoSQL DB is built as a general purpose scalable, highly reliable NoSQL database. Several of the open source NoSQL databases on the market were built specifically to solve the technical problems at a given company – Voldemort was built by LinkedIn, Dynamo was built by Amazon, Big Table was built by Google – which can trend to affect the technical direction and design decisions for those products. That is not the case with Oracle NoSQL Database. Reliable: Unlike most of the NoSQL databases out there, which are inventing both storage and distributed data management, Oracle NoSQL Database uses Berkeley DB Java Edition for key-value storage and replication on the storage nodes. BDB has been running large production applications for many years and is a proven, reliable, scalable storage system.
Keep the cluster investment at workMost bang for your buckTraining NeededMultiple Management ToolsRapidly, automatically or rule based single click provisioning of Big Data ClustersMeasure the boost provided by Clusters/Grids to your business data processing capabilities. Need to change your choice of cluster software at any point of time when you feel that it is not sufficiently delivering to your needsManage big data solution from a single cluster management software umbrellaIT & System Administrators wantConsistent and easy to use provisioning, management & monitoring toolsCreate less disruption in the stack, reuse technology investmentsExtensibility, keep the same tooling when adding new big data technologies to the stackReduced outage timesReduced time to scale & production
Cluster Analytics – Cross Cluster AnalyticsOptimizationsSelf healing capabilitiesFail Safe for false negatives/positivesAdvanced ProfilingCapability to “certify” cluster performanceJob Profiling – weeds out bad written codeValue Added FeaturesTesting Framework for Map – Reduce jobs : certify build to production
This slide shows the master-slave architecture of Oracle NoSQL DB. Master receives the write and it asynchronously replicate the data to the other replica-nodes.
This slide shows the master-slave architecture of Oracle NoSQL DB. Master receives the write and it asynchronously replicate the data to the other replica-nodes.
This slide shows the master-slave architecture of Oracle NoSQL DB. Master receives the write and it asynchronously replicate the data to the other replica-nodes.
Experienced Advisors Accelerated Consulting & Services Leader for Big Data. Headquartered in San Jose, offices in India.Expertise through Architects Pioneers in distributed software engineering with both vertical and functional expertise. Dedicated Innovation Labs.Excellence delivered through technology Advances Open source and Innovation Product Portfolio.Founded 1991 – 1300 StrongLeading Big Data since 2008Chicago, NYC, Atlanta, Indore, Noida, BangaloreImpetus provides Big Data thought leadership and services, creating new ways of analyzing data to gain key business insights across enterprises. Impetus’ experience extends across the big data ecosystem including Hadoop, NoSQL, newsql, MPP databases, machine learning, and visualization. Impetus offers a Quick Start program, Architecture Advisory Services, Proof of Concept, and Implementation.
Oracle NoSQL Database allows you to relax/configure the Consistencyand Durability policies for a given operation. Durability is controlled by defining the Write Policy and the HA Acknowledgement Policy. You can increase write transactions performance by relaxing the Durability constraints. The default is Write-to-memory, Majority Ack. Consistency is controlled by defining the Read Guarantees that you require from the system. You can increase read transaction performance by relaxing the Consistency constraints. The default is None.
We heard you – we have ACID transactions in Oracle NoSQL Database. You can think of a transaction as a single auto-commit API call. That API call can be for a single record, multiple records or multiple operations AS LONG AS all of the records are for the same Major Key. However many records/operations are in that API call, they are all committed atomically (all or nothing). Because they all share the same Major Key, all of the data being affected resides on a single storage node, so we can guarantee the transactional semantics of the transaction commit. We will replicate that transaction to the replicas (copies of the data) as part of the transaction. Of course, not all operations are created equal. In some cases you may want operations that are not completely ACID. One of the benefits of NoSQL is that it relaxes transactional guarantees in order to provide faster throughput. The Oracle NoSQL Database allows you to override the default and relax the ACID properties on a per-operation basis, allowing the application to specify the transactional behavior that is most appropriate.
Elasticity refers to dynamic/online expansion changes in a deployed store configuration. New storage nodes are added to a store to increase performance, reliability, or both.Increase Data Capacity - A Company’s Oracle NoSQL Database application is now obtaining it’s data from several unplanned new sources. The utilization of the existing configuration as more than adequate to meet requirements, with one exception, they anticipate running out of disk space later this year. The company would like to add the needed disks to the existing servers in existing slots, establish mount points, ask NoSQL Database to fully utilize the new disks along with the disks already in place while the system is up and running Oracle NoSQL Database. The Administrator after installing the new disks, defines a new topology using the Administrator with the new mount points and capacity value such that new replication nodes can be created on the existing storage nodes. The administrator can review the plan for errors and then when ready the new topology is deployed while the Oracle NoSQL Database is online and continues to serve the running application with CRUD operations.Increase Throughput- As a result of an unplanned corporate merger, the live Oracle NoSQL Database will see a substantial increase in write operations. The read write mix of transactions will go from 50/50 to 85/15. The need workload will exceeds the I/O capacity available of the available storage nodes. The company would like to add new hardware and have it be utilized by the existing Oracle NoSQL Database (kvstore) currently in place. Oh, and of course the Application needs to continue to be available while this upgrade is occurring.With the new elasticity capabilities and topology planning, the administrator can add the new hardware and define a new topology with the new Storage Nodes. The administrator can then look at the resulting topology (storage nodes, replication nodes, shards, etc) to confirm it meets their requirements. Once they are satisfied with the new topolgy they can also determine when they want to deploy the new topology in the background and while the existing application continues to operate. As partitions/chunks of data are moved they are made available to the live system. Increase Replication Factor- A new requirement has been placed on an existing Oracle NoSQL Database to increase the overall availability of the Oracle NoSQL Database by increasing the replication factor by utilizing new storage nodes added in a second geographic location. This is accomplished by adding at least 1 replication node for every existing shard. The current configuration has a replication factor of 3.While the system is live, the administrator changes the topology to define the new storage nodes and define the replication factor. Again the administrator can validate the topology and review it before deploying. As a side point, the administrator could validate several changes to evaluate alternatives and then decide which topology to deploy. Just like the other scenarios described the data is automatically moved and partitions are made available as they are moved as part of a background activity. Meanwhile the KVStore continues to service the existing workload starting to use the new replicas as they become available. Once the topology is deployed a new replication node has been created and populated for each shard. We have increased availability by increasing the replication factor where the new storage nodes are in another geographic location. We have increased read throughput capability with the new Replication nodes for each shard and the Replication Factor is now 4.
Rebalance a configuration :A storage node has failed and must be replaced (KVStore continues to run). The new hardware is a much more powerful machine (9 Cores, 64 GB of real (compared to 8 GB), multiple 400 GB Solid State Drives). The hardware is a heterogenous hardware mix. The new hardware replaces the failed storage node and the System administrator add the new Storage node to the pool of available storage modes and then migrates the old (failed) Storage node to the new one. After successful migration (KVStore continues to run) the failed storage node is deleted and all Storage nodes are active again. Continuing to monitor the performance of the system and the existing topology, the administrator notices that some of the older storage nodes have 2 replication nodes on them and the CPU/IO utilization is high and latency is high as well, while the new much faster storage node is under utilized. By using the new physical topology planning support available in this release, Oracle NoSQL Database will rebalance the configuration and redistribute the data . In other words, Oracle NoSQL Database will make optimal use of heterogeneous storage nodes. The new Storage nodes will likely have multiple replication nodes running on them while many of the older systems may go from 2 to 1. The replication nodes will automatically be moved. Again this can all happen while the system is online and at the convenience of the company.By using the new physical topology planning support available in this release, Oracle NoSQL Database will rebalance the configuration and redistribute the data . In other words, Oracle NoSQL Database will make optimal use of heterogeneous storage nodes. The new Storage nodes will likely have multiple replication nodes running on them while many of the older systems may go from 2 to 1. The replication nodes will automatically be moved. Again this can all happen while the system is online and at the convenience of the company.Data Movement:• Idempotent: Can be run multiple times with the same result• Interruptible: You can interrupt at any time and the KVStore will continue running. The company may have a peak workload period daily and may want to interrupt the data movement (as part of the new topology) and restart it after the peak period. • Restartable:
Why Avro?Avro is used in multiple products such as Hadoop and other programming languages. Having a schema and serialization framework is advantageous when working with multiple programmers and other products such as Hadoop. Schema With Avro, each value is associated with an AVRO schema (created in JSON format) typically created by the application programmer. An advantage of using Avro is that the serialized values can be stored in a space efficient manner. Avro has a number of primitive data types, including. boolean, int, long, float and stringBindingsOracle NoSQL Database supports multiple binding types. Generic – Schemas are treated dynamically (not fixed at build time).Using Specific bindings (named SpecificAvroBinding) has the advantage of creating a POJO (Plain Old Java Object) class with getter and setter methods for each field in the schema. JSON Bindings: . The JSON binding JsonAvroBinding is easy to read or create and also can interoperate with other programs that use JSON objects. Raw – Low level serialization not performedSchema Evolution is important with large databases where you can’t simply update every key/value pair in the store. Different schemas (with defined constraints in the avro specification) can be used when data is read or written. With well defined constraints in the avro specification, the schema used to read data does not need to be exactly the same as for writing data. For example, let’s imagine we have a key/value record representing profile information for a user. We have a new requirement to add an alternate email address. The field is added and a default value is established. In the future if a new key/value pair is added, the alternate email address is added. If the profile information is updated, the alternate email address is added. On reads (for example displaying the profile information) the alternate email address may not have been updated yet and that is fine, a default value can be displayed. This allows complete flexibility in terms of providing the updated field over time.
New streaming API for Large Objects (recommended size greater than 1M to 100’s of GB). Examples would be audio files, video files, Medical Imaging. New methods were created of the kvstore handle (getLob, putLOB, deleteLOB, putLOBIfAbsent, putLOBIfPresent)The major difference is the Input stream utilized to chunk the Large Object. The result is that the smaller chunks can be stored across the KVStore (multiple shards) depending on size. In addition, the chunks are stored in parallel so the write/read operations are much faster.
External Table support. Allows you to access data in external sources as it is a table in the Oracle Relational Database. Through Oracle’s external table support, you can access Oracle NoSQL Database key/value paris as if they are rows in Oracle Database. This allows you to issue SQL read statements such as Select, Select Count(*) where the results are obtained from Oracle NoSQL Database. Since Select statements can refer to multiple tables, the query can be looking at both Oracle NoSQL Database information AND data that resides directly in the Oracle Database. It also means that the data can be accessed via JDBC.Sample Programs and javadoc are available. Event Processing.The cartridge will work with Oracle EP.
From http://www.slideshare.net/jmusser/j-musser-apishotnotgluecon2012, slide 23
There’s a web-based Admin GUI which is a great way to get started. Most production sites with lots of nodes will probably use the CLI (command line interface) to start/stop the system, and use the GUI to check on status. The system keeps track of both the status of the system and the various storage nodes, as well as the performance statistics and throughput for each node. In a future of NoSQL Database, the administration functionality will also be available via Oracle Enterprise Manager.