2. Why Kite?
1. Codify expert patterns and practices for building data-oriented systems and applications.
2. Let developers focus on business logic, not plumbing or infrastructure.
3. Provide smart defaults for platform choices.
4. Support piecemeal adoption via loosely-coupled modules.
3. Kite Data Module
Provides APIs to interact with data in Hadoop.
The data module contains APIs and utilities for defining and performing actions on datasets.
entities
schemas
datasets
dataset repositories
loading data
dataset writers
viewing data
4. Entities, Schemas
Entity:
A single record in the dataset. (More like a plain java object, analogous to a row in relational
RDBMS table)
Schemas:
A schema specify fieldname and the data type for a dataset. Kite relies on Apache AVRO for the
same.
◦ Using Java API (AVRO schema is inferred from a Java class/ AVRO data)
◦ Using command line argument (AVRO schema is inferred from a Java class/ CSV data)
5. Datasets
Datasets:
A collection of zero or more entities. It is analogous to a RDBMS table.
The HDFS implementation of dataset is stored as Snappy-compressed Avro data files by default.
We can also store it in column oriented Parquet format.
◦ Performance of a dataset can be increased by partition strategy
◦ Based on one or more fields in the entity
◦ Partitioning can be done using Hash, Identity or Date (year, month, day, hour) strategies
◦ It provides coarse grained organization
◦ Partition strategy is configured with a JSON-based format
◦ Partition strategy can be applied only when dataset is created and cannot be altered later on.
◦ We can work with a subset of dataset entities using Views API.
◦ Datasets are identified using URIs.
6. Datasets
◦ Dataset URIs:
Depending on the data set scheme, we can specify dataset URI using one of the following pattern.
Hive dataset:hive:<namespace>/<dataset>
HDFS dataset:hdfs:/<path>/<namespace>/<dataset-name>
Local FS dataset:file:/<path>/<namespace>/<dataset-name>
HBase dataset:hbase:<zookeeper>/<dataset-name>
◦ View URIs:
A view URI is constructed by changing the prefix of a dataset URI from ‘dataset:’ to ‘view:’. The
query arguments can be added as name/value pairs, similar to query arguments in HTTP URL.
7. Dataset Repositories, Loading, Dataset
Writers, Viewing Data
Dataset Repositories:
• The physical storage location for datasets. It is equivalent to database in RDBMS model.
• Required for logical grouping, security, access controls, backup policies, etc.
• Each dataset belong to exactly one dataset repository.
• Kite does not provide the functionality of copying/moving a dataset from one dataset repository to
another. (However, it can be done via Map Reduce)
Loading:
• We can load comma separated values into dataset repository using CLI.
Dataset Writers:
• Used to add entities to datasets.
Viewing Data:
• We can query the data using Hive/Impala
• We can also use CLI.
9. Generate Schema
•A Kite dataset is defined using an Avro schema.
•It can manually written or generated from Java object/CSV data file.
• CLI command for:
• Schema generation from Java class
kite-dataset obj-schema org.kitesdk.cli.example.Movie -o movie.avsc
• Schema generation from CSV file
kite-dataset csv-schema movie.csv --class Movie -o movie.avsc
10. Example – Schema Generation
package org.kitesdk.examples.data;
/** Movie class */
class Movie {
private int id;
private String title;
private String releaseDate;
. . .
public Movie() {
// Empty constructor for serialization
purposes
}
}
{
"type":"record",
"name":"Movie",
"namespace":"org.kitesdk.examples.data",
"fields":[
{"name":"id","type":"int"},
{"name":"title","type":"string"},
{"name":"releaseDate","type":"string"},
]
}
11. Create Dataset
Dataset is created using the Avro schema.
kite-dataset create movie --schema movie.avsc
Partition Strategy:
• Logical partitions for improving performance
• Specified using a JSON file
Example: movie.json
[ {
"source" : "id",
"type" : "int",
"name" : "id"
}]
kite-dataset create movie --schema movie.avsc partition-by movie.json
12. Create Dataset
Column Mapping:
•Specifies how data should be stored in Hbase for maximum performance
•Specified in JSON file
• Each definition is a JSON object with following fields
• SOURCE – The field in the entity
• TYPE – Where the field data is stored (cells in Hbase)
• FAMILY – the column family in Hbase table
• QUALIFIER – the column name in Hbase table
Example
{"source" : "timestamp", "type" : "column", "family" : "m", "qualifier" : "ts"}
• There are five mapping types:
1. Column 2. Counter 3. keyAsColumn 4. Key 5. Version
13. Populate-Validate-Update-Annihilate
Dataset
Populate Dataset:
There are various ways to populate data to dataset
• Importing form csv files
• Copying from another dataset
• Using Flume ingestion, etc.
Validate Dataset:
‘SHOW’ command can be used to validate the data loaded.
Update Dataset:
Kite supports schema evolution as AVRO.
Annihilate Dataset:
Delete dataset when it is not required.