8. Create Projection の DDL
基本クエリ(Join
も指定可能)
セグメンテーショ
ンと K-safety
列のリストとエン
コーディング
ソート順
SEGMENTED BY hash ( host, interface,
time, metric ) ALL NODES KSAFE;
CREATE projection snmp_p1 (
host encoding rle,
interface encoding rle,
time encoding commondelta_comp,
metric encoding rle,
value encoding deltaval
) as
ORDER BY host, interface, metric, time
SELECT
host,
interface,
time,
metric,
value
FROM snmp
14. プロジェクションのソート順の決定
– ORDER BY 文で:
1. クエリの述語から開始
– カーディナリティの低いものから高いものへ述語を並び替え
– ソート順で列をランレングス符号化(RLE)
2. JOIN、かつ/または、GROUP BY に基づき列を並び替え
– カーディナリティの低いものから高いものへ JOIN、かつ/または、GROUP BY の列を並び替え
3. 平均ランレングスが10未満になるまで残りの列を並び替え
SEGMENTED BY
CREATE
SELECT
ORDER BY host, interface,
metric, time
15. プロジェクションのセグメンテーション
– サイズの大きいテーブルを分散化し、サイズの小さいテーブルを複製
– ランダムなデータ分散のためにセグメンテーション用の列の選択
– 相対的に一意の値を含む列
– クエリの述語での使用頻度が低い列
– JOINがローカルで処理されるように分散
– 最も一般的なのは、分散化されたファクトテーブルを複製されたディメンションテーブルにJOIN
– サイズの大きいディメンションテーブル、ファクト同士のJOIN、自己結合については、Identically Segmented プロジェクション
(ISP)を使うことを検討
SEGMENTED BY hash (host,
interface) ALL NODES
OFFSET 0;
CREATE
SELECT
ORDER BY
16. セグメンテーション句
– SEGMENTED BY HASH
– ノード間でデータを均等に分散
– 一意の値が80%より多く、ランダムであるべ
き
– クエリの述語に出てくる列には使用しない
– セグメンテーションに適した列もしくは列の組
み合わせがない場合、新たにセグメンテー
ション用に列を作成
– ALL NODES {OFFSET N}
– ALL NODES
– ALL NODES OFFSET 1
– ALL NODES KSAFE
–ALL NODES OFFSET 0のプロ
ジェクションとALL NODES
OFFSET 1のプロジェクションを作成する
ことと等価(k=1の場合)
– 全く同じバディープロジェクションを作成するた
めに使用
SEGMENTED BY hash (host,interface,time,metric) ALL NODES KSAFE;
19. ランレングス符号化 – どう処理されるか?
– 同じ値が連続するように、ランレングス符号化済の列を並び替え
– 各「バケット」が、以降の列の行のセットにマッピング
– 一つの列のクエリ述語は、他の列のデータ読み込みを限定
F
M
Fresh
Junior
Senior
Soph
Fresh
Junior
Senior
Soph
F
F
F
F
T
T
T
T
F
F
F
F
T
T
T
T
Gender Class Pass Name
SELECT Name
FROM Students
WHERE Class='Junior'
and Gender='M'
and Pass='T'
20. ランレングス符号化 – 停止タイミング
– ソート順に列を追加し、平均繰り返し回数が10未満になるまで、RLEの適用を継続
– バケット毎に繰り返し回数を決定するために:
– プロジェクションの定義: … ORDER BY gender, class, pass;
– SELECT count(*)
FROM students
GROUP BY gender, class, pass;
– RLEでこれ以上の効果が望めないと判断した場合、他の列に対しての他のエンコーディングや圧縮のオプション
を検討
21. ランレングス符号化 – 停止タイミング
RLEの平均繰り返し回数が10未満
375 93.7 46.8 2.5
(350)
(400)
(50)
(100)
(100)
(100)
(100)
(100)
(100)
(100)
(25)
(50)
(50)
(50)
(25)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
Gender Class Pass Name
SELECT Name
FROM Students
WHERE Class='Junior'
and Gender='M'
and Pass='T'
平均:
25. プリジョインプロジェクション:プロジェクションの例
CREATE PROJECTION fact_cust_pjp (
f_id ENCODING RLE,
customer_id ENCODING RLE,
name
)
AS SELECT
f.id,
f.customer_id,
d.name
FROM fact f, customer_dimension d
WHERE f.customer_id = d.id
ORDER BY d.name, f.customer_id, f.id
SEGMENTED BY HASH (f.id)
ALL NODES;
f_id
10
20
30
40
50
60
customer_id
1
1
2
1
2
1
> プリジョインプロジェクションが使われ
る例:
SELECT name, f.id
FROM fact f, customer_dimension d
WHERE f.customer_id = d.id;
> プリジョインプロジェクションが使われ
ない例:
SELECT name, d.id
FROM fact f, customer_dimension d
WHERE f.customer_id = d.id;
Name
Steve
Steve
Mary
Steve
Mary
Steve
37. Top-K プロジェクションの例
Top-K のクエリ例
SELECT meter_id, reading_date, reading_value FROM readings
LIMIT 5 OVER (PARTITION BY meter_id ORDER BY reading_date
DESC);
meter_idで構成されるTop-Kプロジェクションを作成し、各ガスメーター用の最新の5つのメーター測定値を格納
CREATE PROJECTION readings_topk (meter_id, recent_date,
recent_value) AS SELECT meter_id, reading_date, reading_value
FROM readings LIMIT 5 OVER (PARTITION BY meter_id ORDER BY
reading_date DESC);
41. 数式を含むプロジェクションの例
数式を含むクエリの例
SELECT a, b, a*b FROM values;
必要なクエリデータを含むテーブルを作成
CREATE TABLE values (a INT, b INT);
クエリに数式を含むプロジェクションを作成
CREATE PROJECTION values_product (a, b,
product_value)
AS SELECT a, b, a*b FROM values
SEGMENTED BY HASH(a) ALL NODES KSAFE;
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
Facilitator Keys
As you can see on the slide, there are 3 different schemas, customer employee and store for the database
Vertica has one database per cluster, so schemas are used to separate out different entities within a database.
You may want separate schemas for security purposes.
(Example HR department may only have access to the employee schema while sales may only have access to the store data.)
Within Vertica, schema's are seen as a good way of separating data for security purposes.
One database, different schemas within that database.
Permissions can be granted at the schema level.
We create tables in the schema defining the columns and data types. Here we will look at the customer table that has 3 columns: A, B, and C. Tables are logical, but they do play a role in projection design. Underlying projections inherit table characteristics such as column data types, field length, and partitioning. Projections can only contain columns from the anchor table.
From the table, we define projections. Projections are how we store data and the metadata is defined by the CREATE PROJECTION DDL that defines the columns to be included in the projection, the encoding, the sort order and the segmentation. Here we will have 3 projections built for different query sets for optimization and efficiencies.
Projections then have logical containers called ROS containers that help the query optimizer find the files it needs to answer a query.
Within the containers are the fdb files, each with an associated pidx file (position index files), store the actual data. They are the only physical . Do not open these files as you will corrupt your Vertica database if you open these files through a text editor.
Tables are logical objects in Vertica. Queries reference tables then the Optimizer picks best projection(s) to answer queries. Tables are owned by a user and created in a schema.
To improve performance, use the narrowest data type (varchar, varbinary, numeric) for each data field to improve performance. So for example if you have a column that is set to varchar(25) and it’s a US state column that uses 2 characters you might want to set it to a more specific data size to shrink the column size. When a query is run against the column it pulls the whole column size. Using a more specific data type will help the query run more efficiently.
Primary Key Constraints
Tables in Vertica have typical table constraints such as: Primary Key, Foreign Key, Unique and Null. It is important to note that when you apply the PK constraint, the NOT NULL and UNIQUE constraints are added implicitly. However, if you remove the PK constraint, the NOT NULL constraint continues to apply to the column.
FOREIGN KEY Constraints
A foreign key (FK) is a column that is used to join a table to other tables to ensure referential integrity of the data. A FOREIGN KEY constraint is a rule that states that a column can contain only values from the PRIMARY KEY column on a specific dimension table.
A column with a FOREIGN KEY constraint can contain NULL values if it does not also have a NOT NULL constraint, even though the NULL value does not appear in the PRIMARY KEY column of the dimension table. This allows rows to be inserted into the table even if the foreign key is not yet known.
Enforcement
HP Vertica does not enforce the uniqueness of primary keys when they are loaded into a table but rather during query time. A table's foreign key constraints are enforced during data load either. Subsequently, a constraint violation error can happen when:
An inner join query is processed.
An outer join is treated as an inner join due to the presence of foreign key.
A new pre-join projection anchored on the table with the foreign key constraint is refreshed.
To detect constraint violations, you can load data without committing it using the COPY statement with the NO COMMIT option, and then perform a post-load check using the ANALYZE_CONSTRAINTS function.
Next we have projections. Projections are how we organize the storage of the data in Vertica.
Every table has at least one superprojection, which contains all the columns from the anchor table. A superprojection is necessary to ensure that at least one projection contains all the columns from the table to service any query against the table.
Additional, query-specific projections may be created to optimize for more challenging queries that may not be well served by the superprojection. Query-specific projections only need to contain the columns required for the query they are optimizing. The order of the columns in query-specific projections may adjusted to meet the specific needs of a query.
Here we can see a basic CREATE PROJECTION DDL, defining the columns and encoding with the sort order, the segmentation clause and the select statement from the table it is based on. Next we are going to look at manual projection design.
Most projection optimization needs can be fulfilled using the database designer as discussed in the Essentials class,
But, today we will look at what goes in to building a projection, how to choose the columns, sort order, encoding and segmentation.
Later we will look at how the various parts of the projection design can help optimize queries.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
To build a projection, we will assume we are designing this projection with a specific query or groups of queries in mind.
First you will need to choose the columns and then consider the sort order and encoding types that will minimize the projection storage footprint and allow for an efficient query.
Finally you will define the projection segmentation, will the projection be segmented or replicated and if its segmented what columns will you segment on.
Optional Exercise: using the graphic have students tell you where in the DDL these steps are specified.
16
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
Having pre-join projections causes data to be joined at load time, so the projection can be directly used for the query without additional processing of a join.
プリジョインプロジェクションを保持することにより、ロード時に結合が処理されることになり、そのため、追加の結合処理なしで、クエリ実行時にそのプリジョインプロジェクションがそのまま使われることになります。
This slows down loading time, which is a consideration for the MoveOutInterval, if loading to WOS. Also, the restriction that you must load dimension data first means that you need to pay attention to load order.
これにより、ロード時間は遅くなり、 WOS へロードする場合、 MoveOutInterval について考慮する必要がでてきます。また、最初にディメンションデータをロードしなければいけないという制約により、ロードの順番に注意を払う必要がでてきます。
Note that in the customer_dimension table, the id is created as a primary key field that cannot be null. This is joined to the customer_id field in the fact table.
テーブル customer_dimension で、列 id が NULL を許さないものとして作成されていることに注意してください。この列は、ファクトテーブルの customer_id フィールドと結合されます。
This shows the ddl to create the pre-join projection shown on the right.
It will not answer the second query since it requests the id field from the dimension table, which is not part of the pre-join projection.
Query Execution
SELECT d.name, count(*)
FROM fact f JOIN customer_dimension d
ON f.customer_id = d.id
WHERE d.name = 'Steve'
GROUP BY d.name;
Join is performed as single scan of fact_cust_pjp projection
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
Vertica prepends "agg_" to the projection name.
DBD does not create live aggregate projections when optimizing for queries. Live aggregated queries must be manually created.
Limit to the number of aggregate projections per anchor table since the aggregations are done at load time: Practical limit, but no declared limit. May not need to say anything
Requirements
You must include a GROUP BY clauses and it must appear at the end of the CREATE PROJECTION statement.
When you create a live aggregate projection for a table, HP Vertica automatically aggregates the data from the anchor table and loads it into the live aggregate projection. On subsequent loads through the anchor table, HP Vertica updates both any regular projections and any live aggregate projections associated with the anchor table.
An anchor table must exist to perform any DML.
To create a live aggregate projection, an anchor table must be created through which data can be queried and loaded.
At least one regular projection must exist to serve as the base for the aggregated projection.
Then, a base projection must be created to serve as the anchor projection for the live aggregate projection.
Finally, the live aggregate projection can be created. It will use a base projection as the source of its data.
The live aggregate projection will have some or all of the columns from the base projection. In addition, it will contain one or more aggregate columns.
The query in the DDL will contain some type of aggregation SQL to generate the columns and aggregate the data in the aggregated columns.
Queries containing aggregates are made directly to the Live Aggregate Projection (not the table)
Unlike regular projections, queries against live aggregate projections are made directly against the live aggregate projection.
Normally, a query is run against a table and the optimizer chooses the projection to be used. In the case of live aggregated queries the table name must be replaced with the name of the live aggregated projections.
Important: You must create an anchor table and an anchor projection before you create a live aggregate projection.
The anchor projection must specify a segmentation that is a subset of the live aggregate projection's segmentation.
NOTE: In the DDL, you do not specify this is an anchor table in the conventional way. The query in the DDL of the live aggregate projection (see step 3) is an aggregation query.
If you do not specify the segmentation for the anchor projection, HP Vertica creates a projection that is segmented by all columns. If the anchor projection is segmented on all columns, you cannot create a live aggregate projection.
You cannot use the ORDER BY clause when you create a live aggregate projection. Internally, the data is ordered on the GROUP BY column. However, when you retrieve the data, the results do not necessarily display in that order. Use the ORDER BY clause to sort the results of querying the live aggregate projection:
Before you create a live aggregate projection, you must create an anchor projection. The anchor projection's segmentation must be a subset of the live aggregate projection's segmentation. The anchor table cannot be unsegmented.
The GROUP BY and PARTITION BY clauses must be supersets of the anchor projection segmentation.
Live aggregate projections are only stored in the ROS.
Live aggregate projections must be segmented.
The list of columns in the SELECT list for a live aggregate projection must be in the same order as the list of columns in the GROUP BY clause.
HP Vertica supports the following aggregations for live aggregate projections:
SUM [Aggregate]
Computes the sum of an expression over a group of rows. It returns a DOUBLE PRECISION value for a floating-point expression. Otherwise, the return value is the same as the expression data type.
MAX [Aggregate]
Returns the greatest value of an expression over a group of rows. The return value is the same as the expression data type.
MIN [Aggregate]
Returns the smallest value of an expression over a group of rows. The return value is the same as the expression data type.
COUNT [Aggregate]
Returns the number of rows in each group of the result set for which the expression is not NULL. The return value is a BIGINT.
The COUNT() aggregate function is different from the COUNT() analytic function. The COUNT() analytic function returns the number over a group of rows within a window.
Live aggregate projections cannot contain DISTINCT aggregates.
Another type of Live Aggregate projection is a Top K projections. For optimal performance of Top-K queries, create a Top-K projection that aggregates the data in the table for fast access. Querying the pre-aggregated data directly from the Top-K projection is usually faster than querying the data from the anchor table and then calculating the top k rows.
Top-K projections are a type of live aggregate projection. All the requirements and limitations for live aggregate projections apply to Top-K projections as well.
In this slide, we see a top-K query that will bring back meter readings for the past five reading dataes.
With a regular projection, the query would need to go into a table that contains all reading dates, then filter out the irrelevant dates.
By creating a top-k projection, the data in the projection only contains the data needed by the query. Because the question literally only contains the data needed by the query, the query can simply lift the tuples from the top-k projection without having to do any processing. This will make the query run more quickly.
Note: Normally customers may use aggregate queries to get their top k, Vertica recommends an analytical query as it is an analytical database.
The following considerations are unique to Top-K projections:
Top-K projections must include LIMIT, OVER, PARTITION BY, and ORDER BY clauses. In other words, top-k projections inherently are designed to contain a sub-set of the total data in a table. So, you must use one of these organization characteristics to group and limit the data in the projection.
When creating a Top-K projection, you can only use an ORDER BY clause on the PARTITION BY clause inside an OVER() clause.
The columns in the PARTITION BY and ORDER BY clauses must also be in the SELECT list.
The PARTITION BY clause must be a superset of the anchor projection segmentation. In other words, if segmenting the top-k projection, you must only segment the anchor projection to the higher columns in the segmentation clause of the top-k projection.
You cannot use a PARTITION AUTO clause when creating a Top-K projection. Partitioning of top-k projection must indicate a specific column or expression.
You cannot use the DISTINCT keyword in a Top-K projection definition.
Top-K projections support ORDER BY NULLS FIRST/LAST.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
The expression calculates values based on data in the anchor table and stores it in the projection. When you load data into an anchor table that has a projection that uses expressions, HP Vertica automatically calculates the values using the data from that table and inserts the data into the new projection.
Projections with expressions behave the same as normal projections, except you cannot perform any merge operations on the anchor table.
A projection that uses expressions is not a live aggregate projection unless it aggregates the data, If it does not aggregate data, you can continue to delete and update data in the anchor table but you cannot perform any merge operations in the anchor table. If the projection with expressions does
aggregate data, it behaves like Live Aggregate Projections.
Using expressions when defining projections gives you more flexible sorting. You can sort the data based on the calculated results of an expression. In addition, you can eliminate resource-intensive computations. HP Vertica loads the new data into the anchor table. Then it performs the calculations and loads the resulting data into the projection. That way, the data is already calculated each time you query the projection.
Suppose you want to calculate the product of the two values, as show in the query on this slide.
To run this query, you must create a table. The simple example on this slide shows a table that contains two INTEGER columns.
To create a projection that calculates the product of a and b, use a CREATE PROJECTION statement like the one shown on this slide.
Notice that the aggregate query is actually contained in the DDL of the projection. It is this aggregated query that causes the projection to aggregate data as data is loaded to it via the anchor table.
To query that projection, you must use the name that HP Vertica assigned to it--or its buddy projection.