9. 1256678
1254038
1278858
1230807
Student_ID
1210466
1249290
1244262
1252490
1267170
1248100
1243483
1230382
1240224
1222781
1231806
1246648
Cappiello, Emilia
Dalal, Alana
Orner, Katy
Frigo, Avis
Name
Stober, Saundra
Borba, Milagros
Sosnowski, Hillary
Nibert, Emilia
Popovic, Tanisha
Schreckengost, Max
Porcelli, Darren
Sinko, Erik
Tarvin, Julio
Lessig, Elnora
Thon, Max
Trembley, Allyson
F
F
F
M
Gender
F
F
F
F
F
M
M
M
M
F
M
F
Sophomore
Senior
Junior
Senior
Class
Junior
Freshman
Junior
Sophomore
Freshman
Senior
Junior
Freshman
Sophomore
Junior
Sophomore
Junior
62
92
76
64
Score
90
96
68
59
95
76
67
91
85
63
82
100
D
A
C
D
Grade
A
A
D
F
A
C
D
A
B
D
B
A
プロジェクション:ツールを使い最適なデータ配置を作成
Example query:
select avg( Score ) from example where
Class = ‘Junior’ and Gender = ‘F’ and Grade = ‘A’
12. A
D
B
A
Junior
Senior
Freshman
Junior
Sophomore
Sophomore
Junior
Junior
F
F
F
M
F
F
F
F
M
M
M
F
M
F
1256678Cappiello, EmiliaSophomore 62D
1254038Dalal, AlanaSenior 92A
1278858Orner, Katy76C
1230807Frigo, Avis64D
1210466Stober, SaundraJunior 90
1249290Borba, Milagros96
1244262Sosnowski, Hillary68
1252490Nibert, Emilia59F
1267170Popovic, TanishaF Freshman 95A
1248100Schreckengost, MaxSenior 76C
1243483Porcelli, DarrenJunior 67D
1230382Sinko, ErikM Freshman 91A
1240224Tarvin, Julio85
1222781Lessig, Elnora63D
1231806Thon, MaxSophomore 82B
1246648Trembley, Allyson100
Student_IDNameScoreClassGender Grade
A
圧縮
エン
コー
ド
データ保持イメージ③
13. A
D
B
A
Junior
Senior
Freshman
Junior
Sophomore
Sophomore
Junior
Junior
F
F
F
M
F
F
F
F
M
M
M
F
M
F
1256678Cappiello, Emilia62
1254038Dalal, Alana
Sophomore
Senior 92
1278858Orner, Katy76
1230807Frigo, Avis64
1210466Stober, SaundraJunior 90
1249290Borba, Milagros96
1244262Sosnowski, Hillary68
1252490Nibert, Emilia59
1267170Popovic, TanishaF Freshman 95A
1248100Schreckengost, MaxSenior 76
D
C
1243483Porcelli, Darren67
1230382Sinko, ErikM 91
1240224Tarvin, Julio85
1222781Lessig, Elnora63
C
D
1231806Thon, Max
Junior
Freshman
Sophomore 82
D
A
F
D
A
B
1246648Trembley, Allyson100
Student_IDNameScoreClassGender
F
Grade
AJunior
Junior
Junior
Junior
Junior
A
A
90
100
1st I/O
Reads entire
column
2nd I/O 3rd I/O 4th I/O
offset offset
少ないIOで効率的に検索
Example query:
select avg( Score ) from example where
Class = ‘Junior’ and Gender = ‘F’ and Grade = ‘A’
21. コンプリヘンシブ vs. インクリメンタル デザイン
コンプリヘンシブデザイン インクリメンタルデザイン
全テーブルのサン
プルデータ
データベー
スデザイ
ナー
複数クエリ(または指定なし)
AC
CDB
C B D A
複数のプロジェクション
関連テーブルのみ
のサンプルデータ
データベー
スデザイ
ナー
特定クエリ
CDB
特定のプロジェクション
43. サンプル DDL
デプロイメントスクリプトを開き確認
エンコーディング、ソーティング、セグメンテーションの確認
CREATE PROJECTION fact_p
/*+basename(fact_p),createtype(D
)*/
(
a ENCODING RLE,
b ENCODING RLE,
c,
id ENCODING RLE
) AS
SELECT a, b, c, id FROM Fact
ORDER BY a,id, b
SEGMENTED BY HASH(c) ALL NODES
KSAFE;
CREATE PROJECTION dim_p
/*+basename(dim_p),createtype
(D)*/
(
x ENCODING RLE,
y ENCODING RLE,
z,
id ENCODING RLE
) AS
SELECT x, y, z,id FROM Dim
ORDER BY id, x, y
UNSEGMENTED ALL NODES;
98. プロジェクションのソート順の決定
ORDER BY 文で:
1. クエリの述語から開始
- カーディナリティの低いものから高いものへ述語を並び替え
- ソート順で列をランレングス符号化(RLE)
2. JOIN、かつ/または、GROUP BY に基づき列を並び替え
- カーディナリティの低いものから高いものへ JOIN、かつ/または、GROUP BY の列を並び替え
3. 平均ランレングスが10未満になるまで残りの列を並び替え
SEGMENTED BY
CREATE
SELECT
ORDER BY host, interface,
metric, time
99. プロジェクションのセグメンテーション
サイズの大きいテーブルを分散化し、サイズの小さいテーブルを複製
ランダムなデータ分散のためにセグメンテーション用の列の選択
- 相対的に一意の値を含む列
- クエリの述語での使用頻度が低い列
JOINがローカルで処理されるように分散
- 最も一般的なのは、分散化されたファクトテーブルを複製されたディメンションテーブルにJOIN
- サイズの大きいディメンションテーブル、ファクト同士のJOIN、自己結合については、Identically
Segmented プロジェクション(ISP)を使うことを検討
SEGMENTED BY hash (host,
interface) ALL NODES
OFFSET 0;
CREATE
SELECT
ORDER BY
100. セグメンテーション句
SEGMENTED BY HASH
- ノード間でデータを均等に分
散
- 一意の値が80%より多く、ラン
ダムであるべき
- クエリの述語に出てくる列に
は使用しない
- セグメンテーションに適した
列もしくは列の組み合わせが
ない場合、新たにセグメン
テーション用に列を作成
ALL NODES {OFFSET N}
- ALL NODES
- ALL NODES OFFSET 1
ALL NODES KSAFE
- ALL NODES OFFSET 0の
プロジェクションとALL
NODES OFFSET 1のプロ
ジェクションを作成することと
等価(k=1の場合)
- 全く同じバディープロジェク
ションを作成するために使用
SEGMENTED BY hash (host,interface,time,metric) ALL NODES KSAFE;
103. ランレングス符号化 – どう処理されるか?
同じ値が連続するように、ランレングス符号化済の列を並び替え
各「バケット」が、以降の列の行のセットにマッピング
一つの列のクエリ述語は、他の列のデータ読み込みを限定
F
M
Fresh
Junior
Senior
Soph
Fresh
Junior
Senior
Soph
F
F
F
F
T
T
T
T
F
F
F
F
T
T
T
T
Gender Class Pass Name
SELECT Name
FROM Students
WHERE Class='Junior'
and Gender='M'
and Pass='T'
104. ランレングス符号化 – 停止タイミング
ソート順に列を追加し、平均繰り返し回数が10未満になるまで、RLEの適用を継続
バケット毎に繰り返し回数を決定するために:
- プロジェクションの定義: … ORDER BY gender, class, pass;
- SELECT count(*)
FROM students
GROUP BY gender, class, pass;
RLEでこれ以上の効果が望めないと判断した場合、他の列に対しての他のエンコーディングや
圧縮のオプションを検討
105. ランレングス符号化 – 停止タイミング
RLEの繰り返し回数が10未満
375 93.7 46.8 2.5
(350)
(400)
(50)
(100)
(100)
(100)
(100)
(100)
(100)
(100)
(25)
(50)
(50)
(50)
(25)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
(50)
Gender Class Pass Name
SELECT Name
FROM Students
WHERE Class='Junior'
and Gender='M'
and Pass='T'
平均:
116. Top-K プロジェクションの例
Top-K のクエリ例
SELECT meter_id, reading_date, reading_value FROM readings
LIMIT 5 OVER (PARTITION BY meter_id ORDER BY reading_date
DESC);
meter_idで構成されるTop-Kプロジェクションを作成し、各ガスメーター用の最新の5つのメー
ター測定値を格納
CREATE PROJECTION readings_topk (meter_id, recent_date,
recent_value) AS SELECT meter_id, reading_date, reading_value
FROM readings LIMIT 5 OVER (PARTITION BY meter_id ORDER BY
reading_date DESC);
123. マージ結合の要件
JOIN している両側のデータが結合キーとなる列でソートされている必要あり
- プロジェクションにおいて、結合キーで並び替え
- ORDER BY 句でサブクエリ内で並び替え
等価の述語は、マージ結合のサイズを減らすために、最初に実行される(JOIN を下に押し下
げ)
- オプティマイザは、最初に、クエリの述語に使われている列にあわせて設計されたプロジェクション
を探す
- クエリが列に対して単一の値の等価の述語を持つ場合にのみ、述語の列は、プロジェクションの
ORDER BY で最初に配置される
ディスクへの書き込みを避けるために、大きいディメンションテーブルに対して推奨
12
3
124. マージ結合の最適化
ファクトとディメンションの両プロジェクションを JOIN 句の列で並び替え
サンプルクエリ
SELECT * FROM Fact F, Dim D WHERE F.id = D.id;
サンプルプロジェクション
12
4
CREATE PROJECTION Fact_p
(
a ENCODING RLE,
b ENCODING RLE,
c,
id ENCODING RLE
) AS
SELECT a, b, c, id FROM Fact
ORDER BY id, b
SEGMENTED BY HASH(c) ALL NODES;
CREATE PROJECTION Dim_p
(
x ENCODING RLE,
y ENCODING RLE,
z,
id ENCODING RLE
) AS
SELECT x, y, z,id FROM Dim
ORDER BY id, x, y
UNSEGMENTED ALL NODES;
125. Predicate Pushdown を伴うマージ結合
述語で列をフィルターすることにより、ファクトのプロジェクションを並び替え
続いて、ファクトとディメンションの両プロジェクションの JOIN 句の列で並び替え
サンプルクエリ
SELECT * FROM Fact F, Dim D WHERE F.id = D.id AND f.a = 10;
サンプルプロジェクション
12
5
CREATE PROJECTION Fact_p
(
a ENCODING RLE,
b ENCODING RLE,
c,
id ENCODING RLE
) AS
SELECT a, b, c, id FROM Fact
ORDER BY a,id, b
SEGMENTED BY HASH(c) ALL NODES;
CREATE PROJECTION Dim_p
(
x ENCODING RLE,
y ENCODING RLE,
z,
id ENCODING RLE
) AS
SELECT x, y, z,id FROM Dim
ORDER BY id, x, y
UNSEGMENTED ALL NODES;
128. ローカル結合の設計
小さなディメンションのプロジェクションを複製し、大きなファクトのプロジェクションを
分散化
12
8
00034892 trial results
00034892 ABC's storm
01734984 rt Javier
01734984 This week
08092845 Go Broncos!
08092845 Pats-onside
分散化された
プロジェクション
00034892 trial results
00034892 ABC's storm
01734984 rt Javier
01734984 This week
08092845 Go Broncos!
08092845 Pats-onside
ファクトの
データ
Node1 Node2 Node3
ディメンションの
データ
00034892 Carol
01734984 Jim
08092845 Kim
00034892 Carol
01734984 Jimmy
08092845 Kim
00034892 Carol
01734984 Jimmy
08092845 Kim
00034892 Carol
01734984 Jimmy
08092845 Kim
複製された
/分散化されていない
プロジェクション
129. 分散化されたプロジェクションと複製されたプロジェクション
ファクトを分散化、ディメンションを複製
サンプルクエリ
SELECT * FROM fact F
JOIN dim D ON F.id = D.id;
サンプルプロジェクション
12
9
CREATE PROJECTION fact_p
(
a ENCODING RLE,
b ENCODING RLE,
c ENCODING RLE,
id
) AS
SELECT a, b, c, id FROM fact
ORDER BY id, a, b, c
SEGMENTED BY HASH(id) ALL NODES;
CREATE PROJECTION dim_p
(
x ENCODING RLE,
y ENCODING RLE,
z,
id
) AS
SELECT x, y, z,id FROM dim
ORDER BY id, x, y
UNSEGMENTED ALL NODES;
131. ISP の例
ファクトとディメンションテーブルのどちらもサイズが大きい場合、両プロジェクションの
結合キーで分散化すると、主キー/外部キーで結合された行が同じノードに格納される
13
1
00034892 trial results
00034892 ABC's storm
01734984 rt Javier
01734984 This week
08092845 Go Broncos!
08092845 Pats-onside
FKでハッシュ分散された
大きいファクト
00034892 trial results
00034892 ABC's storm
01734984 rt Javier
01734984 This week
08092845 Go Broncos!
08092845 Pats-onside
PKでハッシュ分散された
大きいディメンション
大きいディメン
ション
大きいファクト
Node1 Node2 Node3
00034892 Carol
01734984 Jim
08092845 Kim
01734984 Jim00034892 Carol 08092845 Kim
132. Identically Segmented プロジェクション
ファクトとディメンションの JOIN 句の列で分散化
サンプルクエリ
SELECT * FROM fact F
JOIN dim D ON F.id = D.id;
サンプルプロジェクション
13
2
CREATE PROJECTION fact_p
(
a ENCODING RLE,
b ENCODING RLE,
c ENCODING RLE,
id
) AS
SELECT a, b, c, id FROM fact
ORDER BY id, a, b, c
SEGMENTED BY HASH(id) ALL NODES;
CREATE PROJECTION dim_p
(
x ENCODING RLE,
y ENCODING RLE,
z,
id
) AS
SELECT x, y, z,id FROM dim
ORDER BY id, x, y
SEGMENTED BY HASH(id) ALL NODES;
133. ネットワーク演算子
結合対象のデータがローカルで利用できない場合、データはネットワークオペレーションを必要とし、実行時に再
分散される
該当するネットワークオペレーションを実行計画から探す
- BROADCAST(各ノードにデータの完全な一時コピーを配布)
Access Path:
+-JOIN HASH [LeftOuter] [Cost: 40K, Rows: 10K (NO STATISTICS)] (PATH ID: 1) Inner (BROADCAST)
| Join Filter: (T1.a > T2.y)
| Materialize at Output: T1.b
| Execute on: All Nodes
- RESEGMENT(各ノードに ISP の一時セグメントを配布)
Access Path:
+-JOIN HASH [Cost: 639, Rows: 10K (NO STATISTICS)] (PATH ID: 1) Inner (RESEGMENT)
| Join Cond: (T1.a = T2.y)
| Materialize at Output: T1.b | Execute on: All Nodes
13
3
135. GROUP BY の最適化
最適化された GROUP BY 演算子の設計
- GROUP BY PIPE もしくは GROUP BY HASH のプロジェクションの作成
LOCAL GROUP BY の設計
- 各 GROUP が一つのみのノードで実行されるように分散化
13
5
136. GROUP BY HASH 演算子
SELECT count(*) FROM cust GROUP BY cust.state;
ハッシュテーブルは、結果セットがユーザーに返される前に、完全に構築される必要あり
13
6
CA
MA
AL
CA
DE
AL
MA
DE
値 Count
ハッシュマップ
(メモリ上に格納)
cust.state
(未ソート)
CA 12
MA
AL
DE
1
1
1
2
2
2
ユーザーへ返す
137. GROUP BY PIPE 演算子
SELECT count(*) FROM cust GROUP BY cust.state;
使用メモリが少なく、 GROUP BY HASH より高速
13
7
cust.state
(ソート済)
C 12BCount( ) = ユーザーへ返す
AL
AL
CA
CA
DE
DE
MA
MA
138. GROUP BY PIPE
GROUP BY PIPE は、大量のデータもしくは多数のグループを集計するために必要不可欠
- タプルの数を無限にストリーミング可能
- グループの合計数は問題ではない
- 選択的な述語がある場合、代わりにその述語を最適化
- 等価の述語は GROUP BY PIPE の前に実行される
GROUP BY PIPE の出力はソート済
クエリの実行計画で、 GROUP BY 演算子を探す
- GROUP BY HASH もしくは GROUP BY PIPELINED
13
8
139. GROUP BY PIPE の最適化
GROUP BY 句に指定される全ての列は、プロジェクションで最初に ORDER BY 句に指定される
必要あり
サンプルクエリ
SELECT count(*) FROM cust GROUP BY a, b, c;
サンプルプロジェクション
CREATE PROJECTION cust_p
(
a ENCODING RLE,
b ENCODING RLE,
c ENCODING RLE,
d,
e
) AS
SELECT a, b, c, d, e FROM cust
ORDER BY a, b, c
SEGMENTED BY HASH(d) ALL NODES;
13
9
140. DISTRIBUTED GROUP BY
データがノード間でランダムに分散されている場合、 GROUP BY 実行のためにデータを再分
散する必要あり
- GROUP BY の列で際分散
各グループが一つのノードのみに存在する場合、どうなるか?
- GROUP BY の列で再分散することにより達成される
クエリの実行計画が再分散を表示
- Distributed Group By の場合、 RESEGMENT GROUP を表示
14
0
141. LOCAL GROUP BY の最適化
Group By の列で分散化
サンプルクエリ
SELECT count(*) FROM cust GROUP BY a, b, c;
サンプルプロジェクション
CREATE PROJECTION cust_p
(
a ENCODING RLE,
b ENCODING RLE,
c ENCODING RLE,
d, e
) AS
SELECT a, b, c, d, e FROM cust
ORDER BY a, b, c
SEGMENTED BY HASH(a, b, c) ALL NODES;
14
1
142. GROUP BY の再分散
SEGMENTED BY HASH(a,b) ALL NODES と指定されたプロジェクションで、下記 GROUP BY がクエ
リに指定されている場合
- GROUP BY a
- Group By に指定されていないセグメンテーションの値があることにより、再分散が必要
- GROUP BY a,b
- 実行時に再分散は必要なし
- GROUP BY a,b,c
- 実行時に再分散は必要なし
- GROUP BY a+1,b
- 列 a の式により、再分散が必要
14
2
144. Count Distinct の最適化
GROUP BY PIPE と同様
サンプルクエリ
SELECT a, count(distinct b) FROM cust
GROUP BY a;
サンプルプロジェクション
CREATE PROJECTION cust_p
(
a ENCODING RLE,
b ENCODING RLE,
c ENCODING RLE
) AS
SELECT a, b, c FROM cust
ORDER BY a, b
SEGMENTED BY HASH(a, b) ALL NODES;
14
4
145. Approximate Count Distinct
誤差は +/-1% であり、97 %の時間短縮
ユーザーが指定した精度を保持
結果はロールアップ可能。たとえば、時間毎のものを日毎に
個別の値は、各ノードに存在する必要なし
より効率的にメモリを使用し、結果がより速く返される
SELECT a, APPROXIMATE_COUNT_DISTINCT (b)
FROM cust
GROUP BY a;
14
5
164. 関連ページ
Best Practices for Projection Optimization (英語)
Query Tuning with Vertica: Dos and Don‘ts (英語)
Query Optimization Using Projections (英語)
Troubleshooting Vertica Query Performance with System Tables (英語)
Physical Schema (英語)
Creating a Database Design (英語)
16
4
Facilitator Keys
The next 3 slides will walk through step by step the DBD and the importance of the DBD in the overall structure of your database.
When preparing to run the DBD, data should be loaded (it is not required but makes more sense to load data) and be a truly representative of all activities including typical insert, select and delete statements. Make sure queries are solid.
DBD outputs a sql file containing scripts to create the projections it has decided are needed to optimize the database.
You can run DBD against a full data set or sample data set.
Run Comprehensive Mode – Can have unlimited queries
Incremental Mode – recommend about 100 queries
DBD does take a fair amount of resource allocation.
Deploying projections – You can review the DBD projections in output or have DBD automatically deploy.
Facilitator Keys
Let's review the files generated in the DBD output directory and what you need to be aware of.
The output directory files include:
design name]_deploy.sql: contains DDL statements that:
Creates new optimized projections for the design you are deploying
Refreshes new projections with initial data
Drops unused/unoptimized projections being replaced
[design name]_design.sql: contains the new design projection definitions
[design name]_params.txt file: contains the parameters used in the DBD when creating a specific design
designer.log: locate details on any errors or other output generated while running the DBD
Facilitator Keys
Create a representative sample of your data. Sample data should have tables with no more than 10G so if have a 100G table and a 10G table a representative amount would be 10 for the 100G and 1G for the 10G.
When loading sample data into Vertica, data should be representative when presenting it to the DBD.
Vertica specs dictate 40% free disk space on each node. One of the reasons is a lack of disk space will create issues with the database design through the DBD as it will not have enough disk space to first generate the replacement projection before dropping the replaced projection.
Facilitator Keys
When providing sample queries to the database, it is important to include various statements. The DBD will ignore queries if syntax is incorrect.
The comprehensive DBD - unlimited queries.
Query specific - About 100 queries.
Important to include predicates, it doesn’t matter if A=15, it is simply important that A is searched on.
Queries are grouped and assigned a weighted balance for similarity.
In the DBD, queries will be clustered that are similar to each other. (It picks one representative from each cluster.)
It will give the representative query a weight equal to the cardinality/size of the cluster it belongs to. This allows the DBD to process fewer overall queries making the process faster.
NOTE: when using DBD in MC you can choose to use a sample set of queries in a file or choose the option to use the Query Request Table.
Facilitator Keys
The Database designer is accessed through adminTools. In admin tools, select the configuration menu then select run database designer from that menu. You can also use the management console in 7.0. This is a web based interface
Comprehensive Design
Good idea after initial load of data into Vertica. If your data fundamentally changes over time, it is ideal to run the DBD again.
If you rerun a comprehensive design on pre-existing DBD projections, doing so is faster than the first comprehensive design phase. The Database Designer does not encode any data that it already encoded, and it does not optimize projections that it has already optimized for storage. Comprehensive mode allows for, but does not require, a file of commonly run queries. If provided, DBD can also create query specific projections and will optimize superprojections for query and storage performance.
DBD in comprehensive mode will drop old projections and replace with newer projections.
Incremental Design
DBD must be provided with a file containing one or more queries to optimize against.
Only adds projections, does not drop projections. (This may affect overall storage footprint because you may have old projections stored in the database that are not used.)
If users run the DBD query specific mode many times, it may result in a build up of projections. (DBD does not drop old projections when run in query specific mode.)
When the DBD is run, you must select a database, and then one or more schemas within that database to use for analysis. If you want to run the DBD against a single table, you can
For Comprehensive or Incremental design, put that table into a new schema and run the DBD against just that schema
For Incremental design only, select the schema that currently contains the table and pass select * from <table>; as the query
Facilitator Keys
There are two things to be aware of when you run the DBD.
You may want the DBD to optimize for query performance or perhaps optimize the storage footprint. We recommend that you run a balanced design.
When optimizing for query performance the DBD creates more query specific projections for each table. (The queries will run faster, but loads will take longer and footprint will be larger.)
If you are optimizing or storage footprint, the DBD creates very few projections. (Queries will be slower, loads faster.)
Balanced design is recommended. Adjustments can be made by running in query specific mode to improve performance on a select set of queries.
Ask
How does Vertica optimize the storage footprint when running the DBD? –
Answer: Tries every possible encoding and compression type on every column. For each column, selects the encoding and compression type that most reduces the data size.
How does Vertica optimize for query performance when running the DBD? –
Answer: Generates a set of candidate projections for each table. Invokes the optimizer to determine the query cost of candidate projections and picks the projection with the lowest query cost.
Facilitator Keys
Let's discuss now how many projections per table are ideal.
As the slide indicates, the more projections you have the more optimal for query performance, if you have fewer projections the load rates are faster and there is a smaller storage footprint.
On the flip slide, the more projections you have, the load rates may be slower and the footprint will be larger.
As indicated 2-5 projections per table is common (1-2 superprojections and 1-2 query specific projections)
Facilitator Keys
So what are some of the advantages of utilizing the DBD
You do not have to do a lot of the manual tuning that you may have had to do in your old environment.
You do not have to deploy what the DBD tells you. You have the ability to add columns, change order by, replicate vs. segment, you have that flexibility.
It's a major benefit that the Optimizer runs the DBD and queries. This means that the DBD will truly evaluate your sample queries the exact way the real queries will run.
An incremental design type may be employed some time after the Comprehensive design to provide additional queries to Vertica that should be considered for optimization.
インクリメンタルデザインは、コンプリヘンシブデザインを実行した後に、最適化した方がよいと考えられるクエリを Vertica に追加したい場合に、実行されるのがよいでしょう。
This process may (but not always) result in additional incremental projections being created when the existing design does not provide optimal query performance.
この処理において、既存のプロジェクションでは最適なクエリパフォーマンスを出せない場合に、(常にではないですが)追加のクエリスペシフィックプロジェクションが作成されます。
There is not always a one-to-one ratio between new queries and new projections. Sometimes the DBD may create a single new incremental projection that may serve several new queries, if they use the same table.
新規の各クエリに対して、1対1でプロジェクションが必ずしも新規作成されるわけではありません。時には、いくつかのクエリが同じテーブルを参照している場合、 DBD はいくつかの新規クエリに対応する1つの新規クエリスペシフィックプロジェクションを作成するかもしれません。
Encourage experimentation in a test environment to see what happens, so that they can consider the tradeoff between performance and the extra file space required for any extra projections created to meet the needs of the newly added query optimizations.
パフォーマンスと、クエリ最適化のニーズを満たすために新たに追加作成された任意の余分なプロジェクションに必要な余分なディスク容量との間のトレードオフを考慮するために、テスト環境で実験して起こりうる内容を確認することをお勧めします。
Note: A good rule-of-thumb is to provide no more than 10-15 queries in the list for an incremental optimization. However, up to 100 queries may be provided.
注意: 経験則として、インクリメンタルデザインの最適化のリストには、10~15以下のクエリを適用するのがよいとされています。しかしながら、最大100程度までのクエリは適用されうるでしょう。
The query file provided to the DBD will be parsed for syntax errors, but not other errors, so make sure the queries are solid.
DBD に与えられたクエリファイルは、クエリが均質であることを確認するために、文法エラーがチェックされます。
Theoretically, a comprehensive run can take an unlimited number of queries, but it is best to keep it around 100 or less. For incremental, go with around 10 or less.
理論的には、コンプリヘンシブモードで実行する際のクエリ数の上限はありませんが、100本以下におさえるのが最善です。インクリメンタルモードの場合、10以下におさえるのがよいでしょう。
You can cheat the DBD by repeating important queries so that the DBD identifies a common pattern and is more likely to optimize against that.
DBD が共通のパターンを識別し、そのパターンに対してより最適化するように、重要なクエリを繰り返すことにより、 DBD を誤魔化すことが可能です。
It is very important to include predicates. It doesn't matter if A = 15, it is simply important that A is searched on.
述語を含むことが非常に重要です。A=15 などでも問題ありません。A で検索されているということが大切です。
Encourage backups of all these scripts after each run of the DBD for archival purpose. This can provide a way to restore to a previous design if there are deployment issues with new design or for any other reason that it might be desirable to get back to previous design.
アーカイブ目的で、 DBD の各実行後に全てのこれらのスクリプトをバックアップすることを奨励します。これをすることにより、新しいデザインでデプロイメントに問題があった場合や、他の理由で以前のデザインに戻したいという場合に、過去のデザインにリストアすることが可能になります。
There are three deployment-related scripts at this location:
このディレクトリ以下に、デプロイに関連する3つのスクリプトが作成されます:
[designName]_deploy.sql: Contains entire Comprehensive design script.
Creates new projections
Refreshes the new projections
Drops projections from previous design
[designName]_ design.sql: Contains a the DDL to create the new projections only. It does not refresh or drop any projections.
Included as the create portion of the deploy script
[designName]_ projection_backup_[sequencNumber].sql: Contains a script to re-create the design that existed prior to running the deploy script.
Created projections in the backup script are the ones being dropped in the deploy script.
[designName]_deploy.sql: コンプリヘンシブデザインのスクリプト
新規プロジェクションの作成
新規プロジェクションのリフレッシュ
以前のデザインからのプロジェクションの削除
[designName]_ design.sql: 新規プロジェクションの作成用の DDL のみ。Refresh コマンドや Drop コマンドは含まれません。
デプロイ用スクリプト内の Create 文のみ
[designName]_ projection_backup_[sequencNumber].sql: デプロイ用スクリプトを実行する前に定義されていたデザインを再作成するためのスクリプト
このバックアップスクリプト内で作成されるプロジェクションが、デプロイ用スクリプトで削除対象となります。
This shows running the deployment script from vsql or a command prompt.
このスライドは、 vsql あるいはコマンドプロンプトからのデプロイメントスクリプトの実行方法を示しています。
New Note added for 7.0
All nodes should be up when running the deployment script since this will allow creation of buddy projections.
7.0で追加された新しい注意点
デプロイメントスクリプトの実行時、バディープロジェクションの作成をするため、全ノードが起動していなければいけない。
When you run deployment script in MC, you must save it with Export Design button or Save button - otherwise when you deploy, the script gets deleted
MC 上でデプロイメントスクリプトを実行する場合、 Export Design ボタン、あるいは、 Save ボタンを使って、それを保存する必要があります。さもなければ、デプロイすると、スクリプトは消去されてしまいます。
It is possible that there were known issues with the sample data that you can resolve now.
サンプルデータで、すぐに解決可能な既知の問題があるかもしれません。
The cardinality of the sample may have been off, causing DBD to make one decision about encoding, but you want to change it.
DBD にエンコーディングに関して1つの決定をさせるために、サンプルのカーディナリティが有効でなかったかもしれません。そのため、それを変更したいとします。
Order by could be altered if you know of, but did not have examples of, additional queries that could be run.
ORDER BY については、サンプルクエリには含まれなかったが実行されうる追加のクエリをあなたが分かっていれば変更できるでしょう。
Segmentation could be changed if you want to set up for local joins or group by.
セグメンテーションについては、ローカル結合や Local Group By が実行されるように設定したい場合は、変更されうるでしょう。
These topics are covered later in the Advanced Query Performance Design topic and they will understand better why to look at these sections and how to detect where additional optimization may be possible before running the deployment script.
これらのトピックについては、後半の上級クエリパフォーマンス設計でのトピックで説明されるでしょう。その際、これらのセクションを見る理由や、デプロイメントスクリプトが実行される前に追加の最適化の可能性がある箇所の特定の仕方について、よりよく理解するでしょう。
Notes:
In general, for incremental optimization, the ORDER BY clause in projection DDL should follow the ORDER BY clause of a query to cause the data to be grouped on each node before being aggregated by the initiator to reduce processing time of the query by the initiator node.
Keep in mind that the DDL being viewed may either the superprojection or an incremental projection design to support another query. So, be cautious when modifying the ORDER BY clause in the DDL to optimize for a specific query. By doing so, you may adversely impact the optimization of another query.
Consider MODULARHASH to distribute data more evenly than HASH (which distributes data using a normal statistical distribution). Use MODULARHASH if you can hash segment your data using a column with a regular pattern, such as a sequential unique identifier. The DBD in 6.1.2 utilizes MODULARHASH.
注意:
一般的に、クエリ固有の最適化のために、イニシエーターノードによるクエリの処理時間を減らすために、イニシエーターによって集約される前に各ノード上でデータが集約されるように、プロジェクションの DDL の ORDER BY 句はクエリの ORDER BY 句に準ずるべきである。
表示されている DDL が、他のクエリをサポートするためのスーパープロジェクションもしくはクエリスペシフィックプロジェクションのいずれかであるということを覚えておいてください。そのため、特定のクエリを最適化するために、 DDL の ORDER BY 句を編集する際に注意が必要です。そうすることで、他のクエリの最適化に悪影響を与えるかもしれません。
HASH(通常の統計的分布を使用してデータを分散)よりもより均一にデータを分散させるために、 MODULARHASH を使うことを検討してください。連続した一意の識別子のような規則的なパターンを持つ列を使うデータをハッシュ分散する際に、 MODULARHASH を使ってください。6.1.2の DBD では、 MODULARHASH が使われるようになっています。
Database Designer Logging Projection Data
HP Vertica 7.1 allows you to log information about the projections that the Optimizer recommends. Database Designer considers these projections when creating a design. HP Vertica stores this information in two Data Collector (DC) tables:
DC_DESIGN_PROJECTION_CANDIDATES
DC_DESIGN_QUERY_PROJECTION_CANDIDATES
When you enable logging and start Database Designer, the Optimizer proposes a set of ideal projections based on the options that you specify. The logs contain information about:
Whether or not the projections are actually created when the design is deployed.
How projections are optimized.
Whether the projections are created with and without the ideal criteria that the Optimizer identified.
If you do not deploy the design immediately, review the logs to determine if you want to change the design and proposed projections. If Database Designer deployed the design, you can still manually create some of the projections that Database Designer did not create in the deployed design.
By default, logging the Database Designer design data is disabled. To enable it, turn on the configuration parameter DBDLogInternalDesignProcess:
=> SELECT SET_CONFIG_PARAMETER('DBDLogInternalDesignProcess','1');
To build a projection, we will assume we are designing this projection with a specific query or groups of queries in mind.
First you will need to choose the columns and then consider the sort order and encoding types that will minimize the projection storage footprint and allow for an efficient query.
Finally you will define the projection segmentation, will the projection be segmented or replicated and if its segmented what columns will you segment on.
Optional Exercise: using the graphic have students tell you where in the DDL these steps are specified.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
Vertica prepends "agg_" to the projection name.
DBD does not create live aggregate projections when optimizing for queries. Live aggregated queries must be manually created.
Limit to the number of aggregate projections per anchor table since the aggregations are done at load time: Practical limit, but no declared limit. May not need to say anything
Requirements
You must include a GROUP BY clauses and it must appear at the end of the CREATE PROJECTION statement.
When you create a live aggregate projection for a table, HP Vertica automatically aggregates the data from the anchor table and loads it into the live aggregate projection. On subsequent loads through the anchor table, HP Vertica updates both any regular projections and any live aggregate projections associated with the anchor table.
An anchor table must exist to perform any DML.
To create a live aggregate projection, an anchor table must be created through which data can be queried and loaded.
At least one regular projection must exist to serve as the base for the aggregated projection.
Then, a base projection must be created to serve as the anchor projection for the live aggregate projection.
Finally, the live aggregate projection can be created. It will use a base projection as the source of its data.
The live aggregate projection will have some or all of the columns from the base projection. In addition, it will contain one or more aggregate columns.
The query in the DDL will contain some type of aggregation SQL to generate the columns and aggregate the data in the aggregated columns.
Queries containing aggregates are made directly to the Live Aggregate Projection (not the table)
Unlike regular projections, queries against live aggregate projections are made directly against the live aggregate projection.
Normally, a query is run against a table and the optimizer chooses the projection to be used. In the case of live aggregated queries the table name must be replaced with the name of the live aggregated projections.
Important: You must create an anchor table and an anchor projection before you create a live aggregate projection.
The anchor projection must specify a segmentation that is a subset of the live aggregate projection's segmentation.
NOTE: In the DDL, you do not specify this is an anchor table in the conventional way. The query in the DDL of the live aggregate projection (see step 3) is an aggregation query.
If you do not specify the segmentation for the anchor projection, HP Vertica creates a projection that is segmented by all columns. If the anchor projection is segmented on all columns, you cannot create a live aggregate projection.
You cannot use the ORDER BY clause when you create a live aggregate projection. Internally, the data is ordered on the GROUP BY column. However, when you retrieve the data, the results do not necessarily display in that order. Use the ORDER BY clause to sort the results of querying the live aggregate projection:
Before you create a live aggregate projection, you must create an anchor projection. The anchor projection's segmentation must be a subset of the live aggregate projection's segmentation. The anchor table cannot be unsegmented.
The GROUP BY and PARTITION BY clauses must be supersets of the anchor projection segmentation.
Live aggregate projections are only stored in the ROS.
Live aggregate projections must be segmented.
The list of columns in the SELECT list for a live aggregate projection must be in the same order as the list of columns in the GROUP BY clause.
HP Vertica supports the following aggregations for live aggregate projections:
SUM [Aggregate]
Computes the sum of an expression over a group of rows. It returns a DOUBLE PRECISION value for a floating-point expression. Otherwise, the return value is the same as the expression data type.
MAX [Aggregate]
Returns the greatest value of an expression over a group of rows. The return value is the same as the expression data type.
MIN [Aggregate]
Returns the smallest value of an expression over a group of rows. The return value is the same as the expression data type.
COUNT [Aggregate]
Returns the number of rows in each group of the result set for which the expression is not NULL. The return value is a BIGINT.
The COUNT() aggregate function is different from the COUNT() analytic function. The COUNT() analytic function returns the number over a group of rows within a window.
Live aggregate projections cannot contain DISTINCT aggregates.
Another type of Live Aggregate projection is a Top K projections. For optimal performance of Top-K queries, create a Top-K projection that aggregates the data in the table for fast access. Querying the pre-aggregated data directly from the Top-K projection is usually faster than querying the data from the anchor table and then calculating the top k rows.
Top-K projections are a type of live aggregate projection. All the requirements and limitations for live aggregate projections apply to Top-K projections as well.
In this slide, we see a top-K query that will bring back meter readings for the past five reading dataes.
With a regular projection, the query would need to go into a table that contains all reading dates, then filter out the irrelevant dates.
By creating a top-k projection, the data in the projection only contains the data needed by the query. Because the question literally only contains the data needed by the query, the query can simply lift the tuples from the top-k projection without having to do any processing. This will make the query run more quickly.
Note: Normally customers may use aggregate queries to get their top k, Vertica recommends an analytical query as it is an analytical database.
The following considerations are unique to Top-K projections:
Top-K projections must include LIMIT, OVER, PARTITION BY, and ORDER BY clauses. In other words, top-k projections inherently are designed to contain a sub-set of the total data in a table. So, you must use one of these organization characteristics to group and limit the data in the projection.
When creating a Top-K projection, you can only use an ORDER BY clause on the PARTITION BY clause inside an OVER() clause.
The columns in the PARTITION BY and ORDER BY clauses must also be in the SELECT list.
The PARTITION BY clause must be a superset of the anchor projection segmentation. In other words, if segmenting the top-k projection, you must only segment the anchor projection to the higher columns in the segmentation clause of the top-k projection.
You cannot use a PARTITION AUTO clause when creating a Top-K projection. Partitioning of top-k projection must indicate a specific column or expression.
You cannot use the DISTINCT keyword in a Top-K projection definition.
Top-K projections support ORDER BY NULLS FIRST/LAST.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
In this example, we will look at joining the fact table purchases to the dimension table customers. The fact table projection purchases is not sorted on the cust_id column that we will use to join it with the dimension table customer. The dimension table projection customer is also not sorted on the join column cust_id . Because neither projection is sorted, Vertica will create the join by hashing the dimension projection customer into memory. It will then scan through the Fact (outer) projection purchases for matches and creating an output tuple as each match is found. When the join is complete, the entire contents of the join is sent to the Initiator node to be aggregated with the join results from the other nodes.
For join in which the inner (dimension table) is small, a hash join can be OK. However, for joins between very large tables, the resulting hash table can be very large, using lots of memory…or even running out of memory and spilling to disk, which is even worse than just a memory-based hash.
In this example, we have projections for the fact (outer) table purchases and the dimension (inner) table customers that we want to join on the cust_id. Because both projections have been ordered by the column cust_id, Vertica can join the individual tuples and send them to the initiator node as the joins are made. Vertica does not need to create a hash table to accomplish this.
When this situation occurs, Vertica selects a MERGE JOIN as the operation used to bring the data together. This will be shown in the query plan on the join step.
Additional Information: It is not possible to force Vertica to perform a MERGE JOIN using a hint or notation in the query. The Vertica optimizer must choose this more efficient join methodology based upon seeing optimally designed projections. If the dimension table is small and not sorted by the join column, Vertica may perform a SORT MERGE JOIN on the dimension data to sort it in memory by the join key. When, this happens, the optimizer may choose to do the join between the projections using a MERGE JOIN instead of a HASH JOIN.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.
Facilitator Keys
Now, let's start to discuss the Vertica Analytics Platform. These are the 6 specific things that make Vertica, Vertica as we discussed earlier.