SlideShare a Scribd company logo
1 of 45
Download to read offline
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowフォーマットは
なぜ速いのか
須藤功平
株式会社クリアコード
db tech showcase ONLINE 2020
2020-12-08
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowと私
2016-12-21に最初のコミット✓
2017-05-10にコミッター✓
2017-09-15にPMCメンバー✓
2020-11-25現在コミット数2位(508人中)✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrow
データ分析ツールの基盤を提供✓
ツールで必要になるやつ全部入り✓
各種プログラミング言語をサポート✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
全部ってなに!?
データフォーマットとかそのデータ
を高速処理する機能とか他の各種
データフォーマットと変換する機能
とかローカル・リモートにあるデー
タを透過的に読み書きする機能とか
高速RPCとかなんだけど、全部を説
明すると「すごそうだけどよくわか
らないね!」と言われる!
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
今日のトピック
Apache Arrow
フォーマット
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowフォーマット
データフォーマット
通信用とインメモリー用の両方✓
✓
表形式のデータ用
=データフレーム形式のデータ用✓
✓
速い!✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
速い!
データ交換が速い!✓
データ処理が速い!✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
データ交換が速い!
Apache Arrowフォーマットにすると高速化!
データソース データ処理
Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation
CSV
など
データソース データ処理
高速化
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
利用事例:Apache Spark
Spark PySpark
Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation
pickle
高速化
Spark PySpark
実行時間
Data at https://arrow.apache.org/blog/2017/07/26/spark-arrow/
0s 20s
(短いほど高速)
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
利用事例:Amazon Athena
S3 Athena
Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation
CSV
Federated
Query
組込クエリーより
高速
S3 Athena
1億行の処理にかかった時間
Data at https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk#performance
(短いほど高速)
0s 4s 8s
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
利用事例:RAPIDS
cuDF cuIO
Analytics
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
cuxfilter, pyViz,
plotly
Visualization
Dask
GPU Memory
RAPIDS
End-to-End GPU Accelerated Data Science
https://docs.rapids.ai/overview/RAPIDS%200.15%20Release%20Deck.pdf#page=3
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
利用事例:RAPIDS
25-100x Improvement
Less Code
Language Flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More Code
Language Rigid
Substantially on GPU
Traditional GPU Processing
Hadoop Processing, Reading from Disk
Spark In-Memory Processing
Data Processing Evolution
Faster Data Access, Less Data Movement
RAPIDS
Arrow
Read
ETL
ML
Train
Query
50-100x Improvement
Same Code
Language Flexible
Primarily on GPU
https://docs.rapids.ai/overview/RAPIDS%200.15%20Release%20Deck.pdf#page=8
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
どうして速いの?
シリアライズコストが低い
すぐに送れるようになる✓
✓
デシリアライズコストが低い
受け取ったデータをすぐに使えるようになる✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
シリアライズ処理
メタデータを用意1.
メタデータ+元データそのものを送信
元データを加工しないから速い!✓
なにもしないのが最速!✓
2.
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
元データを加工する例:JSON
0x01 0x02(8bit数値の配列)
↓
"[1,2]"(JSON)
0x01→0x49(数値→ASCIIの文字:'1')
0x02→0x50(数値→ASCIIの文字:'2')
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
元データそのものを使うと…
変換処理にCPUを使わなくてよい
速い✓
✓
変換後のデータ用のメモリー確保ゼロ
大きなメモリー確保はコストが高い✓
一定の作業領域を使い回すとかしなくてよい✓
速い✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
デシリアライズ処理
メタデータをパース1.
メタデータを基に元データを取り出す
元データをそのまま使えるから速い!✓
なにもしないのが最速!✓
2.
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
元データを元に戻す例:JSON
"[1,2]"(JSON)
↓
0x01 0x02(8bit数値の配列)
0x49→0x01(ASCIIの文字:'1'→数値)
0x50→0x02(ASCIIの文字:'2'→数値)
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
元データを取り出せると…
変換処理にCPUを使わなくてよい
速い✓
✓
変換後のデータ用のメモリー確保ゼロ
すでにあるデータをそのまま使うのでゼロコピー✓
速い✓
✓
メモリーマップで直接データを使える
ディスク上のメモリー以上のデータを扱える✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
{,デ}シリアライズコスト
Apache Arrowフォーマット
ほぼメタデータのパースコストだけ✓
✓
それ以外の多くのフォーマット
データ変換処理(CPU)✓
作業用メモリー確保処理(メモリー)✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
データ交換が速い!
Apache Arrowフォーマットにすると高速化!
データソース データ処理
Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation
CSV
など
データソース データ処理
高速化
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
利用事例:Apache Spark
Spark PySpark
Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation
pickle
高速化
Spark PySpark
実行時間
Data at https://arrow.apache.org/blog/2017/07/26/spark-arrow/
0s 20s
(短いほど高速)
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
利用事例:Amazon Athena
S3 Athena
Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation
CSV
Federated
Query
組込クエリーより
高速
S3 Athena
1億行の処理にかかった時間
Data at https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk#performance
(短いほど高速)
0s 4s 8s
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
利用事例:RAPIDS
25-100x Improvement
Less Code
Language Flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More Code
Language Rigid
Substantially on GPU
Traditional GPU Processing
Hadoop Processing, Reading from Disk
Spark In-Memory Processing
Data Processing Evolution
Faster Data Access, Less Data Movement
RAPIDS
Arrow
Read
ETL
ML
Train
Query
50-100x Improvement
Same Code
Language Flexible
Primarily on GPU
https://docs.rapids.ai/overview/RAPIDS%200.15%20Release%20Deck.pdf#page=8
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
データサイズは?
CPU・メモリーにやさしくて速いんだね!✓
じゃあ、データサイズはどうなの?
大きいとネットワーク・IOがボトルネック✓
本来のデータ処理を最大限リソースを使いたい✓
データ交換をボトルネックにしたくない✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
データサイズ
別に小さくない
元データそのままなのでデータ量に応じて増加✓
✓
Zstandard・LZ4での圧縮をサポート
ゼロコピーではなくなるがサイズは数分の1✓
圧縮・展開が必要→元データそのものを使えない✓
CPU・メモリー負荷は上がるが
ネットワーク・IO負荷は下がる
✓
ネットワーク・IOがボトルネックになるなら効く✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
圧縮時のデータサイズと読み込み速度
https://ursalabs.org/blog/2020-feather-v2/
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
データ交換が速い!のまとめ
{,デ}シリアライズが速い
元データをそのまま使うので処理が少ない✓
CPU・メモリーにやさしい✓
✓
圧縮もサポート
ネットワーク・IOがボトルネックならこれ✓
CPU・メモリー負荷は上がるが
データ交換のボトルネックを解消できるかも
✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
交換したデータの扱い
データ分析はデータ交換だけじゃない
データ交換だけ速くしても基盤とは言えない✓
✓
データ処理も速くしないと!
データ処理を速くするにはデータ構造が大事✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
高速処理のためのデータ構造
基本方針:
関連するデータを近くに置く✓
✓
効果:
CPUキャッシュミスを減らす✓
SIMDを活用できる✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
データ分析時の関連データ
分析時はカラムごとの処理が多い
集計・ソート・絞り込み…✓
✓
同じカラムのデータを近くに置く
カラムナーフォーマット✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
カラムナーフォーマット
列行
a b c
1
2
3
値 値 値
値 値 値
値 値 値
列
行
a b c
1
2
3
値 値 値
値 値 値
値 値 値
Apache
Arrow
列ごと
RDBMS
列 行
値の管理単位 行ごと
高速なアクセス単位
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
各カラムでのデータの配置
関連するデータを近くに置く✓
定数時間でアクセスできるように置く✓
SIMDできるように置く
アラインする
アライン:データの境界を64の倍数とかに揃える
✓
条件分岐をなくす✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
真偽値・数値のデータの配置
固定長データなので連続して配置
32ビット整数:[1, 2, 3]
0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 ...
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
文字列・バイト列:データの配置
実データバイト列+長さ配列に配置
UTF-8文字列:["Hello", "", "!"]
実データバイト列:"Hello!"
長さ配列:[0, 5, 5, 6]
i番目の長さ:長さ配列[i+1] - 長さ配列[i]
i番目のデータ:
実データバイト列[長さ配列[i]..長さ配列[i+1]]
注:長さ→データ→長さ→データ→…で置くと
定数時間でi番目にアクセスできない
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
nullと条件分岐
null対応のアプローチ:
null値:Julia: missing, R: NA✓
別途ビットマップを用意:Apache Arrow✓
✓
ビットマップを使うと条件分岐をなくせる✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
nullと条件分岐とSIMD
null 1 0 1
data 1 X 3
0 1 1
X 2 5
+[1, null, 3] [null, 2, 5]
0 0 1
X X 8
[null, null, 8]
ビット単位の& nullのところも
気にせずSIMDで+
null
data
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
nullと条件分岐とSIMD
https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
高速なデータ処理のまとめ
高速なデータ処理にはデータ構造が重要
データ分析にはカラムナーフォマットが適切✓
✓
定数時間でアクセス可能なデータの配置✓
SIMDにやさしいデータの持ち方
アライン・null用のビットマップ✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
まとめ
Apache Arrow
なんかデータ分析に便利そうなすごいやつ!✓
「よくわからん」と言われるからといって
全体像を説明してもらえなかった!
✓
使い方も説明してもらえなかった!✓
✓
Apache Arrowフォーマット
これだけ説明してもらえた!✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
まとめ:Apache Arrowフォーマット
Apache Arrowフォーマット
通信用・インメモリー用のデータフォーマット✓
表形式のデータ用✓
✓
Apache Arrowフォーマットは速い
データ交換が速い✓
データ処理が速い✓
データ交換してすぐに高速処理できるフォーマット✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
まとめ:なぜデータ交換が速いのか
元データをそのままやりとりできるから
{,デ}シリアライズコストが低い✓
CPU・メモリーにやさしい✓
✓
ネットワーク・IOの負荷を下げたい
Zstandard・LZ4による圧縮✓
CPU・メモリー負荷は上がるが
ネットワーク・IO負荷は下がる
✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
まとめ:なぜデータ処理が速いのか
最適化されたデータ構造
カラムナーフォーマット✓
SIMDを使えるデータの持ち方✓
✓
最適化された実装
今回は紹介していない!!!✓
✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
次回予告!
案1:Apache Arrowデータの高速処理✓
案2:Apache Arrowデータの高速RPC✓
案3:○○言語でApache Arrowを使う方法✓
...✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
次のステップ
もっと詳しく知りたくなったから
イベント・社内・…で紹介して!
https://www.clear-code.com/contact/✓
✓
使いたくなったから技術サポートして!
https://www.clear-code.com/contact/✓
✓
Apache Arrowの開発に参加したい!
https://arrow.apache.org/community/✓
✓

More Related Content

What's hot

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Speeding up PySpark with Arrow
Speeding up PySpark with ArrowSpeeding up PySpark with Arrow
Speeding up PySpark with ArrowRubén Berenguel
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)Amazon Web Services Korea
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)Amazon Web Services
 
How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
Deep Learning을 위한  AWS 기반 인공 지능(AI) 서비스 (윤석찬)Deep Learning을 위한  AWS 기반 인공 지능(AI) 서비스 (윤석찬)
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)Amazon Web Services Korea
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsShankar M S
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm毅 吕
 
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...Amazon Web Services
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 

What's hot (20)

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Speeding up PySpark with Arrow
Speeding up PySpark with ArrowSpeeding up PySpark with Arrow
Speeding up PySpark with Arrow
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Oracle on AWS RDS Migration - 성기명
Oracle on AWS RDS Migration - 성기명Oracle on AWS RDS Migration - 성기명
Oracle on AWS RDS Migration - 성기명
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
 
How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
Deep Learning을 위한  AWS 기반 인공 지능(AI) 서비스 (윤석찬)Deep Learning을 위한  AWS 기반 인공 지능(AI) 서비스 (윤석찬)
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm
 
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 

Similar to Why Apache Arrow format is fast

RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
RubyKaigi 2022 - Fast data processing with Ruby and Apache ArrowRubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
RubyKaigi 2022 - Fast data processing with Ruby and Apache ArrowKouhei Sutou
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksDustin Vannoy
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Timothy Spann
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Amazon Web Services
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Sql bits apache nifi 101 Introduction and best practices
Sql bits    apache nifi 101 Introduction and best practicesSql bits    apache nifi 101 Introduction and best practices
Sql bits apache nifi 101 Introduction and best practicesTimothy Spann
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaAshish Thapliyal
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...Amazon Web Services
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Amazon Web Services
 
Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksLace Lofranco
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Troposphere Python infrastructure as code for AWS Cloudformation
Troposphere Python infrastructure as code for AWS CloudformationTroposphere Python infrastructure as code for AWS Cloudformation
Troposphere Python infrastructure as code for AWS CloudformationPatrick Pierson
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesHao Chen
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueAmazon Web Services
 

Similar to Why Apache Arrow format is fast (20)

RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
RubyKaigi 2022 - Fast data processing with Ruby and Apache ArrowRubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Sql bits apache nifi 101 Introduction and best practices
Sql bits    apache nifi 101 Introduction and best practicesSql bits    apache nifi 101 Introduction and best practices
Sql bits apache nifi 101 Introduction and best practices
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and Kafka
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure Databricks
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Troposphere Python infrastructure as code for AWS Cloudformation
Troposphere Python infrastructure as code for AWS CloudformationTroposphere Python infrastructure as code for AWS Cloudformation
Troposphere Python infrastructure as code for AWS Cloudformation
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New Features
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS Glue
 

More from Kouhei Sutou

Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Kouhei Sutou
 
Rubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアRubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアKouhei Sutou
 
Apache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataApache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataKouhei Sutou
 
Redmine検索の未来像
Redmine検索の未来像Redmine検索の未来像
Redmine検索の未来像Kouhei Sutou
 
Apache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataApache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataKouhei Sutou
 
Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Kouhei Sutou
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームKouhei Sutou
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムKouhei Sutou
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroongaKouhei Sutou
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Kouhei Sutou
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムKouhei Sutou
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!Kouhei Sutou
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版Kouhei Sutou
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムKouhei Sutou
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionKouhei Sutou
 

More from Kouhei Sutou (20)

Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
 
Rubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアRubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェア
 
Apache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataApache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory data
 
Apache Arrow 2019
Apache Arrow 2019Apache Arrow 2019
Apache Arrow 2019
 
Redmine検索の未来像
Redmine検索の未来像Redmine検索の未来像
Redmine検索の未来像
 
Apache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataApache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory data
 
Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォーム
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroonga
 
My way with Ruby
My way with RubyMy way with Ruby
My way with Ruby
 
Red Data Tools
Red Data ToolsRed Data Tools
Red Data Tools
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システム
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extension
 

Recently uploaded

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 

Recently uploaded (20)

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 

Why Apache Arrow format is fast

  • 1. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrowフォーマットは なぜ速いのか 須藤功平 株式会社クリアコード db tech showcase ONLINE 2020 2020-12-08
  • 2. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrowと私 2016-12-21に最初のコミット✓ 2017-05-10にコミッター✓ 2017-09-15にPMCメンバー✓ 2020-11-25現在コミット数2位(508人中)✓
  • 3. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrow データ分析ツールの基盤を提供✓ ツールで必要になるやつ全部入り✓ 各種プログラミング言語をサポート✓
  • 4. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 全部ってなに!? データフォーマットとかそのデータ を高速処理する機能とか他の各種 データフォーマットと変換する機能 とかローカル・リモートにあるデー タを透過的に読み書きする機能とか 高速RPCとかなんだけど、全部を説 明すると「すごそうだけどよくわか らないね!」と言われる!
  • 5. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 今日のトピック Apache Arrow フォーマット
  • 6. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrowフォーマット データフォーマット 通信用とインメモリー用の両方✓ ✓ 表形式のデータ用 =データフレーム形式のデータ用✓ ✓ 速い!✓
  • 7. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 速い! データ交換が速い!✓ データ処理が速い!✓
  • 8. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データ交換が速い! Apache Arrowフォーマットにすると高速化! データソース データ処理 Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation CSV など データソース データ処理 高速化
  • 9. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:Apache Spark Spark PySpark Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation pickle 高速化 Spark PySpark 実行時間 Data at https://arrow.apache.org/blog/2017/07/26/spark-arrow/ 0s 20s (短いほど高速)
  • 10. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:Amazon Athena S3 Athena Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation CSV Federated Query 組込クエリーより 高速 S3 Athena 1億行の処理にかかった時間 Data at https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk#performance (短いほど高速) 0s 4s 8s
  • 11. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:RAPIDS cuDF cuIO Analytics Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization Dask GPU Memory RAPIDS End-to-End GPU Accelerated Data Science https://docs.rapids.ai/overview/RAPIDS%200.15%20Release%20Deck.pdf#page=3
  • 12. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:RAPIDS 25-100x Improvement Less Code Language Flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More Code Language Rigid Substantially on GPU Traditional GPU Processing Hadoop Processing, Reading from Disk Spark In-Memory Processing Data Processing Evolution Faster Data Access, Less Data Movement RAPIDS Arrow Read ETL ML Train Query 50-100x Improvement Same Code Language Flexible Primarily on GPU https://docs.rapids.ai/overview/RAPIDS%200.15%20Release%20Deck.pdf#page=8
  • 13. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 どうして速いの? シリアライズコストが低い すぐに送れるようになる✓ ✓ デシリアライズコストが低い 受け取ったデータをすぐに使えるようになる✓ ✓
  • 14. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 シリアライズ処理 メタデータを用意1. メタデータ+元データそのものを送信 元データを加工しないから速い!✓ なにもしないのが最速!✓ 2.
  • 15. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 元データを加工する例:JSON 0x01 0x02(8bit数値の配列) ↓ "[1,2]"(JSON) 0x01→0x49(数値→ASCIIの文字:'1') 0x02→0x50(数値→ASCIIの文字:'2')
  • 16. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 元データそのものを使うと… 変換処理にCPUを使わなくてよい 速い✓ ✓ 変換後のデータ用のメモリー確保ゼロ 大きなメモリー確保はコストが高い✓ 一定の作業領域を使い回すとかしなくてよい✓ 速い✓ ✓
  • 17. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 デシリアライズ処理 メタデータをパース1. メタデータを基に元データを取り出す 元データをそのまま使えるから速い!✓ なにもしないのが最速!✓ 2.
  • 18. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 元データを元に戻す例:JSON "[1,2]"(JSON) ↓ 0x01 0x02(8bit数値の配列) 0x49→0x01(ASCIIの文字:'1'→数値) 0x50→0x02(ASCIIの文字:'2'→数値)
  • 19. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 元データを取り出せると… 変換処理にCPUを使わなくてよい 速い✓ ✓ 変換後のデータ用のメモリー確保ゼロ すでにあるデータをそのまま使うのでゼロコピー✓ 速い✓ ✓ メモリーマップで直接データを使える ディスク上のメモリー以上のデータを扱える✓ ✓
  • 20. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 {,デ}シリアライズコスト Apache Arrowフォーマット ほぼメタデータのパースコストだけ✓ ✓ それ以外の多くのフォーマット データ変換処理(CPU)✓ 作業用メモリー確保処理(メモリー)✓ ✓
  • 21. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データ交換が速い! Apache Arrowフォーマットにすると高速化! データソース データ処理 Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation CSV など データソース データ処理 高速化
  • 22. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:Apache Spark Spark PySpark Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation pickle 高速化 Spark PySpark 実行時間 Data at https://arrow.apache.org/blog/2017/07/26/spark-arrow/ 0s 20s (短いほど高速)
  • 23. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:Amazon Athena S3 Athena Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation CSV Federated Query 組込クエリーより 高速 S3 Athena 1億行の処理にかかった時間 Data at https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk#performance (短いほど高速) 0s 4s 8s
  • 24. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:RAPIDS 25-100x Improvement Less Code Language Flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More Code Language Rigid Substantially on GPU Traditional GPU Processing Hadoop Processing, Reading from Disk Spark In-Memory Processing Data Processing Evolution Faster Data Access, Less Data Movement RAPIDS Arrow Read ETL ML Train Query 50-100x Improvement Same Code Language Flexible Primarily on GPU https://docs.rapids.ai/overview/RAPIDS%200.15%20Release%20Deck.pdf#page=8
  • 25. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データサイズは? CPU・メモリーにやさしくて速いんだね!✓ じゃあ、データサイズはどうなの? 大きいとネットワーク・IOがボトルネック✓ 本来のデータ処理を最大限リソースを使いたい✓ データ交換をボトルネックにしたくない✓ ✓
  • 26. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データサイズ 別に小さくない 元データそのままなのでデータ量に応じて増加✓ ✓ Zstandard・LZ4での圧縮をサポート ゼロコピーではなくなるがサイズは数分の1✓ 圧縮・展開が必要→元データそのものを使えない✓ CPU・メモリー負荷は上がるが ネットワーク・IO負荷は下がる ✓ ネットワーク・IOがボトルネックになるなら効く✓ ✓
  • 27. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 圧縮時のデータサイズと読み込み速度 https://ursalabs.org/blog/2020-feather-v2/
  • 28. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データ交換が速い!のまとめ {,デ}シリアライズが速い 元データをそのまま使うので処理が少ない✓ CPU・メモリーにやさしい✓ ✓ 圧縮もサポート ネットワーク・IOがボトルネックならこれ✓ CPU・メモリー負荷は上がるが データ交換のボトルネックを解消できるかも ✓ ✓
  • 29. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 交換したデータの扱い データ分析はデータ交換だけじゃない データ交換だけ速くしても基盤とは言えない✓ ✓ データ処理も速くしないと! データ処理を速くするにはデータ構造が大事✓ ✓
  • 30. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 高速処理のためのデータ構造 基本方針: 関連するデータを近くに置く✓ ✓ 効果: CPUキャッシュミスを減らす✓ SIMDを活用できる✓ ✓
  • 31. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データ分析時の関連データ 分析時はカラムごとの処理が多い 集計・ソート・絞り込み…✓ ✓ 同じカラムのデータを近くに置く カラムナーフォーマット✓ ✓
  • 32. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 カラムナーフォーマット 列行 a b c 1 2 3 値 値 値 値 値 値 値 値 値 列 行 a b c 1 2 3 値 値 値 値 値 値 値 値 値 Apache Arrow 列ごと RDBMS 列 行 値の管理単位 行ごと 高速なアクセス単位
  • 33. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 各カラムでのデータの配置 関連するデータを近くに置く✓ 定数時間でアクセスできるように置く✓ SIMDできるように置く アラインする アライン:データの境界を64の倍数とかに揃える ✓ 条件分岐をなくす✓ ✓
  • 34. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 真偽値・数値のデータの配置 固定長データなので連続して配置 32ビット整数:[1, 2, 3] 0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 ...
  • 35. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 文字列・バイト列:データの配置 実データバイト列+長さ配列に配置 UTF-8文字列:["Hello", "", "!"] 実データバイト列:"Hello!" 長さ配列:[0, 5, 5, 6] i番目の長さ:長さ配列[i+1] - 長さ配列[i] i番目のデータ: 実データバイト列[長さ配列[i]..長さ配列[i+1]] 注:長さ→データ→長さ→データ→…で置くと 定数時間でi番目にアクセスできない
  • 36. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 nullと条件分岐 null対応のアプローチ: null値:Julia: missing, R: NA✓ 別途ビットマップを用意:Apache Arrow✓ ✓ ビットマップを使うと条件分岐をなくせる✓
  • 37. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 nullと条件分岐とSIMD null 1 0 1 data 1 X 3 0 1 1 X 2 5 +[1, null, 3] [null, 2, 5] 0 0 1 X X 8 [null, null, 8] ビット単位の& nullのところも 気にせずSIMDで+ null data
  • 38. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 nullと条件分岐とSIMD https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
  • 39. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 高速なデータ処理のまとめ 高速なデータ処理にはデータ構造が重要 データ分析にはカラムナーフォマットが適切✓ ✓ 定数時間でアクセス可能なデータの配置✓ SIMDにやさしいデータの持ち方 アライン・null用のビットマップ✓ ✓
  • 40. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 まとめ Apache Arrow なんかデータ分析に便利そうなすごいやつ!✓ 「よくわからん」と言われるからといって 全体像を説明してもらえなかった! ✓ 使い方も説明してもらえなかった!✓ ✓ Apache Arrowフォーマット これだけ説明してもらえた!✓ ✓
  • 41. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 まとめ:Apache Arrowフォーマット Apache Arrowフォーマット 通信用・インメモリー用のデータフォーマット✓ 表形式のデータ用✓ ✓ Apache Arrowフォーマットは速い データ交換が速い✓ データ処理が速い✓ データ交換してすぐに高速処理できるフォーマット✓ ✓
  • 42. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 まとめ:なぜデータ交換が速いのか 元データをそのままやりとりできるから {,デ}シリアライズコストが低い✓ CPU・メモリーにやさしい✓ ✓ ネットワーク・IOの負荷を下げたい Zstandard・LZ4による圧縮✓ CPU・メモリー負荷は上がるが ネットワーク・IO負荷は下がる ✓ ✓
  • 43. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 まとめ:なぜデータ処理が速いのか 最適化されたデータ構造 カラムナーフォーマット✓ SIMDを使えるデータの持ち方✓ ✓ 最適化された実装 今回は紹介していない!!!✓ ✓
  • 44. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 次回予告! 案1:Apache Arrowデータの高速処理✓ 案2:Apache Arrowデータの高速RPC✓ 案3:○○言語でApache Arrowを使う方法✓ ...✓
  • 45. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 次のステップ もっと詳しく知りたくなったから イベント・社内・…で紹介して! https://www.clear-code.com/contact/✓ ✓ 使いたくなったから技術サポートして! https://www.clear-code.com/contact/✓ ✓ Apache Arrowの開発に参加したい! https://arrow.apache.org/community/✓ ✓