SlideShare a Scribd company logo
1 of 47
Download to read offline
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Fast data processing
with Ruby and Apache Arrow
Sutou Kouhei
ClearCode Inc.
RubyKaigi 2022
2022-09-10
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Sutou Kouhei
A president Ruby committer
The president of ClearCode Inc.
クリアコードの社長
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Sutou Kouhei
The Apache Arrow PMC chair
PMC: Project Management Committee
Apache Arrowのプロジェクト管理委員会のリーダー
✓
#2 commits(コミット数2位)
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Sutou Kouhei
The pioneer in Ruby and Arrow
A Ruby committer
Maintain some standard libraries/default gems
標準ライブラリーとかデフォルトgemのメンテナンスをしている
✓
✓
The author of Red Arrow
✓
Red Arrow:
The official Apache Arrow library for Ruby
公式のRuby用のApache Arrowライブラリー
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Why do I work on Red Arrow?
なぜRed Arrowの開発をしているか
To use Ruby for data processing too!
データ処理でもRubyを使いたい!
At least a part of data processing
データ処理の全部と言わず一部だけでも
✓
✓
Data processing is an important task
データ処理は最近の重要なタスクの1つ
# of Rubyists will be increased by this
データ処理にRubyを使えるようになるとRubyistが増えるはず
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Current situation
Negative spiral
今は負のスパイラル
Few users
Small community Few developers
Few useful tools
How to break the negative spiral?
どうやってこの負のスパイラルを打開する?
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Expand useful tools
with few developers
少人数で便利なツールを増やせればいいんじゃない?
Negative spiral
Positive spiral
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
But how?
でもどうやって?
Apache Arrow
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Apache Arrow
Cross-language dev platform for data
複数言語対応のデータ用の開発プラットフォーム
Ruby community doesn't need to dev everything
Rubyコミュニティーがすべてを開発しなくてもよい
✓
We can share common implementations
共通の実装を言語を超えて共有できる
✓
✓
Today's highlighted features
今日注目する機能
Fast data processing(高速データ処理)
✓
Fast data interchange(高速データ交換)
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
My approach
私のアプローチ
Negative spiral
Positive spiral
Apache Arrow
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Goal of this talk
このトークのゴール
Negative spiral
Positive spiral
Apache Arrow
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
You want to use Ruby
for some data processings
いくつかのデータ処理でRubyを使いたくなる
Especially, you want to implement a BI tool
特にBIツールを作りたくなる
✓
✓
You join Red Data Tools project
Red Data Toolsプロジェクトに参加する
It provides data processing tools for Ruby
Ruby用のデータ処理ツールを提供するプロジェクト
https://red-data-tools.github.io/
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Fast data processing
高速データ処理
Ruby is slow to process data
Rubyでデータを処理すると遅い
✓
Resolve in external process:(別プロセスで解決)
Use case: Web app, batch process for Web app
Use fast data processing module (e.g.: DB)
DBとか速いデータ処理モジュールを使う
✓
✓
Resolve in the same process:(プロセス内で解決)
Use cases: IRB, batch process for Web app
Implement core features in other fast lang
他の速い言語でコアの機能を実装
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
External process
別プロセス
Ruby External process
Fast data processing
Request
Response
Ruby External process
Popular case
in current Ruby usage
今のRubyの使われ方だとよくあるケース
✓
Small response: No problem
レスポンスが小さい場合は問題ない
✓
Large response:
レスポンスが大きい場合:
Sending/receiving response are slow
レスポンスの送信・受信処理が遅い
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Sending/receiving response
レスポンスの送受信
Ruby External process
Serialize
Deserialize
Send
Ruby External process
Serialize/deserialize
are slow
シリアライズ・デシリアライズが遅い
✓
How to speed them up?
どうやって高速化すればよいか
Apache Arrow format
✓
Serialize/deserialize cost ≒ 0
シリアライズ・デシリアライズコストがほぼ0
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Why Apache Arrow format is fast
Apache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowフォーマットは
なぜ速いのか
須藤功平
株式会社クリアコード
db tech showcase ONLINE 2020
2020-12-08
https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Apache Arrow Flight SQL
Ruby SQL DB
Fast data processing
Request (SQL)
Response (Apache Arrow data)
Ruby SQL DB
gRPC based protocol
gRPCベースのプロトコル
NOTE: Other network libraries
such as UCX can be used
UCXなど他のネットワークライブラリーも使える
✓
✓
Specialized to Apache Arrow format
Apache Arrowフォーマットに特化
Serialize/deserialize cost ≒ 0
シリアライズ・デシリアライズコストがほぼ0
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red Arrow Flight SQL
require "arrow-flight-sql"
location = "grpc://server:2929"
client = ArrowFlight::Client.new(location)
sql_client = ArrowFlightSQL::Client.new(client)
info = sql_client.execute("SELECT * FROM logs")
info.endpoints.each do |endpoint|
reader = sql_client.do_get(endpoint.ticket)
reader.read_all
end
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Which SQL DBs support
Apache Arrow Flight SQL?
SQL DB Support?
MySQL No
PostgreSQL No
BigQuery No
Trino No
Dremio Yes
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Why don't most SQL DBs support it?
どうしてほとんどのSQL DBはサポートしていないの?
Flight SQL is a new protocol
Apache Arrow Flight SQLは新しいプロトコルだから
The first release: 2022-02(最初のリリース)
✓
Still experimental(まだ実験的扱い)
✓
✓
Tradition SQL DBs may not support
MySQL・PostgreSQLとか昔からあるSQL DBはサポートしないかも
✓
New SQL DBs will support because...
新しいSQL DBはサポートするはず。なぜなら…
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Compatibility is important
互換性が重要だから
New SQL DBs often use major protocols
新しいSQL DBは既存のメジャーなプロトコルを使うことが多い
To reuse existing client libraries
ユーザーは既存のクライアントライブラリーで新しいSQL DBを使える
✓
✓
For example:
MySQL protocol: TiDB, ...
✓
PostgreSQL protocol: Cloud Spanner, CockroachDB, ...
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Future
将来
Flight SQL client libraries will be increased
Flight SQLのクライアントライブラリーが充実するだろう
✓
New SQL DBs will support Flight SQL
新しいSQL DBはFlight SQLをサポートするだろう
To reuse existing client libraries
既存のクライアントライブラリーを再利用するため
✓
✓
BI tools will support Flight SQL by default
BIツールはデフォルトでFlight SQLをサポートするだろう
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
What should we do next?
私たちは次はなにをするべき?
Negative spiral
Positive spiral
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
Implement an Active Record
adapter for Flight SQL
Flight SQL用のActive Recordアダプターを
実装するといいんじゃないかな
For easy to use from Ruby on Rails apps
Ruby on Railsアプリから使いやすくなるはず
✓
✓
Join Red Data Tools!
Red Data Toolsで開発しようぜ!
https://red-data-tools.github.io/
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
But I'm using MySQL/PostgreSQL...
でも、MySQL/PostgreSQLを使っているし。。。
ADBC: Apache Arrow Database Connectivity
https://voltrondata.com/news/simplifying-database-connectivity-with-arrow-
flight-sql-and-adbc/
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
ADBC
Generic fast
SQL DB client API
任意のSQL DBに接続できる高速なAPI
We can use Flight SQL
through ADBC
ADBC経由でFlight SQLも使える
✓
✓
Flight SQL is the most fast driver
Flight SQLが最速のドライバー
But the same API for all SQL DBs is useful
like Active Record for Rubyists
なんだけど、すべてのSQL DBに同じAPIでアクセスできるのは便利
Active Recordも便利でしょ?
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
ADBC and Ruby
Implementing Ruby bindings
Rubyバインディングを実装中
✓
Join Red Data Tools
to implement an Active Record adapter
for ADBC!
Red Data ToolsでADBC用のActive Recordアダプターを開発しようぜ!
https://red-data-tools.github.io/
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red ADBC
require "adbc"
options = {
driver: "adbc_driver_sqlite",
filename: ":memory:",
}
ADBC::Database.open(**options) do |database|
database.connect do |connection|
puts(connection.query("SELECT 1"))
end
end
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Wrap up: External process case
まとめ:別プロセスの場合
Large response is slow
レスポンスが大きいときに遅い
Bottle neck is serialize/deserialize
ボトルネックはシリアライズ・デシリアライズ
✓
✓
Apache Arrow Flight SQL/ADBC
そこでApache Arrow Flight SQL/ADBCですよ!
✓
Let's implement AR adapters for them
Active Recordのアダプターを実装しようぜ!
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
In-process
同一プロセス Ruby Library
Fast data processing
Pass data
Return processed data
Ruby Library
Not popular case
in current Ruby usage
今のRubyの使われ方だとあまりないケース
Use case: process data
on local/remote storage
IRB, batch process for Web app
ユースケース:ローカル・リモートストレージにあるデータの処理
✓
✓
Need a fast data processing library
implemented in other fast language
高速にデータ処理できる他の速い言語で実装されたライブラリーが必要
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Fast language?
速い言語?
C/C++
✓
Rust
✓
Julia
✓
...
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
A C++ case: Apache Arrow
Apache Arrow has a computation module
Apache Arrowは計算モジュールも提供している
✓
Ruby bindings: Red Arrow/red-arrow
RubyバインディングはRed Arrow/red-arrow
✓
Data frame based on Red Arrow
Red Arrowベースのデータフレームもある
RedAmber/red_amber by @heronshoes
Red Data Toolsメンバーでもある鈴木さんが開発
https://mybinder.org/v2/gh/RubyData/docker-stacks/master?
filepath=red-amber.ipynb
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red Arrow
require "datasets-arrow"
codes = Datasets::PostalCodeJapan.new.to_arrow
require "arrow"
sliced_codes = codes.slice do |slicer|
slicer.prefecture == "東京都"
end
puts(sliced_codes)
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red Amber
require "datasets-arrow"
codes = Datasets::PostalCodeJapan.new.to_arrow
require "red_amber"
data_frame = RedAmber::DataFrame.new(codes)
prefecture = data_frame[:prefecture]
puts(data_frame[prefecture == "東京都"])
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
A C++ case: DuckDB
Similar to SQLite
but for data analytics
データ分析向けのSQLiteみたいなやつ
Fast aggregation/filter/sort
高速な集計・フィルター・ソート
✓
✓
Ruby bindings: ruby-duckdb by @suketa
Rubyコミッターでもある助田さんがRubyバインディングを開発
We can impl. fast data processing with DuckDB
DuckDBを使って高速データ処理を実現できる!
✓
If we can interchange data w/ DuckDB fast...
DuckDBと高速にデータ交換できればね…
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Is fast data interchange important?
高速データ交換は重要なの?
Ruby DuckDB
Fast data processing
Load data (!)
Read result (!)
Ruby DuckDB
If (!) are slow, total data processing is also slow
(!)が遅いと全体のデータ処理も遅くなる
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Fast data interchange
高速なデータ交換
Ruby DuckDB
Fast data processing
Pass Apache Arrow data directly
Read result with C data interface
Ruby DuckDB
Use data as-is: zero-copy
データをそのまま使う:ゼロコピー
✓
Apache Arrow C data/stream interface
C ABI for fast data interchange
高速にデータ交換するためのC ABI
✓
FYI: C ABI in Ruby: MemoryView
参考:RubyもMmeoryViewというC ABIを提供している
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
C data interface
struct ArrowArray {
// Array data description
int64_t length;
int64_t null_count;
int64_t offset;
int64_t n_buffers;
int64_t n_children;
const void** buffers;
struct ArrowArray** children;
struct ArrowArray* dictionary;
// Release callback
void (*release)(struct ArrowArray*);
// Opaque producer-specific data
void* private_data;
};
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
DuckDB with Apache Arrow
require "datasets-arrow"
codes = Datasets::PostalCodeJapan.new.to_arrow
require "arrow-duckdb"
db = DuckDB::Database.open
c = db.connect
c.register("codes", codes) do # Use Apache Arrow data as-is
c.query("SELECT * FROM codes WHERE prefecture = ?",
"東京都", # Tokyo
output: :arrow) # Output as Apache Arrow data
.to_table # C data interface
end
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
C data interface on Web
Web上でもC data interface
Some unofficial WebAssembly ports exist
非公式ながらいくつかWebAssembly対応のApache Arrowライブラリーがある
Rust based, Go based, ...
✓
✓
WASM Ruby + C data I/F is useful?
WebAssembly版のRubyとC data interfaceでなんかできるかも?
FYI: DuckDB supports WebAssembly too
参考:DuckDBもWebAssemblyをサポートしている
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
A Rust case: Arrow DataFusion
SQL query engine
Internal memory layout is Apache Arrow
DuckDBと似ているが内部のメモリーレイアウトはApache Arrow
✓
✓
Direct Ruby bindings:(直接のバインディング)
arrow-datafusion by @jychen7 with Magnus
✓
✓
Ruby bindings via C API:(C API経由)
datafusion-c with cargo-c
✓
Red DataFusion with datafusion-c
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Ruby bindings via C API
C API経由のRubyバインディング
Negative spiral
Positive spiral
Apache Arrow
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
To develop with
devs from other langs
他言語の開発者と一緒に開発するため
C API is useful for other languages too
C APIはJavaやGoなど他の言語のバインディング開発でも有用
✓
✓
C API provides a normal C library
C APIは普通のCライブラリーを提供する
Headers: Generated by cbindgen automatically
✓
Shared libraries: Built with cargo-c
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red DataFusion
require "datasets-arrow"
codes = Datasets::PostalCodeJapan.new.to_arrow
require "datafusion"
context = DataFusion::SessionContext.new
context.register("codes", codes) # C data interface
data_frame = context.sql(<<-SQL)
SELECT * FROM codes WHERE prefecture = '東京都'
SQL
puts(data_frame.to_table) # C data interface
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Remote data
リモートデータ
Recent modules can read remote data
最近のモジュールはリモートデータを読み込める
✓
At least these modules support it:
少なくとも次のモジュールはできる
Apache Arrow
✓
DuckDB
✓
Apache Arrow DataFusion
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Remote data example: DuckDB
SELECT COUNT(*)
FROM parquet_scan('s3://ookla-open-data/parquet/performance/*/*/*/*.parquet',
HIVE_PARTITIONING=1);
-- ┌───────┐
-- │ count_star() │
-- ├───────┤
-- │ 144567188 │
-- └───────┘
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Wrap up: In-process case
まとめ:同一プロセスの場合
Ruby Library
Fast data processing
Pass data
Return processed data
Ruby Library
To process data
on local/remote storage
ローカル・リモートストレージにあるデータの処理
✓
Low-level APIs (bindings)
are preparing
低レベルのAPIは整備できてきた
✓
Let's implement high-level API!
高レベルのAPIを実装しようぜ!
e.g.: RedAmber, Active Record adapters, ...
https://red-data-tools.github.io/
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Acknowledgments
謝辞
Voltron Data
Supports my Apache Arrow related work
私のApache Arrow関係の作業にお金を払ってくれている
✓
You can join Voltron Data or ClearCode
to work on Apache Arrow as a job
Voltron Dataさんかクリアコードに転職すると
仕事でApache Arrowを開発できるよ!
https://www.clear-code.com/recruitment/
✓
✓
Red Data Tools members
They develop data processing tools for Ruby!
Ruby用のデータ処理ツールを開発しているよ!
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Wrap up
まとめ
Can use Ruby for fast data processing
with Apache Arrow in some cases
Apache Arrowを使えば高速なデータ処理にRubyを使える…こともある
External process: Fast data interchange
別プロセスの場合:高速データ交換機能を使う
✓
In-process: Fast data processing/interchange
同一プロセスの場合:高速データ処理・交換機能を使う
✓
✓
But we still have missing pieces
でも、まだ足りないところがある
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Goal of this talk
このトークのゴール
Negative spiral
Positive spiral
Apache Arrow
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
You want to use Ruby
for some data processings
いくつかのデータ処理でRubyを使いたくなる
Especially, you want to implement a BI tool
特にBIツールを作りたくなる
✓
✓
You join Red Data Tools project
Red Data Toolsプロジェクトに参加する
It provides data processing tools for Ruby
Ruby用のデータ処理ツールを提供するプロジェクト
https://red-data-tools.github.io/
✓
✓

More Related Content

More from Kouhei Sutou

Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Kouhei Sutou
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームKouhei Sutou
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムKouhei Sutou
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroongaKouhei Sutou
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Kouhei Sutou
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムKouhei Sutou
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!Kouhei Sutou
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版Kouhei Sutou
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムKouhei Sutou
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionKouhei Sutou
 
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システムMySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システムKouhei Sutou
 
全文検索でRedmineをさらに活用!
全文検索でRedmineをさらに活用!全文検索でRedmineをさらに活用!
全文検索でRedmineをさらに活用!Kouhei Sutou
 
MySQL・PostgreSQL上で動かす全文検索エンジン「Groonga」セミナー
MySQL・PostgreSQL上で動かす全文検索エンジン「Groonga」セミナーMySQL・PostgreSQL上で動かす全文検索エンジン「Groonga」セミナー
MySQL・PostgreSQL上で動かす全文検索エンジン「Groonga」セミナーKouhei Sutou
 
PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
 PHPでPostgreSQLとPGroongaを使って高速日本語全文検索! PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!Kouhei Sutou
 
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?Kouhei Sutou
 

More from Kouhei Sutou (20)

Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォーム
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroonga
 
My way with Ruby
My way with RubyMy way with Ruby
My way with Ruby
 
Red Data Tools
Red Data ToolsRed Data Tools
Red Data Tools
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システム
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extension
 
PGroonga & Zulip
PGroonga & ZulipPGroonga & Zulip
PGroonga & Zulip
 
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システムMySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
 
全文検索でRedmineをさらに活用!
全文検索でRedmineをさらに活用!全文検索でRedmineをさらに活用!
全文検索でRedmineをさらに活用!
 
MySQL・PostgreSQL上で動かす全文検索エンジン「Groonga」セミナー
MySQL・PostgreSQL上で動かす全文検索エンジン「Groonga」セミナーMySQL・PostgreSQL上で動かす全文検索エンジン「Groonga」セミナー
MySQL・PostgreSQL上で動かす全文検索エンジン「Groonga」セミナー
 
PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
 PHPでPostgreSQLとPGroongaを使って高速日本語全文検索! PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
 
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
 

Recently uploaded

Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Projectdanielbell861
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopThinkInnovation
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 

Recently uploaded (13)

Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Project
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI Desktop
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 

RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow

  • 1. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Fast data processing with Ruby and Apache Arrow Sutou Kouhei ClearCode Inc. RubyKaigi 2022 2022-09-10
  • 2. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Sutou Kouhei A president Ruby committer The president of ClearCode Inc. クリアコードの社長
  • 3. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Sutou Kouhei The Apache Arrow PMC chair PMC: Project Management Committee Apache Arrowのプロジェクト管理委員会のリーダー ✓ #2 commits(コミット数2位) ✓
  • 4. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Sutou Kouhei The pioneer in Ruby and Arrow A Ruby committer Maintain some standard libraries/default gems 標準ライブラリーとかデフォルトgemのメンテナンスをしている ✓ ✓ The author of Red Arrow ✓ Red Arrow: The official Apache Arrow library for Ruby 公式のRuby用のApache Arrowライブラリー ✓ ✓
  • 5. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Why do I work on Red Arrow? なぜRed Arrowの開発をしているか To use Ruby for data processing too! データ処理でもRubyを使いたい! At least a part of data processing データ処理の全部と言わず一部だけでも ✓ ✓ Data processing is an important task データ処理は最近の重要なタスクの1つ # of Rubyists will be increased by this データ処理にRubyを使えるようになるとRubyistが増えるはず ✓ ✓
  • 6. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Current situation Negative spiral 今は負のスパイラル Few users Small community Few developers Few useful tools How to break the negative spiral? どうやってこの負のスパイラルを打開する?
  • 7. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Expand useful tools with few developers 少人数で便利なツールを増やせればいいんじゃない? Negative spiral Positive spiral Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools
  • 8. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 But how? でもどうやって? Apache Arrow
  • 9. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Apache Arrow Cross-language dev platform for data 複数言語対応のデータ用の開発プラットフォーム Ruby community doesn't need to dev everything Rubyコミュニティーがすべてを開発しなくてもよい ✓ We can share common implementations 共通の実装を言語を超えて共有できる ✓ ✓ Today's highlighted features 今日注目する機能 Fast data processing(高速データ処理) ✓ Fast data interchange(高速データ交換) ✓ ✓
  • 10. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 My approach 私のアプローチ Negative spiral Positive spiral Apache Arrow Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools
  • 11. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Goal of this talk このトークのゴール Negative spiral Positive spiral Apache Arrow Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools You want to use Ruby for some data processings いくつかのデータ処理でRubyを使いたくなる Especially, you want to implement a BI tool 特にBIツールを作りたくなる ✓ ✓ You join Red Data Tools project Red Data Toolsプロジェクトに参加する It provides data processing tools for Ruby Ruby用のデータ処理ツールを提供するプロジェクト https://red-data-tools.github.io/ ✓ ✓
  • 12. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Fast data processing 高速データ処理 Ruby is slow to process data Rubyでデータを処理すると遅い ✓ Resolve in external process:(別プロセスで解決) Use case: Web app, batch process for Web app Use fast data processing module (e.g.: DB) DBとか速いデータ処理モジュールを使う ✓ ✓ Resolve in the same process:(プロセス内で解決) Use cases: IRB, batch process for Web app Implement core features in other fast lang 他の速い言語でコアの機能を実装 ✓ ✓
  • 13. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 External process 別プロセス Ruby External process Fast data processing Request Response Ruby External process Popular case in current Ruby usage 今のRubyの使われ方だとよくあるケース ✓ Small response: No problem レスポンスが小さい場合は問題ない ✓ Large response: レスポンスが大きい場合: Sending/receiving response are slow レスポンスの送信・受信処理が遅い ✓ ✓
  • 14. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Sending/receiving response レスポンスの送受信 Ruby External process Serialize Deserialize Send Ruby External process Serialize/deserialize are slow シリアライズ・デシリアライズが遅い ✓ How to speed them up? どうやって高速化すればよいか Apache Arrow format ✓ Serialize/deserialize cost ≒ 0 シリアライズ・デシリアライズコストがほぼ0 ✓ ✓
  • 15. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Why Apache Arrow format is fast Apache Arrowフォーマットはなぜ速いのか Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrowフォーマットは なぜ速いのか 須藤功平 株式会社クリアコード db tech showcase ONLINE 2020 2020-12-08 https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/
  • 16. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Apache Arrow Flight SQL Ruby SQL DB Fast data processing Request (SQL) Response (Apache Arrow data) Ruby SQL DB gRPC based protocol gRPCベースのプロトコル NOTE: Other network libraries such as UCX can be used UCXなど他のネットワークライブラリーも使える ✓ ✓ Specialized to Apache Arrow format Apache Arrowフォーマットに特化 Serialize/deserialize cost ≒ 0 シリアライズ・デシリアライズコストがほぼ0 ✓ ✓
  • 17. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red Arrow Flight SQL require "arrow-flight-sql" location = "grpc://server:2929" client = ArrowFlight::Client.new(location) sql_client = ArrowFlightSQL::Client.new(client) info = sql_client.execute("SELECT * FROM logs") info.endpoints.each do |endpoint| reader = sql_client.do_get(endpoint.ticket) reader.read_all end
  • 18. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Which SQL DBs support Apache Arrow Flight SQL? SQL DB Support? MySQL No PostgreSQL No BigQuery No Trino No Dremio Yes
  • 19. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Why don't most SQL DBs support it? どうしてほとんどのSQL DBはサポートしていないの? Flight SQL is a new protocol Apache Arrow Flight SQLは新しいプロトコルだから The first release: 2022-02(最初のリリース) ✓ Still experimental(まだ実験的扱い) ✓ ✓ Tradition SQL DBs may not support MySQL・PostgreSQLとか昔からあるSQL DBはサポートしないかも ✓ New SQL DBs will support because... 新しいSQL DBはサポートするはず。なぜなら… ✓
  • 20. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Compatibility is important 互換性が重要だから New SQL DBs often use major protocols 新しいSQL DBは既存のメジャーなプロトコルを使うことが多い To reuse existing client libraries ユーザーは既存のクライアントライブラリーで新しいSQL DBを使える ✓ ✓ For example: MySQL protocol: TiDB, ... ✓ PostgreSQL protocol: Cloud Spanner, CockroachDB, ... ✓ ✓
  • 21. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Future 将来 Flight SQL client libraries will be increased Flight SQLのクライアントライブラリーが充実するだろう ✓ New SQL DBs will support Flight SQL 新しいSQL DBはFlight SQLをサポートするだろう To reuse existing client libraries 既存のクライアントライブラリーを再利用するため ✓ ✓ BI tools will support Flight SQL by default BIツールはデフォルトでFlight SQLをサポートするだろう ✓
  • 22. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 What should we do next? 私たちは次はなにをするべき? Negative spiral Positive spiral Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools Implement an Active Record adapter for Flight SQL Flight SQL用のActive Recordアダプターを 実装するといいんじゃないかな For easy to use from Ruby on Rails apps Ruby on Railsアプリから使いやすくなるはず ✓ ✓ Join Red Data Tools! Red Data Toolsで開発しようぜ! https://red-data-tools.github.io/ ✓
  • 23. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 But I'm using MySQL/PostgreSQL... でも、MySQL/PostgreSQLを使っているし。。。 ADBC: Apache Arrow Database Connectivity https://voltrondata.com/news/simplifying-database-connectivity-with-arrow- flight-sql-and-adbc/
  • 24. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 ADBC Generic fast SQL DB client API 任意のSQL DBに接続できる高速なAPI We can use Flight SQL through ADBC ADBC経由でFlight SQLも使える ✓ ✓ Flight SQL is the most fast driver Flight SQLが最速のドライバー But the same API for all SQL DBs is useful like Active Record for Rubyists なんだけど、すべてのSQL DBに同じAPIでアクセスできるのは便利 Active Recordも便利でしょ? ✓ ✓
  • 25. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 ADBC and Ruby Implementing Ruby bindings Rubyバインディングを実装中 ✓ Join Red Data Tools to implement an Active Record adapter for ADBC! Red Data ToolsでADBC用のActive Recordアダプターを開発しようぜ! https://red-data-tools.github.io/ ✓
  • 26. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red ADBC require "adbc" options = { driver: "adbc_driver_sqlite", filename: ":memory:", } ADBC::Database.open(**options) do |database| database.connect do |connection| puts(connection.query("SELECT 1")) end end
  • 27. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Wrap up: External process case まとめ:別プロセスの場合 Large response is slow レスポンスが大きいときに遅い Bottle neck is serialize/deserialize ボトルネックはシリアライズ・デシリアライズ ✓ ✓ Apache Arrow Flight SQL/ADBC そこでApache Arrow Flight SQL/ADBCですよ! ✓ Let's implement AR adapters for them Active Recordのアダプターを実装しようぜ! ✓
  • 28. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 In-process 同一プロセス Ruby Library Fast data processing Pass data Return processed data Ruby Library Not popular case in current Ruby usage 今のRubyの使われ方だとあまりないケース Use case: process data on local/remote storage IRB, batch process for Web app ユースケース:ローカル・リモートストレージにあるデータの処理 ✓ ✓ Need a fast data processing library implemented in other fast language 高速にデータ処理できる他の速い言語で実装されたライブラリーが必要 ✓
  • 29. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Fast language? 速い言語? C/C++ ✓ Rust ✓ Julia ✓ ... ✓
  • 30. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 A C++ case: Apache Arrow Apache Arrow has a computation module Apache Arrowは計算モジュールも提供している ✓ Ruby bindings: Red Arrow/red-arrow RubyバインディングはRed Arrow/red-arrow ✓ Data frame based on Red Arrow Red Arrowベースのデータフレームもある RedAmber/red_amber by @heronshoes Red Data Toolsメンバーでもある鈴木さんが開発 https://mybinder.org/v2/gh/RubyData/docker-stacks/master? filepath=red-amber.ipynb ✓ ✓
  • 31. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red Arrow require "datasets-arrow" codes = Datasets::PostalCodeJapan.new.to_arrow require "arrow" sliced_codes = codes.slice do |slicer| slicer.prefecture == "東京都" end puts(sliced_codes)
  • 32. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red Amber require "datasets-arrow" codes = Datasets::PostalCodeJapan.new.to_arrow require "red_amber" data_frame = RedAmber::DataFrame.new(codes) prefecture = data_frame[:prefecture] puts(data_frame[prefecture == "東京都"])
  • 33. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 A C++ case: DuckDB Similar to SQLite but for data analytics データ分析向けのSQLiteみたいなやつ Fast aggregation/filter/sort 高速な集計・フィルター・ソート ✓ ✓ Ruby bindings: ruby-duckdb by @suketa Rubyコミッターでもある助田さんがRubyバインディングを開発 We can impl. fast data processing with DuckDB DuckDBを使って高速データ処理を実現できる! ✓ If we can interchange data w/ DuckDB fast... DuckDBと高速にデータ交換できればね… ✓ ✓
  • 34. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Is fast data interchange important? 高速データ交換は重要なの? Ruby DuckDB Fast data processing Load data (!) Read result (!) Ruby DuckDB If (!) are slow, total data processing is also slow (!)が遅いと全体のデータ処理も遅くなる
  • 35. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Fast data interchange 高速なデータ交換 Ruby DuckDB Fast data processing Pass Apache Arrow data directly Read result with C data interface Ruby DuckDB Use data as-is: zero-copy データをそのまま使う:ゼロコピー ✓ Apache Arrow C data/stream interface C ABI for fast data interchange 高速にデータ交換するためのC ABI ✓ FYI: C ABI in Ruby: MemoryView 参考:RubyもMmeoryViewというC ABIを提供している ✓ ✓
  • 36. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 C data interface struct ArrowArray { // Array data description int64_t length; int64_t null_count; int64_t offset; int64_t n_buffers; int64_t n_children; const void** buffers; struct ArrowArray** children; struct ArrowArray* dictionary; // Release callback void (*release)(struct ArrowArray*); // Opaque producer-specific data void* private_data; };
  • 37. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 DuckDB with Apache Arrow require "datasets-arrow" codes = Datasets::PostalCodeJapan.new.to_arrow require "arrow-duckdb" db = DuckDB::Database.open c = db.connect c.register("codes", codes) do # Use Apache Arrow data as-is c.query("SELECT * FROM codes WHERE prefecture = ?", "東京都", # Tokyo output: :arrow) # Output as Apache Arrow data .to_table # C data interface end
  • 38. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 C data interface on Web Web上でもC data interface Some unofficial WebAssembly ports exist 非公式ながらいくつかWebAssembly対応のApache Arrowライブラリーがある Rust based, Go based, ... ✓ ✓ WASM Ruby + C data I/F is useful? WebAssembly版のRubyとC data interfaceでなんかできるかも? FYI: DuckDB supports WebAssembly too 参考:DuckDBもWebAssemblyをサポートしている ✓ ✓
  • 39. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 A Rust case: Arrow DataFusion SQL query engine Internal memory layout is Apache Arrow DuckDBと似ているが内部のメモリーレイアウトはApache Arrow ✓ ✓ Direct Ruby bindings:(直接のバインディング) arrow-datafusion by @jychen7 with Magnus ✓ ✓ Ruby bindings via C API:(C API経由) datafusion-c with cargo-c ✓ Red DataFusion with datafusion-c ✓ ✓
  • 40. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Ruby bindings via C API C API経由のRubyバインディング Negative spiral Positive spiral Apache Arrow Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools To develop with devs from other langs 他言語の開発者と一緒に開発するため C API is useful for other languages too C APIはJavaやGoなど他の言語のバインディング開発でも有用 ✓ ✓ C API provides a normal C library C APIは普通のCライブラリーを提供する Headers: Generated by cbindgen automatically ✓ Shared libraries: Built with cargo-c ✓ ✓
  • 41. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red DataFusion require "datasets-arrow" codes = Datasets::PostalCodeJapan.new.to_arrow require "datafusion" context = DataFusion::SessionContext.new context.register("codes", codes) # C data interface data_frame = context.sql(<<-SQL) SELECT * FROM codes WHERE prefecture = '東京都' SQL puts(data_frame.to_table) # C data interface
  • 42. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Remote data リモートデータ Recent modules can read remote data 最近のモジュールはリモートデータを読み込める ✓ At least these modules support it: 少なくとも次のモジュールはできる Apache Arrow ✓ DuckDB ✓ Apache Arrow DataFusion ✓ ✓
  • 43. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Remote data example: DuckDB SELECT COUNT(*) FROM parquet_scan('s3://ookla-open-data/parquet/performance/*/*/*/*.parquet', HIVE_PARTITIONING=1); -- ┌───────┐ -- │ count_star() │ -- ├───────┤ -- │ 144567188 │ -- └───────┘
  • 44. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Wrap up: In-process case まとめ:同一プロセスの場合 Ruby Library Fast data processing Pass data Return processed data Ruby Library To process data on local/remote storage ローカル・リモートストレージにあるデータの処理 ✓ Low-level APIs (bindings) are preparing 低レベルのAPIは整備できてきた ✓ Let's implement high-level API! 高レベルのAPIを実装しようぜ! e.g.: RedAmber, Active Record adapters, ... https://red-data-tools.github.io/ ✓ ✓
  • 45. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Acknowledgments 謝辞 Voltron Data Supports my Apache Arrow related work 私のApache Arrow関係の作業にお金を払ってくれている ✓ You can join Voltron Data or ClearCode to work on Apache Arrow as a job Voltron Dataさんかクリアコードに転職すると 仕事でApache Arrowを開発できるよ! https://www.clear-code.com/recruitment/ ✓ ✓ Red Data Tools members They develop data processing tools for Ruby! Ruby用のデータ処理ツールを開発しているよ! ✓ ✓
  • 46. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Wrap up まとめ Can use Ruby for fast data processing with Apache Arrow in some cases Apache Arrowを使えば高速なデータ処理にRubyを使える…こともある External process: Fast data interchange 別プロセスの場合:高速データ交換機能を使う ✓ In-process: Fast data processing/interchange 同一プロセスの場合:高速データ処理・交換機能を使う ✓ ✓ But we still have missing pieces でも、まだ足りないところがある ✓
  • 47. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Goal of this talk このトークのゴール Negative spiral Positive spiral Apache Arrow Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools You want to use Ruby for some data processings いくつかのデータ処理でRubyを使いたくなる Especially, you want to implement a BI tool 特にBIツールを作りたくなる ✓ ✓ You join Red Data Tools project Red Data Toolsプロジェクトに参加する It provides data processing tools for Ruby Ruby用のデータ処理ツールを提供するプロジェクト https://red-data-tools.github.io/ ✓ ✓