SlideShare a Scribd company logo
1 of 67
Download to read offline
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Arrow
Ruby and Apache Arrow
Sutou Kouhei
ClearCode Inc.
RubyKaigi Takeout 2021
2021-09-11
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Sutou Kouhei
A president Rubyist
The president of ClearCode Inc.
クリアコードの社長
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Sutou Kouhei
An Apache Arrow contributor
A member of PMC of Apache Arrow
PMC: Project Management Committee
Apache Arrowのプロジェクト管理委員会メンバー
✓
#2 commits(コミット数2位)
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Sutou Kouhei
The pioneer in Ruby and Arrow
The author of Red Arrow
Red Arrowの作者
✓
Red Arrow:
The official Apache Arrow library for Ruby
公式のRuby用のApache Arrowライブラリー
✓
GObject Introspection based bindings
GObject Introspectionベースのバインディング
✓
Apache Arrow GLib is developed for Red Arrow
Red ArrowのためにApache Arrow GLibも開発
✓
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
GObject Introspection?
A way to implement bindings
バインディングの実装方法の1つ
Ruby bindings 2016 - How to create bindings 2016 Powered by Rabbit 2.2.0
Ruby bindings 2016
How to create bindings 2016
Kouhei Sutou
ClearCode Inc.
RubyKaigi 2016
2016-09-09
https://rubykaigi.org/2016/presentations/ktou.html
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Why do I work on Red Arrow?
なぜRed Arrowの開発をしているか
To use Ruby for data processing!
データ処理でRubyを使いたい!
At least a part of data processing
データ処理の全部と言わず一部だけでも
✓
✓
Results of my 5 years of work:
私のここ5年の仕事の成果
We can use Ruby for some data processing!
いくつかのデータ処理でRubyを使える!
✓
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Goal of this talk
このトークのゴール
You want to use Ruby
for some data processing
いくつかのデータ処理でRubyを使いたくなる
✓
You join Red Data Tools project
Red Data Toolsプロジェクトに参加する
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Data Tools project?
Red Data Tools is a project that
provides data processing tools for
Ruby
Red Data ToolsはRuby用のデータ処理ツールを提供するプロジェクト
https://red-data-tools.github.io/
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Data processing?
... how?
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
0. Why do you want?
0. データ処理の目的を明らかにする
What problem do you want to resolve?
どんな問題を解決したい?
✓
What data is needed for it?
そのためにはどんなデータが必要?
✓
...
✓
No Red Arrow support in this area
このあたりにはRed Arrowを使えない
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
1. Collect data
1. データ収集
Where are data?
データはどこにある?
✓
Where are collected data stored?
集めたデータはどこに保存する?
✓
...
✓
Some Red Arrow supports in this area
このあたりでは少しRed Arrowを使えない
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Common dataset
よく使われるデータセット
require "datasets"
Datasets::Iris.new
Datasets::PostalCodeJapan.new
Datasets::Wikipedia.new
Red Datasets
https://github.com/red-data-tools/red-datasets
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Output: Local file
出力先:ローカルファイル
require "datasets-arrow"
dataset = Datasets::PostalCodeJapan.new
dataset.to_arrow.save("codes.csv")
dataset.to_arrow.save("codes.arrow")
Red Datasets Arrow
https://github.com/red-data-tools/red-datasets-arrow
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save
General serialize API for table data
テーブルデータ用の汎用シリアライズAPI
Serialize as the specified format
指定したフォーマットにシリアライズ
✓
If you use Red Arrow object for in-memory
table data, you can serialize to many
formats! Cool!
メモリー上のテーブルデータをRed Arrowオブジェクトにするといろんな
フォーマットにシリアライズできる!かっこいい!
✓
✓
Extensible!
拡張可能!
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Implementation
module Arrow
class Table
def save(output)
saver = TableSaver.new(self, output)
saver.save
end
end
end
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Implementation
class Arrow::TableSaver
def save
format = detect_format(@output)
__send__("save_as_#{format}")
end
def save_as_csv
end
end
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Extend by Red Parquet
module Parquet::ArrowTableSavable
def save_as_parquet
end
Arrow::TableSaver.include(self)
end
Red Parquet is a subproject of Red Arrow
Red ParquetはRed Arrowのサブプロジェクト
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Extended
require "datasets-arrow"
require "parquet"
dataset = Datasets::PostalCodeJapan.new
dataset.to_arrow.save("codes.parquet")
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Output: Online storage: Fluentd
出力先:オンラインストレージ:Fluentd
fluent-plugin-s3-arrow:
Collect data by Fluentd
Fluentdでデータ収集
✓
Format data as Apache Parquet by Red Arrow
Red ArrowでApache Parquet形式にデータを変換
✓
Store data to Amazon S3 by fluent-plugin-s3
fluent-plugin-s3でAmazon S3にデータを保存
✓
By @kanga33 at Speee/Red Data Tools
Speee/Red Data Toolsの香川さんが開発
✓
✓
https://github.com/red-data-tools/fluent-plugin-s3-arrow/
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Output: Online storage: Red Arrow
出力先:オンラインストレージ:Red Arrow
require "datasets-arrow"
require "arrow-dataset"
dataset = Datasets::PostalCodeJapan.new
url = URL("s3://mybucket/codes.parquet")
dataset.to_arrow.save(url)
Implementing...
実装中。。。
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Implementing...
class Arrow::TableSaver
def save
if @output.is_a?(URI)
__send__("save_to_uri")
else
__send__("save_to_file")
end
end
end
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Collect data w/ Red Arrow: Wrap up
Red Arrowでデータ収集:まとめ
Usable as serializer for common formats
よくあるフォーマットにシリアライズするツールとして使える
✓
Usable as writer to common locations
in the near future...
近いうちによくある出力先に書き出すツールとして使える
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
2. Read data
2. データ読み込み
What format is used?
どんなフォーマットで保存されている?
✓
Where are collected data?
収集したデータはどこ?
✓
How large is collected data?
データはどれかで大きい?
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Format
フォーマット
require "arrow"
table = Arrow::Table.load("data.csv")
table = Arrow::Table.load("data.json")
table = Arrow::Table.load("data.arrow")
table = Arrow::Table.load("data.orc")
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load
General deserialize API for table data
テーブルデータ用の汎用デシリアライズAPI
Deserialize common formats
よく使われているフォーマットからデシリアライズ
✓
✓
Extensible!
拡張可能!
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Implementation
module Arrow
def Table.load(input)
loader = TableLoader.new(self, input)
loader.load
end
end
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Implementation
class Arrow::TableLoader
def load
format = detect_format(@output)
__send__("load_as_#{format}")
end
def load_as_csv
end
end
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Extend by Red Parquet
module Parquet::ArrowTableLoadable
def load_as_parquet
end
Arrow::TableLoader.include(self)
end
Red Parquet is a subproject of Red Arrow
Red ParquetはRed Arrowのサブプロジェクト
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Extended
require "parquet"
table = Arrow::Table.load("data.parquet")
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: More extensible
class Arrow::TableLoader
def load
if @output.is_a?(URI)
__send__("load_from_uri")
else
__send__("load_from_file")
end
end
end
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Extend by Red Arrow Dataset
module ArrowDataset::ArrowTableLoadable
def load_from_uri
end
Arrow::TableLoader.include(self)
end
Red Arrow Dataset is a subproject of Red Arrow
Red Arrow DatasetはRed Arrowのサブプロジェクト
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Location: Online storage
場所:オンラインストレージ
require "arrow-dataset"
url = URI("s3://bucket/path...")
table = Arrow::Table.load(url)
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Location: RDBMS
場所:RDBMS
require "arrow-activerecord"
User.all.to_arrow
Red Arrow Active Record
https://github.com/red-data-tools/red-arrow-activerecord
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Location: Network
場所:ネットワーク
require "arrow-flight"
client = ArrowFlight::Client.new(url)
info = client.list_flights[0]
reader = client.do_get(info.endpoints[0].ticket)
table = reader.read_all
Introducing Apache Arrow Flight: A Framework for Fast Data Transport
https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Large data
大規模データ
Apache Arrow format
Designed for large data
大規模データ用に設計されている
✓
✓
For large data
大規模データ用に必要なもの
Fast load
高速にロードできること
✓
...
✓
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Fast load: Benchmark
高速ロード:ベンチマーク
require "datasets-arrow"
dataset = Datasets::PostalCodeJapan.new
table = dataset.to_arrow # 124271 records
n = 5
n.times do |i|
table.save("codes.#{i}.csv")
table.save("codes.#{i}.arrow")
CSV.read("codes.#{i}.csv")
Arrow::Table.load("codes.#{i}.csv")
Arrow::Table.load("codes.#{i}.arrow")
table = table.concatenate([table])
end
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Fast load: Benchmark: All
高速ロード:ベンチマーク:すべて
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Fast load: Benchmark: Red Arrow
高速ロード:ベンチマーク:Red Arrow
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
How to implement fast load
高速ロードの実装方法
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowフォーマットは
なぜ速いのか
須藤功平
株式会社クリアコード
db tech showcase ONLINE 2020
2020-12-08
https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Read data with Red Arrow: Wrap up
Red Arrowでデータ読み込み:まとめ
Easy to read common formats
よくあるフォーマットのデータを簡単に読める
✓
Easy to read from common locations
よくある場所にあるデータを簡単に読める
✓
Large data ready
大規模データも扱える
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
3. Explore data
3. データ探索
Preprocess data(データを前処理)
Filter out needless data(不要なデータを除去)
✓
...
✓
✓
Summarize data and visualize them
(データを要約して可視化)
✓
...
✓
Red Arrow can be used for some operations
いくつかの操作でRed Arrowを使える
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Filter: Red Arrow
絞り込み:Red Arrow
table = Datasets::PostalCodeJapan.new.to_arrow
table.n_rows # 124271
filtered_table = table.slice do |slicer|
slicer.prefecture == "東京都" # Tokyo
end
filtered_table.n_rows # 3887
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Filter: Performance
絞り込み:性能
dataset = Datasets::PostalCodeJapan.new
arrow_dataset = dataset.to_arrow
dataset.find_all do |row|
row.prefecture == "東京都" # Tokyo
end # 1.256s
arrow_dataset.slice do |slicer|
slicer.prefecture == "東京都" # Tokyo
end # 0.001s
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Filter: Performance
絞り込み:性能
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Apache Arrow data: Interchangeable
Apache Arrow data:交換可能
With low cost thanks to fast load
高速ロードできるので低コスト
✓
Apache Arrow data ready systems are
increasing
Apache Arrowデータを扱えるシステムは増加中
e.g. DuckDB: in-process SQL OLAP DBMS
(SQLite like DBMS for OLAP)
OLAP: OnLine Analytical Processing
例:DuckDB:同一プロセス内で動くデータ分析用SQL DB管理システム
✓
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Filter: DuckDB
絞り込み:DuckDB
require "arrow-duckdb"
codes = Datasets::PostalCodeJapan.new.to_arrow
db = DuckDB::Database.open
c = db.connect
c.register("codes", codes) do # Use codes without copy
c.query("SELECT * FROM codes WHERE prefecture = ?",
"東京都", # Tokyo
output: :arrow) # Output as Apache Arrow data
.to_table.n_rows # 3887
end
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Summarize: Group + aggregation
要約:グループ化して集計
iris = Datasets::Iris.new.to_arrow
iris.group(:label).count(:sepal_length)
# count(sepal_length) label
# 0 50 Iris-setosa
# 1 50 Iris-versicolor
# 2 50 Iris-virginica
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Visualize: Charty
可視化:Charty
require "charty"
Charty.backends.use("pyplot")
Charty.scatter_plot(data: iris,
x: :sepal_length,
y: :sepal_width,
color: :label)
.save("iris.png")
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Visualize: Charty: Result
可視化:Charty:結果
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
4. Use insight
4. 知見を活用
Write report
(レポートにまとめたり)
✓
Build a model
(モデルを作ったり)
✓
...
✓
No Red Arrow support in this area for now
Can be used for passing data to other tools like DuckDB and Charty
今のところこのあたりにはRed Arrowを使えない
DuckDBやChartyにデータを渡すように他のツールにデータを渡すためには使える
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Data processing and Red Arrow
Red Arrowでデータ処理
Red Arrow helps us in some areas
いくつかの領域ではRed Arrowを使える
Collect, read and explore data
データを収集して読み込んで探索するとか
✓
✓
Some tools can integrate with Red Arrow
いくつかのツールはRed Arrowと連携できる
Fluentd, DuckDB, Charty, ...
✓
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Arrow and Ruby 3.0
MemoryView support
✓
Ractor support
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
MemoryView
MemoryView provides the features to
share multidimensional homogeneous
arrays of fixed-size element on
memory among extension libraries.
MemoryViewは多次元数値配列(数値はすべて同じ型)を共有する機能を提供します。
https://docs.ruby-lang.org/en/master/doc/memory_view_md.html
https://tech.speee.jp/entry/2020/12/24/093131 (Japanese)
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Numeric arrays in Red Arrow
Red Arrow内の数値配列
Arrow::NumericArray family
1-dimensional numeric array
1次元数値配列
✓
✓
Arrow::Tensor
Multidimensional homogeneous numeric arrays
多次元数値配列
✓
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
MemoryView: Red Arrow
Arrow::NumericArray family
Export as MemoryView: Support
MemoryViewとしてエクスポート:対応済み
✓
Import from MemoryView: Not yet
MemoryViewをインポート:未対応
✓
✓
Arrow::Tensor
Export/Import: Not yet
エクスポート・インポート:未対応
✓
✓
Join Red Data Tools to work on this!
対応を進めたい人はRed Data Toolsに来てね!
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
MemoryView: C++
Some problems are found by this work
Red Arrowの対応作業でいくつかの問題が見つかった
Can't use private as member name
メンバー名にprivateを使えない
✓
Can't assign to const variable with cast
キャストしてもconst変数に代入できない
✓
✓
Ruby 3.1 will fix them
Ruby 3.1では直っているはず
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Ractor
Ractor is designed to provide a
parallel execution feature of Ruby
without thread-safety concerns.
Ractorはスレッドセーフかどうかを気にせずに並列実行するための機能です。
https://docs.ruby-lang.org/en/master/doc/ractor_md.html
https://techlife.cookpad.com/entry/2020/12/26/131858 (Japanese)
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Arrow and concurrency
Red Arrowと並列性
Red Arrow data are immutable
Red Arrowデータは変更不可
✓
Ractor can share frozen objects
Ractorはfrozenなオブジェクトを共有可能
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Ractor: Red Arrow
require "datasets-arrow"
table = Datasets::PostalCodeJapan.new.to_arrow
Ractor.make_shareable(table)
Ractor.new(table) do |t|
t.slice do |slicer|
slicer.prefecture == "東京都" # Tokyo
end
end
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Ractor: Red Arrow: Benchmark
n_ractors = 4
n_jobs_per_ractor = 1000
n_jobs = n_ractors * n_jobs_per_ractor
n_jobs.times do
table.slice {|s| s.prefecture == "東京都"}
end
n_ractors.times.collect do
Ractor.new(table, n_jobs_per_ractor) do |t, n|
n.times {t.slice {|s| s.prefecture == "東京都"}}
end
end.each(&:take)
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Ractor: Red Arrow: Benchmark
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Wrap up
まとめ
Ruby can be used
in some data processing work
いくつかのデータ処理作業にRubyを使える
Red Arrow helps you!
Red Arrowが有用なケースがあるはず!
✓
✓
Ruby 3.0 has useful features for data
processing work
Ruby 3.0にはデータ処理作業に有用な機能があるよ
Red Arrow starts supporting them
Red Arrowはそれらのサポートを進めている
✓
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Goal of this talk
このトークのゴール
You want to use Ruby
for some data processing
いくつかのデータ処理でRubyを使いたくなる
✓
You join Red Data Tools project
あなたがRed Data Toolsプロジェクトに参加する
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Feature work
今後の仕事
Implement DataFusion bindings by
adding C API to DataFusion
DataFusionにC APIを追加してバインディングを実装
DataFusion: Apache Arrow native query
execution framework written in Rust
https://github.com/apache/arrow-datafusion/
DataFusion:Rust実装のApache Arrowベースのクエリー実行フレームワーク
✓
✓
Add Active Record like API to Red Arrow
Red ArrowにActive Record風のAPIを追加
✓
Improve MemoryView/Ractor support
MemoryView/Ractorサポートを進める
✓
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Data Tools
Join us!
https://red-data-tools.github.io/
https://gitter.im/red-data-tools/en
https://red-data-tools.github.io/ja/
https://gitter.im/red-data-tools/ja
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
OSS Gate on-boarding
OSS Gateオンボーディング
Supports accepting newcomers by OSS
projects such as Ruby & Red Arrow
RubyやRed ArrowといったOSSプロジェクトが新人を受け入れることを支援
✓
Contact me!興味がある人は私に教えて!
OSS project members who want to accept newcomers
新人を受け入れたいOSSプロジェクトのメンバー
✓
Companies which want to support OSS Gate on-boarding
OSS Gateオンボーディングを支援したい会社
✓
✓
https://oss-gate.github.io/on-boarding/
Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
ClearCode Inc.
Recruitment: Developer to work on Red
Arrow related business
採用情報:Red Arrow関連のビジネスをする開発者
https://www.clear-code.com/recruitment/
✓
✓
Business: Apache Arrow/Red Arrow
related technical support/consulting:
仕事:Apache Arrow/Red Arrow関連の技術サポート・コンサルティング
https://www.clear-code.com/contact/
✓
✓

More Related Content

What's hot

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)Amazon Web Services
 
R Jobs on the Cloud
R Jobs on the CloudR Jobs on the Cloud
R Jobs on the CloudJohn Doxaras
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...Chris Fregly
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySparkLadle Patel
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkKazuaki Ishizaki
 

What's hot (20)

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
 
R Jobs on the Cloud
R Jobs on the CloudR Jobs on the Cloud
R Jobs on the Cloud
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache Spark
 

Similar to RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow

RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
RubyKaigi 2022 - Fast data processing with Ruby and Apache ArrowRubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
RubyKaigi 2022 - Fast data processing with Ruby and Apache ArrowKouhei Sutou
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails FinalRobert Postill
 
Crossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end FrameworkCrossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end FrameworkDaniel Spector
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Presto Fast SQL on Anything
Presto Fast SQL on AnythingPresto Fast SQL on Anything
Presto Fast SQL on AnythingAlluxio, Inc.
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Python in the land of serverless
Python in the land of serverlessPython in the land of serverless
Python in the land of serverlessDavid Przybilla
 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineDavid Keener
 
Introduction to Rails - presented by Arman Ortega
Introduction to Rails - presented by Arman OrtegaIntroduction to Rails - presented by Arman Ortega
Introduction to Rails - presented by Arman Ortegaarman o
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Timothy Spann
 
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App EngineRotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App Enginegeehwan
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Super simple introduction to REST-APIs (2nd version)
Super simple introduction to REST-APIs (2nd version)Super simple introduction to REST-APIs (2nd version)
Super simple introduction to REST-APIs (2nd version)Patrick Savalle
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014N Masahiro
 
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with DebuggingPART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with DebuggingFastBit Embedded Brain Academy
 

Similar to RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow (20)

RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
RubyKaigi 2022 - Fast data processing with Ruby and Apache ArrowRubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
 
Crossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end FrameworkCrossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end Framework
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Presto Fast SQL on Anything
Presto Fast SQL on AnythingPresto Fast SQL on Anything
Presto Fast SQL on Anything
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Intro to Rails
Intro to RailsIntro to Rails
Intro to Rails
 
Python in the land of serverless
Python in the land of serverlessPython in the land of serverless
Python in the land of serverless
 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search Engine
 
Introduction to Rails - presented by Arman Ortega
Introduction to Rails - presented by Arman OrtegaIntroduction to Rails - presented by Arman Ortega
Introduction to Rails - presented by Arman Ortega
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
 
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App EngineRotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Super simple introduction to REST-APIs (2nd version)
Super simple introduction to REST-APIs (2nd version)Super simple introduction to REST-APIs (2nd version)
Super simple introduction to REST-APIs (2nd version)
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Raml part 1
Raml part 1Raml part 1
Raml part 1
 
Supa fast Ruby + Rails
Supa fast Ruby + RailsSupa fast Ruby + Rails
Supa fast Ruby + Rails
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
 
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with DebuggingPART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
 

More from Kouhei Sutou

Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Kouhei Sutou
 
Rubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアRubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアKouhei Sutou
 
Apache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataApache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataKouhei Sutou
 
Redmine検索の未来像
Redmine検索の未来像Redmine検索の未来像
Redmine検索の未来像Kouhei Sutou
 
Apache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataApache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataKouhei Sutou
 
Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Kouhei Sutou
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームKouhei Sutou
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムKouhei Sutou
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroongaKouhei Sutou
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Kouhei Sutou
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムKouhei Sutou
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!Kouhei Sutou
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版Kouhei Sutou
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムKouhei Sutou
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionKouhei Sutou
 

More from Kouhei Sutou (20)

Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
 
Rubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアRubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェア
 
Apache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataApache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory data
 
Apache Arrow 2019
Apache Arrow 2019Apache Arrow 2019
Apache Arrow 2019
 
Redmine検索の未来像
Redmine検索の未来像Redmine検索の未来像
Redmine検索の未来像
 
Apache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataApache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory data
 
Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォーム
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroonga
 
My way with Ruby
My way with RubyMy way with Ruby
My way with Ruby
 
Red Data Tools
Red Data ToolsRed Data Tools
Red Data Tools
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システム
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extension
 

Recently uploaded

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow

  • 1. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Red Arrow Ruby and Apache Arrow Sutou Kouhei ClearCode Inc. RubyKaigi Takeout 2021 2021-09-11
  • 2. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Sutou Kouhei A president Rubyist The president of ClearCode Inc. クリアコードの社長
  • 3. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Sutou Kouhei An Apache Arrow contributor A member of PMC of Apache Arrow PMC: Project Management Committee Apache Arrowのプロジェクト管理委員会メンバー ✓ #2 commits(コミット数2位) ✓
  • 4. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Sutou Kouhei The pioneer in Ruby and Arrow The author of Red Arrow Red Arrowの作者 ✓ Red Arrow: The official Apache Arrow library for Ruby 公式のRuby用のApache Arrowライブラリー ✓ GObject Introspection based bindings GObject Introspectionベースのバインディング ✓ Apache Arrow GLib is developed for Red Arrow Red ArrowのためにApache Arrow GLibも開発 ✓ ✓
  • 5. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 GObject Introspection? A way to implement bindings バインディングの実装方法の1つ Ruby bindings 2016 - How to create bindings 2016 Powered by Rabbit 2.2.0 Ruby bindings 2016 How to create bindings 2016 Kouhei Sutou ClearCode Inc. RubyKaigi 2016 2016-09-09 https://rubykaigi.org/2016/presentations/ktou.html
  • 6. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Why do I work on Red Arrow? なぜRed Arrowの開発をしているか To use Ruby for data processing! データ処理でRubyを使いたい! At least a part of data processing データ処理の全部と言わず一部だけでも ✓ ✓ Results of my 5 years of work: 私のここ5年の仕事の成果 We can use Ruby for some data processing! いくつかのデータ処理でRubyを使える! ✓ ✓
  • 7. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Goal of this talk このトークのゴール You want to use Ruby for some data processing いくつかのデータ処理でRubyを使いたくなる ✓ You join Red Data Tools project Red Data Toolsプロジェクトに参加する ✓
  • 8. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Red Data Tools project? Red Data Tools is a project that provides data processing tools for Ruby Red Data ToolsはRuby用のデータ処理ツールを提供するプロジェクト https://red-data-tools.github.io/
  • 9. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Data processing? ... how?
  • 10. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 0. Why do you want? 0. データ処理の目的を明らかにする What problem do you want to resolve? どんな問題を解決したい? ✓ What data is needed for it? そのためにはどんなデータが必要? ✓ ... ✓ No Red Arrow support in this area このあたりにはRed Arrowを使えない
  • 11. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 1. Collect data 1. データ収集 Where are data? データはどこにある? ✓ Where are collected data stored? 集めたデータはどこに保存する? ✓ ... ✓ Some Red Arrow supports in this area このあたりでは少しRed Arrowを使えない
  • 12. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Common dataset よく使われるデータセット require "datasets" Datasets::Iris.new Datasets::PostalCodeJapan.new Datasets::Wikipedia.new Red Datasets https://github.com/red-data-tools/red-datasets
  • 13. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Output: Local file 出力先:ローカルファイル require "datasets-arrow" dataset = Datasets::PostalCodeJapan.new dataset.to_arrow.save("codes.csv") dataset.to_arrow.save("codes.arrow") Red Datasets Arrow https://github.com/red-data-tools/red-datasets-arrow
  • 14. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 #save General serialize API for table data テーブルデータ用の汎用シリアライズAPI Serialize as the specified format 指定したフォーマットにシリアライズ ✓ If you use Red Arrow object for in-memory table data, you can serialize to many formats! Cool! メモリー上のテーブルデータをRed Arrowオブジェクトにするといろんな フォーマットにシリアライズできる!かっこいい! ✓ ✓ Extensible! 拡張可能! ✓
  • 15. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 #save: Implementation module Arrow class Table def save(output) saver = TableSaver.new(self, output) saver.save end end end
  • 16. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 #save: Implementation class Arrow::TableSaver def save format = detect_format(@output) __send__("save_as_#{format}") end def save_as_csv end end
  • 17. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 #save: Extend by Red Parquet module Parquet::ArrowTableSavable def save_as_parquet end Arrow::TableSaver.include(self) end Red Parquet is a subproject of Red Arrow Red ParquetはRed Arrowのサブプロジェクト
  • 18. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 #save: Extended require "datasets-arrow" require "parquet" dataset = Datasets::PostalCodeJapan.new dataset.to_arrow.save("codes.parquet")
  • 19. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Output: Online storage: Fluentd 出力先:オンラインストレージ:Fluentd fluent-plugin-s3-arrow: Collect data by Fluentd Fluentdでデータ収集 ✓ Format data as Apache Parquet by Red Arrow Red ArrowでApache Parquet形式にデータを変換 ✓ Store data to Amazon S3 by fluent-plugin-s3 fluent-plugin-s3でAmazon S3にデータを保存 ✓ By @kanga33 at Speee/Red Data Tools Speee/Red Data Toolsの香川さんが開発 ✓ ✓ https://github.com/red-data-tools/fluent-plugin-s3-arrow/
  • 20. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Output: Online storage: Red Arrow 出力先:オンラインストレージ:Red Arrow require "datasets-arrow" require "arrow-dataset" dataset = Datasets::PostalCodeJapan.new url = URL("s3://mybucket/codes.parquet") dataset.to_arrow.save(url) Implementing... 実装中。。。
  • 21. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 #save: Implementing... class Arrow::TableSaver def save if @output.is_a?(URI) __send__("save_to_uri") else __send__("save_to_file") end end end
  • 22. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Collect data w/ Red Arrow: Wrap up Red Arrowでデータ収集:まとめ Usable as serializer for common formats よくあるフォーマットにシリアライズするツールとして使える ✓ Usable as writer to common locations in the near future... 近いうちによくある出力先に書き出すツールとして使える ✓
  • 23. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 2. Read data 2. データ読み込み What format is used? どんなフォーマットで保存されている? ✓ Where are collected data? 収集したデータはどこ? ✓ How large is collected data? データはどれかで大きい? ✓
  • 24. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Format フォーマット require "arrow" table = Arrow::Table.load("data.csv") table = Arrow::Table.load("data.json") table = Arrow::Table.load("data.arrow") table = Arrow::Table.load("data.orc")
  • 25. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 .load General deserialize API for table data テーブルデータ用の汎用デシリアライズAPI Deserialize common formats よく使われているフォーマットからデシリアライズ ✓ ✓ Extensible! 拡張可能! ✓
  • 26. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 .load: Implementation module Arrow def Table.load(input) loader = TableLoader.new(self, input) loader.load end end
  • 27. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 .load: Implementation class Arrow::TableLoader def load format = detect_format(@output) __send__("load_as_#{format}") end def load_as_csv end end
  • 28. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 .load: Extend by Red Parquet module Parquet::ArrowTableLoadable def load_as_parquet end Arrow::TableLoader.include(self) end Red Parquet is a subproject of Red Arrow Red ParquetはRed Arrowのサブプロジェクト
  • 29. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 .load: Extended require "parquet" table = Arrow::Table.load("data.parquet")
  • 30. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 .load: More extensible class Arrow::TableLoader def load if @output.is_a?(URI) __send__("load_from_uri") else __send__("load_from_file") end end end
  • 31. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 .load: Extend by Red Arrow Dataset module ArrowDataset::ArrowTableLoadable def load_from_uri end Arrow::TableLoader.include(self) end Red Arrow Dataset is a subproject of Red Arrow Red Arrow DatasetはRed Arrowのサブプロジェクト
  • 32. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Location: Online storage 場所:オンラインストレージ require "arrow-dataset" url = URI("s3://bucket/path...") table = Arrow::Table.load(url)
  • 33. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Location: RDBMS 場所:RDBMS require "arrow-activerecord" User.all.to_arrow Red Arrow Active Record https://github.com/red-data-tools/red-arrow-activerecord
  • 34. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Location: Network 場所:ネットワーク require "arrow-flight" client = ArrowFlight::Client.new(url) info = client.list_flights[0] reader = client.do_get(info.endpoints[0].ticket) table = reader.read_all Introducing Apache Arrow Flight: A Framework for Fast Data Transport https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
  • 35. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Large data 大規模データ Apache Arrow format Designed for large data 大規模データ用に設計されている ✓ ✓ For large data 大規模データ用に必要なもの Fast load 高速にロードできること ✓ ... ✓ ✓
  • 36. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Fast load: Benchmark 高速ロード:ベンチマーク require "datasets-arrow" dataset = Datasets::PostalCodeJapan.new table = dataset.to_arrow # 124271 records n = 5 n.times do |i| table.save("codes.#{i}.csv") table.save("codes.#{i}.arrow") CSV.read("codes.#{i}.csv") Arrow::Table.load("codes.#{i}.csv") Arrow::Table.load("codes.#{i}.arrow") table = table.concatenate([table]) end
  • 37. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Fast load: Benchmark: All 高速ロード:ベンチマーク:すべて
  • 38. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Fast load: Benchmark: Red Arrow 高速ロード:ベンチマーク:Red Arrow
  • 39. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 How to implement fast load 高速ロードの実装方法 Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrowフォーマットは なぜ速いのか 須藤功平 株式会社クリアコード db tech showcase ONLINE 2020 2020-12-08 https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/
  • 40. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Read data with Red Arrow: Wrap up Red Arrowでデータ読み込み:まとめ Easy to read common formats よくあるフォーマットのデータを簡単に読める ✓ Easy to read from common locations よくある場所にあるデータを簡単に読める ✓ Large data ready 大規模データも扱える ✓
  • 41. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 3. Explore data 3. データ探索 Preprocess data(データを前処理) Filter out needless data(不要なデータを除去) ✓ ... ✓ ✓ Summarize data and visualize them (データを要約して可視化) ✓ ... ✓ Red Arrow can be used for some operations いくつかの操作でRed Arrowを使える
  • 42. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Filter: Red Arrow 絞り込み:Red Arrow table = Datasets::PostalCodeJapan.new.to_arrow table.n_rows # 124271 filtered_table = table.slice do |slicer| slicer.prefecture == "東京都" # Tokyo end filtered_table.n_rows # 3887
  • 43. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Filter: Performance 絞り込み:性能 dataset = Datasets::PostalCodeJapan.new arrow_dataset = dataset.to_arrow dataset.find_all do |row| row.prefecture == "東京都" # Tokyo end # 1.256s arrow_dataset.slice do |slicer| slicer.prefecture == "東京都" # Tokyo end # 0.001s
  • 44. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Filter: Performance 絞り込み:性能
  • 45. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Apache Arrow data: Interchangeable Apache Arrow data:交換可能 With low cost thanks to fast load 高速ロードできるので低コスト ✓ Apache Arrow data ready systems are increasing Apache Arrowデータを扱えるシステムは増加中 e.g. DuckDB: in-process SQL OLAP DBMS (SQLite like DBMS for OLAP) OLAP: OnLine Analytical Processing 例:DuckDB:同一プロセス内で動くデータ分析用SQL DB管理システム ✓ ✓
  • 46. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Filter: DuckDB 絞り込み:DuckDB require "arrow-duckdb" codes = Datasets::PostalCodeJapan.new.to_arrow db = DuckDB::Database.open c = db.connect c.register("codes", codes) do # Use codes without copy c.query("SELECT * FROM codes WHERE prefecture = ?", "東京都", # Tokyo output: :arrow) # Output as Apache Arrow data .to_table.n_rows # 3887 end
  • 47. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Summarize: Group + aggregation 要約:グループ化して集計 iris = Datasets::Iris.new.to_arrow iris.group(:label).count(:sepal_length) # count(sepal_length) label # 0 50 Iris-setosa # 1 50 Iris-versicolor # 2 50 Iris-virginica
  • 48. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Visualize: Charty 可視化:Charty require "charty" Charty.backends.use("pyplot") Charty.scatter_plot(data: iris, x: :sepal_length, y: :sepal_width, color: :label) .save("iris.png")
  • 49. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Visualize: Charty: Result 可視化:Charty:結果
  • 50. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 4. Use insight 4. 知見を活用 Write report (レポートにまとめたり) ✓ Build a model (モデルを作ったり) ✓ ... ✓ No Red Arrow support in this area for now Can be used for passing data to other tools like DuckDB and Charty 今のところこのあたりにはRed Arrowを使えない DuckDBやChartyにデータを渡すように他のツールにデータを渡すためには使える
  • 51. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Data processing and Red Arrow Red Arrowでデータ処理 Red Arrow helps us in some areas いくつかの領域ではRed Arrowを使える Collect, read and explore data データを収集して読み込んで探索するとか ✓ ✓ Some tools can integrate with Red Arrow いくつかのツールはRed Arrowと連携できる Fluentd, DuckDB, Charty, ... ✓ ✓
  • 52. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Red Arrow and Ruby 3.0 MemoryView support ✓ Ractor support ✓
  • 53. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 MemoryView MemoryView provides the features to share multidimensional homogeneous arrays of fixed-size element on memory among extension libraries. MemoryViewは多次元数値配列(数値はすべて同じ型)を共有する機能を提供します。 https://docs.ruby-lang.org/en/master/doc/memory_view_md.html https://tech.speee.jp/entry/2020/12/24/093131 (Japanese)
  • 54. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Numeric arrays in Red Arrow Red Arrow内の数値配列 Arrow::NumericArray family 1-dimensional numeric array 1次元数値配列 ✓ ✓ Arrow::Tensor Multidimensional homogeneous numeric arrays 多次元数値配列 ✓ ✓
  • 55. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 MemoryView: Red Arrow Arrow::NumericArray family Export as MemoryView: Support MemoryViewとしてエクスポート:対応済み ✓ Import from MemoryView: Not yet MemoryViewをインポート:未対応 ✓ ✓ Arrow::Tensor Export/Import: Not yet エクスポート・インポート:未対応 ✓ ✓ Join Red Data Tools to work on this! 対応を進めたい人はRed Data Toolsに来てね!
  • 56. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 MemoryView: C++ Some problems are found by this work Red Arrowの対応作業でいくつかの問題が見つかった Can't use private as member name メンバー名にprivateを使えない ✓ Can't assign to const variable with cast キャストしてもconst変数に代入できない ✓ ✓ Ruby 3.1 will fix them Ruby 3.1では直っているはず ✓
  • 57. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Ractor Ractor is designed to provide a parallel execution feature of Ruby without thread-safety concerns. Ractorはスレッドセーフかどうかを気にせずに並列実行するための機能です。 https://docs.ruby-lang.org/en/master/doc/ractor_md.html https://techlife.cookpad.com/entry/2020/12/26/131858 (Japanese)
  • 58. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Red Arrow and concurrency Red Arrowと並列性 Red Arrow data are immutable Red Arrowデータは変更不可 ✓ Ractor can share frozen objects Ractorはfrozenなオブジェクトを共有可能 ✓
  • 59. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Ractor: Red Arrow require "datasets-arrow" table = Datasets::PostalCodeJapan.new.to_arrow Ractor.make_shareable(table) Ractor.new(table) do |t| t.slice do |slicer| slicer.prefecture == "東京都" # Tokyo end end
  • 60. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Ractor: Red Arrow: Benchmark n_ractors = 4 n_jobs_per_ractor = 1000 n_jobs = n_ractors * n_jobs_per_ractor n_jobs.times do table.slice {|s| s.prefecture == "東京都"} end n_ractors.times.collect do Ractor.new(table, n_jobs_per_ractor) do |t, n| n.times {t.slice {|s| s.prefecture == "東京都"}} end end.each(&:take)
  • 61. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Ractor: Red Arrow: Benchmark
  • 62. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Wrap up まとめ Ruby can be used in some data processing work いくつかのデータ処理作業にRubyを使える Red Arrow helps you! Red Arrowが有用なケースがあるはず! ✓ ✓ Ruby 3.0 has useful features for data processing work Ruby 3.0にはデータ処理作業に有用な機能があるよ Red Arrow starts supporting them Red Arrowはそれらのサポートを進めている ✓ ✓
  • 63. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Goal of this talk このトークのゴール You want to use Ruby for some data processing いくつかのデータ処理でRubyを使いたくなる ✓ You join Red Data Tools project あなたがRed Data Toolsプロジェクトに参加する ✓
  • 64. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Feature work 今後の仕事 Implement DataFusion bindings by adding C API to DataFusion DataFusionにC APIを追加してバインディングを実装 DataFusion: Apache Arrow native query execution framework written in Rust https://github.com/apache/arrow-datafusion/ DataFusion:Rust実装のApache Arrowベースのクエリー実行フレームワーク ✓ ✓ Add Active Record like API to Red Arrow Red ArrowにActive Record風のAPIを追加 ✓ Improve MemoryView/Ractor support MemoryView/Ractorサポートを進める ✓
  • 65. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 Red Data Tools Join us! https://red-data-tools.github.io/ https://gitter.im/red-data-tools/en https://red-data-tools.github.io/ja/ https://gitter.im/red-data-tools/ja
  • 66. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 OSS Gate on-boarding OSS Gateオンボーディング Supports accepting newcomers by OSS projects such as Ruby & Red Arrow RubyやRed ArrowといったOSSプロジェクトが新人を受け入れることを支援 ✓ Contact me!興味がある人は私に教えて! OSS project members who want to accept newcomers 新人を受け入れたいOSSプロジェクトのメンバー ✓ Companies which want to support OSS Gate on-boarding OSS Gateオンボーディングを支援したい会社 ✓ ✓ https://oss-gate.github.io/on-boarding/
  • 67. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1 ClearCode Inc. Recruitment: Developer to work on Red Arrow related business 採用情報:Red Arrow関連のビジネスをする開発者 https://www.clear-code.com/recruitment/ ✓ ✓ Business: Apache Arrow/Red Arrow related technical support/consulting: 仕事:Apache Arrow/Red Arrow関連の技術サポート・コンサルティング https://www.clear-code.com/contact/ ✓ ✓