SlideShare a Scribd company logo
1 of 25
Download to read offline
Apache Arrow as a full stack data
engineering solution
Alessandro Molina
@__amol__
https:/
/alessandro.molina.fyi/
Who am I, Alessandro
● Maintainer of TurboGears2,
Apache Arrow Contributor,
Author of DukPy and DEPOT
● Director of Engineering at Voltron Data Labs
● Author of
“Modern Python Standard Library Cookbook” and
“Crafting Test-Driven Software with Python”.
What’s Apache Arrow?
● a data interchange standard
● an in-memory format
● a networking format
● a storage format
● an i/o library
● a computation engine
● a tabular data library
● a query engine
● a partitioned datasets
manager
So much there!
The Apache Arrow project is a huge effort, aimed at
solving the foundamental problems in the data
analytics world.
Aimed at providing a “write everywhere, run
everywhere” experience it’s easy to get lost if you
don’t know where to start.
PyArrow is the entry point to the Apache Arrow
ecosystem for Python developers, and it can easily
give you access to many of the benefits of Arrow itself.
Introducing PyArrow
● Apache Arrow was born as a Columnar Data Format
● So the foundamental type in PyArrow is a “column of data”,
which is exposed by the pyarrow.Array object and its
subclasses.
● At this level, PyArrow is similar to NumPy single dimension
arrays.
PyArrow Arrays
import pyarrow as pa
# Arrays can be made of numbers
>>> pa.array([1, 2, 3, 4, 5])
<pyarrow.lib.Int64Array object at 0xffff77d75d20>
# Or strings
>>> pa.array(["A", "B", "C", "D", "E"])
<pyarrow.lib.StringArray object at 0xffff77d75b40>
# And even complex objects
>>> pa.array([{"a": 5}, {"a": 7}])
<pyarrow.lib.StructArray object at 0xffff77d75d20>
# Arrays can also be masked
>>> pa.array([1, 2, 3, 4, 5],
... mask=pa.array([True, False, True, False, True]))
<pyarrow.lib.Int64Array object at 0xffff77d75d80>
Compared to classic NumPy arrays, PyArrow
arrays are a bit more complex.
● They pair a buffer with the data with one
with the validity map. So that null values
can be more than just None
● Also arrays of strings retaining the
guarantee of having a single continuous
buffer for the values
Introducing PyArrow Tables
● As Arrays are “columns”, their grouping can form pyarrow.Table
● Tables are actually consistuted by pyarrow.ChunkedArray so
that appending rows to them is a cheap operation.
● At this level, PyArrow is similar to Pandas Dataframes
PyArrow Tables
>>> table = pa.table([
... pa.array([1, 2, 3, 4, 5]),
... pa.array(["a", "b", "c", "d", "e"]),
... pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
... ], names=["col1", "col2", "col3"])
>>> table.take([0, 1, 4])
col1: [[1,2,5]]
col2: [["a","b","e"]]
col3: [[1,2,5]]
>>> table.schema
col1: int64
col2: string
col3: double
Compared to Pandas, PyArrow tables are fully
implemented in C++ and never modify data in
place.
Tables are based on ChunkedArrays so that
appending data to them is a zero copy
operation. A new table is created that
references the data from the existing table as
the first chunk of the arrays and the added
data se the new chunk.
The Acero compute engine in Arrow is able to
provide many common analytics and
transformation capabilities, like joining, filtering
and aggregating data in tables.
Running Analytics
The Acero compute engine
powers the analytics and
transformation capabilities
available on tables.
Many pyarrow.compute
functions provide kernels
that work on tables and
Table exposes join, filtering
and grouping methods
import pyarrow as pa
import pyarrow.compute as pc
>>> table = pa.table([
... pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
... pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
... ], names=["keys", "values"])
>>> table.filter(pc.field("values") == 4)
keys: [["b","e"]]
values: [[4,4]]
>>> table.group_by("keys").aggregate([("values", "sum")])
values_sum: [[31,7,15,1,4]]
keys: [["a","b","c","d","e"]]
>>> table1 = pa.table({'id': [1, 2, 3],
... 'year': [2020, 2022, 2019]})
>>>
>>> table2 = pa.table({'id': [3, 4],
... 'n_legs': [5, 100],
... 'animal': ["Brittle stars", "Centipede"]})
>>>
>>> table1.join(table2, keys="id")
id: [[3,1,2]]
year: [[2019,2020,2022]]
n_legs: [[5,null,null]]
animal: [["Brittle stars",null,null]]
PyArrow, Numpy and Pandas
One of the original design goals of Apache Arrow was
to allow ease exchange of data without the cost of
converting it across multiple formats or marshaling it
before transfer.
In the spirit of those capabilities, PyArrow provides
copy-free support for converting data to and from
pandas and numpy.
If you have data in PyArrow you can invoke to_numpy
on pyarrow.Array and to_pandas on pyarrow.Array and
pyarrow.Table to get them as pandas or numpy
objects without facing any additional conversion cost.
And it’s fast!
>>> data = [a % 5 for a in range(100000000)]
>>> npdata = np.array(data)
>>> padata = pa.array(data)
>>> import timeit
>>> timeit.timeit(
... lambda: np.unique(npdata, return_counts=True),
... number=1
... )
1.5212857750011608
>>> timeit.timeit(
... lambda: pc.value_counts(padata),
... number=1
... )
0.3754262370057404
Very fast!
In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays)
82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas()
50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df = pd.read_csv("large.csv", engine="pyarrow")
Full-Stack Solution
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK
Arrow from disk to memory
● Saving data in the Arrow format allows PyArrow
to leverage the same exact format for disk and
in-memory data.
● This means that no marshaling cost happens
when loading back the data.
● And allows to leverage memory mapping to
avoid processing data until it’s actually
accessed.
● This means reducing the latency to access data
from seconds to milliseconds.
● Memory mapping also allows managing data
bigger than memory.
Arrow format does not solve it all
● The Arrow format can make working with your data very fast
● But is expensive in terms of disk space as it’s optimized for fast computation and SIMD
instructions, not for storage size.
● It natively support compressions algorithms, but those come at a cost that nullify most
benefits of using the Arrow format itself.
● Arrow format is a great hot format, but there are better solutions for cold storage.
total 1.3G
-rw-r--r-- 1 root root 1.2G Nov 2 16:10 data.arrow
-rw-r--r-- 1 root root 155M Nov 2 16:10 data.pqt
Yes, you can read 17 Milions Rows in 9ms*
* for some definitions of read
From memory-to-network: Arrow Flight
● Arrow Flight is a protocol and implementation provided in Arrow itself that is optimized for
transferring columnar data using Apache Arrow format.
● pyarrow.flight.FlightServerBase provides the server implementation and
pyarrow.flight.connect allows to create clients that connect to flight servers.
● Flight hooks directly into gRPC,
thus no marshaling or
unmarshaling happens when
sending data through network.
● https:/
/arrow.apache.org/coo
kbook/py/flight.html
Arrow Flight speed
Based on the same foundations that we saw for dealing with data on disk, using Arrow Flight for
data on network can provide major performance gains compared to other existing solutions to
transfer data
Full-Stack Solution, evolved
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK
COLD
STORAGE
Parquet
PyArrow natively
supports optimized
parquet loading
FLIGHT
SQL
ADBC & FlightSQL
Native support for fetching data
from databases in Arrow format.
ADBC
NANO
ARROW
NanoArrow
Sharing Arrow data
between languages
and libraries in same
process
Arrow & Database: FlightSQL
● Flight SQL aims to provide broadly similar functionality to existing APIs like JDBC and ODBC
● Using Flight, it provides an efficient implementation of a wire format that supports features
like encryption and authentication out of the box, while allowing for further optimizations like
parallel data access
● You get the performance
of Flight, with the
convenience of a SQL
database.
● FlightSQL is mostly a
transport for higher level
APIs, you are not meant
to use it directly.
Arrow & Database: ADBC
● Standard database interface built around
Arrow data, especially for efficiently fetching
large datasets (i.e. with minimal or no
serialization and copying)
● ADBC can leverage FlightSQL or directly
connect to the database (currently supports
Postgres, DuckDB, SQLite, …)
● Optimized for transferring column major data
instead of row major data like most database
drivers.
● Support both SQL dialects and the emergent
Substrait standard.
Arrow & Database: ADBC
with sqlite.cursor() as cur:
cur.execute('SELECT 1, "foo", 2.0')
assert cur.fetch_arrow_table() == pyarrow.table(
{
"1": [1],
'"foo"': ["foo"],
"2.0": [2.0],
}
)
with sqlite.cursor() as cur:
cur.execute('SELECT 1, "foo", 2.0')
assert_frame_equal(
cur.fetch_df(),
pandas.DataFrame(
{
"1": [1],
'"foo"': ["foo"],
"2.0": [2.0],
}
),
)
ARROW Tables
Pandas DataFrame
Questions?
● PyArrow Documentation
https:/
/arrow.apache.org/docs/python
/getstarted.html
● PyArrow Cookbook
https:/
/arrow.apache.org/cookbook/p
y/index.html

More Related Content

Similar to PyCon Ireland 2022 - PyArrow full stack.pdf

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptx
SreeLaya9
 

Similar to PyCon Ireland 2022 - PyArrow full stack.pdf (20)

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
pm1
pm1pm1
pm1
 
4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx
 
Lecture 3 intro2data
Lecture 3 intro2dataLecture 3 intro2data
Lecture 3 intro2data
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On Workshop
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptx
 
Rpg Pointers And User Space
Rpg Pointers And User SpaceRpg Pointers And User Space
Rpg Pointers And User Space
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Advanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL ServerAdvanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL Server
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 

More from Alessandro Molina

PyConIT6 - Messing up with pymongo for fun and profit
PyConIT6 - Messing up with pymongo for fun and profitPyConIT6 - Messing up with pymongo for fun and profit
PyConIT6 - Messing up with pymongo for fun and profit
Alessandro Molina
 
Reactive & Realtime Web Applications with TurboGears2
Reactive & Realtime Web Applications with TurboGears2Reactive & Realtime Web Applications with TurboGears2
Reactive & Realtime Web Applications with TurboGears2
Alessandro Molina
 
Rapid Prototyping with TurboGears2
Rapid Prototyping with TurboGears2Rapid Prototyping with TurboGears2
Rapid Prototyping with TurboGears2
Alessandro Molina
 
TurboGears2 Pluggable Applications
TurboGears2 Pluggable ApplicationsTurboGears2 Pluggable Applications
TurboGears2 Pluggable Applications
Alessandro Molina
 
From SQLAlchemy to Ming with TurboGears2
From SQLAlchemy to Ming with TurboGears2From SQLAlchemy to Ming with TurboGears2
From SQLAlchemy to Ming with TurboGears2
Alessandro Molina
 

More from Alessandro Molina (17)

PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
 
EP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
EP2016 - Moving Away From Nodejs To A Pure Python Solution For AssetsEP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
EP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
 
EuroPython 2015 - Storing files for the web is not as straightforward as you ...
EuroPython 2015 - Storing files for the web is not as straightforward as you ...EuroPython 2015 - Storing files for the web is not as straightforward as you ...
EuroPython 2015 - Storing files for the web is not as straightforward as you ...
 
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATESPyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
 
PyConIT6 - Messing up with pymongo for fun and profit
PyConIT6 - Messing up with pymongo for fun and profitPyConIT6 - Messing up with pymongo for fun and profit
PyConIT6 - Messing up with pymongo for fun and profit
 
PyConFR 2014 - DEPOT, Story of a file.write() gone wrong
PyConFR 2014 - DEPOT, Story of a file.write() gone wrongPyConFR 2014 - DEPOT, Story of a file.write() gone wrong
PyConFR 2014 - DEPOT, Story of a file.write() gone wrong
 
PyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development UpdatedPyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development Updated
 
Reactive & Realtime Web Applications with TurboGears2
Reactive & Realtime Web Applications with TurboGears2Reactive & Realtime Web Applications with TurboGears2
Reactive & Realtime Web Applications with TurboGears2
 
Post-Mortem Debugging and Web Development
Post-Mortem Debugging and Web DevelopmentPost-Mortem Debugging and Web Development
Post-Mortem Debugging and Web Development
 
MongoTorino 2013 - BSON Mad Science for fun and profit
MongoTorino 2013 - BSON Mad Science for fun and profitMongoTorino 2013 - BSON Mad Science for fun and profit
MongoTorino 2013 - BSON Mad Science for fun and profit
 
PyConUK2013 - Validated documents on MongoDB with Ming
PyConUK2013 - Validated documents on MongoDB with MingPyConUK2013 - Validated documents on MongoDB with Ming
PyConUK2013 - Validated documents on MongoDB with Ming
 
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
 
EuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingEuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears Training
 
PyGrunn2013 High Performance Web Applications with TurboGears
PyGrunn2013  High Performance Web Applications with TurboGearsPyGrunn2013  High Performance Web Applications with TurboGears
PyGrunn2013 High Performance Web Applications with TurboGears
 
Rapid Prototyping with TurboGears2
Rapid Prototyping with TurboGears2Rapid Prototyping with TurboGears2
Rapid Prototyping with TurboGears2
 
TurboGears2 Pluggable Applications
TurboGears2 Pluggable ApplicationsTurboGears2 Pluggable Applications
TurboGears2 Pluggable Applications
 
From SQLAlchemy to Ming with TurboGears2
From SQLAlchemy to Ming with TurboGears2From SQLAlchemy to Ming with TurboGears2
From SQLAlchemy to Ming with TurboGears2
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

PyCon Ireland 2022 - PyArrow full stack.pdf

  • 1. Apache Arrow as a full stack data engineering solution Alessandro Molina @__amol__ https:/ /alessandro.molina.fyi/
  • 2. Who am I, Alessandro ● Maintainer of TurboGears2, Apache Arrow Contributor, Author of DukPy and DEPOT ● Director of Engineering at Voltron Data Labs ● Author of “Modern Python Standard Library Cookbook” and “Crafting Test-Driven Software with Python”.
  • 3. What’s Apache Arrow? ● a data interchange standard ● an in-memory format ● a networking format ● a storage format ● an i/o library ● a computation engine ● a tabular data library ● a query engine ● a partitioned datasets manager
  • 4. So much there! The Apache Arrow project is a huge effort, aimed at solving the foundamental problems in the data analytics world. Aimed at providing a “write everywhere, run everywhere” experience it’s easy to get lost if you don’t know where to start. PyArrow is the entry point to the Apache Arrow ecosystem for Python developers, and it can easily give you access to many of the benefits of Arrow itself.
  • 5. Introducing PyArrow ● Apache Arrow was born as a Columnar Data Format ● So the foundamental type in PyArrow is a “column of data”, which is exposed by the pyarrow.Array object and its subclasses. ● At this level, PyArrow is similar to NumPy single dimension arrays.
  • 6. PyArrow Arrays import pyarrow as pa # Arrays can be made of numbers >>> pa.array([1, 2, 3, 4, 5]) <pyarrow.lib.Int64Array object at 0xffff77d75d20> # Or strings >>> pa.array(["A", "B", "C", "D", "E"]) <pyarrow.lib.StringArray object at 0xffff77d75b40> # And even complex objects >>> pa.array([{"a": 5}, {"a": 7}]) <pyarrow.lib.StructArray object at 0xffff77d75d20> # Arrays can also be masked >>> pa.array([1, 2, 3, 4, 5], ... mask=pa.array([True, False, True, False, True])) <pyarrow.lib.Int64Array object at 0xffff77d75d80> Compared to classic NumPy arrays, PyArrow arrays are a bit more complex. ● They pair a buffer with the data with one with the validity map. So that null values can be more than just None ● Also arrays of strings retaining the guarantee of having a single continuous buffer for the values
  • 7. Introducing PyArrow Tables ● As Arrays are “columns”, their grouping can form pyarrow.Table ● Tables are actually consistuted by pyarrow.ChunkedArray so that appending rows to them is a cheap operation. ● At this level, PyArrow is similar to Pandas Dataframes
  • 8. PyArrow Tables >>> table = pa.table([ ... pa.array([1, 2, 3, 4, 5]), ... pa.array(["a", "b", "c", "d", "e"]), ... pa.array([1.0, 2.0, 3.0, 4.0, 5.0]) ... ], names=["col1", "col2", "col3"]) >>> table.take([0, 1, 4]) col1: [[1,2,5]] col2: [["a","b","e"]] col3: [[1,2,5]] >>> table.schema col1: int64 col2: string col3: double Compared to Pandas, PyArrow tables are fully implemented in C++ and never modify data in place. Tables are based on ChunkedArrays so that appending data to them is a zero copy operation. A new table is created that references the data from the existing table as the first chunk of the arrays and the added data se the new chunk. The Acero compute engine in Arrow is able to provide many common analytics and transformation capabilities, like joining, filtering and aggregating data in tables.
  • 9. Running Analytics The Acero compute engine powers the analytics and transformation capabilities available on tables. Many pyarrow.compute functions provide kernels that work on tables and Table exposes join, filtering and grouping methods import pyarrow as pa import pyarrow.compute as pc >>> table = pa.table([ ... pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]), ... pa.array([11, 20, 3, 4, 5, 1, 4, 10]), ... ], names=["keys", "values"]) >>> table.filter(pc.field("values") == 4) keys: [["b","e"]] values: [[4,4]] >>> table.group_by("keys").aggregate([("values", "sum")]) values_sum: [[31,7,15,1,4]] keys: [["a","b","c","d","e"]] >>> table1 = pa.table({'id': [1, 2, 3], ... 'year': [2020, 2022, 2019]}) >>> >>> table2 = pa.table({'id': [3, 4], ... 'n_legs': [5, 100], ... 'animal': ["Brittle stars", "Centipede"]}) >>> >>> table1.join(table2, keys="id") id: [[3,1,2]] year: [[2019,2020,2022]] n_legs: [[5,null,null]] animal: [["Brittle stars",null,null]]
  • 10. PyArrow, Numpy and Pandas One of the original design goals of Apache Arrow was to allow ease exchange of data without the cost of converting it across multiple formats or marshaling it before transfer. In the spirit of those capabilities, PyArrow provides copy-free support for converting data to and from pandas and numpy. If you have data in PyArrow you can invoke to_numpy on pyarrow.Array and to_pandas on pyarrow.Array and pyarrow.Table to get them as pandas or numpy objects without facing any additional conversion cost.
  • 11. And it’s fast! >>> data = [a % 5 for a in range(100000000)] >>> npdata = np.array(data) >>> padata = pa.array(data) >>> import timeit >>> timeit.timeit( ... lambda: np.unique(npdata, return_counts=True), ... number=1 ... ) 1.5212857750011608 >>> timeit.timeit( ... lambda: pc.value_counts(padata), ... number=1 ... ) 0.3754262370057404
  • 12. Very fast! In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays) 82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas() 50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) df = pd.read_csv("large.csv", engine="pyarrow")
  • 13. Full-Stack Solution DISK Arrow Storage Format Data can be stored in the Arrow Disk Format itself. Arrow InMemory Format When loaded, it will still be in the Arrow Format. MEMORY Acero Computation can be performed natively on the Arrow format. COMPUTE Arrow Flight Arrow format can be used to ship data across network through Arrow Flight NETWORK
  • 14. Arrow from disk to memory ● Saving data in the Arrow format allows PyArrow to leverage the same exact format for disk and in-memory data. ● This means that no marshaling cost happens when loading back the data. ● And allows to leverage memory mapping to avoid processing data until it’s actually accessed. ● This means reducing the latency to access data from seconds to milliseconds. ● Memory mapping also allows managing data bigger than memory.
  • 15. Arrow format does not solve it all ● The Arrow format can make working with your data very fast ● But is expensive in terms of disk space as it’s optimized for fast computation and SIMD instructions, not for storage size. ● It natively support compressions algorithms, but those come at a cost that nullify most benefits of using the Arrow format itself. ● Arrow format is a great hot format, but there are better solutions for cold storage. total 1.3G -rw-r--r-- 1 root root 1.2G Nov 2 16:10 data.arrow -rw-r--r-- 1 root root 155M Nov 2 16:10 data.pqt
  • 16. Yes, you can read 17 Milions Rows in 9ms* * for some definitions of read
  • 17. From memory-to-network: Arrow Flight ● Arrow Flight is a protocol and implementation provided in Arrow itself that is optimized for transferring columnar data using Apache Arrow format. ● pyarrow.flight.FlightServerBase provides the server implementation and pyarrow.flight.connect allows to create clients that connect to flight servers. ● Flight hooks directly into gRPC, thus no marshaling or unmarshaling happens when sending data through network. ● https:/ /arrow.apache.org/coo kbook/py/flight.html
  • 18. Arrow Flight speed Based on the same foundations that we saw for dealing with data on disk, using Arrow Flight for data on network can provide major performance gains compared to other existing solutions to transfer data
  • 19.
  • 20. Full-Stack Solution, evolved DISK Arrow Storage Format Data can be stored in the Arrow Disk Format itself. Arrow InMemory Format When loaded, it will still be in the Arrow Format. MEMORY Acero Computation can be performed natively on the Arrow format. COMPUTE Arrow Flight Arrow format can be used to ship data across network through Arrow Flight NETWORK COLD STORAGE Parquet PyArrow natively supports optimized parquet loading FLIGHT SQL ADBC & FlightSQL Native support for fetching data from databases in Arrow format. ADBC NANO ARROW NanoArrow Sharing Arrow data between languages and libraries in same process
  • 21. Arrow & Database: FlightSQL ● Flight SQL aims to provide broadly similar functionality to existing APIs like JDBC and ODBC ● Using Flight, it provides an efficient implementation of a wire format that supports features like encryption and authentication out of the box, while allowing for further optimizations like parallel data access ● You get the performance of Flight, with the convenience of a SQL database. ● FlightSQL is mostly a transport for higher level APIs, you are not meant to use it directly.
  • 22. Arrow & Database: ADBC ● Standard database interface built around Arrow data, especially for efficiently fetching large datasets (i.e. with minimal or no serialization and copying) ● ADBC can leverage FlightSQL or directly connect to the database (currently supports Postgres, DuckDB, SQLite, …) ● Optimized for transferring column major data instead of row major data like most database drivers. ● Support both SQL dialects and the emergent Substrait standard.
  • 23. Arrow & Database: ADBC with sqlite.cursor() as cur: cur.execute('SELECT 1, "foo", 2.0') assert cur.fetch_arrow_table() == pyarrow.table( { "1": [1], '"foo"': ["foo"], "2.0": [2.0], } ) with sqlite.cursor() as cur: cur.execute('SELECT 1, "foo", 2.0') assert_frame_equal( cur.fetch_df(), pandas.DataFrame( { "1": [1], '"foo"': ["foo"], "2.0": [2.0], } ), ) ARROW Tables Pandas DataFrame
  • 24.
  • 25. Questions? ● PyArrow Documentation https:/ /arrow.apache.org/docs/python /getstarted.html ● PyArrow Cookbook https:/ /arrow.apache.org/cookbook/p y/index.html