Going beyond Apache Parquet's default settings

•

0 likes•5 views

In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings. While these default settings and access patterns already provide decent performance, by understanding the format in more detail and using recent developments, one can get much better performance, smaller files, and utilise Parquet's newer partial reading features to read even smaller subsets of a file for a given query. This talk aims to provide insight into the Parquet format and its recent development that are useful for end users' daily workflows. One only needs prior knowledge to know what a DataFrame/tabular data is.

Data & Analytics

Going beyond Parquet’s
default settings
Uwe Korn – QuantCo – April 2024
🔎

About me
• Uwe Korn
https://mastodon.social/@xhochy / @xhochy
• CTO at Data Science startup QuantCo
• Previously worked as a Data Engineer
• A lot of OSS, notably Apache {Arrow,
Parquet} and conda-forge
• PyData Südwest Co-Organizer

Apache Parquet
1. Data Frame storage? CSV? Why?
2. Use Parquet

Apache Parquet
1. Columnar, on-disk storage format
2. Started in 2012 by Cloudera and Twitter
3. Later, it became Apache Parquet
4. Fall 2016 brought full Python & C++ Support
5. State-of-the-art since the Hadoop era, still going strong

Clear bene
fi
ts
1. Columnar makes vectorized operations fast
2. E
ffi
cient encodings and compression make it small
3. Predicate-pushdown brings computation to the I/O layer
4. Language-independent and widespread; common exchange format

Photo by Gabriel Dias Pimenta on Unsplash
Tuning

Knobs to tune
1. Compression Algorithm
2. Compression Level
3. RowGroup size
4. Encodings

Data Types!?
Photo by Patrick Fore on Unsplash

Data Types?
• Well, actually…
• …it doesn’t save much on disk.

Data Types?
• Well, actually…
• …it doesn’t save much on disk.
• By choosing the optimal types (lossless cast to e.g.
fl
oat32 or uint8) on a
month of New York Taxi trips:

Compression
Photo by cafeconcetto on Unsplash

Compression Algorithm
• Datasets:
• New York Yellow Taxi Trips 2021-01

Compression Algorithm
• Datasets:
• New York Yellow Taxi Trips 2021-01
• New York Yellow Taxi Trips 2021-01 with a custom prediction

Compression Algorithm
• Datasets:
• New York Yellow Taxi Trips 2021-01
• New York Yellow Taxi Trips 2021-01 with a custom prediction
• gov.uk (House) Price Paid dataset

Compression Level
1. For Brotli, ZStandard and GZIP, we can tune the level
2. Snappy and „none“ have a
fi
xed compression level.

Compression
1. Let’s stick for now with ZStandard, as it seems a good tradeo
ff
between
speed and size.

Compression
1. Let’s stick for now with ZStandard, as it seems a good tradeo
ff
between
speed and size.
2. In some cases (e.g. slow network drives), it might be worth to also
considering Brotli

RowGroup size
1. If you plan to partially access the data, RowGroups are the common
place to
fi
lter.
2. If you want to read the whole data, less are better.
3. Compression & encoding also works better.

Encodings
1. https://parquet.apache.org/docs/
fi
le-format/data-pages/encodings/
2. We have been using RLE_DICTIONARY for all columns
3. DELTA_* encodings not implemented in pyarrow
4. Byte Stream Split a recent addition

Encodings
1. Byte Stream Split sometimes is faster than dictionary encoding, but not
signi
fi
cantly
2. For high entropy columns, BSS shines

Hand-Crafted Delta
1. Let’s take the timestamps in NYC Taxi Trip
2. Sort by pickup date
3. Compute a delta column for both dates
4. 17.5% saving on the whole
fi
le.

Order your data
1. With our hand-crafted delta, it was worth sorting the data
2. This can help, but only worked for the Price Paid dataset in tests, there it
saved 25%, all others actually got larger

Summary
1. Adjusting your data types is helpful for in-memory, but have no signi
fi
cant
e
ff
ect on-disk
2. Store high-entropy
fl
oats as Byte Stream Split encoded columns
3. Check whether sorting has an e
ff
ect
4. Delta Encoding in Parquet would be useful, use handcrafted for now
5. Zstd on level 3/4 seems like a good default compression setting

What do we get?
1. Run once with the default settings
2. Test all compression settings, but also…
1. … use hand-crafted delta.
2. … use Byte Stream Split on predictions.

Code example available at
https://github.com/xhochy/pyconde24-parquet

Similar to Going beyond Apache Parquet's default settings

A Gen3 Perspective of Disparate DataRobert Grossman

Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent

An Optics LifeThomas Weible

GPU cloud with Job scheduler and ContainerAndrew Yongjoon Kong

TrueReusableCode-BigDataCodeCamp2016Eduard Lazar

Fuzzing Janus @ IPTComm 2019Lorenzo Miniero

Enterprise Data LakesFarid Gurbanov

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn

REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENTcscpconf

Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment csandit

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn

Dense Retrieval with Apache Solr Neural Search.pdfSease

Data & Analytics - Session 1 - Big Data AnalyticsAmazon Web Services

High Performance Machine Learning in R with H2OSri Ambati

Data Engineering for Data Scientists jlacefie

Data ScienceAhmet Bulut

Elastic Data Analytics Platform @DatadogC4Media

Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationDenodo

How to empower community by using GIS lecture 2wang yaohui

Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks

Similar to Going beyond Apache Parquet's default settings (20)

A Gen3 Perspective of Disparate Data

Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry

An Optics Life

GPU cloud with Job scheduler and Container

TrueReusableCode-BigDataCodeCamp2016

Fuzzing Janus @ IPTComm 2019

Enterprise Data Lakes

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...

REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT

Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...

Dense Retrieval with Apache Solr Neural Search.pdf

Data & Analytics - Session 1 - Big Data Analytics

High Performance Machine Learning in R with H2O

Data Engineering for Data Scientists

Data Science

Elastic Data Analytics Platform @Datadog

Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration

How to empower community by using GIS lecture 2

Headaches and Breakthroughs in Building Continuous Applications

Recently uploaded

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Week-01-2.ppt BBB human Computer interactionfulawalesam

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Invezz.com - Grow your wealth with trading signalsInvezz1

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Mature dropshipping via API with DroFx.pptxolyaivanovalion

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Recently uploaded (20)

Ravak dropshipping via API with DroFx.pptx

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

Carero dropshipping via API with DroFx.pptx

Week-01-2.ppt BBB human Computer interaction

BabyOno dropshipping via API with DroFx.pptx

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

Smarteg dropshipping via API with DroFx.pptx

Invezz.com - Grow your wealth with trading signals

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

CebaBaby dropshipping via API with DroFX.pptx

Generative AI on Enterprise Cloud with NiFi and Milvus

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Mature dropshipping via API with DroFx.pptx

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Going beyond Apache Parquet's default settings

1. Going beyond Parquet’s default settings Uwe Korn – QuantCo – April 2024 🔎

2. About me • Uwe Korn https://mastodon.social/@xhochy / @xhochy • CTO at Data Science startup QuantCo • Previously worked as a Data Engineer • A lot of OSS, notably Apache {Arrow, Parquet} and conda-forge • PyData Südwest Co-Organizer

3. Apache Parquet 1. Data Frame storage? CSV? Why? 2. Use Parquet

4. Photo by Hansjörg Keller on Unsplash

5. Apache Parquet 1. Columnar, on-disk storage format 2. Started in 2012 by Cloudera and Twitter 3. Later, it became Apache Parquet 4. Fall 2016 brought full Python & C++ Support 5. State-of-the-art since the Hadoop era, still going strong

6. Clear bene fi ts 1. Columnar makes vectorized operations fast 2. E ffi cient encodings and compression make it small 3. Predicate-pushdown brings computation to the I/O layer 4. Language-independent and widespread; common exchange format

7. Constructing Parquet Files

8. Parquet with pandas

9. Parquet with polars

10. Anatomy of a fi le

11. Anatomy of a fi le

12. Anatomy of a fi le

13. Anatomy of a fi le

14. Anatomy of a fi le

15. Photo by Gabriel Dias Pimenta on Unsplash Tuning

16. Knobs to tune 1. Compression Algorithm 2. Compression Level 3. RowGroup size 4. Encodings

17. Data Types!? Photo by Patrick Fore on Unsplash

18. Data Types?

19. Data Types? • Well, actually…

20. Data Types? • Well, actually… • …it doesn’t save much on disk.

21. Data Types? • Well, actually… • …it doesn’t save much on disk. • By choosing the optimal types (lossless cast to e.g. fl oat32 or uint8) on a month of New York Taxi trips:

22. Data Types? • Well, actually… • …it doesn’t save much on disk. • By choosing the optimal types (lossless cast to e.g. fl oat32 or uint8) on a month of New York Taxi trips: Saves 963 bytes 😥 of 20.6 MiB

23. Compression Photo by cafeconcetto on Unsplash

24. Compression Algorithm

25. Compression Algorithm • Datasets:

26. Compression Algorithm • Datasets: • New York Yellow Taxi Trips 2021-01

27. Compression Algorithm • Datasets: • New York Yellow Taxi Trips 2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction

28. Compression Algorithm • Datasets: • New York Yellow Taxi Trips 2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset

29. Compression Algorithm • Datasets: • New York Yellow Taxi Trips 2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset • COVID-19 Epidemiology

30. Compression Algorithm • Datasets: • New York Yellow Taxi Trips 2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset • COVID-19 Epidemiology • Time measurements: Pick the median of fi ve runs

31. Compression Algorithm

32. Compression Algorithm

33. Compression Algorithm

34. Compression Level 1. For Brotli, ZStandard and GZIP, we can tune the level 2. Snappy and „none“ have a fi xed compression level.

35. GZIP

36. Brotli

37. ZStandard

38. ZStandard 🔬

39. ZStandard & Brotli 🔬

40. Compression

41. Compression 1. Let’s stick for now with ZStandard, as it seems a good tradeo ff between speed and size.

42. Compression 1. Let’s stick for now with ZStandard, as it seems a good tradeo ff between speed and size. 2. In some cases (e.g. slow network drives), it might be worth to also considering Brotli

43. Compression 1. Let’s stick for now with ZStandard, as it seems a good tradeo ff between speed and size. 2. In some cases (e.g. slow network drives), it might be worth to also considering Brotli • …but Brotli is relatively slow to decompress.

44. RowGroup size 1. If you plan to partially access the data, RowGroups are the common place to fi lter. 2. If you want to read the whole data, less are better. 3. Compression & encoding also works better.

45. Single RowGroup

46. Encodings 1. https://parquet.apache.org/docs/ fi le-format/data-pages/encodings/ 2. We have been using RLE_DICTIONARY for all columns 3. DELTA_* encodings not implemented in pyarrow 4. Byte Stream Split a recent addition

47. Dictionary Encoding

48. RLE Encoding

49. Byte Stream Split Encoding

50. Encodings 1. Byte Stream Split sometimes is faster than dictionary encoding, but not signi fi cantly 2. For high entropy columns, BSS shines

51. Hand-Crafted Delta

52. Hand-Crafted Delta 1. Let’s take the timestamps in NYC Taxi Trip 2. Sort by pickup date 3. Compute a delta column for both dates 4. 17.5% saving on the whole fi le.

53. Order your data 1. With our hand-crafted delta, it was worth sorting the data 2. This can help, but only worked for the Price Paid dataset in tests, there it saved 25%, all others actually got larger

54. Summary 1. Adjusting your data types is helpful for in-memory, but have no signi fi cant e ff ect on-disk 2. Store high-entropy fl oats as Byte Stream Split encoded columns 3. Check whether sorting has an e ff ect 4. Delta Encoding in Parquet would be useful, use handcrafted for now 5. Zstd on level 3/4 seems like a good default compression setting

55. Cost Function for compression

56. What do we get? 1. Run once with the default settings 2. Test all compression settings, but also… 1. … use hand-crafted delta. 2. … use Byte Stream Split on predictions.

57. Cost Function for compression

58. Cost Function for compression

59. Code example available at https://github.com/xhochy/pyconde24-parquet

60. Questions?

Going beyond Apache Parquet's default settings

Recommended

Recommended

More Related Content

Similar to Going beyond Apache Parquet's default settings

Similar to Going beyond Apache Parquet's default settings (20)

More from Uwe Korn

More from Uwe Korn (9)

Recently uploaded

Recently uploaded (20)

Going beyond Apache Parquet's default settings