SlideShare a Scribd company logo
1 of 24
Download to read offline
Boosting command line experience
Python meets AWK
Kirill Pavlov
Technical Recruiter, Terminal 1
October 22, 2017
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 1 / 20
README.md
A lot of Python/AWK examples here.
Source code and slides available online.
At the end: build a stock trading system and check NYSE:CS.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 2 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 3 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 4 / 20
Background
Yandex, year 2010. Hadoop was not widly adopted.
10Gb of archived ads data daily: time, ad_id, site_id, clicks.
Task: daily data aggregation (simple functions: group by, sum, join) and feature
generation for further machine learning classification.
Solution: released a set of command line scripts.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 5 / 20
Example
This presentation uses UCI machine learning Higgs boson data: 11M objects, 28
attributes, 7.5Gb unarchived.
Questions:
1. What is the maximum value of lepton_eta?
2. What is the average value of lepton_phi by class 0 and 1?
3. Filter objects with m_jj > 0.75 (8.9M objects) and sort them by m_wbb.
Solutions:
1. In-memory Python with Pandas.
2. Databse SQL queries (PostgreSQL and Docker).
3. Command line with AWK.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 6 / 20
Demo Time
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 6 / 20
Reality Check
It’s not as agile as it seems. You work inside the company network.
1. You don’t have sudo rights and your admin does not want to install anything for you.
Like no database or user privileges, etc.
2. The server does not have GitHub/Internet access and the only deployment
possible is Java JARs or C/C++/etc. So, no NodeJS/Python packages. And of course
no R/Matlab/Excel.
3. Get better at command line tools ;)
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 7 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 8 / 20
Basic concepts
1. AWK1
— language for streaming columnar data processing. Standard in unix-like OS.
2. Actual AWK is outdated, use mawk (fast) or gawk (flexible).
3. Limited data structures: strings, associative arrays (hash maps) and regexps.
4. Built-in variables:
$1, $2, . . . ($0 is entire record)
NR - number of processe lines (records)
NF - number of columns (fields)
5. Use vars without declaration. Default values are 0s. One liners. Hipster friendly.
1
Tutorial by Bruce Barnett. Careful, he writes his blog in txt
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 9 / 20
AWK Examples 1 & 2
1. Count number of words and lines at codeconf.hk:
cat codeconf.md | awk ’{w += NF}END{print NR, w}’
370 1445
2. Most popular words on codeconf.hk website:
cat codeconf.md 
| awk ’{for(i=1; i<=NF; i++) words[tolower($i)]++}
END{for(w in words) print w, words[w]}’ 
| sort -k2 -nr
Most popular non stop-words: "Serverles" and "Android".
SEO winners: Davide Benvegnù and Richard Cohen.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 10 / 20
AWK Examples 3
3. Find the longest line in the text (if-then-else example):
cat codeconf.md | awk ’{
l = length > length(l) ? $0 : l
}END{
print length(l), l
}’
146 * We believe that the Hong Kong developer community is skilled and
diverse, but that often these skills end up hidden away in big organisations.
Demo Time
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 11 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 12 / 20
Basic concepts
1. Special files format: tsv + header (meta information). Easy to convert and
autogenerate headers.
# Date Open High Low Close Volume
2014-02-21 84.35 84.45 83.9 83.45 17275.0
2. Python script manages file descriptors headers, convert column names to column
numbers and executes command line command, e.g. cat/tail/sort.
3. Heavy lifting goes to awk: tawk (map) and tgrp (map-reduce).
4. Based on command line expressions, it generates awk command and executes it
with incoming stream.
5. Visual sugar: tpretty and tplot.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 13 / 20
Features
1. Streaming expressions: parametrized running/total sum/average/maximum2
.
2. Aggregators: first, last, min, max, count.
3. Modules: deque.
4. Build to self-contained 2k LOC portable python (2.7, 3.3+) scripts.
5. All together: zero-configuration extensible sql in command line. It is readable and
faster than a generic python/cython code (even after shedskin) and perl.
2
moving maximum in linear time with deque implemented on top of awk associative arrays.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 14 / 20
Solutions comparison
Dell xps 15, 16Gb RAM, 8 CPUs:
Python PostgreSQL gawk mawk Tabtools
Read time 104.4 180.3 0 0 0
Q1: "max" time 0 15.2 22.8 12.2 12.8
Q2: "group + avg" time 0 5.8 30.5 12.6 26.63
Q3: "filter + sort" time 21.3 33.6 174.2 36.3 33.5
Total, sec. 125.7 243.9 227.5 61.1 72.9
3
Uses Ω(n log(n)) complexity instead of Ω(n). Could be improved.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 15 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 16 / 20
Data description
Credit Suisse (NYSE:CS) daily stock data from Yahoo Finance: ’CS.csv’ + ’cs.tsv’.
cat cs.tsv | tgrp 
-k "Week=strftime("%U", DateEpoch(Date))" 
-g "Date=FIRST(Date)" 
-g "Open=FIRST(Open)" 
-g "High=MAX(High)" 
-g "Low=MIN(Low)" 
-g "Close=LAST(Close)" 
-g "Volume=SUM(Volume)" 
| ttail 
| tsrt -k Date:desc 
| tpretty
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 17 / 20
Demo Time
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 17 / 20
Demo: metrics
1. Moving Average for windown size 200 and 50.
2. Exponential moving average for window size 26 and 13.
3. MACD(26, 12, 9) histogram.
4. Moving maximum and minimum for window size 14.
5. Fast and Slow Stochastics.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 18 / 20
Demo: plot (expected and actual)
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 19 / 20
Thank you!
Kirill Pavlov <k@p99.io>, Recruiter, Terminal 1.
GitHub: @pavlov99 | Presentation: 2017-10-22-codeconf | tabtools
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 20 / 20

More Related Content

What's hot

CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingHenry Schreiner
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Phillip Trelford
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesViach Kakovskyi
 
2019 IRIS-HEP AS workshop: Boost-histogram and hist
2019 IRIS-HEP AS workshop: Boost-histogram and hist2019 IRIS-HEP AS workshop: Boost-histogram and hist
2019 IRIS-HEP AS workshop: Boost-histogram and histHenry Schreiner
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisHenry Schreiner
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and PythonAndrii Soldatenko
 
Digital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meetingDigital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meetingHenry Schreiner
 
TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...
TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...
TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...Andrii Vozniuk
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonDaniel Rodriguez
 
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...Viach Kakovskyi
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.Andrii Soldatenko
 
Weather of the Century: Visualization
Weather of the Century: VisualizationWeather of the Century: Visualization
Weather of the Century: VisualizationMongoDB
 
The Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: VisualizationThe Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: VisualizationMongoDB
 
The Weather of the Century
The Weather of the CenturyThe Weather of the Century
The Weather of the CenturyMongoDB
 

What's hot (20)

R and C++
R and C++R and C++
R and C++
 
R and cpp
R and cppR and cpp
R and cpp
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
 
2019 IRIS-HEP AS workshop: Boost-histogram and hist
2019 IRIS-HEP AS workshop: Boost-histogram and hist2019 IRIS-HEP AS workshop: Boost-histogram and hist
2019 IRIS-HEP AS workshop: Boost-histogram and hist
 
Influxdb and time series data
Influxdb and time series dataInfluxdb and time series data
Influxdb and time series data
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Pybind11 - SciPy 2021
Pybind11 - SciPy 2021Pybind11 - SciPy 2021
Pybind11 - SciPy 2021
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for Analysis
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and Python
 
Digital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meetingDigital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meeting
 
TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...
TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...
TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
 
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.
 
Weather of the Century: Visualization
Weather of the Century: VisualizationWeather of the Century: Visualization
Weather of the Century: Visualization
 
The Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: VisualizationThe Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: Visualization
 
The Weather of the Century
The Weather of the CenturyThe Weather of the Century
The Weather of the Century
 

Similar to Boosting command line experience with python and awk

Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]Kevin Xu
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKShu-Jeng Hsieh
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Animesh Singh
 
C++ Windows Forms L01 - Intro
C++ Windows Forms L01 - IntroC++ Windows Forms L01 - Intro
C++ Windows Forms L01 - IntroMohammad Shaker
 
Princeton RSE Peer network first meeting
Princeton RSE Peer network first meetingPrinceton RSE Peer network first meeting
Princeton RSE Peer network first meetingHenry Schreiner
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB
 
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020rodburns
 
Beam_installation123456785678777777.pptx
Beam_installation123456785678777777.pptxBeam_installation123456785678777777.pptx
Beam_installation123456785678777777.pptxSravanthiVaka1
 
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019corehard_by
 
Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Edwin Jung
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 

Similar to Boosting command line experience with python and awk (20)

Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 
tit
tittit
tit
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Development Workflows on AWS
Development Workflows on AWSDevelopment Workflows on AWS
Development Workflows on AWS
 
Amazed by AWS Series #4
Amazed by AWS Series #4Amazed by AWS Series #4
Amazed by AWS Series #4
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
C++ Windows Forms L01 - Intro
C++ Windows Forms L01 - IntroC++ Windows Forms L01 - Intro
C++ Windows Forms L01 - Intro
 
Princeton RSE Peer network first meeting
Princeton RSE Peer network first meetingPrinceton RSE Peer network first meeting
Princeton RSE Peer network first meeting
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
 
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
 
2012: ql.io and Node.js
2012: ql.io and Node.js2012: ql.io and Node.js
2012: ql.io and Node.js
 
Beam_installation123456785678777777.pptx
Beam_installation123456785678777777.pptxBeam_installation123456785678777777.pptx
Beam_installation123456785678777777.pptx
 
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
 
(C)NodeJS
(C)NodeJS(C)NodeJS
(C)NodeJS
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 

Recently uploaded

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutionsmonugehlot87
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Recently uploaded (20)

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutions
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

Boosting command line experience with python and awk

  • 1. Boosting command line experience Python meets AWK Kirill Pavlov Technical Recruiter, Terminal 1 October 22, 2017 Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 1 / 20
  • 2. README.md A lot of Python/AWK examples here. Source code and slides available online. At the end: build a stock trading system and check NYSE:CS. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 2 / 20
  • 3. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 3 / 20
  • 4. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 4 / 20
  • 5. Background Yandex, year 2010. Hadoop was not widly adopted. 10Gb of archived ads data daily: time, ad_id, site_id, clicks. Task: daily data aggregation (simple functions: group by, sum, join) and feature generation for further machine learning classification. Solution: released a set of command line scripts. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 5 / 20
  • 6. Example This presentation uses UCI machine learning Higgs boson data: 11M objects, 28 attributes, 7.5Gb unarchived. Questions: 1. What is the maximum value of lepton_eta? 2. What is the average value of lepton_phi by class 0 and 1? 3. Filter objects with m_jj > 0.75 (8.9M objects) and sort them by m_wbb. Solutions: 1. In-memory Python with Pandas. 2. Databse SQL queries (PostgreSQL and Docker). 3. Command line with AWK. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 6 / 20
  • 7. Demo Time Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 6 / 20
  • 8. Reality Check It’s not as agile as it seems. You work inside the company network. 1. You don’t have sudo rights and your admin does not want to install anything for you. Like no database or user privileges, etc. 2. The server does not have GitHub/Internet access and the only deployment possible is Java JARs or C/C++/etc. So, no NodeJS/Python packages. And of course no R/Matlab/Excel. 3. Get better at command line tools ;) Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 7 / 20
  • 9. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 8 / 20
  • 10. Basic concepts 1. AWK1 — language for streaming columnar data processing. Standard in unix-like OS. 2. Actual AWK is outdated, use mawk (fast) or gawk (flexible). 3. Limited data structures: strings, associative arrays (hash maps) and regexps. 4. Built-in variables: $1, $2, . . . ($0 is entire record) NR - number of processe lines (records) NF - number of columns (fields) 5. Use vars without declaration. Default values are 0s. One liners. Hipster friendly. 1 Tutorial by Bruce Barnett. Careful, he writes his blog in txt Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 9 / 20
  • 11. AWK Examples 1 & 2 1. Count number of words and lines at codeconf.hk: cat codeconf.md | awk ’{w += NF}END{print NR, w}’ 370 1445 2. Most popular words on codeconf.hk website: cat codeconf.md | awk ’{for(i=1; i<=NF; i++) words[tolower($i)]++} END{for(w in words) print w, words[w]}’ | sort -k2 -nr Most popular non stop-words: "Serverles" and "Android". SEO winners: Davide Benvegnù and Richard Cohen. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 10 / 20
  • 12.
  • 13.
  • 14. AWK Examples 3 3. Find the longest line in the text (if-then-else example): cat codeconf.md | awk ’{ l = length > length(l) ? $0 : l }END{ print length(l), l }’ 146 * We believe that the Hong Kong developer community is skilled and diverse, but that often these skills end up hidden away in big organisations. Demo Time Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 11 / 20
  • 15. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 12 / 20
  • 16. Basic concepts 1. Special files format: tsv + header (meta information). Easy to convert and autogenerate headers. # Date Open High Low Close Volume 2014-02-21 84.35 84.45 83.9 83.45 17275.0 2. Python script manages file descriptors headers, convert column names to column numbers and executes command line command, e.g. cat/tail/sort. 3. Heavy lifting goes to awk: tawk (map) and tgrp (map-reduce). 4. Based on command line expressions, it generates awk command and executes it with incoming stream. 5. Visual sugar: tpretty and tplot. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 13 / 20
  • 17. Features 1. Streaming expressions: parametrized running/total sum/average/maximum2 . 2. Aggregators: first, last, min, max, count. 3. Modules: deque. 4. Build to self-contained 2k LOC portable python (2.7, 3.3+) scripts. 5. All together: zero-configuration extensible sql in command line. It is readable and faster than a generic python/cython code (even after shedskin) and perl. 2 moving maximum in linear time with deque implemented on top of awk associative arrays. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 14 / 20
  • 18. Solutions comparison Dell xps 15, 16Gb RAM, 8 CPUs: Python PostgreSQL gawk mawk Tabtools Read time 104.4 180.3 0 0 0 Q1: "max" time 0 15.2 22.8 12.2 12.8 Q2: "group + avg" time 0 5.8 30.5 12.6 26.63 Q3: "filter + sort" time 21.3 33.6 174.2 36.3 33.5 Total, sec. 125.7 243.9 227.5 61.1 72.9 3 Uses Ω(n log(n)) complexity instead of Ω(n). Could be improved. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 15 / 20
  • 19. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 16 / 20
  • 20. Data description Credit Suisse (NYSE:CS) daily stock data from Yahoo Finance: ’CS.csv’ + ’cs.tsv’. cat cs.tsv | tgrp -k "Week=strftime("%U", DateEpoch(Date))" -g "Date=FIRST(Date)" -g "Open=FIRST(Open)" -g "High=MAX(High)" -g "Low=MIN(Low)" -g "Close=LAST(Close)" -g "Volume=SUM(Volume)" | ttail | tsrt -k Date:desc | tpretty Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 17 / 20
  • 21. Demo Time Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 17 / 20
  • 22. Demo: metrics 1. Moving Average for windown size 200 and 50. 2. Exponential moving average for window size 26 and 13. 3. MACD(26, 12, 9) histogram. 4. Moving maximum and minimum for window size 14. 5. Fast and Slow Stochastics. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 18 / 20
  • 23. Demo: plot (expected and actual) Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 19 / 20
  • 24. Thank you! Kirill Pavlov <k@p99.io>, Recruiter, Terminal 1. GitHub: @pavlov99 | Presentation: 2017-10-22-codeconf | tabtools Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 20 / 20