Despite the existence of data analysis tools such as R, SQL, Excel and others, it is still insufficient to cope with today's big data analysis needs.
The author proposes a CUI (Character User Interface) toolset with dozens of functions to neatly handle tabular data in TSV (Tab Separated Values) files.
It implements many basic and useful functions that have not been implemented in existing software with each function borrowing the ideas of Unix philosophy and covering the most frequent pre-analysis tasks during the initial exploratory stage of data analysis projects.
Also, it greatly speeds up basic analysis tasks, such as drawing cross tables, Venn diagrams, etc., while existing software inevitably requires rather complicated programming and debugging processes for even these basic tasks.
Here, tabular data mainly means TSV (Tab-Separated Values) files as well as other CSV (Comma Separated Value)-type files which are all widely used for storing data and suitable for data analysis.
1. A Hacking Toolset
for Big Tabular Files
Toshiyuki Shimono
Uhuru Corporation
1IEEE BigData 2016, Washington DC, Dec 5.
2. 1. First of all..
2IEEE BigData 2016, Washington DC
3. What my software is like?
• Handles tabular data files (esp. TSV)
in a neat and advanced style.
• A bunch of more than 50 functions in CUI.
• A supplement of other analysis software.
including Excel, SQL, R, Python, etc.
• Speedy enough among existing software.
3IEEE BigData 2016, Washington DC
4. As a result,
• Data analysis projects become
super swift at
initial-deciphering and pre-process
by providing 50+ useful functions.
• Each function, being CUI,
has several command-line options,
often can handle the heading line well,
often utilizes color properly,
often handles Ctrl + C signal wisely.
Coloring is utilized a lot.
For example, you can easily read big
numbers of this big data age,
by periodic coloring.
4IEEE BigData 2016, Washington DC
6. IEEE BigData 2016, Washington DC 6
Why I have created it?
• Reducing the labor of data pre-processing
deserves a lot!
• You have gave up a lot of analysis, hypothesis-
verifications just because it takes time.
• But this situation would change dramatically!
• The author noticed that :
• Debugging/testing of programs takes a lot of time.
• Thus if unnoticed routines are neatly programmed and
prepared beforehand, the change would happen.
• So I designed a kind of philosophy to implement those
routines, and am programming.
• And actually swift data analysis is in high
demand!
7. IEEE BigData 2016, Washington DC 7
Designing philosophy in
partial.
• Each command works independently.
• The name of a command is a combination
within 2 English words with some exception.
• Swiftness is a key factor in many aspects:
• To recall the existence and the name of a
function.
• Ease of understanding how a program works
• ..
8. Comparison with other soft.
• Unix/Linux :
• Not matured enough for tabular files.
• Note: Unix philosophy is worth while utilizing.
• Excel :
• Problems exist in transparency, in reproduction,
and in coordinative work with other software.
• Against SQL :
• Requires elaborate designing of tables.
• R, Python :
• Requires loading timewhen many and large files.
8IEEE BigData 2016, Washington DC
10. Examples of functions.
• Venn Diagram (complex condition cardinalities)
• Cross-table (2-way contingency table)
• Column extraction (simpler than AWK)
• Table looking-up (cf. SQL join, Excel vlookup)
• Key-Value handling (eg. sameness check for 2 KV-tables)
• Random sampling, Shuffling (line-wise)
• Quantiles, Integrated histogram
• etc..
10IEEE BigData 2016, Washington DC
11. IEEE BigData 2016, Washington DC 11
Other useful functions
• Putting colors on character strings
12. Common sub-functions
• Self-manual also with its shorter version.
• Heading line handling. (Omitting, Utilizing)
• Showing ongoing result by Ctrl+C
interruption.
• Decimal floating calculation if appropriate.
12IEEE BigData 2016, Washington DC
14. colsummary : grasping what each column is.
1st (white) : column number
2nd (green) : how many distinct values in the column
3rd (blue) : numerical average (optionally can be omitted)
4th (yellow) : column name
5th (white) : value range ( min .. max )
6th (gray) : most frequent values ( quantity can be specified. )
7th (green) : frequency ( highest .. lowest, x means multiplicity )
When you get new data, deciphering the data starts.
Extracting those information above are
fairly enough to go on next analysis step.
14IEEE BigData 2016, Washington DC
15. colsummary command
15
# of different values
Column
name
Value
range
Numeric
average
Frequent
values Frequencies of the most and the fewest values.
The multiplicity of each frequency is present after “x”
olympic.tsv :
The numbers of medals Japan
got in past Olympic Games
(1912-2016).
The biggest hurdle at data
deciphering quickly goes
away by this command.
IEEE BigData 2016, Washington DC
16. Freq : How many lines are what string ?
Functionally equivalent to Unix/Linux “ sort | uniq -c ” ,
but much more speedy in computation.
Note that many sub-functions, such as specifying output order,
can be implemented, so the author already implemented.
16IEEE BigData 2016, Washington DC
17. 2-way contingency table
• Given 2-columned table, counting the number of lines
consist of ui and vj and tabulating often occurs.
• You can do it by “Excel Pivot Table” function.
• Note that it is impossible to perform by an SQL query.
17IEEE BigData 2016, Washington DC
18. 2-way contingency table (crosstable)
• Almost impossible by a SQL query.
• Excel pivot table is actually tedious/error-
prone.
18IEEE BigData 2016, Washington DC
19. crosstable (2-way contingency
table) Provides the cross-table from
2 columned table
(Add blue color on “0” )(Extract 3rd and 4th columns)
You may draw many cross-table from a table data.
The crosstable commands provides cross-tables very quickly. 19IEEE BigData 2016, Washington DC
20. cols : extracting columns
• Easier than AWK and Unix-cut .
cols –t 2 ⇒ moves the 2nd column to rightmost.
cols –h 3 ⇒ moves the 3rd column to leftmost.
cols –p 5,9..7 ⇒ shows 5th,9th,8th,7th columns.
cols –d 6..9 ⇒ shows except 6th,7th,8th,9th columns.
-d stands for deleting, -p for printing,
-h for head, -t for tail.
20IEEE BigData 2016, Washington DC
21. IEEE BigData 2016, Washington DC 21
“cols” – column extraction.
The existing commands is not enough:
cut : you cannot realize the intention of “cut –f 2,1” ..
awk : one line program is easy but insufficient.
“cols” has many functions :
Range specification : “1..5” -> 1,2,3,4,5。 “5..1” -> 5,4,3,2,1
Negative specification : -N means N-th line from the rightmost.
Deleting columns : -d 4 -> omitting 4th column.
Head/tail : -h 3 -> moving 3rd column into the leftmost.
You can also specify the column name accordingly to the 1st line.
Speedy enough. ( awk > cols > cut )
The author also prepared awk command sentece generator
because awk is doubly faster than ”cols.”
22. Determining the cardinality of all the regions
is important before the main analysis.
Venn Diagrams
22IEEE BigData 2016, Washington DC
23. venn4 (Cardinalities for complex
conditions)
A B
C
D
• The cardinalities of the concerning data set are important.
• As a technique, determining the cardinalities of the records of your data files in
the beginning greatly helps the following tasks. Because guessing what one of
your data files is can be easily told from the number of the records.
• Functions of venn4 is actually difficult by your hand. Try it if you doubt!
23IEEE BigData 2016, Washington DC
24. Comparison with other packages
Excel R Pandas SQL Hadoop Unix Our soft
Primary usage
Spread
Sheet
Statistics Matrix etc.
DB
manipulation
Distributed DB
File
manipulations
Initial deciphering
and Pre-processing
Ease of use ◎ ○ △ △ × ○ ○
Transparency × ○ ○ depends depends ○ ○
Forward
compatibility
△ △ Too new ○-◎ Too new ◎ Too new
Processing
speed
×
Skill
required
○ △
◎ in
narrow scope
○ ○
Large size data
handling
×
Skill
required
○ specific ◎ specific ◎
High quantity
file processing
× △ -- -- ○ ◎
Small data
handling
◎ ◎ ○
Need to know
SQL grammar
× ◎ ◎
Column
selection
(Alphabe
tically)
Name &
Number
Similar to R Column name Column name
Column number
(cut/AWK)
Name/Number
And also by range
24
◎: very good ; ○ good ; △ moderate ; × bad.
IEEE BigData 2016, Washington DC
25. IEEE BigData 2016, Washington DC 25
How you will use it.
1. Get the data or get the access to data.
2. Transform into TSV if it is necessary.
3. Manipulate it with the operation such as,
cross tabulation,
Venn diagram,
extract columns/lines that you like or randomly,
…
4. Accumulate the findings
E.G. copy/paste into Excel, PowerPoint.
28. IEEE BigData 2016, Washington DC 28
Grasping a tabular data.
• By taking a look by editor.
• By taking a look by Unix less command.
• By reading the description documents..
• Extracting 1,2,4,8,16,32,.. or 1, 10, 100, .. lines.
• Randomly sampling lines with some low
possibility.
• ..
Anyway, check whether the data is suitable
enough to process some specific operation is
fairly difficult to tell. It requires a lot of checking.
29. IEEE BigData 2016, Washington DC 29
Prepared function groups.
• Transforming formats
• Grasping how each given tabular data is like
• Operating Column-specific functions
• Treating as a 2-columned key + value table
• Combining tables such as looking-up
• Normalizing functions of relative data base
• Seeing/Comparing the distribution(s) of
numbers
30. IEEE BigData 2016, Washington DC 30
Interface designing policy
• Each command name consists of 1 or 2 words.
• Option switches are utilized a lot.
-a -b .. -z ← Various minor additional functions.
-A -B .. -Z ← Great mode change in functions.
-= ← Specifications that treating the beginning lines
properly as a heading line. i.e. list of column
names.
-~ ← Specifications that turning over some small
function.
• Swiftness is the important factor :
To recall functions of which existence and name.
To understand what will happen for users.
As speedy as other Unix commands.
33. Ex-2) silhouette (integrated histogram)
1. Suppose you are interested in the distribution
of the ratio of the follower number and the
following number of each Twitter account.
2. To properly understand the ratio distribution,
stratifying according to the number of the
follower of each account is also considered :
≧5000, ≧ 500 from the rest, ≧ 50 ditto, and the others.
3. The silhouette command given the data
provides the PDF file as shown in right.
4. You can read that with 48% possibility, the
account with ≧5000 followers has followers
that are more than 2 times of the number of
following.
(On average, 95% twitter accounts does not.)
0% 25% 50% 75% 100%
◀
The command name
“silhouette” comes
from the image when
various height people
are aligned in the order
of height. See the curve
connecting the heads.
33IEEE BigData 2016, Washington DC
34. Features and summaries
Useful functions for analyses in TSV.
Especially in deciphering and pre-processing.
Handles a heading line properly if it exists.
Speeds comparable to existing software.
Can be used on any Perl installed machine.
The ongoing result displayed by Ctrl+C.
The software will appear on GITHUB.
34IEEE BigData 2016, Washington DC