SlideShare a Scribd company logo
1 of 99
Download to read offline
two visualization tools
     for log files



                       Eugene Kirpichov, 2010
            ekirpichev@mirantis.com, ekirpichov@gmail.com


 Download: http://jkff.info/presentations/two-visualization-tools.pdf
    Slideshare: http://slideshare.net/jkff/two-visualization-tools
Plan
• intro / philosophy
• splot, tplot
  – purpose
  – basics
  – plenty of examples
  – options
• installation
Intro
• I wanted to visualize the behavior of my code
• I only had logs
• I found no existing tools to be good enough
  – suggestions welcome
So I wrote them
tplot (for timeplot)
      http://github.com/jkff/timeplot

splot (for stateplot)
      http://github.com/jkff/splot

P.S. both are in Haskell – “for fun”, but turned out to pay its weight in gold

All new features took a couple lines of code and usually worked immediately

This helped when I really needed a feature quickly
Philosophy


- I want to see X!
- If X is in the log, you’re almost done.

 Easily map logs to tools’ input, let tools do the rest.
Philosophy
Do not depend on log format
  – cat log | text-processing oneliner | plot



Do not depend on domain
  – Visualize arbitrary “signals”
Mode of operation


cat log.txt | awk ‘/…/{something simple}’
            | tplot some parameters …


  You can use perl or whatever, but awk is really freakin’ damn simple.
               You can learn enough of awk in 1 minute.
             /REGEX/{print something}, and $n is n-th field.


                  Actually awk is very powerful, but simple things are simple
Philosophy


The “awk ,something simple-” part is really simple
   – Adapt to typical log messages
Philosophy
What are typical log messages?
    –   An error happened
    –   The request took 100
    –   Machine UNIT001 started/finished reading data
    –   The current temperature is 96 F
    –   Search returned 974 results
    –   URL responded: NOT_FOUND
    –   …
Philosophy
What are typical questions?
    – Show me the big picture!
    – Show me X over time!
    – Analyze X over time!
        • Percentiles
        • Buckets
    – How did X and Y behave together?
    – …
Log                UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7195



                        awk ‘{time=$2 “ “ $3; core=$1 “          “   $4} 
            One-liner        /Begin /       {print time          "   >" core " blue"} 
                             /GetCommonData/{print time          "   >" core " orange"} 
                             /End /         {print time          "   <" core}‘ 



                        2010-12-07 13:52:44.738 >UNIT01-P2368 blue
    Trace               2010-12-07 13:52:44.912 >UNIT01-P2368 orange
                        2010-12-07 13:52:44.912 <UNIT01-P2368


            One-liner   splot -bh 1 -w 1400 -h 800 -expire 10000




Pretty picture
Tool 1 - splot
        “state plot”

you’ve got a zillion workers
they all work on something
  what is the big picture?
1 actor = 1 thread
1 actor = 1 cluster core
How to use it?
Usage: splot [-o PNGFILE] [-w WIDTH] [-h HEIGHT] [-bh BARHEIGHT] [-tf TIMEFORMAT]
             [-tickInterval TICKINTERVAL]
  -o PNGFILE    - filename to which the output will be written in PNG format.
                  If omitted, it will be shown in a window.
  -w, -h        - width and height of the resulting picture. Default 640x480.
  -bh           - height of the bar depicting each individual process. Default 5 pixels.
                  Use 1 or so if you have a lot of them.
  -tf           - time format, as in http://linux.die.net/man/3/strptime but with
                  fractional seconds supported via %OS - will parse 12.4039 or 12,4039
  -tickInterval - ticks on the X axis will be this often (in millis).
  -sort SORT    - sort tracks by SORT, where: 'time' - sort by time of first event,
                  'name' - sort by track name

Input is read from stdin. Example input (speaks for itself):
2010-10-21 16:45:09,431 >foo green
2010-10-21 16:45:09,541 >bar green
2010-10-21 16:45:10,631 >foo yellow
2010-10-21 16:45:10,725 >foo red
2010-10-21 16:45:10,930 >bar blue
2010-10-21 16:45:11,322 <foo
2010-10-21 16:45:12,508 <bar

'>FOO COLOR' means 'start a bar of color COLOR on track FOO',
'<FOO' means 'end the current bar for FOO'.
Log               UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7195



                       awk ‘{time=$2 “ “ $3; core=$1 “          “   $4} 
           One-liner        /Begin /       {print time          "   >" core " blue"} 
                            /GetCommonData/{print time          "   >" core " orange"} 
                            /End /         {print time          "   <" core}‘ 



                       2010-12-07 13:52:44.738 >UNIT01-P2368 blue
Trace (stdin)          2010-12-07 13:52:44.912 >UNIT01-P2368 orange
                       2010-12-07 13:52:44.912 <UNIT01-P2368


           One-liner   splot -bh 1 -w 1400 -h 800 -expire 10000




Pretty picture
Trace format
                            Track          Color

2010-12-15 21:10:34.938 >UNIT65 blue
          Time                      Rise


                            Track

2010-12-15 21:10:34.938 <UNIT65
          Time              Fall
Trace format
TIME >ACTOR COLOR
TIME <ACTOR

2010-12-07 13:52:44.738 >UNIT01-P2368 blue
2010-12-07 13:52:44.912 >UNIT01-P2368 orange
2010-12-07 13:52:44.912 <UNIT01-P2368
Log               UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7195



                       awk ‘{time=$2 “ “ $3; core=$1 “          “   $4} 
           One-liner        /Begin /       {print time          "   >" core " blue"} 
                            /GetCommonData/{print time          "   >" core " orange"} 
                            /End /         {print time          "   <" core}‘ 



                       2010-12-07 13:52:44.738 >UNIT01-P2368 blue
    Trace              2010-12-07 13:52:44.912 >UNIT01-P2368 orange
                       2010-12-07 13:52:44.912 <UNIT01-P2368


           One-liner   splot -bh 1 -w 1400 -h 800 -expire 10000




Pretty picture
From log to trace
UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7



awk ‘{time=$3 “ “ $4; core=$1 “                    “     $9} 
     /Begin /       {print time                    "     >" core " blue"} 
     /GetCommonData/{print time                    "     >" core " orange"} 
     /End /         {print time                    "     <" core}‘ 
     log.txt | splot

2010-11-13 06:23:27.975 >UNIT006-P5872 blue




                                    Actor
                          Ordered by time of 1st event




                                                                        Time
Log               UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7195



                       awk ‘{time=$2 “ “ $3; core=$1 “          “   $4} 
           One-liner        /Begin /       {print time          "   >" core " blue"} 
                            /GetCommonData/{print time          "   >" core " orange"} 
                            /End /         {print time          "   <" core}‘ 



                       2010-12-07 13:52:44.738 >UNIT01-P2368 blue
    Trace              2010-12-07 13:52:44.912 >UNIT01-P2368 orange
                       2010-12-07 13:52:44.912 <UNIT01-P2368


           One-liner   splot -bh 1 -w 1400 -h 800 -expire 10000




Pretty picture
How to create a PNG


    -o out.png
How to change window size


     -w 1400 -h 800
What if the bars are too thick?


                     -bh 1

  (for example, 1 bar per process, 2000 processes)
What if the ticks are too often?


         -tickInterval 5000

    (for example, the log spans 1.5 hours)
What if the time format is not
     %Y-%m-%d %H:%M:%OS?




 -tf ‘[%H-%M-%OS %Y/%m/%d]’


           (man strptime)
    (%OS for fractional seconds)
What if I want to sort on actor name
      (not time of first event)?


                     -sort name
(for example, your actor names are like “JOB-MACHINE-PID”,
   and you want to clearly differentiate jobs in output)
What if the ‘<‘ is lost?
• Process was killed before it said ‘Done with X’

• Assume X takes no more than T
• Use “-expire T”
What if the ‘<‘ is lost?
    -expire 10000
What if the ‘<‘ is lost?
    -expire 10000
What if the ‘>’ is lost?
• You’re processing a log in pieces

• Use ‘-phantom COLOR’.
• Tracks starting with ‘<‘ will be prepended with ‘>COLOR’.
So what can it do, again?


  It can help you see a pattern
   that is hard/impossible to see by other means
           (just like any other visualization)




  P.S. All examples below are drawn by one-liners.
N jobs run concurrently,
         then all but one finish,
       it continues sequentially:
too sequentially to saturate the cluster.
For the early tasks, fetching data takes a pathologically large time.
Sometimes it takes a lot of time for other tasks, too, but not that much.
We’re spraying tasks all over the cluster as fast as we can (900/s),
                                but they are just too short.

                              Spraying slows down over time.




1900
cores




        2s
Utilization is better, but there are some strange pauses.
Component A calls component B
   Component A’s impression    Component B’s impression




Diagnosis: Slow inter-component transport!
There are interrupts in task generation                 Long tail causes under-utilization

                           Flow of tasks not big enough to saturate cluster




               Memcached gets slow at times
Guidelines
What are the actors?
    – Processes
        • Name them like “MACHINE-PID-THREAD” or “JOB-MACHINE-PID-THREAD”
        • Make sure your log is verbose enough for that
    – Tasks
        • Better show those who process them (not tasks themselves)
Guidelines
What are the states?
     –   Example: “fetch data”, then “compute”, then “write result”
     –   Make sure your log shows boundaries


{time=…; actor=…}
/Started fetching/{print   time   “   >”   actor “ blue”}
/Computing…/      {print   time   “   >”   actor “ orange”}
/Done computing/ {print    time   “   >”   actor “ green”}
/Written result/ {print    time   “   <“   actor}
Guidelines
How to differentiate between actor groups?
        – Make them of different colour


Log: ….   Deliver JOBID.TASKID


BEGIN    {color[0]="green"; color[1]="orange"}
         {time=$3 " " $4; core=$1 "-" $9}
/Deliver/{id=$NF; sub(/..*/,"",id); job[core]=id;
          print time " >" job[core] "-" core " " color[id%2]}
/End /   {print time " <" job[core] "-" core}
That’s how it looks. One job preempts the other.
(1st job’s processes are killed, 2nd’s are spawned)
Example
awk 'BEGIN{color[0]="green"; color[1]="orange"}                              
     {time=$3 " " $4; core=$1 "-" $9}                                        
     /Deliver/{id=$NF; sub(/..*/,"",id); job[core]=id;                      
               print time " >" job[core] "-" core " " color[id%2]}           
     /End /   {print time " <" job[core] "-" core}‘                          
     "$f"                                                                    
  | splot -bh 1 -w 1400 -h 800 -tickInterval 30000 -o "$f.utilization.png“   
    -sort time -expire 15000
P.S.
To draw a “global picture” you’ll need a global time axis.

Try greg – http://code.google.com/p/greg
Tool 2 - tplot
             “time plot”

how do these quantitative characteristics
     change together over time?
When load increased,
                                                         both hits and misses
      Request times are clustered in 2 groups            got more expensive
             (cache hits and misses)




“arc” is for “arcadia” (Yandex Server) – it’s from a load test of rabota.yandex.ru
Some “splits” are slow. Their impact is not negligible at all
We’re basically keeping up with the flow of tasks, lagging ~1s behind.
Client calls Gateway. Client thinks it takes 50ms, Gateway thinks it takes ~2ms
Job 168 preempts job 167 and see how cluster usage share changes.
Numbes of “waves” being processed by cluster at each moment




See anything
in common?
It has slightly more options than splot
         Usage: tplot [-o OFILE] [-of {png|pdf|ps|svg|x}] [-or 640x480] -if IFILE [-tf TF]
                      [-k Pat1 Kind1 -k Pat2 Kind2 ...] [-dk KindN] [-fromTime TIME] [-toTime TIME]
           -o OFILE - output file (required if -of is not x)
           -of       - output format (x means draw result in a window, default: extension of -o)
                       x is only available if you installed timeplot with --flags=gtk
           -or       - output resolution (default 640x480)
           -if IFILE - input file; '-' means 'read from stdin'
           -tf TF    - time format: 'num' means that times are floating-point numbers
                       (for instance, seconds elapsed since an event); 'date PATTERN' means that times are dates
                       in the format specified by PATTERN - see http://linux.die.net/man/3/strptime,
                       for example, [%Y-%m-%d %H:%M:%S] parses dates like [2009-10-20 16:52:43].
                       We also support %OS for fractional seconds (i.e. %OS will parse 12.4039 or 12,4039).
                       Default: 'date %Y-%m-%d %H:%M:%OS'
           -k P K    - set diagram kind for tracks matching regex P (in the format of regex-tdfa, which
                       is at least POSIX-compliant and supports some GNU extensions) to K
                       (-k clauses are matched till first success)
           -dk       - set default diagram kind
           -fromTime - filter records whose time is >= this time (formatted according to -tf)
           -toTime   - filter records whose time is < this time (formatted according to -tf)

         Input format: lines of the following form:
         1234 >A - at time 1234, activity A has begun
         1234 <A - at time 1234, activity A has ended
         1234 !B - at time 1234, pulse event B has occured
         1234 @B COLOR - at time 1234, the status of B became such that it is appropriate to draw it with color COLOR :)
         1234 =C VAL - at time 1234, parameter C had numeric value VAL (for example, HTTP response time)
         1234 =D `EVENT - at time 1234, event EVENT occured in process D (for example, HTTP response code)
         It is assumed that many events of the same kind may occur at once.
         Diagram kinds:
           ‘none’ - do not plot this track
           'event' is for event diagrams: activities are drawn like --[===]--- , pulse events like --|--
           'duration XXXX' - plot any kind of diagram over the *durations* of events on a track (delimited by > ... <)
              for example 'duration quantile 300 0.25,0.5,0.75' will plot these quantiles of durations of the events.
              This is useful where your log looks like 'Started processing' ... 'Finished processing': you can plot
              processing durations without computing them yourself.
           'duration[C] XXXX' - same as 'duration', but of a track's name we only take the part before character C.
              For example, if you have processes named 'MACHINE-PID' (i.e. UNIT027-8532) say 'begin something' /
              'end something' and you're interested in the properties of per-machine durations, use duration[-].
           'count N' is for activity counts: a 'histogram' is drawn with granularity of N time units, where
              the bin corresponding to [t..t+N) has value 'what was the maximal number of active events
              in that interval', or 'what was the number of impulses in that interval'.
           'freq N [TYPE]' is for event frequency histograms: a histogram of type TYPE (stacked or
              clustered, default clustered) is drawn for each time bin of size N, about the distribution
              of various ` events
           'hist N [TYPE]' is for event frequency histograms: a histogram of type TYPE (stacked or
              clustered, default clustered) is drawn for each time bin of size N, about the counts of
              various ` events
           'quantile N q1,q2,..' (example: quantile 100 0.25,0.5,0.75) - a bar chart of corresponding
              quantiles in time bins of size N
           'binf N v1,v2,..' (example: binf 100 1,2,5,10) - a bar chart of frequency of values falling
              into bins min..v1, v1..v2, .., v2..max in time bins of size N
           'binh N v1,v2,..' (example: binf 100 1,2,5,10) - a bar chart of counts of values falling
              into bins min..v1, v1..v2, .., v2..max in time bins of size N
           'lines' - a simple line plot of numeric values
           'dots'   - a simple dot plot of numeric values
           'cumsum' - a simple line plot of the sum of the numeric values
           'sum N' - a simple line plot of the sum of the numeric values in time bins of size N
         N is measured in units or in seconds.
But it’s used the same way

         Log file              INFO 2010-12-02 07:08:10.422 [Pool-1] A task arrived
                               INFO 2010-12-02 07:08:10.440 [Pool-2] A task arrived
                               INFO 2010-12-02 07:08:10.518 [Pool-3] Task finished


                               awk ‘{t=$2 “ “ $3} 
                One-liner           /arrived/{print t “ >running”; print t “ !begin/5s”} 
                                    /finished/{print t “ <running”; print t “ !end/5s”}’

                               2010-12-02   07:08:10.422   !begin/5s
                               2010-12-02   07:08:10.422   >running
                               2010-12-02   07:08:10.440   !begin/5s
Trace file (input for tools)   2010-12-02
                               2010-12-02
                                            07:08:10.440
                                            07:08:10.518
                                                           >running
                                                           !end/5s
                               2010-12-02   07:08:10.518   <running



                One-liner      tplot -dk ‘count 5’ -if - -of x -or 1400x800




      Pretty picture
Trace format

                            Track

2010-12-15 21:10:34.938 =fruit `Oranges
          Time                      Event
Event types
                                                                        !error
             >computing               <computing

            Activity start            Activity end                   Impulse



                                     @UNIT05 green

                                Colored activity start

       t           =t 36.6                                 fruit

36.6                                                               =fruit `Oranges
                                                                                     Apples
                                                                                     Oranges
                                                                                     Cucumbers
                                                                                     Melons
                                                                                     Potatoes


       Continuous observation                            Categorical (discrete) observation
How to map logs to that?
Log: Starting   “Reduce” phase…

   • TIME >reduce
   • When “>” finished, say “<“.


Log: TASKID   – fetching data…

   •   TIME   >num-fetching-data
   •   TIME   @state-of-TASKID blue
   •   TIME   =state-of-TASKID `fetching-data
   •   TIME   =state `fetching-data
How to map logs to that?
Log: GET   /image.php

    • TIME =url `/image.php
    • TIME >/image.php-MACHINE.THREADID
           – Do not fear – see use case later
           – Say “<“ when you’ve generated the response


Log: Accessing    database DB002 – error NOTFOUND!
    • TIME !error-DB002
    • TIME =error-DB002 `NOTFOUND
    • TIME =who-failed `DB002
How to map logs to that?
Log: Search    returned 973 results

      • TIME =search-results 973


Log: Request   took 34ms

      • TIME =response-time 34
But it’s used the same way

         Log file              INFO 2010-12-02 07:08:10.422 [Pool-1] A task arrived
                               INFO 2010-12-02 07:08:10.440 [Pool-2] A task arrived
                               INFO 2010-12-02 07:08:10.518 [Pool-3] Task finished


                               awk ‘{t=$2 “ “ $3} 
                One-liner           /arrived/{print t “ >running”; print t “ !begin/5s”} 
                                    /finished/{print t “ <running”; print t “ !end/5s”}’

                               2010-12-02   07:08:10.422   !begin/5s
                               2010-12-02   07:08:10.422   >running
                               2010-12-02   07:08:10.440   !begin/5s
Trace file (input for tools)   2010-12-02
                               2010-12-02
                                            07:08:10.440
                                            07:08:10.518
                                                           >running
                                                           !end/5s
                               2010-12-02   07:08:10.518   <running



                One-liner      tplot -dk ‘count 5’ -if - -of x -or 1400x800




      Pretty picture
Let tools do the rest
• Choose diagram kinds
• Map the trace to diagrams
• 1 diagram per track
  -k REGEX1 KIND1
  -k REGEX2 KIND2
  …
  -dk DEFAULT-KIND

  -k search-results ‘quantile 1 0.5,0.75,0.95’ -k return-code ‘freq 1’ -dk none
Choose your poison diagram kind
     ‘none’ - do not plot this track
     'event' is for event diagrams: activities are drawn like --[===]--- , pulse events like --|--
     'duration XXXX' - plot any kind of diagram over the *durations* of events on a track (delimited by > ... <)
        for example 'duration quantile 300 0.25,0.5,0.75' will plot these quantiles of durations of the events.
        This is useful where your log looks like 'Started processing' ... 'Finished processing': you can plot
        processing durations without computing them yourself.
     'duration[C] XXXX' - same as 'duration', but of a track's name we only take the part before character C.
        For example, if you have processes named 'MACHINE-PID' (i.e. UNIT027-8532) say 'begin something' /
        'end something' and you're interested in the properties of per-machine durations, use duration[-].
     'count N' is for activity counts: a 'histogram' is drawn with granularity of N time units, where
        the bin corresponding to [t..t+N) has value 'what was the maximal number of active events
        in that interval', or 'what was the number of impulses in that interval'.
     'freq N [TYPE]' is for event frequency histograms: a histogram of type TYPE (stacked or
        clustered, default clustered) is drawn for each time bin of size N, about the distribution
        of various ` events
     'hist N [TYPE]' is for event frequency histograms: a histogram of type TYPE (stacked or
        clustered, default clustered) is drawn for each time bin of size N, about the counts of
        various ` events
     'quantile N q1,q2,..' (example: quantile 100 0.25,0.5,0.75) - a bar chart of corresponding
        quantiles in time bins of size N
     'binf N v1,v2,..' (example: binf 100 1,2,5,10) - a bar chart of frequency of values falling
        into bins min..v1, v1..v2, .., v2..max in time bins of size N
     'binh N v1,v2,..' (example: binf 100 1,2,5,10) - a bar chart of counts of values falling
        into bins min..v1, v1..v2, .., v2..max in time bins of size N
     'lines' - a simple line plot of numeric values
     'dots'   - a simple dot plot of numeric values
     'cumsum' - a simple line plot of the sum of the numeric values
     'sum N' - a simple line plot of the sum of the numeric values in time bins of size N
TL;DR
‘none’
‘event’
 'event' is for event diagrams: activities are drawn like --[===]--- , pulse events like --|--


Which ‘computation sites’ were active
at any given time?

12/9/2010   5:31:25   >site-0
12/9/2010   5:31:25   >site-4
12/9/2010   5:31:25   >site-1
12/9/2010   5:31:25   >site-5
12/9/2010   5:31:25   >site-3
12/9/2010   5:31:25   >site-2
12/9/2010   5:35:27   <site-2
12/9/2010   5:35:28   >site-6
12/9/2010   5:36:14   <site-4
12/9/2010   5:36:15   >site-7
…

-k site event
‘event’
Other uses:
• How did a long activity influence the rest?
   – Like “data reloading” etc
• Which machines were doing anything at any given time?
   – Log: “machine X started/finished Y”
   – Trace: “>X” / “<X”
   – event : when was X > 0
‘count’
INFO 2010-12-02 07:08:10.422 [Pool-1] A task arrived
INFO 2010-12-02 07:08:10.440 [Pool-2] A task arrived
INFO 2010-12-02 07:08:10.518 [Pool-3] Task finished



…
2010-12-02   07:08:10.422   !begin/5s
2010-12-02   07:08:10.422   >running
2010-12-02   07:08:10.440   !begin/5s
2010-12-02   07:08:10.440   >running
2010-12-02   07:08:10.518   !end/5s
2010-12-02   07:08:10.518   <running
…                                       count 5

Tasks started/finished per 5s
Max active tasks per 5s
‘lines’ and ‘dots’




-dk lines
                     2010-11-11 18:00:27.24.343 =arc 1089.3


            Nothing special              -dk dots
‘sum’ and ‘cumsum’
sum N
• lines over sum of values in bins 0..N, N..2N etc seconds


cumsum
• lines over sum of values from the beginning of the log
How much time memcached took on a 360-node cluster, in each 10-second interval

2010-12-09 01:00:57.738 =memcached/10s 0.059

It was quite unstable.




                                                             -k memcached ‘sum 10’

                                                             Ok, actually
                                                             duration[-] sum 10

                                                             Stay tuned!
Records flow:

   read from socket to “uncalibrated” queue  moved to “time-buffered” queue (TQ)
    expired  written to console
00013846   58.27768326   [332]   Dequeued uncalibrated: 0         58.27768326   =2-moved 0
00013847   58.29270172   [332]   TQ dequeued 0 entries            58.29270172   =3-expired 0
00013848   58.57195663   [332]   Dequeued uncalibrated: 0         58.57195663   =2-moved 0
00013849   58.57243347   [332]   TQ dequeued 59 entries           58.57243347   =3-expired 59
00013850   58.57282257   [332]   Dequeued uncalibrated: 0         58.57282257   =2-moved 0
00013851   58.57336044   [332]   Read 10000 entries from socket   58.57336044   =1-read 10000
00013852   58.57374191   [332]   Read 10000 entries from socket   58.57374191   =1-read 10000
00013853   58.57406998   [332]   Read 10000 entries from socket   58.57406998   =1-read 10000
00013854   58.57439423   [332]   Read 10000 entries from socket   58.57439423   =1-read 10000
00013855   58.57477188   [332]   Read 10000 entries from socket   58.57477188   =1-read 10000
00013856   58.57511139   [332]   Writer dequeued 59 records       58.57511139   =4-written 59




-k dropped dots
-dk cumsum
Records flow:

read from socket to “uncalibrated” queue  moved to “time-buffered” queue (TQ)
 expired  written to console

                                                              These two guys
                                                              keep up.

                                                              We ‘move’ as fast
                                                              as we ‘read’.
Records flow:

read from socket to “uncalibrated” queue  moved to “time-buffered” queue (TQ)
 expired  written to console




                                                             We dequeue ‘expired’
                                                             entries about as fast
                                                             And we could write them
                Actually same slope,
                     Different scale
                                                             equally fast

                                                             But most of the time
                                                             we don’t have any expired entries!
Records flow:

read from socket to “uncalibrated” queue  moved to “time-buffered” queue (TQ)
 expired  written to console




                                                                     Why are no entries
                                                                     expired here?




                                                                     Because they’re
                                                                     dropped from TQ!

                                                                     We should increase
                                                                     TQ buffer size
increase buffer 10x…
         …et voila




Hey, where’s that ‘dropped’ graph? ;)
‘freq’ and ‘hist’
-dk ‘hist 60’            Return codes of a pinger program.




14:05:23 =Code `java.io.IOException
two jobs competing for a cluster
-k relative ‘freq 5 stacked’ -k absolute ‘hist 5 stacked’




      2010-12-10 00:00:30.422 =absolute `244
      2010-12-10 00:00:30.422 =relative `244
‘binh’ and ‘binf’

            08:00:42 =Time 35.8


            -dk ‘binh 15 100,500,1000,5000’




            -dk ‘binf 15 100,500,1000,5000’
-dk ‘binf 60 200,500,1000,2000,5000’
‘quantile’
-dk ‘quantile 3600 0.5’




       min/max/med
 of some supersecret value
      from Yandex 
-dk ‘quantile 60 0.5,0.7,0.9’




How long a “polite” pinger program usually had to wait for a host
‘duration’
• Log: “Started quizzling”, “Finished quizzling”
• We wonder about quizzling durations

• Trace >quizzle, <quizzle
• ‘duration XXX’ plots XXX over durations
• Examples:
   –   duration   dots
   –   duration   sum 10
   –   duration   binh 100,200,500
   –   duration   quantile 1 0.5,0.75,0.95


• duration[SEP] is more universal
‘duration*SEP+’
•   Log: “UNIT035 started quizzling”, “UNIT035 Finished quizzling”
•   We wonder about quizzling durations


•   Trace >quizzle@UNIT035, <quizzle@UNIT035
•   ‘duration[SEP] XXX’ plots XXX over durations
•   Like ‘duration XXX’ but durations of all actors go to 1 track
•   Examples:
     –   duration[@]   dots
     –   duration[@]   sum 10
     –   duration[@]   binh 100,200,500
     –   duration[@]   quantile 1 0.5,0.75,0.95
UNIT011 is on blade 1, UNIT051 is on blade 5, memcached is on blade 1.

UNIT011   2010-12-09   01:54:41.927   P3964    Info   Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/51
                                                                                                               <memcached-UNIT011.P3964
UNIT011   2010-12-09   01:54:41.928   P3964   Debug   GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/51
UNIT051   2010-12-09   01:54:42.045   P3832    Info   Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/99
UNIT051   2010-12-09   01:54:42.045   P3164    Info   Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/98
                                                                                                                  memcached-UNIT011
UNIT051   2010-12-09   01:54:42.046   P3164   Debug   GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/98
UNIT051   2010-12-09   01:54:42.046   P3832   Debug   GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/99
UNIT011   2010-12-09   01:54:42.132   P2740    Info   Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/135
UNIT011   2010-12-09   01:54:42.132   P4032    Info   Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/136
UNIT011   2010-12-09   01:54:42.133   P2740   Debug   GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/135
UNIT011   2010-12-09   01:54:42.133   P4032   Debug   GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/136


awk ‘{t=$2 “ “ $3; p=“memcached-” $1 “.” $4}
     /Begin /{print t " >” p}
     /GetCommonData /{print t " <“ p}'

2010-12-09    01:54:41.853    <memcached-UNIT011.P3964
2010-12-09    01:54:41.927    >memcached-UNIT011.P3964
2010-12-09    01:54:42.001    <memcached-UNIT051.P3164
2010-12-09    01:54:42.002    <memcached-UNIT051.P3832
2010-12-09    01:54:42.045    >memcached-UNIT051.P3832
2010-12-09    01:54:42.045    >memcached-UNIT051.P3164
2010-12-09    01:54:42.128    <memcached-UNIT011.P2740

-dk ‘duration[.] binf 10 0.001,0.002,0.005,0.01,0.05‘



So, apparently, memcached access times
for blade 1 are smaller.

Who’d have thought 
To reiterate
• Take your log
• Trivially map it to a trace (I use ‘awk’)
   /PATTERN/{print something to trace}

• Choose diagram kinds and map trace to them
   -k REGEX KIND, -dk DEFAULT-KIND

• Plot!

cat log.txt | awk ‘…’ | tplot -k …
Options
How to specify input?

-if -       read trace from stdin
-if FILE    read trace from FILE
How to specify output?


-of x                      output to a window
                           (install with --flags=gtk)
-o FILE.{png,svg,pdf,ps}   output to a file
How to produce a bigger image?


-or WIDTHxHEIGHT   Output resolution (default 640x480)
How to specify time format?


-tf num                         time is a real number
-tf ‘date %Y-%m-%d %H:%M:%OS’   time is a date in format of strptime
What if I need just a part of the log?
-fromTime ‘2010-09-12 14:33:00’
-toTime   ‘2010-09-12 16:00:00’

Specify one or both, in format of -tf.

(much better than doing this in the shell pipeline)
How to draw multiple graphs
                for a single track?


Use +dk and +k REGEX KIND instead of -dk and -k.

EXPLANATION:
a track is drawn acc. to all matching +k, to +dk AND ALSO to the
first matching -k, or -dk if none of -k match
Example
 cat log.txt
| grep 'UNIT0[15]1'
| awk '/Begin /        {print $3 " " $4 " >memcached-" $1 "." $9}
       /GetCommonData /{print $3 " " $4 " <memcached-" $1 "." $9}'
| tplot -if - -dk 'duration[.] binf 10 0.001,0.002,0.005,0.01,0.05' -of x
Example
Deliver 244.390256d1-ce56-4f23-8428-1e1b109ab61c/51
awk '{time=$3 " " $4; core=$1 "-" $9}                                        
     /Deliver/{id=$NF; sub(/..*/,"",id);                                    
               print time " =relative `" id;                                 
               print time " =absolute `" id }'                               
    "$f"                                                                     
   | tplot -if - -k relative 'freq 5 stacked' -k absolute 'hist 5 stacked'   
     -o "$f.share.png"




 P.S. Or you could use
 “+k” instead of “-k”:
 +dk “freq 5 stacked”
 +dk “hist 5 stacked”
How much data can they handle?

              200-300Mb is ok.
         If you have more, split it.

                   Into 10-minute bins:

    awk '{hour=substr($4,0,4); sub(/:/,"-",hour); 
         print >(FILENAME "-" hour "0")}' $log
P.S. Installation on Linux
• Install http://hackage.haskell.org/platform/
     – Better from source (not from package manager – may be outdated)
•   cabal update
•   cabal install alex
•   cabal install gtk2hs-buildtools
•   Add /home/YOURSELF/.cabal/bin to PATH
     –   Don’t use ~/.cabal/bin
• Restart your shell (to update PATH)
• Install dev libraries for glib, cairo, pango, gtk
     –   sudo apt-get install libglib2.0-dev libcairo2-dev libpango1.0-dev libgtk2.0-dev
•   cabal install gtk
•   cabal install [--flags=gtk] timeplot
•   cabal install splot
P.S. Installation on Windows
• Install http://hackage.haskell.org/platform/
• Follow http://jystic.com/2010/10/20/installing-gtk2hs-on-windows/
•   cabal update
•   cabal install [--flags=gtk] timeplot
•   cabal install splot
•   Install http://www.cygwin.com/ for awk or whatever
•   Use cygwin shell
P.S. Installation on Mac
• Install XCode
  – Or install the development toolchain (gcc, linker
    etc) in a different way, installing xcode is easiest.
• Proceed as on Linux
P.S. What if I have trouble installing it?
Please tell me.

I’ll help you and fix this document accordingly.

ekirpichov@gmail.com
That’s it

                  Thanks

      If you liked the presentation,
  The best way to say your “thanks” is
to use the tools and to spread the word

     OK, the really best way is to contribute
        http://github.com/jkff/timeplot
          http://github.com/jkff/splot

More Related Content

What's hot

Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
Do more than one thing at the same time, the Python way
Do more than one thing at the same time, the Python wayDo more than one thing at the same time, the Python way
Do more than one thing at the same time, the Python wayJaime Buelta
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsBrendan Gregg
 
The Ring programming language version 1.9 book - Part 57 of 210
The Ring programming language version 1.9 book - Part 57 of 210The Ring programming language version 1.9 book - Part 57 of 210
The Ring programming language version 1.9 book - Part 57 of 210Mahmoud Samir Fayed
 
Opendaylight app development
Opendaylight app developmentOpendaylight app development
Opendaylight app developmentvjanandr
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems PerformanceBrendan Gregg
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Dan Lynn
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
pstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databasepstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databaseRiyaj Shamsudeen
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverPraveen Narayanan
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance ToolsBrendan Gregg
 
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)mahesh madushanka
 
Kubernetes Tutorial
Kubernetes TutorialKubernetes Tutorial
Kubernetes TutorialCi Jie Li
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesCharles Nutter
 
Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Uri Laserson
 
Tutorial Stream Reasoning SPARQLstream and Morph-streams
Tutorial Stream Reasoning SPARQLstream and Morph-streamsTutorial Stream Reasoning SPARQLstream and Morph-streams
Tutorial Stream Reasoning SPARQLstream and Morph-streamsJean-Paul Calbimonte
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodBrendan Gregg
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisBrendan Gregg
 

What's hot (20)

Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Do more than one thing at the same time, the Python way
Do more than one thing at the same time, the Python wayDo more than one thing at the same time, the Python way
Do more than one thing at the same time, the Python way
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame Graphs
 
The Ring programming language version 1.9 book - Part 57 of 210
The Ring programming language version 1.9 book - Part 57 of 210The Ring programming language version 1.9 book - Part 57 of 210
The Ring programming language version 1.9 book - Part 57 of 210
 
Opendaylight app development
Opendaylight app developmentOpendaylight app development
Opendaylight app development
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems Performance
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
pstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databasepstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle database
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
 
Kubernetes Tutorial
Kubernetes TutorialKubernetes Tutorial
Kubernetes Tutorial
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
 
Tutorial Stream Reasoning SPARQLstream and Morph-streams
Tutorial Stream Reasoning SPARQLstream and Morph-streamsTutorial Stream Reasoning SPARQLstream and Morph-streams
Tutorial Stream Reasoning SPARQLstream and Morph-streams
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE Method
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
 

Viewers also liked

YUI is Sexy - 使用 YUI 作為開發基礎
YUI is Sexy - 使用 YUI 作為開發基礎YUI is Sexy - 使用 YUI 作為開發基礎
YUI is Sexy - 使用 YUI 作為開發基礎Joseph Chiang
 
شيت دمحمددسوقى
شيت دمحمددسوقىشيت دمحمددسوقى
شيت دمحمددسوقىabdoo2020
 
Aportes al reglamento de propaganda electoral, publicidad estatal y neutralidad
Aportes al reglamento de propaganda electoral, publicidad estatal y neutralidadAportes al reglamento de propaganda electoral, publicidad estatal y neutralidad
Aportes al reglamento de propaganda electoral, publicidad estatal y neutralidadAsociación Civil Transparencia
 
TD Personal Brand Journey July2016
TD Personal Brand Journey July2016TD Personal Brand Journey July2016
TD Personal Brand Journey July2016Tony D'Onofrio
 
Cuestionario De Convivencia Alumnos
Cuestionario De Convivencia AlumnosCuestionario De Convivencia Alumnos
Cuestionario De Convivencia Alumnosmgvaamonde
 
Student recruitment landscape 2.6
Student recruitment landscape 2.6Student recruitment landscape 2.6
Student recruitment landscape 2.6whatunichennai
 
Joomla!, WordPress e Blogs
Joomla!, WordPress e BlogsJoomla!, WordPress e Blogs
Joomla!, WordPress e BlogsMarcio Okabe
 
Cubrir Riesgo Cambiario de ABC y Evaluar Flujo de Caja Descontado
Cubrir Riesgo Cambiario de ABC y Evaluar Flujo de Caja DescontadoCubrir Riesgo Cambiario de ABC y Evaluar Flujo de Caja Descontado
Cubrir Riesgo Cambiario de ABC y Evaluar Flujo de Caja DescontadoOmar Castellanos R.
 
Notice Aspiration centralisée Beam Electrolux
Notice Aspiration centralisée Beam ElectroluxNotice Aspiration centralisée Beam Electrolux
Notice Aspiration centralisée Beam ElectroluxHomexity
 
A s oct 2013 full web
A s oct 2013 full webA s oct 2013 full web
A s oct 2013 full webMadhavbaug
 
Edutech2015v2 150705224119-lva1-app6891
Edutech2015v2 150705224119-lva1-app6891Edutech2015v2 150705224119-lva1-app6891
Edutech2015v2 150705224119-lva1-app6891Vera Kovaleva
 
The Public Opinion Landscape: Election 2016
The Public Opinion Landscape: Election 2016The Public Opinion Landscape: Election 2016
The Public Opinion Landscape: Election 2016GloverParkGroup
 
Guidelines for Working Smarter, Not Harder
Guidelines for Working Smarter, Not HarderGuidelines for Working Smarter, Not Harder
Guidelines for Working Smarter, Not HarderAdam Sicinski
 

Viewers also liked (20)

Presentacion mercado de derivados
Presentacion mercado de derivadosPresentacion mercado de derivados
Presentacion mercado de derivados
 
YUI is Sexy - 使用 YUI 作為開發基礎
YUI is Sexy - 使用 YUI 作為開發基礎YUI is Sexy - 使用 YUI 作為開發基礎
YUI is Sexy - 使用 YUI 作為開發基礎
 
[52nd KUG PP] Intro KUG
[52nd KUG PP] Intro KUG[52nd KUG PP] Intro KUG
[52nd KUG PP] Intro KUG
 
شيت دمحمددسوقى
شيت دمحمددسوقىشيت دمحمددسوقى
شيت دمحمددسوقى
 
Aportes al reglamento de propaganda electoral, publicidad estatal y neutralidad
Aportes al reglamento de propaganda electoral, publicidad estatal y neutralidadAportes al reglamento de propaganda electoral, publicidad estatal y neutralidad
Aportes al reglamento de propaganda electoral, publicidad estatal y neutralidad
 
TD Personal Brand Journey July2016
TD Personal Brand Journey July2016TD Personal Brand Journey July2016
TD Personal Brand Journey July2016
 
Cuestionario De Convivencia Alumnos
Cuestionario De Convivencia AlumnosCuestionario De Convivencia Alumnos
Cuestionario De Convivencia Alumnos
 
Student recruitment landscape 2.6
Student recruitment landscape 2.6Student recruitment landscape 2.6
Student recruitment landscape 2.6
 
Joomla!, WordPress e Blogs
Joomla!, WordPress e BlogsJoomla!, WordPress e Blogs
Joomla!, WordPress e Blogs
 
Cubrir Riesgo Cambiario de ABC y Evaluar Flujo de Caja Descontado
Cubrir Riesgo Cambiario de ABC y Evaluar Flujo de Caja DescontadoCubrir Riesgo Cambiario de ABC y Evaluar Flujo de Caja Descontado
Cubrir Riesgo Cambiario de ABC y Evaluar Flujo de Caja Descontado
 
resume
resumeresume
resume
 
i.materialise
i.materialisei.materialise
i.materialise
 
Notice Aspiration centralisée Beam Electrolux
Notice Aspiration centralisée Beam ElectroluxNotice Aspiration centralisée Beam Electrolux
Notice Aspiration centralisée Beam Electrolux
 
Proyecto 4-katherine-piraban
Proyecto 4-katherine-pirabanProyecto 4-katherine-piraban
Proyecto 4-katherine-piraban
 
ABA SEER Vice Chair.PDF
ABA SEER Vice Chair.PDFABA SEER Vice Chair.PDF
ABA SEER Vice Chair.PDF
 
A s oct 2013 full web
A s oct 2013 full webA s oct 2013 full web
A s oct 2013 full web
 
Edutech2015v2 150705224119-lva1-app6891
Edutech2015v2 150705224119-lva1-app6891Edutech2015v2 150705224119-lva1-app6891
Edutech2015v2 150705224119-lva1-app6891
 
7maravillas
7maravillas7maravillas
7maravillas
 
The Public Opinion Landscape: Election 2016
The Public Opinion Landscape: Election 2016The Public Opinion Landscape: Election 2016
The Public Opinion Landscape: Election 2016
 
Guidelines for Working Smarter, Not Harder
Guidelines for Working Smarter, Not HarderGuidelines for Working Smarter, Not Harder
Guidelines for Working Smarter, Not Harder
 

Similar to Two visualization tools

NS2: AWK and GNUplot - PArt III
NS2: AWK and GNUplot - PArt IIINS2: AWK and GNUplot - PArt III
NS2: AWK and GNUplot - PArt IIIAjit Nayak
 
Scrapping Your Inefficient Engine
Scrapping Your Inefficient EngineScrapping Your Inefficient Engine
Scrapping Your Inefficient EngineEdwin Brady
 
NS2-tutorial.ppt
NS2-tutorial.pptNS2-tutorial.ppt
NS2-tutorial.pptWajath
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterAndrey Kudryavtsev
 
When the OS gets in the way
When the OS gets in the wayWhen the OS gets in the way
When the OS gets in the wayMark Price
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d methodAjith Narayanan
 
Working with NS2
Working with NS2Working with NS2
Working with NS2chanchal214
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkRachel Warren
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
 
Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Holden Karau
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?ScyllaDB
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
Cs757 ns2-tutorial-exercise
Cs757 ns2-tutorial-exerciseCs757 ns2-tutorial-exercise
Cs757 ns2-tutorial-exercisePratik Joshi
 
Best Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesOdoo
 
Cloud burst tutorial
Cloud burst tutorialCloud burst tutorial
Cloud burst tutorial주영 송
 

Similar to Two visualization tools (20)

NS2: AWK and GNUplot - PArt III
NS2: AWK and GNUplot - PArt IIINS2: AWK and GNUplot - PArt III
NS2: AWK and GNUplot - PArt III
 
Ns network simulator
Ns network simulatorNs network simulator
Ns network simulator
 
NS2-tutorial.pdf
NS2-tutorial.pdfNS2-tutorial.pdf
NS2-tutorial.pdf
 
Scrapping Your Inefficient Engine
Scrapping Your Inefficient EngineScrapping Your Inefficient Engine
Scrapping Your Inefficient Engine
 
NS2-tutorial.ppt
NS2-tutorial.pptNS2-tutorial.ppt
NS2-tutorial.ppt
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
 
When the OS gets in the way
When the OS gets in the wayWhen the OS gets in the way
When the OS gets in the way
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d method
 
Working with NS2
Working with NS2Working with NS2
Working with NS2
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New York
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018
 
Sge
SgeSge
Sge
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
Cs757 ns2-tutorial-exercise
Cs757 ns2-tutorial-exerciseCs757 ns2-tutorial-exercise
Cs757 ns2-tutorial-exercise
 
Best Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance Issues
 
Introduction to ns2
Introduction to ns2Introduction to ns2
Introduction to ns2
 
Cloud burst tutorial
Cloud burst tutorialCloud burst tutorial
Cloud burst tutorial
 

Recently uploaded

Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 

Recently uploaded (20)

20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 

Two visualization tools

  • 1. two visualization tools for log files Eugene Kirpichov, 2010 ekirpichev@mirantis.com, ekirpichov@gmail.com Download: http://jkff.info/presentations/two-visualization-tools.pdf Slideshare: http://slideshare.net/jkff/two-visualization-tools
  • 2. Plan • intro / philosophy • splot, tplot – purpose – basics – plenty of examples – options • installation
  • 3. Intro • I wanted to visualize the behavior of my code • I only had logs • I found no existing tools to be good enough – suggestions welcome
  • 4. So I wrote them tplot (for timeplot) http://github.com/jkff/timeplot splot (for stateplot) http://github.com/jkff/splot P.S. both are in Haskell – “for fun”, but turned out to pay its weight in gold All new features took a couple lines of code and usually worked immediately This helped when I really needed a feature quickly
  • 5. Philosophy - I want to see X! - If X is in the log, you’re almost done. Easily map logs to tools’ input, let tools do the rest.
  • 6. Philosophy Do not depend on log format – cat log | text-processing oneliner | plot Do not depend on domain – Visualize arbitrary “signals”
  • 7. Mode of operation cat log.txt | awk ‘/…/{something simple}’ | tplot some parameters … You can use perl or whatever, but awk is really freakin’ damn simple. You can learn enough of awk in 1 minute. /REGEX/{print something}, and $n is n-th field. Actually awk is very powerful, but simple things are simple
  • 8. Philosophy The “awk ,something simple-” part is really simple – Adapt to typical log messages
  • 9. Philosophy What are typical log messages? – An error happened – The request took 100 – Machine UNIT001 started/finished reading data – The current temperature is 96 F – Search returned 974 results – URL responded: NOT_FOUND – …
  • 10. Philosophy What are typical questions? – Show me the big picture! – Show me X over time! – Analyze X over time! • Percentiles • Buckets – How did X and Y behave together? – …
  • 11. Log UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7195 awk ‘{time=$2 “ “ $3; core=$1 “ “ $4} One-liner /Begin / {print time " >" core " blue"} /GetCommonData/{print time " >" core " orange"} /End / {print time " <" core}‘ 2010-12-07 13:52:44.738 >UNIT01-P2368 blue Trace 2010-12-07 13:52:44.912 >UNIT01-P2368 orange 2010-12-07 13:52:44.912 <UNIT01-P2368 One-liner splot -bh 1 -w 1400 -h 800 -expire 10000 Pretty picture
  • 12. Tool 1 - splot “state plot” you’ve got a zillion workers they all work on something what is the big picture?
  • 13. 1 actor = 1 thread
  • 14. 1 actor = 1 cluster core
  • 15. How to use it? Usage: splot [-o PNGFILE] [-w WIDTH] [-h HEIGHT] [-bh BARHEIGHT] [-tf TIMEFORMAT] [-tickInterval TICKINTERVAL] -o PNGFILE - filename to which the output will be written in PNG format. If omitted, it will be shown in a window. -w, -h - width and height of the resulting picture. Default 640x480. -bh - height of the bar depicting each individual process. Default 5 pixels. Use 1 or so if you have a lot of them. -tf - time format, as in http://linux.die.net/man/3/strptime but with fractional seconds supported via %OS - will parse 12.4039 or 12,4039 -tickInterval - ticks on the X axis will be this often (in millis). -sort SORT - sort tracks by SORT, where: 'time' - sort by time of first event, 'name' - sort by track name Input is read from stdin. Example input (speaks for itself): 2010-10-21 16:45:09,431 >foo green 2010-10-21 16:45:09,541 >bar green 2010-10-21 16:45:10,631 >foo yellow 2010-10-21 16:45:10,725 >foo red 2010-10-21 16:45:10,930 >bar blue 2010-10-21 16:45:11,322 <foo 2010-10-21 16:45:12,508 <bar '>FOO COLOR' means 'start a bar of color COLOR on track FOO', '<FOO' means 'end the current bar for FOO'.
  • 16. Log UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7195 awk ‘{time=$2 “ “ $3; core=$1 “ “ $4} One-liner /Begin / {print time " >" core " blue"} /GetCommonData/{print time " >" core " orange"} /End / {print time " <" core}‘ 2010-12-07 13:52:44.738 >UNIT01-P2368 blue Trace (stdin) 2010-12-07 13:52:44.912 >UNIT01-P2368 orange 2010-12-07 13:52:44.912 <UNIT01-P2368 One-liner splot -bh 1 -w 1400 -h 800 -expire 10000 Pretty picture
  • 17. Trace format Track Color 2010-12-15 21:10:34.938 >UNIT65 blue Time Rise Track 2010-12-15 21:10:34.938 <UNIT65 Time Fall
  • 18. Trace format TIME >ACTOR COLOR TIME <ACTOR 2010-12-07 13:52:44.738 >UNIT01-P2368 blue 2010-12-07 13:52:44.912 >UNIT01-P2368 orange 2010-12-07 13:52:44.912 <UNIT01-P2368
  • 19. Log UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7195 awk ‘{time=$2 “ “ $3; core=$1 “ “ $4} One-liner /Begin / {print time " >" core " blue"} /GetCommonData/{print time " >" core " orange"} /End / {print time " <" core}‘ 2010-12-07 13:52:44.738 >UNIT01-P2368 blue Trace 2010-12-07 13:52:44.912 >UNIT01-P2368 orange 2010-12-07 13:52:44.912 <UNIT01-P2368 One-liner splot -bh 1 -w 1400 -h 800 -expire 10000 Pretty picture
  • 20. From log to trace UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7 awk ‘{time=$3 “ “ $4; core=$1 “ “ $9} /Begin / {print time " >" core " blue"} /GetCommonData/{print time " >" core " orange"} /End / {print time " <" core}‘ log.txt | splot 2010-11-13 06:23:27.975 >UNIT006-P5872 blue Actor Ordered by time of 1st event Time
  • 21. Log UNIT006 2010-11-13 06:23:27.975 P5872 Info Begin 9a444fde86544c7195 awk ‘{time=$2 “ “ $3; core=$1 “ “ $4} One-liner /Begin / {print time " >" core " blue"} /GetCommonData/{print time " >" core " orange"} /End / {print time " <" core}‘ 2010-12-07 13:52:44.738 >UNIT01-P2368 blue Trace 2010-12-07 13:52:44.912 >UNIT01-P2368 orange 2010-12-07 13:52:44.912 <UNIT01-P2368 One-liner splot -bh 1 -w 1400 -h 800 -expire 10000 Pretty picture
  • 22. How to create a PNG -o out.png
  • 23. How to change window size -w 1400 -h 800
  • 24. What if the bars are too thick? -bh 1 (for example, 1 bar per process, 2000 processes)
  • 25. What if the ticks are too often? -tickInterval 5000 (for example, the log spans 1.5 hours)
  • 26. What if the time format is not %Y-%m-%d %H:%M:%OS? -tf ‘[%H-%M-%OS %Y/%m/%d]’ (man strptime) (%OS for fractional seconds)
  • 27. What if I want to sort on actor name (not time of first event)? -sort name (for example, your actor names are like “JOB-MACHINE-PID”, and you want to clearly differentiate jobs in output)
  • 28. What if the ‘<‘ is lost? • Process was killed before it said ‘Done with X’ • Assume X takes no more than T • Use “-expire T”
  • 29. What if the ‘<‘ is lost? -expire 10000
  • 30. What if the ‘<‘ is lost? -expire 10000
  • 31. What if the ‘>’ is lost? • You’re processing a log in pieces • Use ‘-phantom COLOR’. • Tracks starting with ‘<‘ will be prepended with ‘>COLOR’.
  • 32. So what can it do, again? It can help you see a pattern that is hard/impossible to see by other means (just like any other visualization) P.S. All examples below are drawn by one-liners.
  • 33. N jobs run concurrently, then all but one finish, it continues sequentially: too sequentially to saturate the cluster.
  • 34. For the early tasks, fetching data takes a pathologically large time. Sometimes it takes a lot of time for other tasks, too, but not that much.
  • 35. We’re spraying tasks all over the cluster as fast as we can (900/s), but they are just too short. Spraying slows down over time. 1900 cores 2s
  • 36. Utilization is better, but there are some strange pauses.
  • 37. Component A calls component B Component A’s impression Component B’s impression Diagnosis: Slow inter-component transport!
  • 38. There are interrupts in task generation Long tail causes under-utilization Flow of tasks not big enough to saturate cluster Memcached gets slow at times
  • 39. Guidelines What are the actors? – Processes • Name them like “MACHINE-PID-THREAD” or “JOB-MACHINE-PID-THREAD” • Make sure your log is verbose enough for that – Tasks • Better show those who process them (not tasks themselves)
  • 40. Guidelines What are the states? – Example: “fetch data”, then “compute”, then “write result” – Make sure your log shows boundaries {time=…; actor=…} /Started fetching/{print time “ >” actor “ blue”} /Computing…/ {print time “ >” actor “ orange”} /Done computing/ {print time “ >” actor “ green”} /Written result/ {print time “ <“ actor}
  • 41. Guidelines How to differentiate between actor groups? – Make them of different colour Log: …. Deliver JOBID.TASKID BEGIN {color[0]="green"; color[1]="orange"} {time=$3 " " $4; core=$1 "-" $9} /Deliver/{id=$NF; sub(/..*/,"",id); job[core]=id; print time " >" job[core] "-" core " " color[id%2]} /End / {print time " <" job[core] "-" core}
  • 42. That’s how it looks. One job preempts the other. (1st job’s processes are killed, 2nd’s are spawned)
  • 43. Example awk 'BEGIN{color[0]="green"; color[1]="orange"} {time=$3 " " $4; core=$1 "-" $9} /Deliver/{id=$NF; sub(/..*/,"",id); job[core]=id; print time " >" job[core] "-" core " " color[id%2]} /End / {print time " <" job[core] "-" core}‘ "$f" | splot -bh 1 -w 1400 -h 800 -tickInterval 30000 -o "$f.utilization.png“ -sort time -expire 15000
  • 44. P.S. To draw a “global picture” you’ll need a global time axis. Try greg – http://code.google.com/p/greg
  • 45. Tool 2 - tplot “time plot” how do these quantitative characteristics change together over time?
  • 46. When load increased, both hits and misses Request times are clustered in 2 groups got more expensive (cache hits and misses) “arc” is for “arcadia” (Yandex Server) – it’s from a load test of rabota.yandex.ru
  • 47. Some “splits” are slow. Their impact is not negligible at all
  • 48. We’re basically keeping up with the flow of tasks, lagging ~1s behind.
  • 49. Client calls Gateway. Client thinks it takes 50ms, Gateway thinks it takes ~2ms
  • 50. Job 168 preempts job 167 and see how cluster usage share changes.
  • 51. Numbes of “waves” being processed by cluster at each moment See anything in common?
  • 52. It has slightly more options than splot Usage: tplot [-o OFILE] [-of {png|pdf|ps|svg|x}] [-or 640x480] -if IFILE [-tf TF] [-k Pat1 Kind1 -k Pat2 Kind2 ...] [-dk KindN] [-fromTime TIME] [-toTime TIME] -o OFILE - output file (required if -of is not x) -of - output format (x means draw result in a window, default: extension of -o) x is only available if you installed timeplot with --flags=gtk -or - output resolution (default 640x480) -if IFILE - input file; '-' means 'read from stdin' -tf TF - time format: 'num' means that times are floating-point numbers (for instance, seconds elapsed since an event); 'date PATTERN' means that times are dates in the format specified by PATTERN - see http://linux.die.net/man/3/strptime, for example, [%Y-%m-%d %H:%M:%S] parses dates like [2009-10-20 16:52:43]. We also support %OS for fractional seconds (i.e. %OS will parse 12.4039 or 12,4039). Default: 'date %Y-%m-%d %H:%M:%OS' -k P K - set diagram kind for tracks matching regex P (in the format of regex-tdfa, which is at least POSIX-compliant and supports some GNU extensions) to K (-k clauses are matched till first success) -dk - set default diagram kind -fromTime - filter records whose time is >= this time (formatted according to -tf) -toTime - filter records whose time is < this time (formatted according to -tf) Input format: lines of the following form: 1234 >A - at time 1234, activity A has begun 1234 <A - at time 1234, activity A has ended 1234 !B - at time 1234, pulse event B has occured 1234 @B COLOR - at time 1234, the status of B became such that it is appropriate to draw it with color COLOR :) 1234 =C VAL - at time 1234, parameter C had numeric value VAL (for example, HTTP response time) 1234 =D `EVENT - at time 1234, event EVENT occured in process D (for example, HTTP response code) It is assumed that many events of the same kind may occur at once. Diagram kinds: ‘none’ - do not plot this track 'event' is for event diagrams: activities are drawn like --[===]--- , pulse events like --|-- 'duration XXXX' - plot any kind of diagram over the *durations* of events on a track (delimited by > ... <) for example 'duration quantile 300 0.25,0.5,0.75' will plot these quantiles of durations of the events. This is useful where your log looks like 'Started processing' ... 'Finished processing': you can plot processing durations without computing them yourself. 'duration[C] XXXX' - same as 'duration', but of a track's name we only take the part before character C. For example, if you have processes named 'MACHINE-PID' (i.e. UNIT027-8532) say 'begin something' / 'end something' and you're interested in the properties of per-machine durations, use duration[-]. 'count N' is for activity counts: a 'histogram' is drawn with granularity of N time units, where the bin corresponding to [t..t+N) has value 'what was the maximal number of active events in that interval', or 'what was the number of impulses in that interval'. 'freq N [TYPE]' is for event frequency histograms: a histogram of type TYPE (stacked or clustered, default clustered) is drawn for each time bin of size N, about the distribution of various ` events 'hist N [TYPE]' is for event frequency histograms: a histogram of type TYPE (stacked or clustered, default clustered) is drawn for each time bin of size N, about the counts of various ` events 'quantile N q1,q2,..' (example: quantile 100 0.25,0.5,0.75) - a bar chart of corresponding quantiles in time bins of size N 'binf N v1,v2,..' (example: binf 100 1,2,5,10) - a bar chart of frequency of values falling into bins min..v1, v1..v2, .., v2..max in time bins of size N 'binh N v1,v2,..' (example: binf 100 1,2,5,10) - a bar chart of counts of values falling into bins min..v1, v1..v2, .., v2..max in time bins of size N 'lines' - a simple line plot of numeric values 'dots' - a simple dot plot of numeric values 'cumsum' - a simple line plot of the sum of the numeric values 'sum N' - a simple line plot of the sum of the numeric values in time bins of size N N is measured in units or in seconds.
  • 53. But it’s used the same way Log file INFO 2010-12-02 07:08:10.422 [Pool-1] A task arrived INFO 2010-12-02 07:08:10.440 [Pool-2] A task arrived INFO 2010-12-02 07:08:10.518 [Pool-3] Task finished awk ‘{t=$2 “ “ $3} One-liner /arrived/{print t “ >running”; print t “ !begin/5s”} /finished/{print t “ <running”; print t “ !end/5s”}’ 2010-12-02 07:08:10.422 !begin/5s 2010-12-02 07:08:10.422 >running 2010-12-02 07:08:10.440 !begin/5s Trace file (input for tools) 2010-12-02 2010-12-02 07:08:10.440 07:08:10.518 >running !end/5s 2010-12-02 07:08:10.518 <running One-liner tplot -dk ‘count 5’ -if - -of x -or 1400x800 Pretty picture
  • 54. Trace format Track 2010-12-15 21:10:34.938 =fruit `Oranges Time Event
  • 55. Event types !error >computing <computing Activity start Activity end Impulse @UNIT05 green Colored activity start t =t 36.6 fruit 36.6 =fruit `Oranges Apples Oranges Cucumbers Melons Potatoes Continuous observation Categorical (discrete) observation
  • 56. How to map logs to that? Log: Starting “Reduce” phase… • TIME >reduce • When “>” finished, say “<“. Log: TASKID – fetching data… • TIME >num-fetching-data • TIME @state-of-TASKID blue • TIME =state-of-TASKID `fetching-data • TIME =state `fetching-data
  • 57. How to map logs to that? Log: GET /image.php • TIME =url `/image.php • TIME >/image.php-MACHINE.THREADID – Do not fear – see use case later – Say “<“ when you’ve generated the response Log: Accessing database DB002 – error NOTFOUND! • TIME !error-DB002 • TIME =error-DB002 `NOTFOUND • TIME =who-failed `DB002
  • 58. How to map logs to that? Log: Search returned 973 results • TIME =search-results 973 Log: Request took 34ms • TIME =response-time 34
  • 59. But it’s used the same way Log file INFO 2010-12-02 07:08:10.422 [Pool-1] A task arrived INFO 2010-12-02 07:08:10.440 [Pool-2] A task arrived INFO 2010-12-02 07:08:10.518 [Pool-3] Task finished awk ‘{t=$2 “ “ $3} One-liner /arrived/{print t “ >running”; print t “ !begin/5s”} /finished/{print t “ <running”; print t “ !end/5s”}’ 2010-12-02 07:08:10.422 !begin/5s 2010-12-02 07:08:10.422 >running 2010-12-02 07:08:10.440 !begin/5s Trace file (input for tools) 2010-12-02 2010-12-02 07:08:10.440 07:08:10.518 >running !end/5s 2010-12-02 07:08:10.518 <running One-liner tplot -dk ‘count 5’ -if - -of x -or 1400x800 Pretty picture
  • 60. Let tools do the rest • Choose diagram kinds • Map the trace to diagrams • 1 diagram per track -k REGEX1 KIND1 -k REGEX2 KIND2 … -dk DEFAULT-KIND -k search-results ‘quantile 1 0.5,0.75,0.95’ -k return-code ‘freq 1’ -dk none
  • 61. Choose your poison diagram kind ‘none’ - do not plot this track 'event' is for event diagrams: activities are drawn like --[===]--- , pulse events like --|-- 'duration XXXX' - plot any kind of diagram over the *durations* of events on a track (delimited by > ... <) for example 'duration quantile 300 0.25,0.5,0.75' will plot these quantiles of durations of the events. This is useful where your log looks like 'Started processing' ... 'Finished processing': you can plot processing durations without computing them yourself. 'duration[C] XXXX' - same as 'duration', but of a track's name we only take the part before character C. For example, if you have processes named 'MACHINE-PID' (i.e. UNIT027-8532) say 'begin something' / 'end something' and you're interested in the properties of per-machine durations, use duration[-]. 'count N' is for activity counts: a 'histogram' is drawn with granularity of N time units, where the bin corresponding to [t..t+N) has value 'what was the maximal number of active events in that interval', or 'what was the number of impulses in that interval'. 'freq N [TYPE]' is for event frequency histograms: a histogram of type TYPE (stacked or clustered, default clustered) is drawn for each time bin of size N, about the distribution of various ` events 'hist N [TYPE]' is for event frequency histograms: a histogram of type TYPE (stacked or clustered, default clustered) is drawn for each time bin of size N, about the counts of various ` events 'quantile N q1,q2,..' (example: quantile 100 0.25,0.5,0.75) - a bar chart of corresponding quantiles in time bins of size N 'binf N v1,v2,..' (example: binf 100 1,2,5,10) - a bar chart of frequency of values falling into bins min..v1, v1..v2, .., v2..max in time bins of size N 'binh N v1,v2,..' (example: binf 100 1,2,5,10) - a bar chart of counts of values falling into bins min..v1, v1..v2, .., v2..max in time bins of size N 'lines' - a simple line plot of numeric values 'dots' - a simple dot plot of numeric values 'cumsum' - a simple line plot of the sum of the numeric values 'sum N' - a simple line plot of the sum of the numeric values in time bins of size N
  • 62. TL;DR
  • 64. ‘event’ 'event' is for event diagrams: activities are drawn like --[===]--- , pulse events like --|-- Which ‘computation sites’ were active at any given time? 12/9/2010 5:31:25 >site-0 12/9/2010 5:31:25 >site-4 12/9/2010 5:31:25 >site-1 12/9/2010 5:31:25 >site-5 12/9/2010 5:31:25 >site-3 12/9/2010 5:31:25 >site-2 12/9/2010 5:35:27 <site-2 12/9/2010 5:35:28 >site-6 12/9/2010 5:36:14 <site-4 12/9/2010 5:36:15 >site-7 … -k site event
  • 65. ‘event’ Other uses: • How did a long activity influence the rest? – Like “data reloading” etc • Which machines were doing anything at any given time? – Log: “machine X started/finished Y” – Trace: “>X” / “<X” – event : when was X > 0
  • 66. ‘count’ INFO 2010-12-02 07:08:10.422 [Pool-1] A task arrived INFO 2010-12-02 07:08:10.440 [Pool-2] A task arrived INFO 2010-12-02 07:08:10.518 [Pool-3] Task finished … 2010-12-02 07:08:10.422 !begin/5s 2010-12-02 07:08:10.422 >running 2010-12-02 07:08:10.440 !begin/5s 2010-12-02 07:08:10.440 >running 2010-12-02 07:08:10.518 !end/5s 2010-12-02 07:08:10.518 <running … count 5 Tasks started/finished per 5s Max active tasks per 5s
  • 67. ‘lines’ and ‘dots’ -dk lines 2010-11-11 18:00:27.24.343 =arc 1089.3 Nothing special -dk dots
  • 68. ‘sum’ and ‘cumsum’ sum N • lines over sum of values in bins 0..N, N..2N etc seconds cumsum • lines over sum of values from the beginning of the log
  • 69. How much time memcached took on a 360-node cluster, in each 10-second interval 2010-12-09 01:00:57.738 =memcached/10s 0.059 It was quite unstable. -k memcached ‘sum 10’ Ok, actually duration[-] sum 10 Stay tuned!
  • 70. Records flow: read from socket to “uncalibrated” queue  moved to “time-buffered” queue (TQ)  expired  written to console 00013846 58.27768326 [332] Dequeued uncalibrated: 0 58.27768326 =2-moved 0 00013847 58.29270172 [332] TQ dequeued 0 entries 58.29270172 =3-expired 0 00013848 58.57195663 [332] Dequeued uncalibrated: 0 58.57195663 =2-moved 0 00013849 58.57243347 [332] TQ dequeued 59 entries 58.57243347 =3-expired 59 00013850 58.57282257 [332] Dequeued uncalibrated: 0 58.57282257 =2-moved 0 00013851 58.57336044 [332] Read 10000 entries from socket 58.57336044 =1-read 10000 00013852 58.57374191 [332] Read 10000 entries from socket 58.57374191 =1-read 10000 00013853 58.57406998 [332] Read 10000 entries from socket 58.57406998 =1-read 10000 00013854 58.57439423 [332] Read 10000 entries from socket 58.57439423 =1-read 10000 00013855 58.57477188 [332] Read 10000 entries from socket 58.57477188 =1-read 10000 00013856 58.57511139 [332] Writer dequeued 59 records 58.57511139 =4-written 59 -k dropped dots -dk cumsum
  • 71. Records flow: read from socket to “uncalibrated” queue  moved to “time-buffered” queue (TQ)  expired  written to console These two guys keep up. We ‘move’ as fast as we ‘read’.
  • 72. Records flow: read from socket to “uncalibrated” queue  moved to “time-buffered” queue (TQ)  expired  written to console We dequeue ‘expired’ entries about as fast And we could write them Actually same slope, Different scale equally fast But most of the time we don’t have any expired entries!
  • 73. Records flow: read from socket to “uncalibrated” queue  moved to “time-buffered” queue (TQ)  expired  written to console Why are no entries expired here? Because they’re dropped from TQ! We should increase TQ buffer size
  • 74. increase buffer 10x… …et voila Hey, where’s that ‘dropped’ graph? ;)
  • 75. ‘freq’ and ‘hist’ -dk ‘hist 60’ Return codes of a pinger program. 14:05:23 =Code `java.io.IOException
  • 76. two jobs competing for a cluster -k relative ‘freq 5 stacked’ -k absolute ‘hist 5 stacked’ 2010-12-10 00:00:30.422 =absolute `244 2010-12-10 00:00:30.422 =relative `244
  • 77. ‘binh’ and ‘binf’ 08:00:42 =Time 35.8 -dk ‘binh 15 100,500,1000,5000’ -dk ‘binf 15 100,500,1000,5000’
  • 78. -dk ‘binf 60 200,500,1000,2000,5000’
  • 79. ‘quantile’ -dk ‘quantile 3600 0.5’ min/max/med of some supersecret value from Yandex 
  • 80. -dk ‘quantile 60 0.5,0.7,0.9’ How long a “polite” pinger program usually had to wait for a host
  • 81. ‘duration’ • Log: “Started quizzling”, “Finished quizzling” • We wonder about quizzling durations • Trace >quizzle, <quizzle • ‘duration XXX’ plots XXX over durations • Examples: – duration dots – duration sum 10 – duration binh 100,200,500 – duration quantile 1 0.5,0.75,0.95 • duration[SEP] is more universal
  • 82. ‘duration*SEP+’ • Log: “UNIT035 started quizzling”, “UNIT035 Finished quizzling” • We wonder about quizzling durations • Trace >quizzle@UNIT035, <quizzle@UNIT035 • ‘duration[SEP] XXX’ plots XXX over durations • Like ‘duration XXX’ but durations of all actors go to 1 track • Examples: – duration[@] dots – duration[@] sum 10 – duration[@] binh 100,200,500 – duration[@] quantile 1 0.5,0.75,0.95
  • 83. UNIT011 is on blade 1, UNIT051 is on blade 5, memcached is on blade 1. UNIT011 2010-12-09 01:54:41.927 P3964 Info Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/51 <memcached-UNIT011.P3964 UNIT011 2010-12-09 01:54:41.928 P3964 Debug GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/51 UNIT051 2010-12-09 01:54:42.045 P3832 Info Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/99 UNIT051 2010-12-09 01:54:42.045 P3164 Info Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/98 memcached-UNIT011 UNIT051 2010-12-09 01:54:42.046 P3164 Debug GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/98 UNIT051 2010-12-09 01:54:42.046 P3832 Debug GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/99 UNIT011 2010-12-09 01:54:42.132 P2740 Info Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/135 UNIT011 2010-12-09 01:54:42.132 P4032 Info Begin 390256d1-ce56-4f23-8428-1e1b109ab61c/136 UNIT011 2010-12-09 01:54:42.133 P2740 Debug GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/135 UNIT011 2010-12-09 01:54:42.133 P4032 Debug GetCommonData 390256d1-ce56-4f23-8428-1e1b109ab61c/136 awk ‘{t=$2 “ “ $3; p=“memcached-” $1 “.” $4} /Begin /{print t " >” p} /GetCommonData /{print t " <“ p}' 2010-12-09 01:54:41.853 <memcached-UNIT011.P3964 2010-12-09 01:54:41.927 >memcached-UNIT011.P3964 2010-12-09 01:54:42.001 <memcached-UNIT051.P3164 2010-12-09 01:54:42.002 <memcached-UNIT051.P3832 2010-12-09 01:54:42.045 >memcached-UNIT051.P3832 2010-12-09 01:54:42.045 >memcached-UNIT051.P3164 2010-12-09 01:54:42.128 <memcached-UNIT011.P2740 -dk ‘duration[.] binf 10 0.001,0.002,0.005,0.01,0.05‘ So, apparently, memcached access times for blade 1 are smaller. Who’d have thought 
  • 84. To reiterate • Take your log • Trivially map it to a trace (I use ‘awk’) /PATTERN/{print something to trace} • Choose diagram kinds and map trace to them -k REGEX KIND, -dk DEFAULT-KIND • Plot! cat log.txt | awk ‘…’ | tplot -k …
  • 86. How to specify input? -if - read trace from stdin -if FILE read trace from FILE
  • 87. How to specify output? -of x output to a window (install with --flags=gtk) -o FILE.{png,svg,pdf,ps} output to a file
  • 88. How to produce a bigger image? -or WIDTHxHEIGHT Output resolution (default 640x480)
  • 89. How to specify time format? -tf num time is a real number -tf ‘date %Y-%m-%d %H:%M:%OS’ time is a date in format of strptime
  • 90. What if I need just a part of the log? -fromTime ‘2010-09-12 14:33:00’ -toTime ‘2010-09-12 16:00:00’ Specify one or both, in format of -tf. (much better than doing this in the shell pipeline)
  • 91. How to draw multiple graphs for a single track? Use +dk and +k REGEX KIND instead of -dk and -k. EXPLANATION: a track is drawn acc. to all matching +k, to +dk AND ALSO to the first matching -k, or -dk if none of -k match
  • 92. Example cat log.txt | grep 'UNIT0[15]1' | awk '/Begin / {print $3 " " $4 " >memcached-" $1 "." $9} /GetCommonData /{print $3 " " $4 " <memcached-" $1 "." $9}' | tplot -if - -dk 'duration[.] binf 10 0.001,0.002,0.005,0.01,0.05' -of x
  • 93. Example Deliver 244.390256d1-ce56-4f23-8428-1e1b109ab61c/51 awk '{time=$3 " " $4; core=$1 "-" $9} /Deliver/{id=$NF; sub(/..*/,"",id); print time " =relative `" id; print time " =absolute `" id }' "$f" | tplot -if - -k relative 'freq 5 stacked' -k absolute 'hist 5 stacked' -o "$f.share.png" P.S. Or you could use “+k” instead of “-k”: +dk “freq 5 stacked” +dk “hist 5 stacked”
  • 94. How much data can they handle? 200-300Mb is ok. If you have more, split it. Into 10-minute bins: awk '{hour=substr($4,0,4); sub(/:/,"-",hour); print >(FILENAME "-" hour "0")}' $log
  • 95. P.S. Installation on Linux • Install http://hackage.haskell.org/platform/ – Better from source (not from package manager – may be outdated) • cabal update • cabal install alex • cabal install gtk2hs-buildtools • Add /home/YOURSELF/.cabal/bin to PATH – Don’t use ~/.cabal/bin • Restart your shell (to update PATH) • Install dev libraries for glib, cairo, pango, gtk – sudo apt-get install libglib2.0-dev libcairo2-dev libpango1.0-dev libgtk2.0-dev • cabal install gtk • cabal install [--flags=gtk] timeplot • cabal install splot
  • 96. P.S. Installation on Windows • Install http://hackage.haskell.org/platform/ • Follow http://jystic.com/2010/10/20/installing-gtk2hs-on-windows/ • cabal update • cabal install [--flags=gtk] timeplot • cabal install splot • Install http://www.cygwin.com/ for awk or whatever • Use cygwin shell
  • 97. P.S. Installation on Mac • Install XCode – Or install the development toolchain (gcc, linker etc) in a different way, installing xcode is easiest. • Proceed as on Linux
  • 98. P.S. What if I have trouble installing it? Please tell me. I’ll help you and fix this document accordingly. ekirpichov@gmail.com
  • 99. That’s it Thanks If you liked the presentation, The best way to say your “thanks” is to use the tools and to spread the word OK, the really best way is to contribute http://github.com/jkff/timeplot http://github.com/jkff/splot