SlideShare a Scribd company logo
1 of 45
Download to read offline
Korea Woman Training Center,
Kangwondo, Korea Dec 16th-19th, 2012



        The road of multi/many Core
                computing


                                       Osvaldo Gervasi
                    Dept of Mathematics and Computer Science
                                       University of Perugia
                                         osvaldo@unipg.it
Outline
●   Introduction
●   General Purpose GPU Computing
●   GPU and CPU evolution
●   OpenCL: a standard for programming
    heterogeneous devices
●   A case study: Advanced Encryption Standard
    (AES)
●   The OpenSSL library
●   Scheduling issues
●   Conclusions
         O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   2
Moore's law




                                                                 Source: Wikipedia

O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   3
GPGPU Computing
●   GPU computing or GPGPU “is the use of a GPU
    (graphics processing unit) to carry on general
    purpose scientific and engineering computing”
    [Nvidia].




                                                          Sapphire ATI Radeon HD 4550 GPU
        Nvidia TESLA GPU

        O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   4
Heterogeneous Computing
●   Heterogeneous Computing is the transparent
    use of all computational devices to carry out
    general purpose scientific and engineering
    computing.                              Very promising
Android                                                                                architecture for
 and                                                                            hererogeneous computing:
Ubuntu
                                                                                  it is built on 32nm low-
support
                                                                                   power HKMG (High-K
                                                                                      Metal Gate), and
Enables                                                                             features a dual-core
Google                                                                          1.7GHz mobile CPU built
 Open                                                                           on ARM® Cortex™-A15
Source                                                                              architecture plus an
project                                                                          integrated ARM Mali™-
                                                                                 T604 GPU for increased
       The Arndale Board based on ARM Cortex-A15 with                           performance density and
                                                                                      energy efficiency
      Mali-T604 Samsung Exynos 5250 development platform
          O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012             5
The future of Super Computing
       Centers: the MontBlanc EU project
●   Heterogeneous Computing and minimization of
    power consumption: the new HPC Center of the
    future!    http://www.montblanc-project.eu

                                                                                         MontBlanc
                                                                                        selected the
                                                                                         Samsung
                                                                                         Exynos 5
                                                                                        Processors




        O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012         6
NVIDIA Tegra T3




    Linux
  support:
  Linux for
    Tegra
    (L4T)


Quad-core NVIDIA Tegra T3 based Embedded Toradex Colibri T30 Computer On Module, announced on
January 31, 2012. The cores are ARM Cortex-A9. The GPU is a 520 ULP GeForce.
Audi had selected the Tegra 3 processor for its in-vehicle infotainment systems and digital instruments
display. The processor will be integrated into Audi's entire line of vehicles worldwide, beginning in 2013.



               O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012         7
General Purpose GPU Computing
●   GPUs are:
    –   cheap and powerful
    –   ready to use
    –   highly parallel (thousands of cores)
    –   suitable for SIMD applications
●   SIMD architectures may help solving a large set of
    computational problems:
    –   Data Mining
    –   Cryptography
    –   Earth sciences
    –   Montecarlo simulations
    –   Astrophysics ….
           O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   8
GPU evolution vs. CPU evolution




O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   9
Computational Graphics

Formal Definition
The production of bitmap images based on data
acquired from an external source or computed by
means of a computational model
Phases
●   Definition of the objects in the scene
●   Image rendering
Graphic Pipeline
●   Set of operations for the graphic rendering
       O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   10
Rendering operations
●   Transfer of the scene description: the set of vertex defining
    the objects, the data associated to the scene illumination, the
    textures, the observer's point of view.
●   Vertex transformations: rotations, scaling and objects'
    translation
●   Clipping: elimination of the objects or parts of them not
    visible from the observer's point of view.
●   Lighting and shading: evaluation of the interactions of the
    light sources with the shapes, evaluating their shadowing.
●   Rasterization: generation of the bitmap image. 3D
    coordinates are transformed in 2D coordinates. Textures and
    other graphic effects are also applied.


          O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   11
GPU's evolution

●   Starting from 1995, 3D graphics performance
    issues emerged thanks to the success of video
    games.
●   The OpenGL and DirectX specifications were
    released, hiding the complexity of programming
    3D graphics accelerators
●   The graphic pipeline started to be executed in
    the GPU



         O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   12
GPU's evolution
●   In 2000 the shading operations are included in the
    GPU capabilities:
    –   Vertex shading: manages and transforms the vertex
        positions in an object
    –   Pixel or fragment shading: manages the image pixels,
        enabling the texture mapping
    –   Geometrical shading: starting from the vertex of a
        given object builds more complex objects.
●   Shading capabilities became programmable
    –   each shader were executed on dedicated units
    –   GPUs became flexible almost like CPUs

          O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   13
GPU's evolution
●   In 2005 the Unified Shader Model is introduced:
    the various shading operations are performed
    using a common set of APIs.
●   The shading units are all identical.
●   In 2007 the concept of General Purpose GPU
    became a reality:
    –   NVIDIA released Compute Unified Device Architecture
        (CUDA)
    –   AMD released Brook+
    –   These frameworks allow to access the compute
        devices for general purpose calculations.
          O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   14
The multicore era

●   In the same years (2005-2007) CPUs became
    multicore




Intel Yonah (Core Duo) low-                                         AMD Opteron 2212 introduced in
power dual core processor,                                          August 2006. The first AMD dual
introduced on January 2006                                          core (opteron 875) was released on
Intel Core i7-3920XM Processor                                      April 2005.
Extreme Edition has 6 cores/12                                      AMD Opteron 6366 HE has been
threads (Q4'12).                                                    announced Nov. 5, 2012 and has 16
                                                                    cores and high energy efficiency.
          O. Gervasi, University of Perugia -   FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012    15
Transparent programming of
                   heterogeneous devices
●   In 2008 Khronos Compute Working Group released
    the Open Computing Language (OpenCL)
OpenCL is the first open, royalty-free standard for
cross-platform, parallel programming of modern
processors found in personal computers, servers and
hand-held/embedded devices. [Khronos].




        O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   16
OpenCL by Kronos Group




●   Multi/Many Core Heterogeneous Computing Standard
●   Runs on several devices (CPU, GPU, DSP, etc)
●   Cross vendor (nVidia, AMD, Intel, etc)
●   Portable (Linux, Windows, MacOS)
        O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   17
OpenCL architecture
●   Platform model: abstraction of computing devices
    managed by a single host
●   Execution model: defines the instructions set to be
    executed by the OpenCL devices (kernel) and the
    instructions initializing and controlling the kernels'
    execution (host program).
●   Memory model: defines the memory objects, types of
    memory and how the host and the devices access them.
●   Programming model: defines the type of parallel
    execution performed (on data or on tasks).
●   Framework model: set of APIs and C99 extensions to
    implement host and kernel programs.
          O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   18
The Platform model




The platform model defines the roles of the host and
the devices and provides an abstract hardware
model for devices
      O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   19
The execution model

●   Host program: set of instructions which initialize and
    manage the execution environment of the Compute Device
●   Kernel program: set of instructions executed by the
    Compute Devices
●   The Host prepares the various kernels' execution
●   Each Compute Device executes the kernel
●   Calculations are made by the Work-items (which are
    grouped in Work-groups) each work item executes the
    same program on different data



          O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   20
The memory model
            workgroup 1                         workgroup N



                                                                                  Work-item scope




                                                                                  Work-group scope




                                                                                  Kernel scope




O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012               21
The Framework model

●   Extensions to C99:
    –   Vector data type
    –   Image data type
    –   Conformance to the IEEE-754 - IEEE Standard for
        Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-
        1985)
    –   Memory objects
●   Limitations respect C99:
    –   No recursion
    –   No standard libraries

          O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   22
A case study: AES
●   The Advanced Encryption Standard (AES) algorithm
    plays a big role in the current encryption communications
    and security technologies.
●   Standard FIPS-197
●   The algorithm has been developed by Joan Daemen and
    Vincent Rijmen that submitted it with the ”Rindael”
    codename.
●   The algorithm is symmetric, iterate, block based.
●   Data blocks of 128 bit, Keys of 128, 192 or 256 bits.
●   Due to its characteristics, it can greatly benefit from a
    parallel implementation and in particular from a GPU
    implementation.
          O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   23
The AES algorithm
State = input
                                                              each byte of the state is combined with the
AddRoundKey ( State , RoundKey [ 0 ] )                        round key using bitwise XOR
for r = 1 to rounds−1
                                       a non-linear substitution step where each byte is       replaced with
      SubBytes ( State )               another according to a lookup table
                                       a transposition step where each row of the state is shifted
      ShiftRows ( State )              cyclically a certain number of steps.
      MixColumns ( State )             a mixing operation which operates on the columns of the state,
                                             combining the four bytes in each column
      AddRoundKey ( State , RoundKey [ r ] )                     each byte of the state is combined with the
                                                                 round key using bitwise XOR
end
                               a non-linear substitution step where each byte is       replaced with
SubBytes ( State )             another according to a lookup table
ShiftRows ( State )            a transposition step where each row of the state is shifted
                               cyclically a certain number of steps.
AddRoundKey ( State          , RoundKey [ rounds ] ) each byte of the state is combined with the
                                                               round key using bitwise XOR
output = State

            O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012             24
Implementation

●   Read input file (plain text or ciphered)
●   Read AES parameters
●   Transfer memory objects to device global memory
●   Key expansion
●   Perform kernel on the OpenCl device
●   Transfer memory objects from device global
    memory



         O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   25
Performance tests
●   Hardware description
     –   ATI Firestream 9270 (vendor implementation of OpenCL)
     –   Nvidia GeForce 8600 GT (vendor implementation of OpenCL)
     –   CPU Intel Duo E8500 (AMD OpenCL driver)
Device ATI RV770                                             Device GeForce 8600 GT
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 10                              CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256               CL_DEVICE_MAX_COMPUTE_UNITS: 4
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256                           CL_DEVICE_MAX_WORK_ITEM_SIZES: 
CL_DEVICE_MAX_CLOCK_FREQUENCY: 750 MHz
CL_DEVICE_IMAGE_SUPPORT: 0                                   512 / 512 / 64
CL_DEVICE_GLOBAL_MEM_SIZE: 512 MByte                         CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 Mbyte                      CL_DEVICE_MAX_CLOCK_FREQUENCY: 1188 
CL_DEVICE_QUEUE_PROPERTIES:                                  MHz
CL_QUEUE_PROFILING_ENABLE                                    CL_DEVICE_IMAGE_SUPPORT: 1
Device Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz            CL_DEVICE_GLOBAL_MEM_SIZE: 255 
CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU                           MByte
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 1024            CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024                          CL_DEVICE_QUEUE_PROPERTIES: 
CL_DEVICE_MAX_CLOCK_FREQUENCY: 3166 MHz                      CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 0
CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte
CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_QUEUE_PROPERTIES:
CL_QUEUE_PROFILING_ENABLE


               O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   26
Performance tests




O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   27
Performance tests
    Ignoring the time spent to copy data in memory




O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   28
AES performance tests
   Including the time spent to copy data in memory




O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   29
AES performance tests




O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   30
OpenSSL library

●   FLOSS Security since 1998
●   SSL TSL Tookit
●   Projects based on openssl:                                                                   penC
                                                                                                     L
                                                                                              nO
                                                                  ased o
    –   Apache (mod_ssl)                                     ine b
                                                      S L Eng
        OpenVPN                                   penS reated
                                              An O en c
    –

    –   SSH                                   has be

    …..




         O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012     31
Performance tests

●   The performance tests have been carried out using the
    speed benchmark tool distributed with the openssl library.
●   We used the following hardware:
    –   Intel T7300 Core 2 Duo a 2.00GHz, 2 GB RAM DDR2
    –   Intel i7 870 Quad Core (Hyperthreading) 3.0GHz,                                     4 GB di
        RAM DDR3
    –   Nvidia GTX 580 (16 Compute Units) 772MHz (512 Processing
        Element or Stream Processors at 1.5GHz) 1.5 GB VRAM
        DDR5
    –   Two versions of the algorithm have been impleted: Sbox
        defined in the Constant Memory and the same installed on the
        Global Memory.
           O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012     32
Performance tests

       Data processed as a function of the packet size




O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   33
Performance tests

Speed-up of the same GPU running on 2 separate CPUs




O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   34
Performance tests
Measure of the data transfer from the host memory (RAM) to the device memory (VRAM) and vice-versa:

                      This is a measure of the overhead of the memory transfer




        O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012        35
Performance tests
                 Speed-up of two variants of the algorithm
          (Sbox defined in Constant Memory or in Global memory)




O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   36
Scheduling issues

●   The impressive amount of resources available
    through the GPGPU approach addresses important
    issues related to the efficiency of scheduling of
    modern operating systems in hybrid architectures.
●   Usually it is up to the user decide the type of device
    to use. This is resulting in an inefficient or
    inappropriate scheduling process and to a not
    optimized usage of hardware resources.
●   We are studying an H-system simulator to test
    scheduling algorithms for hybrid systems.

         O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   37
The simulator HPSim

●   The model aims to simulate a H-system
    composed by:
    –   a set of processors (CPUs) and graphics cards
        (GPUs) used as compute units to execute
        heterogeneous jobs
    –   a classifier selecting the type of compute device
        (CPU or GPU)
    –   a scheduler which implements the policy to be
        evaluated.



          O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   38
The simulator HPSim

●   The proposed CPU-GPU simulation model is
    defined in terms of:
    –   a set of state variables describing the system
             –   Devices
             –   Jobs
             –   Queues
             –   Scheduler
    –   a state transition function which determines its
        progression through a finite set of discrete events




           O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   39
The simulator HPSim
●   The simulator provides the following features:
    –   Creation of the user-specified hardware in terms of number of
        CPUs and GPUs.
    –   Generation of the system load, setting the number of jobs.
    –   Tuning of the inter-arrival Job time.
    –   Selection of the Job composition. It allows to specify the
        probability to generate a given number of Realtime, GPU
        User and CPU User Job.
    –   Setting Classifier simulation.
    –   Selection of qt strategy.



            O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   40
The simulator HPSim
●   We are focusing our work on three main aspects
    –   We implemented a simple use-case considering a single
        non-preemptive priority queue. We are working to
        increase the possible cases.
    –   We are carrying out a study of inter-arrival of real
        systems and the implementation of the linux scheduler
        (CFS) to validate the simulator.
    –   We are adding new features to the simulator:
            –   New scheduling policies
            –   Implementation of a graphic interface
            –   Automatic tools for the generation of charts for the analysis of the
                performance of the scheduling strategies.


           O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   41
Conclusions
●   The technology trend is moving towards an high parallelism
    and high energy efficiency, particularly on mobile devices.
●   CPUs make available several cores...
●   GPUs make available an impressive high number of
    specialized cores, suitable for general purpose applications,
    particularly efficient on a SIMD approach.
●   OpenCL makes feasible to use any type of such
    computational resources in a portable and transparent way...
●   A new, very promising scenario is rising ...




           O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   42
Acknowledgement




Dr. Flavio Vella                                         Prof. Sergio Tasso


                High Performance Computing Lab
  Department of Mathematics and Computer Science
                             Unversity of Perugia
    O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   43
Invitation to submit papers to
                     ICCSA 2013
                      Ho Chi Minh City, June 24-27 2012




                http://www.iccsa.org

Deadline for submissions: Jan 15, 2013

    O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   44
Thank you!!!

O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012   45

More Related Content

Viewers also liked

VA Education Benefits
VA Education BenefitsVA Education Benefits
VA Education Benefitsgranimal
 
Studies in application of Augmented Reality in E-Learning Courses
Studies in application of Augmented Reality in E-Learning CoursesStudies in application of Augmented Reality in E-Learning Courses
Studies in application of Augmented Reality in E-Learning CoursesHimanshu Bansal
 
Архитектура А/Б тестирования: сделай сам
Архитектура А/Б тестирования: сделай самАрхитектура А/Б тестирования: сделай сам
Архитектура А/Б тестирования: сделай самSergey Xek
 
Discussion continuum: Obesitat
Discussion continuum: ObesitatDiscussion continuum: Obesitat
Discussion continuum: ObesitatXplore Health
 
Health Tips by VisitHazara.Com
Health Tips by VisitHazara.ComHealth Tips by VisitHazara.Com
Health Tips by VisitHazara.ComMuneer Qureshi
 
KVH プラグ&トレード™
KVH プラグ&トレード™KVH プラグ&トレード™
KVH プラグ&トレード™KVH Co. Ltd.
 
Półmaraton 2013(1)
Półmaraton 2013(1)Półmaraton 2013(1)
Półmaraton 2013(1)franekmaster
 
Discussion continuum - Kto placi za rozwoj lekow
Discussion continuum - Kto placi za rozwoj lekowDiscussion continuum - Kto placi za rozwoj lekow
Discussion continuum - Kto placi za rozwoj lekowXplore Health
 
CSR-friendly tax policy: Unlocking value and aligning interests
CSR-friendly tax policy: Unlocking value and aligning interestsCSR-friendly tax policy: Unlocking value and aligning interests
CSR-friendly tax policy: Unlocking value and aligning interestsWayne Dunn
 
Lab vocabulary - Classroom activities
Lab vocabulary - Classroom activitiesLab vocabulary - Classroom activities
Lab vocabulary - Classroom activitiesXplore Health
 
More photos
More photosMore photos
More photossstjohn
 
Imperia esfera gurgaon 37 C 7428424386
Imperia esfera gurgaon 37 C 7428424386Imperia esfera gurgaon 37 C 7428424386
Imperia esfera gurgaon 37 C 7428424386Adore Global Pvt. Ltd
 
OpenStack系列公开课2 -20130508
OpenStack系列公开课2 -20130508OpenStack系列公开课2 -20130508
OpenStack系列公开课2 -20130508OpenCity Community
 
Manifestação da República do Paraná contra Lula
Manifestação da República do Paraná contra LulaManifestação da República do Paraná contra Lula
Manifestação da República do Paraná contra LulaMiguel Rosario
 

Viewers also liked (20)

VA Education Benefits
VA Education BenefitsVA Education Benefits
VA Education Benefits
 
Studies in application of Augmented Reality in E-Learning Courses
Studies in application of Augmented Reality in E-Learning CoursesStudies in application of Augmented Reality in E-Learning Courses
Studies in application of Augmented Reality in E-Learning Courses
 
Архитектура А/Б тестирования: сделай сам
Архитектура А/Б тестирования: сделай самАрхитектура А/Б тестирования: сделай сам
Архитектура А/Б тестирования: сделай сам
 
Discussion continuum: Obesitat
Discussion continuum: ObesitatDiscussion continuum: Obesitat
Discussion continuum: Obesitat
 
Health Tips by VisitHazara.Com
Health Tips by VisitHazara.ComHealth Tips by VisitHazara.Com
Health Tips by VisitHazara.Com
 
KVH プラグ&トレード™
KVH プラグ&トレード™KVH プラグ&トレード™
KVH プラグ&トレード™
 
Traplate
TraplateTraplate
Traplate
 
Slackline
SlacklineSlackline
Slackline
 
DA_31May2016
DA_31May2016DA_31May2016
DA_31May2016
 
Półmaraton 2013(1)
Półmaraton 2013(1)Półmaraton 2013(1)
Półmaraton 2013(1)
 
Discussion continuum - Kto placi za rozwoj lekow
Discussion continuum - Kto placi za rozwoj lekowDiscussion continuum - Kto placi za rozwoj lekow
Discussion continuum - Kto placi za rozwoj lekow
 
CSR-friendly tax policy: Unlocking value and aligning interests
CSR-friendly tax policy: Unlocking value and aligning interestsCSR-friendly tax policy: Unlocking value and aligning interests
CSR-friendly tax policy: Unlocking value and aligning interests
 
2º residencia granada
2º residencia granada2º residencia granada
2º residencia granada
 
Lab vocabulary - Classroom activities
Lab vocabulary - Classroom activitiesLab vocabulary - Classroom activities
Lab vocabulary - Classroom activities
 
More photos
More photosMore photos
More photos
 
Loaf of bread
Loaf of breadLoaf of bread
Loaf of bread
 
Imperia esfera gurgaon 37 C 7428424386
Imperia esfera gurgaon 37 C 7428424386Imperia esfera gurgaon 37 C 7428424386
Imperia esfera gurgaon 37 C 7428424386
 
LBC: Kudavi
LBC: Kudavi LBC: Kudavi
LBC: Kudavi
 
OpenStack系列公开课2 -20130508
OpenStack系列公开课2 -20130508OpenStack系列公开课2 -20130508
OpenStack系列公开课2 -20130508
 
Manifestação da República do Paraná contra Lula
Manifestação da República do Paraná contra LulaManifestação da República do Paraná contra Lula
Manifestação da República do Paraná contra Lula
 

Similar to The road to multi/many core computing

Compressing of Magnetic Resonance Images with Cuda
Compressing of Magnetic Resonance Images with CudaCompressing of Magnetic Resonance Images with Cuda
Compressing of Magnetic Resonance Images with Cudaijtsrd
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
 
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel MovidiusBenchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel MovidiusbyteLAKE
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010haythem_2015
 
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing WorldCloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing WorldOmer Kilic
 
13th kandroid OpenGL and EGL
13th kandroid OpenGL and EGL13th kandroid OpenGL and EGL
13th kandroid OpenGL and EGLJungsoo Nam
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanAnkita Dewan
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanAnkita Dewan
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Rocketick accelerated verilog simulations
Rocketick  accelerated verilog simulationsRocketick  accelerated verilog simulations
Rocketick accelerated verilog simulationschiportal
 
IRJET- FPGA based Controller Design for Mobile Robots
IRJET- FPGA based Controller Design for Mobile RobotsIRJET- FPGA based Controller Design for Mobile Robots
IRJET- FPGA based Controller Design for Mobile RobotsIRJET Journal
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon Berlin
 
Performance and power comparisons between nvidia and ati gpus
Performance and power comparisons between nvidia and ati gpusPerformance and power comparisons between nvidia and ati gpus
Performance and power comparisons between nvidia and ati gpusijcsit
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
GPU Computing: An Introduction
GPU Computing: An IntroductionGPU Computing: An Introduction
GPU Computing: An Introductionijtsrd
 

Similar to The road to multi/many core computing (20)

Compressing of Magnetic Resonance Images with Cuda
Compressing of Magnetic Resonance Images with CudaCompressing of Magnetic Resonance Images with Cuda
Compressing of Magnetic Resonance Images with Cuda
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel MovidiusBenchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
 
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing WorldCloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
Apu fc & s project
Apu fc & s projectApu fc & s project
Apu fc & s project
 
13th kandroid OpenGL and EGL
13th kandroid OpenGL and EGL13th kandroid OpenGL and EGL
13th kandroid OpenGL and EGL
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita Dewan
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita Dewan
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Rocketick accelerated verilog simulations
Rocketick  accelerated verilog simulationsRocketick  accelerated verilog simulations
Rocketick accelerated verilog simulations
 
Fo2410191024
Fo2410191024Fo2410191024
Fo2410191024
 
IRJET- FPGA based Controller Design for Mobile Robots
IRJET- FPGA based Controller Design for Mobile RobotsIRJET- FPGA based Controller Design for Mobile Robots
IRJET- FPGA based Controller Design for Mobile Robots
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imagination
 
Performance and power comparisons between nvidia and ati gpus
Performance and power comparisons between nvidia and ati gpusPerformance and power comparisons between nvidia and ati gpus
Performance and power comparisons between nvidia and ati gpus
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
GPU Computing: An Introduction
GPU Computing: An IntroductionGPU Computing: An Introduction
GPU Computing: An Introduction
 

Recently uploaded

Amil Baba in Pakistan Kala jadu Expert Amil baba Black magic Specialist in Is...
Amil Baba in Pakistan Kala jadu Expert Amil baba Black magic Specialist in Is...Amil Baba in Pakistan Kala jadu Expert Amil baba Black magic Specialist in Is...
Amil Baba in Pakistan Kala jadu Expert Amil baba Black magic Specialist in Is...Amil Baba Company
 
Gripping Adult Web Series You Can't Afford to Miss
Gripping Adult Web Series You Can't Afford to MissGripping Adult Web Series You Can't Afford to Miss
Gripping Adult Web Series You Can't Afford to Missget joys
 
Amil baba in Pakistan amil baba Karachi amil baba in pakistan amil baba in la...
Amil baba in Pakistan amil baba Karachi amil baba in pakistan amil baba in la...Amil baba in Pakistan amil baba Karachi amil baba in pakistan amil baba in la...
Amil baba in Pakistan amil baba Karachi amil baba in pakistan amil baba in la...Amil Baba Company
 
Aesthetic Design Inspiration by Slidesgo.pptx
Aesthetic Design Inspiration by Slidesgo.pptxAesthetic Design Inspiration by Slidesgo.pptx
Aesthetic Design Inspiration by Slidesgo.pptxsayemalkadripial4
 
NO,1 Amil baba in Lahore Astrologer in Karachi amil baba in pakistan amil bab...
NO,1 Amil baba in Lahore Astrologer in Karachi amil baba in pakistan amil bab...NO,1 Amil baba in Lahore Astrologer in Karachi amil baba in pakistan amil bab...
NO,1 Amil baba in Lahore Astrologer in Karachi amil baba in pakistan amil bab...Amil Baba Company
 
Taken Pilot Episode Story pitch Document
Taken Pilot Episode Story pitch DocumentTaken Pilot Episode Story pitch Document
Taken Pilot Episode Story pitch Documentf4ssvxpz62
 
The Fine Line Between Honest and Evil Comics by Salty Vixen
The Fine Line Between Honest and Evil Comics by Salty VixenThe Fine Line Between Honest and Evil Comics by Salty Vixen
The Fine Line Between Honest and Evil Comics by Salty VixenSalty Vixen Stories & More
 
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...Amil baba
 
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一lvtagr7
 
在线办理曼大毕业证曼尼托巴大学毕业证成绩单留信学历认证
在线办理曼大毕业证曼尼托巴大学毕业证成绩单留信学历认证在线办理曼大毕业证曼尼托巴大学毕业证成绩单留信学历认证
在线办理曼大毕业证曼尼托巴大学毕业证成绩单留信学历认证nhjeo1gg
 
GRADE 7 NEW PPT ENGLISH 1 [Autosaved].pp
GRADE 7 NEW PPT ENGLISH 1 [Autosaved].ppGRADE 7 NEW PPT ENGLISH 1 [Autosaved].pp
GRADE 7 NEW PPT ENGLISH 1 [Autosaved].ppJasmineLinogon
 
Call Girl Price Andheri WhatsApp:+91-9833363713
Call Girl Price Andheri WhatsApp:+91-9833363713Call Girl Price Andheri WhatsApp:+91-9833363713
Call Girl Price Andheri WhatsApp:+91-9833363713Sonam Pathan
 
Biswanath Byam Samiti Open Quiz 2022 by Qui9 Grand Finale
Biswanath Byam Samiti Open Quiz 2022 by Qui9 Grand FinaleBiswanath Byam Samiti Open Quiz 2022 by Qui9 Grand Finale
Biswanath Byam Samiti Open Quiz 2022 by Qui9 Grand FinaleQui9 (Ultimate Quizzing)
 
Call Girls Near The Corus Hotel New Delhi 9873777170
Call Girls Near The Corus Hotel New Delhi 9873777170Call Girls Near The Corus Hotel New Delhi 9873777170
Call Girls Near The Corus Hotel New Delhi 9873777170Sonam Pathan
 
Call Girls Near Taurus Sarovar Portico Hotel New Delhi 9873777170
Call Girls Near Taurus Sarovar Portico Hotel New Delhi 9873777170Call Girls Near Taurus Sarovar Portico Hotel New Delhi 9873777170
Call Girls Near Taurus Sarovar Portico Hotel New Delhi 9873777170Sonam Pathan
 
办理滑铁卢大学毕业证成绩单|购买加拿大文凭证书
办理滑铁卢大学毕业证成绩单|购买加拿大文凭证书办理滑铁卢大学毕业证成绩单|购买加拿大文凭证书
办理滑铁卢大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
Statement Of Intent - - Copy.documentfile
Statement Of Intent - - Copy.documentfileStatement Of Intent - - Copy.documentfile
Statement Of Intent - - Copy.documentfilef4ssvxpz62
 
Vip Delhi Ncr Call Girls Best Services Available
Vip Delhi Ncr Call Girls Best Services AvailableVip Delhi Ncr Call Girls Best Services Available
Vip Delhi Ncr Call Girls Best Services AvailableKomal Khan
 
原版1:1复刻帕森斯设计学院毕业证Parsons毕业证留信学历认证
原版1:1复刻帕森斯设计学院毕业证Parsons毕业证留信学历认证原版1:1复刻帕森斯设计学院毕业证Parsons毕业证留信学历认证
原版1:1复刻帕森斯设计学院毕业证Parsons毕业证留信学历认证jdkhjh
 

Recently uploaded (20)

Amil Baba in Pakistan Kala jadu Expert Amil baba Black magic Specialist in Is...
Amil Baba in Pakistan Kala jadu Expert Amil baba Black magic Specialist in Is...Amil Baba in Pakistan Kala jadu Expert Amil baba Black magic Specialist in Is...
Amil Baba in Pakistan Kala jadu Expert Amil baba Black magic Specialist in Is...
 
Environment Handling Presentation by Likhon Ahmed.pptx
Environment Handling Presentation by Likhon Ahmed.pptxEnvironment Handling Presentation by Likhon Ahmed.pptx
Environment Handling Presentation by Likhon Ahmed.pptx
 
Gripping Adult Web Series You Can't Afford to Miss
Gripping Adult Web Series You Can't Afford to MissGripping Adult Web Series You Can't Afford to Miss
Gripping Adult Web Series You Can't Afford to Miss
 
Amil baba in Pakistan amil baba Karachi amil baba in pakistan amil baba in la...
Amil baba in Pakistan amil baba Karachi amil baba in pakistan amil baba in la...Amil baba in Pakistan amil baba Karachi amil baba in pakistan amil baba in la...
Amil baba in Pakistan amil baba Karachi amil baba in pakistan amil baba in la...
 
Aesthetic Design Inspiration by Slidesgo.pptx
Aesthetic Design Inspiration by Slidesgo.pptxAesthetic Design Inspiration by Slidesgo.pptx
Aesthetic Design Inspiration by Slidesgo.pptx
 
NO,1 Amil baba in Lahore Astrologer in Karachi amil baba in pakistan amil bab...
NO,1 Amil baba in Lahore Astrologer in Karachi amil baba in pakistan amil bab...NO,1 Amil baba in Lahore Astrologer in Karachi amil baba in pakistan amil bab...
NO,1 Amil baba in Lahore Astrologer in Karachi amil baba in pakistan amil bab...
 
Taken Pilot Episode Story pitch Document
Taken Pilot Episode Story pitch DocumentTaken Pilot Episode Story pitch Document
Taken Pilot Episode Story pitch Document
 
The Fine Line Between Honest and Evil Comics by Salty Vixen
The Fine Line Between Honest and Evil Comics by Salty VixenThe Fine Line Between Honest and Evil Comics by Salty Vixen
The Fine Line Between Honest and Evil Comics by Salty Vixen
 
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
 
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
 
在线办理曼大毕业证曼尼托巴大学毕业证成绩单留信学历认证
在线办理曼大毕业证曼尼托巴大学毕业证成绩单留信学历认证在线办理曼大毕业证曼尼托巴大学毕业证成绩单留信学历认证
在线办理曼大毕业证曼尼托巴大学毕业证成绩单留信学历认证
 
GRADE 7 NEW PPT ENGLISH 1 [Autosaved].pp
GRADE 7 NEW PPT ENGLISH 1 [Autosaved].ppGRADE 7 NEW PPT ENGLISH 1 [Autosaved].pp
GRADE 7 NEW PPT ENGLISH 1 [Autosaved].pp
 
Call Girl Price Andheri WhatsApp:+91-9833363713
Call Girl Price Andheri WhatsApp:+91-9833363713Call Girl Price Andheri WhatsApp:+91-9833363713
Call Girl Price Andheri WhatsApp:+91-9833363713
 
Biswanath Byam Samiti Open Quiz 2022 by Qui9 Grand Finale
Biswanath Byam Samiti Open Quiz 2022 by Qui9 Grand FinaleBiswanath Byam Samiti Open Quiz 2022 by Qui9 Grand Finale
Biswanath Byam Samiti Open Quiz 2022 by Qui9 Grand Finale
 
Call Girls Near The Corus Hotel New Delhi 9873777170
Call Girls Near The Corus Hotel New Delhi 9873777170Call Girls Near The Corus Hotel New Delhi 9873777170
Call Girls Near The Corus Hotel New Delhi 9873777170
 
Call Girls Near Taurus Sarovar Portico Hotel New Delhi 9873777170
Call Girls Near Taurus Sarovar Portico Hotel New Delhi 9873777170Call Girls Near Taurus Sarovar Portico Hotel New Delhi 9873777170
Call Girls Near Taurus Sarovar Portico Hotel New Delhi 9873777170
 
办理滑铁卢大学毕业证成绩单|购买加拿大文凭证书
办理滑铁卢大学毕业证成绩单|购买加拿大文凭证书办理滑铁卢大学毕业证成绩单|购买加拿大文凭证书
办理滑铁卢大学毕业证成绩单|购买加拿大文凭证书
 
Statement Of Intent - - Copy.documentfile
Statement Of Intent - - Copy.documentfileStatement Of Intent - - Copy.documentfile
Statement Of Intent - - Copy.documentfile
 
Vip Delhi Ncr Call Girls Best Services Available
Vip Delhi Ncr Call Girls Best Services AvailableVip Delhi Ncr Call Girls Best Services Available
Vip Delhi Ncr Call Girls Best Services Available
 
原版1:1复刻帕森斯设计学院毕业证Parsons毕业证留信学历认证
原版1:1复刻帕森斯设计学院毕业证Parsons毕业证留信学历认证原版1:1复刻帕森斯设计学院毕业证Parsons毕业证留信学历认证
原版1:1复刻帕森斯设计学院毕业证Parsons毕业证留信学历认证
 

The road to multi/many core computing

  • 1. Korea Woman Training Center, Kangwondo, Korea Dec 16th-19th, 2012 The road of multi/many Core computing Osvaldo Gervasi Dept of Mathematics and Computer Science University of Perugia osvaldo@unipg.it
  • 2. Outline ● Introduction ● General Purpose GPU Computing ● GPU and CPU evolution ● OpenCL: a standard for programming heterogeneous devices ● A case study: Advanced Encryption Standard (AES) ● The OpenSSL library ● Scheduling issues ● Conclusions O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 2
  • 3. Moore's law Source: Wikipedia O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 3
  • 4. GPGPU Computing ● GPU computing or GPGPU “is the use of a GPU (graphics processing unit) to carry on general purpose scientific and engineering computing” [Nvidia]. Sapphire ATI Radeon HD 4550 GPU Nvidia TESLA GPU O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 4
  • 5. Heterogeneous Computing ● Heterogeneous Computing is the transparent use of all computational devices to carry out general purpose scientific and engineering computing. Very promising Android architecture for and hererogeneous computing: Ubuntu it is built on 32nm low- support power HKMG (High-K Metal Gate), and Enables features a dual-core Google 1.7GHz mobile CPU built Open on ARM® Cortex™-A15 Source architecture plus an project integrated ARM Mali™- T604 GPU for increased The Arndale Board based on ARM Cortex-A15 with performance density and energy efficiency Mali-T604 Samsung Exynos 5250 development platform O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 5
  • 6. The future of Super Computing Centers: the MontBlanc EU project ● Heterogeneous Computing and minimization of power consumption: the new HPC Center of the future! http://www.montblanc-project.eu MontBlanc selected the Samsung Exynos 5 Processors O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 6
  • 7. NVIDIA Tegra T3 Linux support: Linux for Tegra (L4T) Quad-core NVIDIA Tegra T3 based Embedded Toradex Colibri T30 Computer On Module, announced on January 31, 2012. The cores are ARM Cortex-A9. The GPU is a 520 ULP GeForce. Audi had selected the Tegra 3 processor for its in-vehicle infotainment systems and digital instruments display. The processor will be integrated into Audi's entire line of vehicles worldwide, beginning in 2013. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 7
  • 8. General Purpose GPU Computing ● GPUs are: – cheap and powerful – ready to use – highly parallel (thousands of cores) – suitable for SIMD applications ● SIMD architectures may help solving a large set of computational problems: – Data Mining – Cryptography – Earth sciences – Montecarlo simulations – Astrophysics …. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 8
  • 9. GPU evolution vs. CPU evolution O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 9
  • 10. Computational Graphics Formal Definition The production of bitmap images based on data acquired from an external source or computed by means of a computational model Phases ● Definition of the objects in the scene ● Image rendering Graphic Pipeline ● Set of operations for the graphic rendering O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 10
  • 11. Rendering operations ● Transfer of the scene description: the set of vertex defining the objects, the data associated to the scene illumination, the textures, the observer's point of view. ● Vertex transformations: rotations, scaling and objects' translation ● Clipping: elimination of the objects or parts of them not visible from the observer's point of view. ● Lighting and shading: evaluation of the interactions of the light sources with the shapes, evaluating their shadowing. ● Rasterization: generation of the bitmap image. 3D coordinates are transformed in 2D coordinates. Textures and other graphic effects are also applied. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 11
  • 12. GPU's evolution ● Starting from 1995, 3D graphics performance issues emerged thanks to the success of video games. ● The OpenGL and DirectX specifications were released, hiding the complexity of programming 3D graphics accelerators ● The graphic pipeline started to be executed in the GPU O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 12
  • 13. GPU's evolution ● In 2000 the shading operations are included in the GPU capabilities: – Vertex shading: manages and transforms the vertex positions in an object – Pixel or fragment shading: manages the image pixels, enabling the texture mapping – Geometrical shading: starting from the vertex of a given object builds more complex objects. ● Shading capabilities became programmable – each shader were executed on dedicated units – GPUs became flexible almost like CPUs O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 13
  • 14. GPU's evolution ● In 2005 the Unified Shader Model is introduced: the various shading operations are performed using a common set of APIs. ● The shading units are all identical. ● In 2007 the concept of General Purpose GPU became a reality: – NVIDIA released Compute Unified Device Architecture (CUDA) – AMD released Brook+ – These frameworks allow to access the compute devices for general purpose calculations. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 14
  • 15. The multicore era ● In the same years (2005-2007) CPUs became multicore Intel Yonah (Core Duo) low- AMD Opteron 2212 introduced in power dual core processor, August 2006. The first AMD dual introduced on January 2006 core (opteron 875) was released on Intel Core i7-3920XM Processor April 2005. Extreme Edition has 6 cores/12 AMD Opteron 6366 HE has been threads (Q4'12). announced Nov. 5, 2012 and has 16 cores and high energy efficiency. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 15
  • 16. Transparent programming of heterogeneous devices ● In 2008 Khronos Compute Working Group released the Open Computing Language (OpenCL) OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and hand-held/embedded devices. [Khronos]. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 16
  • 17. OpenCL by Kronos Group ● Multi/Many Core Heterogeneous Computing Standard ● Runs on several devices (CPU, GPU, DSP, etc) ● Cross vendor (nVidia, AMD, Intel, etc) ● Portable (Linux, Windows, MacOS) O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 17
  • 18. OpenCL architecture ● Platform model: abstraction of computing devices managed by a single host ● Execution model: defines the instructions set to be executed by the OpenCL devices (kernel) and the instructions initializing and controlling the kernels' execution (host program). ● Memory model: defines the memory objects, types of memory and how the host and the devices access them. ● Programming model: defines the type of parallel execution performed (on data or on tasks). ● Framework model: set of APIs and C99 extensions to implement host and kernel programs. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 18
  • 19. The Platform model The platform model defines the roles of the host and the devices and provides an abstract hardware model for devices O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 19
  • 20. The execution model ● Host program: set of instructions which initialize and manage the execution environment of the Compute Device ● Kernel program: set of instructions executed by the Compute Devices ● The Host prepares the various kernels' execution ● Each Compute Device executes the kernel ● Calculations are made by the Work-items (which are grouped in Work-groups) each work item executes the same program on different data O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 20
  • 21. The memory model workgroup 1 workgroup N Work-item scope Work-group scope Kernel scope O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 21
  • 22. The Framework model ● Extensions to C99: – Vector data type – Image data type – Conformance to the IEEE-754 - IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754- 1985) – Memory objects ● Limitations respect C99: – No recursion – No standard libraries O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 22
  • 23. A case study: AES ● The Advanced Encryption Standard (AES) algorithm plays a big role in the current encryption communications and security technologies. ● Standard FIPS-197 ● The algorithm has been developed by Joan Daemen and Vincent Rijmen that submitted it with the ”Rindael” codename. ● The algorithm is symmetric, iterate, block based. ● Data blocks of 128 bit, Keys of 128, 192 or 256 bits. ● Due to its characteristics, it can greatly benefit from a parallel implementation and in particular from a GPU implementation. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 23
  • 24. The AES algorithm State = input each byte of the state is combined with the AddRoundKey ( State , RoundKey [ 0 ] ) round key using bitwise XOR for r = 1 to rounds−1 a non-linear substitution step where each byte is replaced with SubBytes ( State ) another according to a lookup table a transposition step where each row of the state is shifted ShiftRows ( State ) cyclically a certain number of steps. MixColumns ( State ) a mixing operation which operates on the columns of the state, combining the four bytes in each column AddRoundKey ( State , RoundKey [ r ] ) each byte of the state is combined with the round key using bitwise XOR end a non-linear substitution step where each byte is replaced with SubBytes ( State ) another according to a lookup table ShiftRows ( State ) a transposition step where each row of the state is shifted cyclically a certain number of steps. AddRoundKey ( State , RoundKey [ rounds ] ) each byte of the state is combined with the round key using bitwise XOR output = State O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 24
  • 25. Implementation ● Read input file (plain text or ciphered) ● Read AES parameters ● Transfer memory objects to device global memory ● Key expansion ● Perform kernel on the OpenCl device ● Transfer memory objects from device global memory O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 25
  • 26. Performance tests ● Hardware description – ATI Firestream 9270 (vendor implementation of OpenCL) – Nvidia GeForce 8600 GT (vendor implementation of OpenCL) – CPU Intel Duo E8500 (AMD OpenCL driver) Device ATI RV770 Device GeForce 8600 GT CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 10 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256 CL_DEVICE_MAX_COMPUTE_UNITS: 4 CL_DEVICE_MAX_WORK_GROUP_SIZE: 256 CL_DEVICE_MAX_WORK_ITEM_SIZES:  CL_DEVICE_MAX_CLOCK_FREQUENCY: 750 MHz CL_DEVICE_IMAGE_SUPPORT: 0 512 / 512 / 64 CL_DEVICE_GLOBAL_MEM_SIZE: 512 MByte CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 Mbyte CL_DEVICE_MAX_CLOCK_FREQUENCY: 1188  CL_DEVICE_QUEUE_PROPERTIES: MHz CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 Device Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz CL_DEVICE_GLOBAL_MEM_SIZE: 255  CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU MByte CL_DEVICE_MAX_COMPUTE_UNITS: 2 CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 1024 CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024 CL_DEVICE_QUEUE_PROPERTIES:  CL_DEVICE_MAX_CLOCK_FREQUENCY: 3166 MHz CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 0 CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 26
  • 27. Performance tests O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 27
  • 28. Performance tests Ignoring the time spent to copy data in memory O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 28
  • 29. AES performance tests Including the time spent to copy data in memory O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 29
  • 30. AES performance tests O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 30
  • 31. OpenSSL library ● FLOSS Security since 1998 ● SSL TSL Tookit ● Projects based on openssl: penC L nO ased o – Apache (mod_ssl) ine b S L Eng OpenVPN penS reated An O en c – – SSH has be ….. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 31
  • 32. Performance tests ● The performance tests have been carried out using the speed benchmark tool distributed with the openssl library. ● We used the following hardware: – Intel T7300 Core 2 Duo a 2.00GHz, 2 GB RAM DDR2 – Intel i7 870 Quad Core (Hyperthreading) 3.0GHz, 4 GB di RAM DDR3 – Nvidia GTX 580 (16 Compute Units) 772MHz (512 Processing Element or Stream Processors at 1.5GHz) 1.5 GB VRAM DDR5 – Two versions of the algorithm have been impleted: Sbox defined in the Constant Memory and the same installed on the Global Memory. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 32
  • 33. Performance tests Data processed as a function of the packet size O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 33
  • 34. Performance tests Speed-up of the same GPU running on 2 separate CPUs O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 34
  • 35. Performance tests Measure of the data transfer from the host memory (RAM) to the device memory (VRAM) and vice-versa: This is a measure of the overhead of the memory transfer O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 35
  • 36. Performance tests Speed-up of two variants of the algorithm (Sbox defined in Constant Memory or in Global memory) O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 36
  • 37. Scheduling issues ● The impressive amount of resources available through the GPGPU approach addresses important issues related to the efficiency of scheduling of modern operating systems in hybrid architectures. ● Usually it is up to the user decide the type of device to use. This is resulting in an inefficient or inappropriate scheduling process and to a not optimized usage of hardware resources. ● We are studying an H-system simulator to test scheduling algorithms for hybrid systems. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 37
  • 38. The simulator HPSim ● The model aims to simulate a H-system composed by: – a set of processors (CPUs) and graphics cards (GPUs) used as compute units to execute heterogeneous jobs – a classifier selecting the type of compute device (CPU or GPU) – a scheduler which implements the policy to be evaluated. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 38
  • 39. The simulator HPSim ● The proposed CPU-GPU simulation model is defined in terms of: – a set of state variables describing the system – Devices – Jobs – Queues – Scheduler – a state transition function which determines its progression through a finite set of discrete events O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 39
  • 40. The simulator HPSim ● The simulator provides the following features: – Creation of the user-specified hardware in terms of number of CPUs and GPUs. – Generation of the system load, setting the number of jobs. – Tuning of the inter-arrival Job time. – Selection of the Job composition. It allows to specify the probability to generate a given number of Realtime, GPU User and CPU User Job. – Setting Classifier simulation. – Selection of qt strategy. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 40
  • 41. The simulator HPSim ● We are focusing our work on three main aspects – We implemented a simple use-case considering a single non-preemptive priority queue. We are working to increase the possible cases. – We are carrying out a study of inter-arrival of real systems and the implementation of the linux scheduler (CFS) to validate the simulator. – We are adding new features to the simulator: – New scheduling policies – Implementation of a graphic interface – Automatic tools for the generation of charts for the analysis of the performance of the scheduling strategies. O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 41
  • 42. Conclusions ● The technology trend is moving towards an high parallelism and high energy efficiency, particularly on mobile devices. ● CPUs make available several cores... ● GPUs make available an impressive high number of specialized cores, suitable for general purpose applications, particularly efficient on a SIMD approach. ● OpenCL makes feasible to use any type of such computational resources in a portable and transparent way... ● A new, very promising scenario is rising ... O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 42
  • 43. Acknowledgement Dr. Flavio Vella Prof. Sergio Tasso High Performance Computing Lab Department of Mathematics and Computer Science Unversity of Perugia O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 43
  • 44. Invitation to submit papers to ICCSA 2013 Ho Chi Minh City, June 24-27 2012 http://www.iccsa.org Deadline for submissions: Jan 15, 2013 O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 44
  • 45. Thank you!!! O. Gervasi, University of Perugia - FGIT 2012, Kangwondo, South Korea, Dec 20 th, 2012 45