QsNetIII Adaptively Routed Network For HPC

QsNetIII an Adaptively Routed Network for
High Performance Computing

Duncan Roweth, Quadrics Ltd
Hot Interconnects August 2008

28/8/2008 Quadrics Ltd 1

Quadrics Background

• Develops interconnect products for the HPC market
– HPC Linux systems
– AlphaServer SC systems
• Quadrics is owned by the Finmeccanica group
• Quadrics was 12 years old in July


QsNet Networks

• Multi-stage switch network
• Components
– Adapter: Elan
– Router: Elite
– Switches, cables
– Firmware, drivers, libraries
– Diagnostics, documentation
• HPC specific features
– Adaptive routing
– Hardware barrier & broadcast


Communication Model
Virtual Address

Processs


Quadrics Networks

• Elan1 / Elite1, 1994, Meiko Computing Surface 2
– Source chooses between pre-defined routes
• Elan3 / Elite3, 2000, first Quadrics product, QsNet
– First use of packet-by-packet adaptive routing
– Crosspoint router, x8
• Elan4 / Elite4, 2004, QsNetII
– Reduced latency, increased bandwidth
– Increased support for offloading collectives
• Elan5 / Elite5, 2008, QsNetIII
– General purpose crosspoint router, increased radix, x32
– Highly programmable adapter


What is Adaptive Routing ?

• Switch networks typically provide many
paths between any two points
• In an adaptively routed network
routers make packet by packet decisions
on the route to use based on
– Queue occupancy
– Channel usage
– Error rates and state
– Class of traffic


Why is Adaptive Routing Important ?

• Most HPC networks are statically routed
– They use pre-determined paths between nodes
• Static routing can work well
– If traffic pattern is known in advance
– If traffic pattern is persistent
– If traffic pattern is uniform (i.e. application is load balanced)
– If there are no errors
• These conditions are not met by real codes on production
HPC systems {see LLNL and Sandia results}
• Adaptive routing solves these problems
– Delivering significantly better aggregate bandwidths and worst
case latencies on real systems running real codes


Benefits of Adaptive Routing

• Bandwidth achieved
when 1024 nodes all
communicate at the
same time

• Plots show the
distribution of
measured bandwidths

System Interconnect Min Max Average
Atlas Infiniband 95 762 263
QsNetII
Thunder 248 403 369
Data from Lawrence Livermore National Lab, published at the Sonoma OpenFabrics workshop April 2007


Benefits of Adaptive Routing

• Classic QsNetII all-to-all bandwidth scaling graph


Ordering Considerations

• Adaptively routed packets can arrive out of order
– Problems for stream devices, e.g. multipath Ethernet
• Message ordering is required in HPC
– But within a message we are free to deliver the bulk data in
arbitrary order
Get it there as fast as possible then tell me that it is done
• QsNet ordering
– Packets contain the destination virtual address at which to write
the data
– Bulk data transfers can arrive out of order and can be replayed
– Atomic transactions are sequenced


Adaptive Routing in QsNetIII

• More flexible than QsNetII
– Operates over arbitrary sets of links
– More opportunities to use the technique
– Higher radix switches

• Select a subset of lightly loaded output ports based on:
– Destination
– Link state, errors etc
– Number of pending acks (programmable threshold)
• Programmable algorithm for selecting from this subset:
– First free, next free, random


Adaptive Routing: standard case

– All top switches are equivalent, select one
– Adaptive routing selects a lightly loaded path


Implementation of Fat Tree Networks

• Connect M×N-way node switches by N×M-way top switches
• In this case M = 16, N = 4


Adaptive Routing in the Top Switch

• If top switch radix ≤ router radix / 2
– i.e. 16 for Elite5, 2048-way networks
• Router provides multiple top switches
– Select which to use based on load
• Example:
– Traffic from A to B via routers 210 and
300 is blocked by traffic between 300
and 200.
– The router providing 300, 301, 302 and
303 can select a different path


Adaptive Routing on the Final Hop

• Multiple connections to a node
• Switch can select a free path
• Reduces end-point contention

• Simple case is not optimal
• Spreading the connections
– Improves fault tolerance
– Reduces network contention
• Routing decision is made higher
in the network


Adaptive routing in the presence of errors

• In a production system with 1000s
of links it is not uncommon for a
small number to be broken – until
the next maintenance slot
• Adaptive routing minimises the
impact
• Example:
– Link between routers 10 and 20 is
broken
– Router 10 dynamically selects paths
via 21,22,23 spreading the load.
– Reverse case, avoid sending to 10
via 20. Reset 20’s links or update
switches 11,12,13.


Small Packet Support

• Aim to get as close to line rate as possible with small packets
• For example:
– Small put
– 32 byte packet

• Adapter has multiple packet engines
• Adapters support up to 64 outstanding packets per link
– Doubles if we use both links
• Switches provide 32 virtual channels per output link
• Prioritisation – buffering on input to the router


Barrier & Broadcast Support

• Switches broadcast over
a range of output links
• Combine Acks / Nacks
• Contiguous in QsNetII
• Sparse in QsNetIII
• Barrier implementation
– Network conditional
– Broadcast release


Elan5 – Device Overview

CX4/ CX4/

• 2× QSNetIII QSNetIII
QsNetIII links
– 20Gbit/s/direction after protocol
Elan5 Adapter
Link Link

• PCIe, PCIe2 host interface
• Multiple packet engines
Packet Engine Packet Engine Packet Engine Packet Engine Packet Engine Packet Engine Packet Engine
16K inst cache 16K inst cache 16K inst cache 16K inst cache 16K inst cache 16K inst cache 16K inst cache
9K data buffers 9K data buffers 9K data buffers 9K data buffers 9K data buffers 9K data buffers 9K data buffers

• 512KB of high bandwidth on
Fabric
chip local memory x8

• SDRAM interface to optional Bridge
Host I/F Local Memory Local Functions
Object Cache Tags
TLB

local memory Buffer Manager External cache
Cmd Launch

SDRAM i/f Ext i/f
Free List
PCIe

• Buffer manager, object 16K x 8 x 8 banks = 1MB ECC RAM PLL
SERDES

cache External EEPROM Clocks
PCIe

• Details in ISC Dresden
DDRII
16 Lanes

Paper


Elite5 – Device Overview

• 64 × 32 crosspoint router
– Direct & buffered input from each link
– 8K of input buffering per link
• 32 virtual channels per link
• Physical layer DDR XAUI (6.25GHz)
• Adaptive routing
• Hardware barrier and broadcast
• Memory mapped stats & error
counters accessed out-of-band


QsNetIII Device Overview

Elan Elite
Semi custom ASIC
Manufacturing partners LSI / TSMC G90 process
500 MHz 312 MHz
High performance BGA package
672 pin 982 pin
< 17W < 18W


QsNetIII Implementation

• Node switch chassis
– 128 links down to the nodes
– 128 links up to the top switches
– Backplane connects 2 sets of cards
• Top switches QsNetIII switch
– 256 links down to the node switches logical design

– Range of system sizes:
Ports Radix Per Chassis
512 4 64
QsNetIII switch
1024 8 32
implementation
2048 16 16
4096 32 8


QsNetIII Network 1024–way

• Fat tree, constructed from 8 × 128-way node switches connected by
128 × 8-way top switches


QsNetIII Implementation – Cables

• QSFP connectors throughout
• Copper cables (e.g. Gore) 1-10m
• Active copper cables (e.g. Gore), 8-20m
• Optical cables (e.g. Luxtera), 5-300m
– PVDF Plenum rated
– LSZH available as an option
• No longer Quadrics proprietary

• Likely usage:
– Short copper cables from nodes
– Optical cables between switches


QsNetIII Fault Tolerance

• All of the QsNetII Features
– CRCs on every packet
– Automatic retransmission
– Redundant routes
– Adaptive routing avoids failed links
– Redundant, hot plugable, PSUs and fans

+ Line rate testing of each link as it comes up
– Switches generate CRPAT, CJPAT or PRBS packets
– Links are only added to the route tables when they are (a) up, (b)
connect to the right place, and (c) can transfer data at full line rate
without error.


QsNetIII Implementation – HP BladeSystem

Elan5 mezzanine adapter
Elite5 switch module
2 QsNet links, PCI-E x8 Gen2
Full bandwidth
128 MB of memory
16 links to the blades (via backplane)
16 links to back of the module


Current Status

• Elite5 silicon in Bristol
• Elan5 at TSMC, first parts expected
in 3-4 weeks
• Switch PCBs, chassis, backplane,
controllers are working
• First adapter PCBs are ready
– PCI-Express x16, HP Blade,
ExpressModule (Sun Blade)
• We are porting the QsNetII software
• Components at SC08 in Austin
• First customer shipment in Q1 of 2009


Future Work

• QsNetIII hardware
– Low cost 32-way switch
– 1024-way single chassis switch

• QsNetIII Software
– General framework for optimised collectives
– Support for “multiport” networks - “fat” nodes have multiple
connections to the same rail
– Ethernet firmware for the network adapter


Conclusions

• Adaptive routing underwrites the scalability of HPC systems
designed to run a single large application
• Adaptive routing has been a feature of QsNet systems since 2000
• QsNetIII offers significant enhancements over both QsNetII and
competing products


Thank you for listening


Additional Material


Packet Format

• Packet size of up to 4K made up of 256 byte packet segment and
continuations, 8 byte ACK


Impact of static routing on latency

Data from Thunderbird cluster, Sandia National Lab
Big increases in worst case latency with number of nodes


Impact of static routing on latency

Data from Thunderbird cluster, Sandia National Lab
Big variation in worst case latency across a large job


Software Model – Firmware & Drivers

• Base firmware in the ROMs
• Firmware modules loadable with the device driver
– Elan, OpenFabrics, 10GE Ethernet, …
• Kernel modules
– elan5, elan, rms
• Device dependent library (libelan5)
• Device independent library (libelan)
• User libraries


Software Model – Elan Libraries

• Point-to-point message • Optimised collectives
passing • Locks and atomics ops
• One-sided put/get • Global memory allocation
• Transparent rail striping


QsNetIII Performance Summary

• Similar latencies to QsNetII
– The 1.3 to 2 microsecs of latency is mostly in the host PCI and
memory system
• Higher issue rates
– Improved link utilisation on small transfers
• Higher bandwidths
– 1.5 to 2.25 GB/sec/link depending on host interface
• Bi-directional host interface
– 2 x improvement over QsNetII
• Broadcast and barrier in hardware
• Continued development of adaptive routing underwrites scaling
to high node counts


QsNetIII Adaptively Routed Network For HPC

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to QsNetIII Adaptively Routed Network For HPC

Similar to QsNetIII Adaptively Routed Network For HPC (20)

Recently uploaded

Recently uploaded (20)

QsNetIII Adaptively Routed Network For HPC