JUQUEEN: IBM Blue Gene/Q ® Supercomputer System at the Jülich Supercomputing Centre

: JUQUEEN (see Figure 1) is a high-scaling supercomputer funded mainly by the Gauss Centre for Supercomputing (2015) and by Helmholtz Association (2015) and is hosted by the Jülich Supercom-puting Centre (2015b). It is a 28 rack, IBM Blue Gene/Q ® system combining 28,672 compute nodes through a high-speed network providing an overall peak performance of 5.9 Petaflops.

Scientist of RWTH Aachen University, Forschungszentrum Jülich or German Research School for Simulation Sciences (GRS) are also quali ed for applications for computing time within the Jülich-Aachen Research Alliance (2015) (JARA-HPC).

JUQUEEN system
3.1 System con guration JUQUEEN, the Blue Gene/Q system hosted by the Jülich Supercomputing Centre (2015b), was installed in four steps.The nal stage of expansion went into production in March 2013 and has the following con guration (Jülich Supercomputing Centre, 2015a): •   The design of the Blue Gene/Q system is based on the International Business Machines Corporation (IBM) PowerPC A2 processing architecture (International Business Machines Corporation, 2013, 2015).Each processor includes 16 compute cores dedicated to the user application plus an additional core allocated to operating system administrative functions and a redundant spare core.By decoupling the execution of system services, e ects of random asynchronous delays of user processes are suppressed.Such noise can signi cantly deteriorate the scalability of an architecture.As every core supports 4way simultaneous multithreading (SMT), a Blue Gene/Q node can run up to 64 independent hardware threads.Every core is assisted by a 4-wide double precision oating point unit (SIMD).
The processor core architecture is relatively simple and implements a standard 64-bit power instruction set architecture.A particular feature of this core architecture is the support of an auxiliary execution unit.For Blue Gene/Q a Quad Floating-Point Processing Unit (QPU) had been developed.The QPU processes vectors of four 64-bit elements.In each clock cycle it can perform four fused multiply-add operations in parallel.Like in previous generations of Blue Gene, the vector arithmetic instructions may involve a permutation of the vector elements, which is used to implement complex arithmetics (without the need of separate shu e operations like in other vector instruction set architectures).With each of the 16 QPUs being able to complete four multiply-add operations per clock cycle at a clock speed of 1.6 GHz, the peak performance is 204.8GFlop/s.This huge amount of performance can only be exploited if it is balanced by a powerful memory subsystem.As can be seen in Figure 2, a large fraction of the die space is occupied by the L2 cache, which has a capacity of 32 MBytes.Data is moved between external memory and this last-level cache by two memory controllers (MC 0 and MC 1).The L2 cache is shared by all processor cores.A central crossbar switch connects it to all cores plus the network subsystem.The cores can read from the L2 cache at an aggregate maximum bandwidth of 409.6 GByte/s.Blue Gene/Q incorporates novel architectural advances that contribute to the system's outstanding performance and helps users to simplify programming of such high scaling many core systems.
• Hardware-based speculative execution capabilities facilitate e cient multi-threading for long code sections, even those with potential data dependencies.If con icts are detected, the hardware can backtrack and redo the work with minimal a ects on the application performance.• Hardware-based transactional memory helps programmers avoid the potentially complex integration of locks and helps eliminate bottlenecks caused by deadlocking -when threads become stuck during the locking process.Hardware-based transactional memory helps to deliver e cient and e ective multi-threading while reducing the need for complicated programming.• The L1 pre-fetcher has the ability to run in normal stream prefetching mode which adaptively balances resources to pre-fetch L2 cache lines in response to observed memory tra c.But in addition it can also use four list-based prefetching engines to record memory access patterns in arbitrarily long code segments on a rst iteration of a loop and playback this pattern for subsequent iterations.On subsequent passes, this list is adaptively re ned for missing or extra cache misses and can be activated by program directives.To interconnect the compute nodes a proprietary high-speed network in a 5D torus topology is used, providing the following advantages: • A reduced latency of ∼ 3µsec for point-to-point communication and ∼ 6µsec within collectives and barrier.• A good trade-o of nearest neighbour and bisection bandwidths.• Extremely exible partitioning into independent, non-interfering sub-machines is possible • The hardware includes direct support for MPI collective reduce and all-reduce operations so that single pass oating point reductions can be executed with near link bandwidth • With exible con gurability in its network Blue Gene/Q can spare out failed lasers without disrupting a running application • The new networking hardware supports o -loading of the I/O tra c from the compute cores One unique feature of the Blue Gene architecture line is the dense integration of a large number of nodes within a single rack.Figure 3 shows the packaging hierarchy of the system.Unlike in previous generations of Blue Gene, where air was used to remove the heat generated by the compute nodes, in the new generation of machines the nodes are directly connected to a liquid cooling system.90% of the heat originating from the compute nodes in the system is directly taken away by the water, the remaining fraction, coming mainly from the power supplies, is still moved out of the rack by air.Engineered with fewer moving parts and built in redundancy, Blue Gene/Q has proven to be extreme reliable.Designed with a small footprint and low power requirements, Blue Gene/Q was ranked as the number-one most energy-e cient supercomputer in the world by the Green500 in Nov. 2011 (Green500, 2011).

Figure 2 :
Figure 2: Physical layout of the Blue Gene/Q chip.