JURECA: Modular supercomputer at Jülich Supercomputing Centre

: JURECA is a petaflop-scale modular supercomputer operated by Jülich Supercomputing Centre at Forschungszentrum Jülich. The system combines a flexible Cluster module, based on T-Platforms V-Class blades with a balanced selection of best of its kind components, with a scalability focused Booster module, delivered by Intel and Dell EMC based on the Xeon Phi many-core processor. With its novel architecture, it supports a wide variety of high-performance computing and data analytics workloads.


Introduction
Since July 2015, the Jülich Supercomputing Centre (JSC) at the Forschungszentrum Jülich (Forschungszentrum Jülich, 2018a) operates the JURECA (Jülich Research on Exascale Cluster Architectures) system as the successor of the popular JUROPA (Jülich Research on Peta op Architectures) supercomputer.From 2015 to 2017, the JURECA Cluster (see Figure 1) served as a general-purpose supercomputing resource and, in accordance with Forschungszentrum Jülich's dual architecture strategy, augmented the leadership-class highly scalable IBM Blue Gene/Q system JUQUEEN (Forschungszentrum Jülich, 2015).In 2017, JURECA was itself augmented with a many-core processor based Booster module to enable highly scalable applications to leverage the system more e ciently.Funding for both JURECA modules was granted by the Helmholtz Association (Helmholtz Association, 2018) through the program "Supercomputing & Big Data".The Cluster and Booster module are tightly integrated and operated as a single system following the modular supercomputing paradigm, pioneered by JSC in the context of the DEEP series of EU-funded projects (Eicker et al., 2016).The modular supercomputing concept enables users to distribute their workloads exibly across di erent, architecturally diverse, modules in order to place di erent phases or subroutines of their workload on the hardware best suited for the execution.The software features required to leverage this architecture is made available in the course of 2018 for the wider JURECA user community.The JURECA Cluster module was designed by JSC together with the hardware vendor T-Platforms (T-Platforms, 2018) to serve as a versatile scienti c instrument for compute-and data-intense (simulation) science that is equally suited for capacity as for capability workloads.The JURECA Booster module was designed by JSC and Intel (Intel Corperation, 2018) in 2016 as a highly-scalable compute architecture leveraging the latest available Intel networking and processor technology.The system was delivered by Intel together with its partner Dell EMC (Dell EMC, 2018) in 2017.

JURECA system details
The JURECA modular supercomputer consists of two separate, but tightly integrated, compute modules.The architecture of the Cluster module is an evolution of the JUROPA architecture and follows the best-of-breed approach by combining the most advanced commodity hardware and software technologies available in the industry.The JURECA Cluster is itself a heterogeneous system o ering nodes with di erent memory sizes (128 GiB, 256 GiB, 512 GiB as well as 1 TiB), nodes with graphics processing unit (GPU) accelerators as well as GPU-equipped nodes for visualization and other post-processing needs.The architecture of the Booster module is designed to best serve highly-scalable simulation workloads that are able to leverage the high core counts and wide vector units of the Intel Xeon Phi many-core processors.

Cluster module
The JURECA Cluster consists of 1,733 compute nodes of type T-Platforms V-Class V210S as well 75 GPU-accelerated V210F blades hosted in V5050 chassis (see Figure 2).Moreover, 64 Supermicro F618R2-RT+ twin-blade servers (512 GiB memory nodes) and 12 Supermicro 1028GR-TR visualization nodes are available.All systems feature two Intel Xeon E5-2680 v3 12-core Haswell central processing units (Intel Corporation, 2018b) (CPUs) which support up to 24 hardware threads each.Each CPU supports the AVX 2.0 instruction set architecture extension and can perform two 256-bit (i.e., four double precision oating point numbers) wide multiply-add operations per cycle .The peak performance of a (non-accelerated) JURECA Cluster node is 0.96 TFlop/s.The maximum memory bandwidth of the node is 136 GB/s.
The two sockets are connected by a bi-directional 9.6 GT/s (Gigatransfers per second) Intel Quick Path Interconnect (QPI) link.In the JURECA Cluster module, 2133 MHz DDR4 memory technology is used.Applications that support the use of GPU accelerators can take advantage of the additional two NVIDIA K80 graphics processing units available in 75 compute nodes.The GPUs are connected with PCI Express Generation 3.0 (16 lanes) links providing a peak of 32 GB/s bidirectional host-device bandwidth.Each K80 GPU is equipped with 2×12 GB GDDR5 memory and o ers 4992 CUDA cores that provide an additional 2.9 TFlop/s peak performance (5.8 TFlop/s per node) and 480 GB/s memory bandwidth per GPU.The 12 visualization nodes are equipped with two NVIDIA K40 GPUs intended for remote visualization usage.The JURECA Cluster compute nodes are connected with Mellanox extended data rate (EDR) In niBand providing 100 Gb/s (12.5 GB/s) link bandwidth and MPI latencies around one microsecond.The host channel adapters (HCA) are connected via PCI Express Generation 3.0 (16 lanes).The Cluster components are interconnected in a three-level full fat tree topology which provides full bisection bandwidth and non-blocking communication for appropriate communication patterns.A particular emphasis during the design of JURECA Cluster has been put on the storage connection in order to meet the increasing data requirements of simulation sciences as well as the needs of emerging data-intense sciences.All o ered global (parallel) lesystems on JURECA are mounted from the central Jülich Storage Cluster (JUST) (Forschungszentrum Jülich, 2018c) using IBM's General Parallel Filesystem (GPFS).Users with access to several systems in the supercomputing facility at JSC work with the same lesystems on all systems so that data movement is minimized and work ows are simpli ed.The storage network connection of the Cluster is realized using In niBand-to-Ethernet gateways bridging the internal In niBand network with the facility's Terabit Ethernet backbone.This connection type was selected as it allows for >100 GB/s aggregate lesystem bandwidth as well as a high per-node lesystem performance that is hardware-wise only limited by the performance of the fourteen data rate (FDR) In niBand links (56 Gb/s) towards the gateways.

Booster module
The JURECA Booster consists of 1,6400 compute nodes of type Dell PowerEdge C6320P (see Figure 3).All systems feature one Intel Xeon Phi 7250-F CPU (Intel Corporation, 2018a) with 68 cores, a base frequency of 1.4 GHz, and 4 hardware threads per physical core.The processor package includes 16 GiB of high-bandwidth, multi-channel DRAM (MCDRAM) with a bandwidth of up to 500 GB/s.The peak performance of a Booster compute node is 3 TFlop/s.Each node is equipped with additional 96 GiB DDR4 memory clocked at 2400 MHz.The Booster compute nodes are connected with 100 Gb/s Intel Omni-Path Architecture (OPA).The host fabric interfaces (HFI) are integrated in the CPU on the package but internally connected with PCI Express Generation 3.0 (16 lanes).The Booster components are interconnected in a three-level full fat tree topology.The Cluster and Booster modules are linked through 198 router nodes equipped with one In niBand HCA and one OPA HFI, enabling Cluster-Booster communication with up to 19.8 Tb/s (2.5 TB/s) bandwidth.The Booster connects to the JUST cluster to access the same le systems as are available on the Cluster.The storage connection is realized with 26 router nodes equipped with two HFI ports and two 40 Gigabit Ethernet connections to the facility Ethernet fabric.The nominal network speed of the storage connection is 260 GB/s.In practice, due to software limitations a lower performance is observed.

Software
JURECA's software stack is largely based on open-source software.Login and compute nodes run the CentOS 7 Linux operating system with a careful setup that balances the ease of use and low entrancebarrier with the requirements, such as minimal operating system jitter, of large-scale capability clusters.
JURECA uses the open-source Slurm workload manager (SchedMD LLC, 2018) in combination with   the ParaStation resource management which has a proven track record in scalability, reliability and performance on several clusters operated by JSC.The ParTec Parastation ClusterSuite (ParTec Cluster Competence Center GmbH, 2018) is used for system provisioning and health monitoring.On JURECA, the Intel and ParTec ParaStation Message Passing Interface (MPI) implementations are supported.In the CUDA-aware MPI implementation MVAPICH2-GDR is available for mixed MPI+CUDA applications.Di erent compilers, optimized mathematical libraries and pre-compiled community codes are available.We refer to the JURECA webpage (Forschungszentrum Jülich, 2018b) for more information.Monitoring of batch jobs is possible using the latest version of the LLview (Forschungszentrum Jülich, 2018d) graphical monitoring tool.Scientists can also use UNICORE (UNICORE Forum e.V., 2018) to create, submit and monitor jobs on the JURECA system.The software functionality required for high-speed communication between Cluster and Booster via MPI is implemented in ParaStation.At the time of the Booster deployment in 2017 the software was available at proof-of-concept level.It is matured in the course of the year 2018 and is made available, along with the necessary enhancement of the workload manager, in steps to the wider JURECA community.

Hardware components
As of this writing, JURECA consists of the following hardware components.An up-to-date description of the hardware (and software) con guration of the system is maintained on the JURECA webpage (Forschungszentrum Jülich, 2018b).

Benchmark results
Using 1,764 JURECA Cluster compute nodes without accelerators a Linpack performance of 1.42 PFlop/s was measured, placing the system on spot 50 in the November 2015 Top500 list (Top500, 2015).The Cluster module consumed on average 825 kW during the Linpack run, i.e., about 1.72 GFlop/s/W.JURECA entered the Green500 list in November 2015 on place 112 (Green500, 2015).On the High Performance Conjugate Gradients (HPCG) benchmark, JURECA Cluster achieved 68.3 TFlop/s in 2015 corresponding to place 18 in the November 2015 HPCG list (HPCG, 2015).
In 2017, following the installation of the Booster module, a combined Linpack performance of 3.78 PFlop/s was measured with 1,760 Cluster and 1,600 Booster compute nodes.The upgrade placed the system on spot 29 in the November 2017 Top500 list (Top500, 2017).With an average 2.81 GFlop/s/W, the system ranked on spot 55 in the Green500 list in November 2017 (Green500, 2017).
3 Access to JURECA Scientists and engineers interested in using the capacities and capabilities of JURECA for their research have to apply for JURECA compute time resources by submitting an adequate proposal in answer to corresponding compute time calls published January and July every year.Submitted proposals are evaluated scienti cally through a competitive peer-review process.Additionally, the review process includes a technical assessment of the applicant's ability to e ciently perform parallel computations utilizing a larger number of compute cores on JURECA.
Basically, there are two calls available twice every year: One is conducted jointly by peers in computational science and engineering at Forschungszentrum Jülich and RWTH Aachen University, accepting proposals from the two institutions only (so-called JARA-HPC/VSR Call) (Jülich

Figure 1 :
Figure 1: Jülich Research on Exascale Cluster Architectures (JURECA) at Jülich Supercomputing Centre.The left picture shows the Cluster module in 2015.The right picture shows the complete system after the Booster deployment in 2017.Copyright: Forschungszentrum Jülich.
(a) Back (left) and side (right) view of a T-Platforms V-Class V210S dual-socket blade server as used in JURECA.The GPU-accelerated V210F blades host two additional PCIe devices and t in two chassis slots.(b) Front (left) and back (right) view of the T-Platforms V5050 chassis.Each chassis can host ten V210S or, alternatively, ve V210F blades.

Figure 2 :
Figure 2: T-Platforms V-Class components used in the JURECA system.Copyright: T-Platforms.

Figure 3 :
Figure 3: Example of a Dell C6320P server system.The model used in the JURECA Booster slightly deviates from the shown version due to the utilized processor type.Copyright: Dell Technologies.

Figure 4 :
Figure 4: Allocated compute time (left) and number of projects (right) on JURECA by scienti c eld in the computing time period from the 1st of November 2015 to the 30th April 2016.Percentages are shown for shares above 3 %.
(Forschungszentrum Jülich, 2018e)The other one (NIC Call) is performed by the John von Neumann Institute for Computing (John von Neumann Institute for Computing, 2018) (NIC), a joint organization of the three Helmholtz centers Forschungszentrum Jülich, Deutsches Elektronen-Synchroton (Deutsches Elektronen Synchrotron, 2018) DESY and the GSI Helmholtzzentrum für Schwerionenforschung (GSI Helmholtzzentrum für Schwerionenforschung, 2018), accepting proposals from all other German universities and research institutions.Applicants have to demonstrate that they are quali ed in their respective eld and that they have an appropriate knowledge in high-performance computing.Scientists with challenging compute-or data-intense scienti c problems that require access to JURECA in order to lay the necessary software foundation for the preparation of a successful proposal can obtain a limited compute time budget on JURECA along with expert support by a JSC simulation lab(Forschungszentrum Jülich, 2018f) by answering the bi-annual call for preparatory access and support resources(Forschungszentrum Jülich, 2018e).Between 2015 and 2018, JURECA Cluster compute time was available for all eligible scientists via NIC Calls.Starting from 2018, only the JURECA Booster module is made available via the national NIC Call for an interim period until approximately 2020.Compute time on the Cluster is only available via the JARA-HPC/VSR Call or for NIC users that can leverage the Cluster and Booster concurrently.