While the Grid’s distributed approach has proven very successful, the computing needs of the LHC experiments keep expanding, so the ATLAS collaboration has been exploring the potential of integrating high-performance computing (HPC) centres in the Grid’s distributed environment. HPC harnesses the power of purpose-built supercomputers constructed from specialised hardware, and is used widely in other scientific disciplines.
However, HPC poses significant challenges for ATLAS data processing. Access to supercomputer installations are typically subject to more restrictions than Grid sites and their CPU architectures may not be suitable for ATLAS software. Their scheduling mechanisms favour very large jobs using many thousands of nodes, which is atypical of an ATLAS workflow. Finally, the supercomputer installation may be geographically distant from storage hosting ATLAS data, which may pose network problems.
Despite these challenges, ATLAS collaborators have been able to successfully exploit HPC over the last few years, including several near the top of the famous Top500 list of supercomputers. Technological barriers were overcome by isolating the main computation from the parts requiring network access, such as data transfer. Software issues were resolved by using container technology, which allows ATLAS software to run on any operating system, and the development of “edge services”, which enables computations to run in an offline mode without the need to contact external services.
The most recent HPC centre to process ATLAS data is Vega – the first new petascale EuroHPC JU machine, hosted in the Institute of Information Science in Maribor, Slovenia. Vega started operation in April 2021 and consists of 960 nodes, each of which contains 128 physical CPU cores, for a total of 122 800 physical or 245 760 logical cores. To put this in perspective, the total number of cores provided to ATLAS from Grid resources is around 300 000.
Due to close connections with the community of ATLAS physicists in Slovenia, some of whom were heavily involved in the design and commissioning of Vega, the ATLAS collaboration was one of the first users to be granted official time allocations. This was to the benefit of both the ATLAS collaboration, which could take advantage of a significant extra resource, and Vega, which was supplied with a steady, well-understood stream of jobs to assist in the commissioning phase.
Vega was almost continually occupied with ATLAS jobs from the moment it was turned on, and the periods where fewer jobs were running were due to either other users on Vega or a lack of ATLAS jobs to submit. This huge additional computing power – essentially doubling ATLAS’s available resources – was invaluable, allowing several large-scale data-processing campaigns to run in parallel. As such, the ATLAS collaboration heads towards the restart of the LHC with a fully refreshed Run 2 data set and corresponding simulations, many of which have been significantly extended in statistics thanks to the additional resources provided by Vega.
It is a testament to the robustness of ATLAS’s distributed computing systems that they could be scaled up to a single site equivalent in size to the entire Grid. While Vega will eventually be given over to other science projects, some fraction will continue to be dedicated to ATLAS. What’s more, the successful experience shows that ATLAS members (and their data) are ready to jump on the next available HPC centre and fully exploit its potential.