HPC-AI Technology Survey 2023: Operating Environments

EXECUTIVE SUMMARY

Intersect360 Research surveyed the user community for High Performance Computing (HPC) and artificial intelligence (AI) on a wide range of technology issues. The complete study analyzes users’ current computing systems, processing elements, storage systems, networks, operating environments, cloud computing usage, and selected forward-looking trends. Our goal in this analysis is to provide an overview of how HPC-AI systems are configured, including the breadth of technologies most commonly used. The survey audience included members of the worldwide HPC-AI user community spanning industry, government, and academia.

Intersect360 Research reports available in this HPC-AI Technology Survey report series include the following segmentations:

  • Systems, CPUs, and Accelerators: including system vendors installed, current and planned installations and user preferences for CPUs and accelerators, system utilization rates, and usage of liquid cooling.
  • Storage and Interconnect Technologies: including total active HPC data; storage configurations spanning on-node, attached storage arrays, and cloud storage; parallel file system usage; system interconnects and speeds; and composable infrastructure.
  • Operating Environments: including installations of operating systems, middleware packages, and developer tools.
  • Cloud Computing: including current and planned proportion of computing and storage in public cloud for HPC and top named cloud vendors.

This report provides a detailed examination of the software operating environments, including operating systems and middleware, that are used as part of HPC-AI infrastructure. We look at the top operating system variants, cluster and workload management tools, and container technologies, with analysis by user sector—commercial, academic, and government. More detailed analysis of complete software environments, including application software, will be provided in reports from the forthcoming Software Environments survey from Intersect360 Research.

HPC-AI operating systems are almost always versions of Linux. CentOS is the most common, but alternatives have gained ground following changes to its support structure. Cluster management tools are diverse, with OpenHPC—itself a collection of tools—in the lead in a distributed field. SLURM is the dominant workload management tool, though there are alternatives. The use of containers is on the rise, with Singularity and Apptainer—which share a common heritage—in the mix with Docker and Kubernetes.

In all these cases, there are dramatic differences by economic sector in the systems and tools used. In general, commercial HPC-AI users favor professional, supporting software packages, whereas their public-sector counterparts tend more toward open-source alternatives.

As important as the underlying software can be in maximizing efficient performance, it is often an afterthought in infrastructure deployment for HPC-AI. Pre-installed or bundled software is common. Furthermore, software is “sticky.” For a new software tool to gain adoption, it may need to be dramatically better than an incumbent package that administrators are accustomed to using.