HPC-AI Technology Survey 2023: Storage and Interconnect Technologies


Intersect360 Research surveyed the user community for High Performance Computing (HPC) and artificial intelligence (AI) on a wide range of technology issues. The complete study analyzes users’ current computing systems, processing elements, storage systems, networks, operating environments, cloud computing usage, and selected forward-looking trends. Our goal in this analysis is to provide an overview of how HPC-AI systems are configured, including the breadth of technologies most commonly used. The survey audience included members of the worldwide HPC-AI user community spanning industry, government, and academia.

Intersect360 Research reports available in this HPC-AI Technology Survey report series include the following segmentations:

  • Systems, CPUs, and Accelerators: including system vendors installed, current and planned installations and user preferences for CPUs and accelerators, system utilization rates, and usage of liquid cooling.
  • Storage and Interconnect Technologies: including total active HPC data; storage configurations spanning on-node, attached storage arrays, and cloud storage; parallel file system usage; system interconnects and speeds; and composable infrastructure.
  • Operating Environments: including installations of operating systems, middleware packages, and developer tools.
  • Cloud Computing: including current and planned proportion of computing and storage in public cloud for HPC and top named cloud vendors.

This report provides a detailed examination of the storage and interconnect technologies that comprise respondents’ HPC-AI infrastructure. We look at the top storage vendors for HPC-AI, as well as the various technologies at play for data storage and system interconnects, including an analysis of the role of Ethernet versus InfiniBand for systems. We also look at trends in data processing units (DPUs), file systems, and storage archives, including the role of cloud computing for storage.

Both HPC and AI are data-hungry endeavors. Over 80% of respondents indicated they had over 1 Petabyte (PB) of active data as part of their HPC-AI environments. Over 70% said their projected near-term data growth was more than 10% per year, and 13% said active data would grow more than 25% per year.

Dell was the leading vendor by total mentions, followed by HPE and DDN. The position of Dell and HPE at the top reflects a common trend for users to often bundle storage purchases together with compute cluster or system purchases. DDN remains the most commonly named storage-focused vendor in our survey, outpacing enterprise-focused brands such as NetApp and EMC. VAST Data has grown quickly in recent years, leaping over other high-performance storage vendors, including Panasas and Weka, in survey share.

Parallel file system usage has increased significantly, perhaps owing to the rise of AI. Lustre is now pulling away as the clear leading parallel file system in HPC-AI. 47% of users reported “broad usage” of Lustre; this is a significant gain from three years ago.

InfiniBand is still the most commonly used system interconnect for HPC-AI, with Ethernet not far behind. These two, InfiniBand and Ethernet, represent the majority of system interconnect usage. Ethernet is more common in commercial sector and in smaller systems, whereas InfiniBand skews to the public sector and to larger systems. InfiniBand is also preferred for higher-speed interconnects.

This is a time of unusual upheaval in HPC, with the rapid adoption of new technologies to accommodate the computational requirements of AI. These results point to a pending seismic shift in processing technologies and the systems that run them, with significant implications for storage and networking technologies as well. We advise our clients to strive to respond to these volatile dynamics. Scalability in this context will not simply mean “biggest” or “fastest” in a monolithic sense. Storage and networking strategies should incorporate both HPC and AI, with multiple data and storage form factors, and the ability to adopt new technologies and standards, from edge to core to cloud.