Overview
On October 16, 2023, the U.S. Department of Energy (DOE) launched the High Performance Data Facility (HPDF), a scientific user facility specializing in advanced infrastructure for data-intensive science. DOE’s Office of Science (SC) named Thomas Jefferson National Accelerator Facility (Jefferson Lab) as the HPDF lead, locating the HPDF Hub infrastructure at the lab’s campus in Newport News, Virginia. The HPDF Project will be a partnership between Jefferson Lab and Lawrence Berkeley National Laboratory (LBNL). The two labs moved immediately to form an integrated team led by Jefferson Lab.
HPDF will be a first-of-its-kind SC user facility that fits within and adds world-class capabilities to the Advanced Scientific Computing Research (ASCR) and SC data and computing infrastructure ecosystem. The facility’s mission will be to enable and accelerate scientific discovery by delivering state-of-the-art data management infrastructure, capabilities, and tools. As a cornerstone of the DOE’s Integrated Research Infrastructure (IRI) initiative, a successful, fully realized HPDF will be widely recognized as a national and international leader in uplifting data science and high-performance data infrastructure.
HPDF is envisioned as a hub-and-spoke model, in which the Hub will host centralized resources and enable high-priority DOE mission applications at Spoke sites by deploying and orchestrating distributed infrastructure at the Spokes or other locations. The number and variety of Spokes is expected to grow and evolve with mission requirements, consistent with SC’s deep experience with the Cooperative Stewardship model for major research infrastructure.
The project team is tasked with designing and delivering a geographically resilient and innovative HPDF, capable of meeting the needs of diverse users, institutions, and use cases. The joint project will itself provide the template for the first Spoke partnerships and blaze new paths in institutional engagement and outreach in the emerging era of artificial intelligence-enabled integrated science.
Data Life Cycle
HPDF will offer innovative production data services and software tools to support the entire data life cycle. A key goal is facilitating data management and interoperability, i.e., making data available to a broad scientific community, providing for new technologies and user access patterns, and preserving the data for future use.
- Data storage, access, and discovery. Core capabilities to spearhead scientific data-driven discovery and provide access to data through various services (e.g., search tools) and interfaces (e.g., web and application programming interfaces).
- Data life cycle tools and services. Support for the entire data life cycle, providing tools and services to move, clean, process, analyze, share, and collaborate while supporting new technologies, such as AI and novel hardware or software technologies. User support and data stewards will help users and scientific communities advance their data analysis.
- Data preservation. Storage and access through a data repository framework that ensures FAIR data support and availability for future use via federated archives and catalogs while providing digital object identifiers in partnership with the Office of Scientific and Technical Information (OSTI).
- Seamless Data and Compute Infrastructure. The free flow of data and workloads among HPDF resources, including institutional clusters, cloud, and high-performance computing using IRI.
Project Lead Institutions
The lab’s existing data center infrastructure will be expanded into a new, free-standing data center building, the Jefferson Lab Data Center (JLDC). Funding for this project will be shared by the Commonwealth of Virginia. Completion of the building and infrastructure required for the HPDF Project is anticipated in 2029. JLDC and HPDF team subject matter experts have visited leading laboratories to gather data center architecture best practices and lessons learned.
Jefferson Lab’s leadership in scientific computing and data science in the nuclear physics community provides a conduit to connect nuclear physics experimental use cases and high-performance computing, analogous to the role Berkeley Lab plays in connecting high-energy physics use cases to HPC facilities.
Berkeley Lab manages five DOE Office of Science National User Facilities, which together provide nearly 14,000 researchers per year with advanced capabilities in high-performance computing and data science, chemical sciences, materials synthesis and characterization, and genomic science.
The HPDF Hub infrastructure at Berkeley Lab will be located in Shyh Wang Hall. This existing state-of-the-art LEED gold certified facility uses free-air cooling and is also home to NERSC and ESnet. The building design allows for highly efficient operations, with power usage effectiveness of <1.1 for its current systems. Wang Hall has 20,000 square feet of existing machine room space and an additional 10,000 square feet of unfinished shell space to accommodate future growth.