2024 Research Computing Center Tactical Plans

Monday 04/08/2024

We are over a third of the way through 2024, and it feels like it has just begun! Here is a brief list of projects that the Research Computing Center is working on this year: 

Finish migration of Research Archival volumes to new Ceph platform 

We began transferring data from our old ZFS-based filesystem to the new Ceph filesystem in Fall of 2023. We intend to complete the transfer for the remaining customers in the coming months. The Ceph filesystem will provide better stability and manageability, as well as making it easier to increase storage on the system. 

Add more GPUs to the High-Performance Computing (HPC) cluster 

GPU resources are in high demand in the FSU research community. To meet the demand, we are adding more of them to the HPC cluster. We plan on increasing the ratio of GPU to CPU nodes throughout the year. Our first batch of six NVIDIA A4500 nodes is already on-order, and we plan to make several more purchases throughout the year. 

Implement liquid cooling in the Sliger Datacenter 

In the first half of 2024, we will deploy 28 liquid-cooled servers to the HPC cluster, marking a groundbreaking shift from traditional air-cooling methods. These servers, equipped with AMD Epyc 9454 48-core processors, will be housed at the Sliger Data Center, which has undergone upgrades to accommodate liquid-cooled systems. Future plans include further adoption of liquid-cooled servers and the addition of a second dedicated chilled water loop at Sliger. Installation of the second loop is slated to begin later in 2024, along with the addition of more HPC compute nodes. 

Rolling Software Updates  

During the first week of December 2023, we upgraded the HPC cluster from Centos 8.3 to AlmaLinux 8.6. Throughout 2024, we will continue updating more software packages and libraries.  Because users require different versions of software for their research workflows, we will maintain the existing versions of packages side-by-side with the new versions on the HPC cluster. Unlike last year’s upgrade, these software updates will be ongoing and will not require an offline maintenance period. 

Upgrade Open OnDemand to Version 3 

We have upgraded our HPC web interface, Open OnDemand, to version 3. This upgrade brings bugfixes and user interface enhancements to the portal. With Open OnDemand, you can upload and download files; create, edit, submit, and monitor jobs; run GUI applications; and connect via SSH, all via a web browser, with no client software to install and configure. We encourage you to give it a try: https://ood.rcc.fsu.edu 

Ongoing improvements to self-service portal 

We are continuing to improve the usability of our web self-service portal, https://rcc.fsu.edu/manage. You may have noticed recently that we integrated single sign-on CAS authentication, so when you are logged into other FSU websites (my.fsu.edu or canvas.fsu.edu), you are automatically logged into the RCC self-service portal. Other things we plan to add are: 

  • Technical details about our cluster, including compute node details for each queue. 
  • Improved reporting of HPC and storage usage, with options to receive email reports 
  • Improved purchase information for resource owners and managers. This includes queue and storage usage statistics and the ability to see the status of your purchases online 

Improving our Outreach and Education 

Part of the core mission of the RCC is to provide education and training opportunities to the FSU research community. We will continue to do this in 2024 by providing workshops in the Spring and Fall semesters. In addition, we are creating an HPC video training course called “HPC Driver’s Ed”, which will debut in the first half of 2024. 

The HPC Driver’s Ed course is designed to help new users understand the basics of HPC and how to best utilize HPC for research and instruction. The collection of videos, slides, and links to documentation will assist in safe and easy navigation of the HPC cluster. 

Additional items 

  • Work is ongoing on the RCCTool, a command-line utility to see information about your HPC account, view usage details, and more 
  • We are upgrading additional components of the InfiniBand infrastructure to 100Gbps