Specific Technical Requirements:
• Design, architect, and oversee the implementation of #Linux-based HPC clusters and storage
• Deploy physical hardware using #HPC deployment tools and configuration and orchestration tools (Ansible)
• Parallel file system (#GPFS) performance tuning, monitoring and troubleshooting
• Perform systems benchmarking, and developing automated tests for the HPC environment, ensuring the reliability and efficiency of our computational infrastructure
• Infiniband network maintenance and troubleshooting
• Automate and monitor the HPC user lifecycle process
• Slurm installation, configuration, performance tuning and troubleshooting
• Plan, design, and implement a transition from the LSF scheduler to Slurm
• Manage the Slurm scheduler and translate Research policies into scheduler configurations
• Consult with faculty and students to develop research pipelines for use on the HPC cluster
• Develop and maintain user lifecycle software suite in #Python, implement #CI/CD pipeline
• Test and automate upgrades of critical system applications using #Ansible and #shell scripts.