
Innovative HPC Hardware Engineer with proven expertise in designing, developing, and testing hardware components. Skilled in troubleshooting technical issues and implementing effective solutions to enhance system performance. Collaborates effectively with diverse teams and clients, embracing continuous learning and new challenges in dynamic environments.
Achieved 99.999% uptime targets while operating Eldorado and APU HPC clusters.
Managed lifecycle of ROCm driver and firmware for compute fleets.
Diagnosed issues related to APU hardware memory topology and Infinity Fabric firmware.
Conducted troubleshooting of InfiniBand to maintain communication stability in distributed systems.
Executed automated updates for provisioning, patching, and firmware across all clusters.
Oversaw root cause investigations and confirmed effectiveness of hardware refreshes.