Highly skilled HPC, Data Center Linux Administrator, DevOps, and Site Reliability Engineer (SRE) with 18+ years of experience managing and optimizing Linux-based infrastructure in high-performance computing (HPC) environments. Expertise in IBM LSF (Load Sharing Facility), system automation, and performance tuning to enhance operational efficiency. Adept at troubleshooting, scripting, and ensuring high availability for mission-critical systems. Strong knowledge of server hardware, including deployment, maintenance, and troubleshooting of enterprise-grade servers, as well as racking, stacking, and replacing components. Experienced in Jenkins and GitLab Runner administration, managing CI/CD pipelines to streamline software delivery and infrastructure automation. Maintained 5 Jenkins masters with around 50 build VMs, EC2 instances (using the EC2 plugin), bare-metal servers, and Kubernetes clusters as slaves. Proficient in AWS cloud services, including EC2, S3, IAM, VPC, and Terraform for infrastructure automation. Experienced in JFrog Artifactory for artifact management and software distribution. Hands-on experience with Docker and Kubernetes, including Amazon EKS (Elastic Kubernetes Service), for containerized application deployment and orchestration. Strong expertise in Yellow Pages (NIS), Auto Mounts, SOS (Service on Site), and Scratch Areas for centralized authentication, network information services, and high-performance storage management. Skilled in Python and Bash script automation to improve efficiency and reduce manual intervention in system administration. Expertise in Dell PowerEdge, NetApp HCI, Supermicro, and HPE server hardware, including installation, maintenance, and performance optimization for enterprise computing environments. Experienced in maintaining and managing IT OpenCockpit and Checkmk monitoring tools for system and network health monitoring. Managed and maintained an HPC environment with over 5300 cores, ensuring optimal job scheduling, resource allocation, and performance tuning.
Operating Systems: RHEL, CentOS, Ubuntu, SUSE Linux
Cluster & Job Scheduling: IBM LSF, SLURM
Scripting & Automation: Bash, Python, Ansible, Terraform
Cloud & Virtualization: VMware, AWS (EC2, S3, IAM, VPC, Route 53, Lambda, CloudFormation, Terraform, EKS), OpenStack, KVM, LXD
Containers and Orchestration: Docker, Kubernetes, Amazon EKS
Storage & Networking: NFS, SAN, iSCSI, LDAP, TCP/IP, DNS, DHCP, Yellow Pages (NIS), Auto Mounts, SOS, Scratch Areas
Security & Monitoring: SELinux, Firewalld, Nagios, Zabbix, Prometheus, ELK Stack, IT OpenCockpit, Checkmk
Configuration Management: Ansible, Puppet, Chef
Version Control & CI/CD: Git, GitLab, SVN, Jenkins, GitLab Runner, CI/CD pipelines, JFrog Artifactory
Server Hardware: Dell PowerEdge, NetApp HCI, Supermicro, HPE, Cisco UCS, RAID, BIOS/UEFI configurations, IBM RS/6000, racking, stacking, and component replacement
• AWS Certified Solutions Architect – Associate
• Red Hat Certified System Administrator (RHCSA)
• Oracle 10x