Summary
Overview
Work History
Education
Skills
Websites
Certification
Timeline
Generic
Madhukar Kanduri

Madhukar Kanduri

Linux & Datacenter Administration
Sunnyvale,CA

Summary

Results-driven Datacenter and Lab Systems Administrator specializing in rack-level AI hardware integration and liquid-cooled GPU/APU platforms. Proficient in managing compute infrastructure within Linux and VMware environments, ensuring optimal performance through effective troubleshooting, monitoring, and system upgrades. Expertise in cooling systems, CDUs, and providing 24/7 production support, complemented by a strong background in networking, backup and recovery solutions, and hardware management. Skilled in root cause analysis (RCA), vendor management, and facilitating AI/GPU server bring-up and validation in high-density environments.

Overview

10
10
years of professional experience
3
3
Certifications

Work History

Datacenter/Lab Technician

AMD
07.2024 - Current
  • Lead rack-level integration, bring-up, validation, and lifecycle management of liquid-cooled HPE Cray MI200 GPU/APU blade systems and AMD MI300X GPU platforms on Supermicro and ZT Systems, including component replacement, failure analysis, troubleshooting, and RMA to ensure reliability, performance, and minimal downtime.
  • Perform hands-on data center operations including rack & stack, cabling, power integration, and airflow optimization for high-density AI and HPC compute racks, ensuring scalable designs with efficient power distribution and cooling.
  • Integrate, configure, and maintain power infrastructure by managing PDUs, performing firmware updates, validating power stability, and ensuring safe and efficient power delivery across liquid-cooled and air-cooled GPU environments.
  • Support end-to-end system qualification and deployment workflows, including firmware flashing using Dediprog, CRD flashing via Jenkins, driver validation, ROCm installation, post-deployment testing, and specialized configurations such as Wombat and UART setup using Raspberry Pi.
  • Deploy, provision, and maintain operating systems using Canonical MAAS to streamline OS installations, automate updates, and support large-scale GPU lab environments.
  • Configure, deploy, and troubleshoot networking components to ensure secure, high-performance connectivity, while monitoring system health and performance using Zabbix and Grafana to proactively identify and resolve issues.
  • Collaborate closely with facilities, vendors, and cross-functional engineering teams to diagnose and resolve rack-level power, cooling, hardware integration, and infrastructure issues in GPU lab and data center environments.
  • Manage hardware inventory, asset tracking, and audits using SnipeIT and SuperGodzilla, while maintaining accurate records, integration procedures, test results, and hardware changes in Confluence for operational transparency and collaboration.
  • Coordinate task orchestration and issue resolution using Conductor and GitLab, tracking data center and lab tickets to ensure timely resolution of complex hardware, firmware, and infrastructure-related issues.

Linux & Data Center Operations Engineer

Infosys-Broadcom
05.2021 - 03.2024
  • Supported data center rack deployments and ensured infrastructure readiness for production environments through detailed power, cooling, and temperature audits.
  • Managed and monitored data center servers, physical hardware, security systems, and access controls to maintain operational integrity and compliance.
  • Coordinated with colocation facilities, data center operators, and third-party vendors for hardware installation, decommissioning, rack-level changes, and on-site repairs while adhering to security and operational standards.
  • Installed, configured, and maintained Linux and Windows operating systems, including firmware updates, OS upgrades, patching, and resolution of OS and software-related issues.
  • Performed hands-on hardware installation and troubleshooting for compute, network, and storage components, resolving infrastructure incidents and minimizing downtime.
  • Collaborated closely with network and storage administrators to address connectivity issues, plan upgrades, and manage vendor dependencies.
  • Managed accounts and data center access through the Cyxtera portal, and monitored DCIM platforms such as Device42 for infrastructure visibility and asset tracking.
  • Documented operational procedures, maintained compliance records, supported 24×7 infrastructure uptime, and executed vulnerability remediation to meet enterprise reliability and security requirements.

Linux/VMware Compute Operations Engineer

Maryland Dept of Health
09.2020 - 04.2021
  • Installed, configured, and patched Linux, Unix, and Windows applications in Ansible environments.
  • Utilized Ansible for task automation, privilege escalation, and managing playbooks, roles, and inventory.
  • Resolved Linux/Unix issues on physical and virtual servers and collaborated with cross-functional teams.
  • Managed user accounts, security patches, and repositories with Yum and Chocolatey.
  • Administered Windows servers, performed disaster recovery, and coordinated release management with Git.
  • Worked with the ELK stack, F5 load balancers.

Linux/VMware Compute Operations Engineer

Meta
01.2020 - 08.2020
  • Optimized business operations systems to enhance daily efficiency.
  • Coordinated hardware replacements and issue resolution with HP, Dell, and VMware.
  • Managed VMware vSphere operations, including vSAN health and vMotion.
  • Performed datastore management, network configurations, and software upgrades.
  • Utilized VMware tools for automation and operational management.
  • Conducted Linux VM migrations, resource sizing, and log analysis.

Linux System Administrator

WORLDPAC
04.2019 - 12.2019
  • General Linux and VMware Administration
  • Skilled in Dell EMC RecoverPoint, vSAN, VPC creation, and cross-region networking (VPC Peering)
  • Experience with SSL cert management, EDI troubleshooting, email security (AppRiver), and IMAP/DNS configurations.
  • Performed Veeam backups, hardware replacements, upgrades.

Linux/Unix High Performance Computing

KLA-Tencor
04.2018 - 03.2019
  • Managed Linux environments (OpenSUSE, RedHat, CentOS, Ubuntu) across desktop and server editions.
  • Built and troubleshot server racks (16u, 42u, 48u) for projects like Lotus, ADC 2.0, and Exige.
  • Proficient in VMware, KVM, OpenStack, and Super Micro hardware, including JBOD enclosures.
  • Documented procedures for system builds, upgrades, and Netgear switch configurations.
  • Configured RAID, BIOS flashing, and network settings with tools like Yast and deployment tools like Clonezilla.
  • Managed compute nodes, DVP systems, and remote tools (IPMI, iDRAC) in data centers and clean rooms.

Linux/VMware/Cloud (AWS) System Administrator

Stanford University (UIT)
11.2016 - 03.2018
  • Managed linux OS and configurations and AWS systems
  • Administered AWS services (EC2, S3, Route53, IAM) with automation via Ansible and Puppet.
  • Performed backups, recovery, and configuration management with IBM TSM, Git, and ServiceNow.
  • Automated security tasks, applied patches, and monitored systems with tools OEM, Nagios, Qualys, Splunk, and Ossec.
  • Built and managed VMware and Dell PowerEdge hosts, including network and storage configurations.

Linux (Azure) Support Engineer/Admin

Microsoft
12.2015 - 10.2016
  • Supported Microsoft Azure Linux platforms, resolving issues across distributions and Azure services.
  • Troubleshot Azure configurations, resolved critical production issues, and enhanced platform performance.
  • Automated tasks with Bash scripts and managed Azure deployments using OpenStack templates.
  • Collaborated with network and storage teams for tuning and feature enhancements.
  • Provided 24/7 support, managed incidents, and utilized tools like Azure Chef, Jenkins, and Ansible.

Education

Bachelors -

Osmania University
Hyderabad, India
04.2010

Master's - Business Administration With IT

NPU
Fremont, CA
09.2015

Skills

  • Data Centers & Labs
  • Rack & Stack Servers, Network&Storage devices
  • Linux and Windows OS Installations
  • Server Configuration
  • Networking
  • Backup and Recovery tools
  • Hardware Deployment
  • Monitoring Tools
  • Cloud Service
  • Virtualization
  • Demand forecasting
  • Documentation
  • Cross-functional team leadership
  • Datacenter Operations Management
  • Procurement
  • Asset Management

Certification

Nvidia-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)

Timeline

Datacenter/Lab Technician

AMD
07.2024 - Current

Linux & Data Center Operations Engineer

Infosys-Broadcom
05.2021 - 03.2024

Linux/VMware Compute Operations Engineer

Maryland Dept of Health
09.2020 - 04.2021

Linux/VMware Compute Operations Engineer

Meta
01.2020 - 08.2020

Linux System Administrator

WORLDPAC
04.2019 - 12.2019

Linux/Unix High Performance Computing

KLA-Tencor
04.2018 - 03.2019

Linux/VMware/Cloud (AWS) System Administrator

Stanford University (UIT)
11.2016 - 03.2018

Linux (Azure) Support Engineer/Admin

Microsoft
12.2015 - 10.2016

Master's - Business Administration With IT

NPU

Bachelors -

Osmania University
Madhukar KanduriLinux & Datacenter Administration
Profile made at Zety.com