Experienced IT professional with over 15 years in designing, implementing, and overseeing large-scale server infrastructures. Specializing in NVIDIA GPU technology, AI/ML, and containerized deployments. Proficient in Linux administration, including RHEL, CentOS, and Ubuntu, with extensive experience in system configuration and issue resolution.
Key Skills:
Server Infrastructure: Expertise in managing server hardware and diverse server technologies to deliver scalable IT solutions.
Proof of Concept (POC) Labs: Demonstrated success in establishing POC labs, assessing hardware/software integration strategies, and advancing state-of-the-art machine learning and HPC systems.
NVIDIA GPU Technology: Deployed 8-node GPU super pods utilizing HGX servers with A100 GPUs and managed a 9-node DGX-1 cluster.
ML/DL Platforms: Created solutions for ML/DL platforms using Kubernetes and Docker.
Automation & High Availability: Well-versed in automating system setups, setting up high-availability servers, and executing complex network configurations.
Hadoop Ecosystem: Proficient in HDFS, YARN, Hive, HBase, Spark, and Ambari setup and management.
Lifecycle Management: Skilled at managing the lifecycle of server hardware from setup through upgrades, ongoing performance monitoring, and decommissioning.
Collaboration & Leadership: Collaborative team player skilled in working with cross-functional teams to design and execute high-availability systems and disaster recovery plans.
Technical Support & Documentation: Adept at providing pre-sales and post-sales technical support, preparing technical documentation, and ensuring optimal system performance and capacity management.
Presales quarterback supporting multi billion dollar Enterprise customers, fostering relationships with engineering helping them solve their engineering problems critical to their business.
Evaluated emerging technologies to stay current on industry trends, making informed decisions for technology adoption.
Setting up state of the art Proof of Concept Labs (POC Labs) for new hardware configurations by collaborating with technology partners, understanding the latest technology, and applying it to current market requirements.
Collaborated with Trace3's Norcal AM/AE to build pipelines, showcase technology roadmaps, and introduce new products beneficial to their customers.
Engaged with OEMs and vendors' SEs to learn about their products and how they can provide solutions for our customers.
Assessed new hardware and software integration strategies and evaluate its applicability to Trace3/Groupware customers requirement’s.
Monitored and improve the implementation procedures and working practices, conducting architecture chalk talks and design concepts.
Created detailed documentation for systems architecture, troubleshooting issues, support guidelines, system metrics, and project information plan
Contributed to writing Statement of Work (SOW’s) for proposals and installations
Developed Build of Materials (BOMs) for new server, storage on different OEMs and Vendors platforms
Played a key engineering role in the design, development, and operation of the large-scale infrastructure systems managing 12000 servers and 500TB storage.
Deployed a 8 node GPU super pod using HGX servers with A100 GPUs for customer
Managed a 9 nodes DGX-1 cluster at Trace3 POC lab utilized by the Nvidia team, customers and internal Trace3 team to show case and test solutions
Installed and tested NVAIE and Run.ai as part of solution testing and POC
Build, evolve and scale start-of-art machine learning GPU and HPC servers and systems infrastructure powering Data and Artificial Intelligence Platforms.
Installed and verified TensorFlow, Caffe2, NVIDIA CUDA drivers and libraries on baremetal servers and as Docker images.
Implemented scalable solutions to solve complex problems for ML and DL platforms using Kubernetes, adding additional GPU and HPC servers into existing infrastructure as needed
Set up and managed container-based deployment solutions using Docker and Kubernetes
Conducted post-sale requirements gathering, analysis and documentation.
Worked with Docker images, Docker Hub, and Docker registries, creating custom Docker container images, tagging, and pushing the images.
Hands on experience with the NVIDIA Docker/Docker2 plugin for deploying GPU accelerated applications on any Linux GPU server
Upgraded and validated NVIDIA CUDA drivers on GPU server platform
Built POC Kubernetes cluster on bare metal servers and automated application deployments
Added worker nodes as necessary to increase capacity and support scaling of micro services running on Kubernetes
Monitored storage capacity to track growth trends accurately for making decisions about data needs due to application changes
Installed new systems, storage solutions and software while having a solid grasp of server hardware details along with operating systems functionality
Experienced in setting up high availability/cluster servers that're crucial for maintaining business operations smoothly
Understanding of multiple scripting languages to develop solutions for automation and reporting
Collaborated within a Team to design system/storage configurations.
Communicated with partners and clients to update product and implementation status at technical or functional level.
Provided 2nd and 3rd level technical support and troubleshooting to internal and external clients.
Worked with customers or prospective customers to develop integrated solutions and lead detailed architectural dialogues to facilitate delivery of comprehensive solution.
Participated in system development life cycle from requirements analysis through system implementation.
Sr. Systems Engineer
Trace3 (Formerly Groupware Technology Inc.)
02.2013 - 03.2019
Installed, configured, and upgraded Agile Systems, providing hardware and kernel troubleshooting for RHEL 7.x, Ubuntu, and CentOS 7.x.
Oversaw hardware performance to identify issues causing network or filesystem I/O degradation, troubleshot hardware components to reduce equipment downtime substantially, and resolved problems with Linux OS kernel software modules, including process management, memory management, and hardware drivers.
Proposed technical feasibility solutions for new system designs and suggested options for performance improvement of technical components.
Participated in system development life cycle from requirements analysis through system implementation.
Maintained proficiency in automating OS installation and deployment using Stacki, Kickstart, Ansible, and PXE. Installed a Kickstart server, creating and modifying scripts for PXE installations to standardize the Linux install image/configuration.
Modified and cfreated ansible playbooks for infrastructure implementation and Continuous Integration.
Build custom software packages/patches (RPM – Red Hat Package Manager) for Linux systems.
Configured, installed, and onboarded Hadoop servers/racks to different Hadoop clusters, including the installation and configuration of HDFS, YARN, Hive, HBase, Spark, PIG, and Ambari.
Worked in a Big Data environment with an understanding of the complete Data Flow: Source, Store (Data Lake), ETL (extract, transform, load)
Planning, building and migrating server/hadoop infrastructure to new Datacenter
Modified ansible playbooks for infrastructure implementation and Continuous Integration
Assisted the Sales team with pre-sales and post-sales activities, including writing technical requirements and assisting in the quote process for purchasing system components.
Participated in primary design meetings for customer disaster recovery planning
Provided support and services for Groupware customers and performed root cause analysis for production issues.
Mentored junior engineers in advanced concepts, fostering professional development within the team.
Installed and upgraded NetApp filers in both ONTAP 7-mode and C-mode
Performed filer head swaps from 6xxx to 8xxx, installed new disk shelf and created new aggregates in the filers
Assisted and Implemented migration from NETAPP 7-mode to cDOT using 7MTT CFT tool
Sr.Linux Administrator (Datacenter Build)
Groupware Technology
06.2012 - 02.2013
Built Linux and Oracle Infrastructure, including Oracle Database and Oracle RAC.
Installed and configured Red Hat Enterprise Linux on bare metal servers and virtual machines.
Configured Blade server chassis and created server profiles for OS installation.
Collaborated with database administration to ensure optimal database performance and maintain development applications.
Tuned kernel parameters based on application/database requirements.
Discovered and attached LUNs in RHEL, OVM, and Xen Citrix servers, configuring them with Linux Multipathing.
Managed Linux user accounts, groups, directories, and file permissions.
Performed server consolidation and virtualization using Citrix Xen and Oracle Virtual Machine.
Installed and supported Baremetal and Virtual Red Hat Linux systems with Oracle 11g R2 RAC and Oracle Application 11i/R12.
Monitored system resources, logs, disk usage, and performed backups/restores using Nagios and Ganglia.
Created and migrated users on Active Directory and Office 365, including mailbox migration from Exchange Server to Office 365 using MigrationWiz.
Deployed Red Hat Network Satellite server for patch management and provisioning.
Created and cloned channels to custom channels in RHN Satellite Server for server registration.
Resigned GPG keys to RPM and uploaded them to custom channels.
Cloned errata and pushed it to the Software Channel in RHN Satellite Server.
Systems Administrator (Production Support)
Microsoft Corporation/Razorfish
05.2010 - 06.2012
Supported 24/7 business-critical services in Red Hat/Oracle Enterprise Linux environments.
Ensured optimal server uptime, monitoring system health and performing routine maintenance.
Conducted regular audits of user accounts and access permissions, maintaining compliance with regulatory standards and organizational policies.
Implemented server virtualization using VMware (ESX Cluster)
Maintained documentation of Linux and Oracle systems architecture and support procedures
Responded to live issues and executed changes across various environments
Developed best practices for capacity planning, monitoring, security, and recovery strategies
Performed OS and kernel upgrades on Red Hat/Oracle Enterprise Linux servers
Set up automated installation methods on Oracle Enterprise Linux (Kickstart)
Used monitoring tools like Nagios, Cacti, and GroundWork for early performance issue detection.
Linux/Unix System Administrator (Datacenter Build and Migration)
The Mens Warehouse
12.2009 - 05.2010
Installed, configured, and upgraded Red Hat Linux (RHEL) and SUSE Linux systems.
Administered and troubleshooted Linux servers running business-critical applications.
Configured Kickstart/AutoYast servers and booted images using PXE in RHEL & SLES.
Experienced in server consolidation and virtualization using UML Linux, XEN, and VMware.
Managed disk storage with Veritas Volume Manager (VxVM) and Solaris Volume Manager.
Monitored system resources, logs, and disk usage; scheduled backups and restores using Nagios and Veritas Netbackup.
Jr. Systems Administrator (Internship)
3i Infotech Inc
03.2009 - 11.2009
Troubleshot development, test, and production environments on UNIX, Linux, and Windows servers.
Installed, configured, and upgraded Red Hat Linux AS and Windows servers.
Managed user permissions, quotas, and backups on Solaris and Linux servers.
Monitored servers and network components using SCOM 2007.
Performed security administration, backups, and resource monitoring.
Installed patches and packages on Windows servers.
Configured web servers using Apache and IIS on Solaris and Windows servers.
Education
MS - Electrical Engineering
Florida International University
Miami, USA
B.Tech - Electronics & Communication Engineering
JNTU Hyderabad
Hyderabad, India
Skills
Technical architecture
Infrastructure Automation
Customer Satisfaction
Solution Optimization
Improving processes
Technical Documentation and Reporting
Adaptability
Problem-solving aptitude
Active Listening
Team Collaboration
Critical Thinking
Adaptability and Flexibility
DevOps Practices
AI/ML
GPU
Performance Optimization
Big Data Solutions
Servers and Storage
Shell, Python, Ansible
Docker, Kubernetes
HPE, Dell, Super Micro, Lenovo, DGX
NetApp, Pure, EMC
RHEL, Ubuntu, DGX OS
AWS
Certification
RedHat RHCSA (RHEL 7)
Timeline
Sr. Systems Engineer
Trace3 (Formerly Groupware Technology Inc.)
02.2013 - 03.2019
Sr.Linux Administrator (Datacenter Build)
Groupware Technology
06.2012 - 02.2013
Systems Administrator (Production Support)
Microsoft Corporation/Razorfish
05.2010 - 06.2012
Linux/Unix System Administrator (Datacenter Build and Migration)