Summary
Overview
Work History
Education
Skills
Certification
Timeline
Generic

RAJESH ARORA

Newark,USA

Summary

Highly skilled HPC, Data Center Linux Administrator, DevOps, and Site Reliability Engineer (SRE) with 18+ years of experience managing and optimizing Linux-based infrastructure in high-performance computing (HPC) environments. Expertise in IBM LSF (Load Sharing Facility), system automation, and performance tuning to enhance operational efficiency. Adept at troubleshooting, scripting, and ensuring high availability for mission-critical systems. Strong knowledge of server hardware, including deployment, maintenance, and troubleshooting of enterprise-grade servers, as well as racking, stacking, and replacing components. Experienced in Jenkins and GitLab Runner administration, managing CI/CD pipelines to streamline software delivery and infrastructure automation. Maintained 5 Jenkins masters with around 50 build VMs, EC2 instances (using the EC2 plugin), bare-metal servers, and Kubernetes clusters as slaves. Proficient in AWS cloud services, including EC2, S3, IAM, VPC, and Terraform for infrastructure automation. Experienced in JFrog Artifactory for artifact management and software distribution. Hands-on experience with Docker and Kubernetes, including Amazon EKS (Elastic Kubernetes Service), for containerized application deployment and orchestration. Strong expertise in Yellow Pages (NIS), Auto Mounts, SOS (Service on Site), and Scratch Areas for centralized authentication, network information services, and high-performance storage management. Skilled in Python and Bash script automation to improve efficiency and reduce manual intervention in system administration. Expertise in Dell PowerEdge, NetApp HCI, Supermicro, and HPE server hardware, including installation, maintenance, and performance optimization for enterprise computing environments. Experienced in maintaining and managing IT OpenCockpit and Checkmk monitoring tools for system and network health monitoring. Managed and maintained an HPC environment with over 5300 cores, ensuring optimal job scheduling, resource allocation, and performance tuning.

Overview

20
20
years of professional experience
1
1
Certification

Work History

Senior DevOps HPC & Build/Release Infra Engineer

Lucid Motors
Newark, California
01.2022 - Current
  • Administer and maintain Linux-based servers in a large-scale HPC data center environment managing over 1900 cores
  • Manage and optimize IBM LSF job scheduling, ensuring efficient resource allocation for HPC workloads
  • Automate system administration tasks using Bash, Python, and Ansible
  • Develop and maintain Python and Bash scripts for automated monitoring, reporting, and system maintenance
  • Monitor system performance and troubleshoot issues using Nagios, Zabbix, ELK Stack, IT OpenCockpit, and Checkmk
  • Implement security best practices, including firewall rules, SELinux policies, and system hardening
  • Deploy, configure, and maintain server hardware, including RAID setups, BIOS/UEFI tuning, firmware updates, racking, stacking, and component replacement
  • Administer Jenkins and GitLab Runners, ensuring smooth execution of CI/CD pipelines for software builds and infrastructure automation
  • Maintained 5 Jenkins masters with around 50 build VMs, EC2 instances (using the EC2 plugin), bare-metal servers, and Kubernetes clusters as slaves
  • Deploy and manage AWS infrastructure, provisioning EC2 instances, configuring IAM roles, setting up S3 storage, and implementing security best practices
  • Manage JFrog Artifactory, optimizing artifact storage and distribution for software development workflows
  • Deploy and manage Docker and Kubernetes environments, ensuring efficient container orchestration with Amazon EKS
  • Configure and manage Yellow Pages (NIS), Auto Mounts, SOS, and Scratch Areas for centralized authentication and storage management
  • Collaborate with cross-functional teams to support scientific computing, enterprise applications, and DevOps processes
  • Manage and maintain Dell PowerEdge, NetApp HCI, Supermicro, and HPE server hardware, ensuring optimal performance and reliability

Senior HPC DevOps/ System Administrator

Synaptics Incorporated
01.2020 - 01.2022
  • Deployed and managed LSF clusters, optimized job scheduling, and ensured high availability in an HPC environment managing 5300+ cores
  • Configured and maintained NFS, LDAP, Yellow Pages (NIS), Auto Mounts, SOS, Scratch Areas, and other network services to support HPC environments
  • Designed and implemented automated patching and system updates to improve security and compliance
  • Developed Python and Bash scripts for log analysis, system monitoring, and proactive issue resolution
  • Installed and maintained enterprise-grade servers, including hardware diagnostics, troubleshooting, racking, stacking, and replacing faulty components
  • Managed Jenkins and GitLab Runners to streamline CI/CD workflows and automate deployment processes
  • Maintained 5 Jenkins masters with around 50 build VMs, EC2 instances, bare-metal servers, and Kubernetes clusters as slaves
  • Implemented AWS-based cloud solutions, deploying scalable infrastructure using Terraform and CloudFormation
  • Administered JFrog Artifactory, ensuring secure and efficient artifact storage for CI/CD workflows
  • Implemented Docker containerization strategies and orchestrated workloads using Kubernetes and Amazon EKS
  • Worked on cloud-based solutions to extend HPC capabilities using AWS and OpenStack
  • Maintained and managed IT OpenCockpit and Checkmk monitoring tools to ensure system performance and reliability
  • Managed and maintained Dell PowerEdge, NetApp HCI, Supermicro, and HPE server hardware, ensuring system stability and performance

DevOps Engineer

Apple Inc
08.2019 - 01.2020
  • Working closely with the Apple Customer care, Apple GBI team to enroll new apple devices in production Oracle and Cassandra databases
  • Involved in Creating new environment and maintain existing QA and production environment of DEP, ABM and AEM
  • Debugging and fixing device replacement, Fraud devices related issues after coordinating with the various teams within the Apple
  • Providing 24
  • 7 production support on Oracle/Cassandra Data bases hosted on Unix/Linux servers
  • Application Deployments & Environment configuration using Chef and automated software installation using Chef Playbooks
  • Using Splunk, AppDynamics, Zabbix, Kafka, AWS Cloud Watch for instance, network, systems, and application monitoring
  • Taking care of the Apple weekly releases by deploying new code on the production Environment using tools like Jenkins, GitHub, Carnival, Dockers extra
  • Providing configuration management support by creating and cutting release branches and integrating the code before release
  • Taking care of running post release jobs and fixing the post release issues
  • Providing application support on Java servers (Jboss, tomcat, WebLogic) and applications on production servers
  • Writing SQL queries to identify and troubleshoot issues within Oracle and SQL Server databases
  • Configuring & troubleshooting the problems related to application servers WebLogic, jboss, tomcat
  • Debugging live sites issues using Splunk search queries and fixing them after coordinating with the DBA and development team
  • Working on Migration of database from Oracle to Cassandra
  • Working on Automating Ci/CD jobs
  • Writing scripts in Python and Bash to automate the daily jobs

Cloud Operation Lead Engineer

Arlo Technologies
San Jose, CA
08.2018 - 08.2019
  • Company Overview: Arlo Technologies provides security and video monitoring solutions for home and Business
  • It is focused on bringing together deep expertise in product design, wireless connectivity and RF engineering, cloud infrastructure
  • Troubleshot the issues generated while building and deploying and in production support on Java Servers on Bare Metal as well as on Cloud Platform Infrastructure on SLA based Service Now ticketing System
  • Create and maintain fully automated CI/CD pipelines for code deployment using Jenkins and Harness
  • Built and deployed Docker containers to break up monolithic app into micro services, improving developer workflow, increasing scalability, and optimizing speed
  • Involved in identifying and troubleshoot issues within Oracle and SQL Server databases and good experience on PLSQL
  • Maintaining and troubleshooting issues related to Dynamo DB
  • Using Splunk Query Language for logs analysis to debug live site issues
  • Deploying java application on production servers
  • Application Deployments & Environment configuration using Chef and automated software installation using Chef Playbooks
  • Using Splunk, AppDynamics, Zabbix, Kafka, AWS Cloud Watch for instance, network, systems, and application monitoring
  • Used & Implemented Kubernetes to deploy scale, load balance, scale and manage Docker containers with multiple name spaced versions
  • Extensive experience using Maven and Ant as build tools for building of deployable artifacts (jar, war & ear) from source code
  • Design & implement VPC service for extension of customer's on-premises datacenter with AWS Cloud using AWS VPC and VPN & Direct connect services
  • Used AWS S3 service as Build Artifact repository to create release-based bucket store various modules/branch-based artifact storage
  • Migrated Akamai Edge Caching Solution of ARLO Technologies on to AWS CloudFront console
  • Managed Amazon Web Services: VPC, EC2, S3 bucket, DynamoDB, CLI Rute53, ELB, Auto-Scaling, ACL, SQS, SNS, CloudFormation, KMS, and IAM
  • Hands on experience with CloudFormation templates and terraform
  • Worked with Engineers, QA, business and other teams to ensure automated test efforts are tightly integrated with the build system and infixing the error while doing the deployment and building
  • Scripting in Python and Bash for automation purposes
  • Installed Jenkins on a Linux machine and created a master and slave configuration to implement multiple parallel builds through a build farm
  • Involved in maintaining highly available secure multi-zone AWS cloud Arlo’s Dev/QA/Staging and Production on AWS cloud and deploying application like WOWZA steaming Engines, Utramedia, Apache web servers and Tomcat servers
  • Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/Ubuntu) and configuring launched instances with respect to specific applications
  • Install/configure/maintain Linux/Solaris servers, NIS, DNS, NFS, Mailing List, Sendmail, apache, FTP, SSHD
  • Experience in creating secured cloud infra using (VPC) for Staging and Development environment on AWS
  • Arlo Technologies provides security and video monitoring solutions for home and Business
  • It is focused on bringing together deep expertise in product design, wireless connectivity and RF engineering, cloud infrastructure

Release Engineer and Application support

PayPal Inc.
San Jose, CA
04.2010 - 08.2018
  • Provided 24x7 support for an ecommerce online payment solution by maintaining an automated build process, enhancing software installation (RPM) and service control, and resolving live site issues as they arise in a Red Hat Linux environment
  • Provided application support on Java servers (Jboss, tomcat, weblogic) and Sparta applications on production servers
  • Wrote SQL queries to identify and troubleshoot issues within Oracle and SQL Server databases
  • Configuring & troubleshooting the problems related to application servers WebLogic, jboss, tomcat
  • Deploying Java applications in continuous release format
  • Involved in troubleshooting applications to diagnose problems on PayPal production website and coordinating with operations teams to create strategies and detailed plans for deployment sequencing and timing skills
  • Coordinated with various teams including PayPal NOC, UNIX delivery, TIVOLI Team for the 100% uptime and fixing up the issues, performing upgrades, migration tasks on production and test environments
  • Supported large-scale, high-availability, 24x7 production environment on UNIX/LINUX and Windows based Network Systems of PayPal Inc
  • Wrote SQL queries to identify and troubleshoot issues within Oracle and SQL Server databases
  • Monitored system health and uptime using Big Brother and Nagios
  • Wrote Python, Perl, and Bash scripts to automate various tasks and reports
  • Working throughout all levels of the PayPal organization and worked closely with the engineering and quality teams, Operations, Deployment and Production Support teams
  • Involved in deploying PayPal products and applications into Production environments
  • Working closely with application developers to devise robust deployment, operating, monitoring, and reporting for the PayPal applications and was involved in Tuning application configuration to support optimize performance per developer
  • Using Unix Systems Administration skills in PayPal Inc
  • To smoothen the daily operation in the field of software and applications engineering, release engineering/configuration management, scripting and applications/web development, software development life-cycle, incident, problem change and release management, large scale distributed application environment, Java and C++, UNIX environment, UNIX platform, web architecture, load balancing appliances in a web environment
  • Pushing of C and Java code to production environments (RHEL 5.4 hosts and jboss application servers) using Tivoli framework tools and RPM utility
  • Web Server Administration - Configuring, managing multiple web server instances in Apache
  • Involved in Tech-refresh and Cap-add project of PayPal to migrate the OS (from Red Hat 6.2 to RHEL 5.4) and deploying code thru the third-party tool like Turbo-Roller and Tivoli scripts and to bring the server back in Traffic

Lead Consultant (UNIX and Application Support Engineer)

Genpact Headstrong India private Ltd (Client Agilent Technologies)
08.2005 - 02.2010
  • IT infrastructure support for Agilent Production environment having 1000+ Linux RHEL 4AS, HP9000/SUN-Solaris servers & 1000+ Windows 2000 servers supporting Agilent customers including the mission-critical/production applications
  • System Administration of AIX 5 L, Linux (red hat), SCO Open Server, UnixWare
  • Linux/Solaris/HP-UX/Windows 2000 Server Administration – Maintaining around 1000 odd Solaris/Linux/Hp-Ux/Windows 1000 servers (onsite), monitoring managing their memory / CPU loads, Disk Space, processes running on them, in order to support the website
  • Responsible for configuring search tools Endeca
  • Pushing of configuration / codes to various environments
  • Monitoring of production, UAT and development websites
  • Involved in trouble shooting the issues on with in the defined SLA and to escalate the issues if required
  • Building new Environments, websites and deploying the codes to the JBoss and WebLogic servers
  • Remote administration of HP servers using secured telnet and Ssh sessions
  • Configuring & trouble shooting the problems related to application servers like Broad Vision, WebLogic, like determining the process states and get them restarted in case they are hung
  • Involved in integrating apache with jboss/tomcat using mod-jk connectors by compiling apache
  • Building and compiling apache with different modules as per the environment need
  • Coordinating various teams including Akamai Teams and Deloitte Team for the 100% uptime
  • Configuring and troubleshooting the Rsync issues in Production and UAT environment

Education

Bachelor of Science - Computer Science And Engineering

Banglore University

Skills

Operating Systems: RHEL, CentOS, Ubuntu, SUSE Linux

Cluster & Job Scheduling: IBM LSF, SLURM

Scripting & Automation: Bash, Python, Ansible, Terraform

Cloud & Virtualization: VMware, AWS (EC2, S3, IAM, VPC, Route 53, Lambda, CloudFormation, Terraform, EKS), OpenStack, KVM, LXD

Containers and Orchestration: Docker, Kubernetes, Amazon EKS

Storage & Networking: NFS, SAN, iSCSI, LDAP, TCP/IP, DNS, DHCP, Yellow Pages (NIS), Auto Mounts, SOS, Scratch Areas

Security & Monitoring: SELinux, Firewalld, Nagios, Zabbix, Prometheus, ELK Stack, IT OpenCockpit, Checkmk

Configuration Management: Ansible, Puppet, Chef

Version Control & CI/CD: Git, GitLab, SVN, Jenkins, GitLab Runner, CI/CD pipelines, JFrog Artifactory

Server Hardware: Dell PowerEdge, NetApp HCI, Supermicro, HPE, Cisco UCS, RAID, BIOS/UEFI configurations, IBM RS/6000, racking, stacking, and component replacement

Certification

• AWS Certified Solutions Architect – Associate

• Red Hat Certified System Administrator (RHCSA)

• Oracle 10x

Timeline

Senior DevOps HPC & Build/Release Infra Engineer

Lucid Motors
01.2022 - Current

Senior HPC DevOps/ System Administrator

Synaptics Incorporated
01.2020 - 01.2022

DevOps Engineer

Apple Inc
08.2019 - 01.2020

Cloud Operation Lead Engineer

Arlo Technologies
08.2018 - 08.2019

Release Engineer and Application support

PayPal Inc.
04.2010 - 08.2018

Lead Consultant (UNIX and Application Support Engineer)

Genpact Headstrong India private Ltd (Client Agilent Technologies)
08.2005 - 02.2010

Bachelor of Science - Computer Science And Engineering

Banglore University
RAJESH ARORA