Summary

Overview

Work History

Education

Skills

Certification

Timeline

RAJESH ARORA

Newark,USA

Summary

Highly skilled HPC, Data Center Linux Administrator, DevOps, and Site Reliability Engineer (SRE) with 18+ years of experience managing and optimizing Linux-based infrastructure in high-performance computing (HPC) environments. Expertise in IBM LSF (Load Sharing Facility), system automation, and performance tuning to enhance operational efficiency. Adept at troubleshooting, scripting, and ensuring high availability for mission-critical systems. Strong knowledge of server hardware, including deployment, maintenance, and troubleshooting of enterprise-grade servers, as well as racking, stacking, and replacing components. Experienced in Jenkins and GitLab Runner administration, managing CI/CD pipelines to streamline software delivery and infrastructure automation. Maintained 5 Jenkins masters with around 50 build VMs, EC2 instances (using the EC2 plugin), bare-metal servers, and Kubernetes clusters as slaves. Proficient in AWS cloud services, including EC2, S3, IAM, VPC, and Terraform for infrastructure automation. Experienced in JFrog Artifactory for artifact management and software distribution. Hands-on experience with Docker and Kubernetes, including Amazon EKS (Elastic Kubernetes Service), for containerized application deployment and orchestration. Strong expertise in Yellow Pages (NIS), Auto Mounts, SOS (Service on Site), and Scratch Areas for centralized authentication, network information services, and high-performance storage management. Skilled in Python and Bash script automation to improve efficiency and reduce manual intervention in system administration. Expertise in Dell PowerEdge, NetApp HCI, Supermicro, and HPE server hardware, including installation, maintenance, and performance optimization for enterprise computing environments. Experienced in maintaining and managing IT OpenCockpit and Checkmk monitoring tools for system and network health monitoring. Managed and maintained an HPC environment with over 5300 cores, ensuring optimal job scheduling, resource allocation, and performance tuning.

Overview

years of professional experience

Certification

Work History

Senior DevOps HPC & Build/Release Infra Engineer

Lucid Motors

Newark, California

01.2022 - Current

Administer and maintain Linux-based servers in a large-scale HPC data center environment managing over 1900 cores
Manage and optimize IBM LSF job scheduling, ensuring efficient resource allocation for HPC workloads
Automate system administration tasks using Bash, Python, and Ansible
Develop and maintain Python and Bash scripts for automated monitoring, reporting, and system maintenance
Monitor system performance and troubleshoot issues using Nagios, Zabbix, ELK Stack, IT OpenCockpit, and Checkmk
Implement security best practices, including firewall rules, SELinux policies, and system hardening
Deploy, configure, and maintain server hardware, including RAID setups, BIOS/UEFI tuning, firmware updates, racking, stacking, and component replacement
Administer Jenkins and GitLab Runners, ensuring smooth execution of CI/CD pipelines for software builds and infrastructure automation
Maintained 5 Jenkins masters with around 50 build VMs, EC2 instances (using the EC2 plugin), bare-metal servers, and Kubernetes clusters as slaves
Deploy and manage AWS infrastructure, provisioning EC2 instances, configuring IAM roles, setting up S3 storage, and implementing security best practices
Manage JFrog Artifactory, optimizing artifact storage and distribution for software development workflows
Deploy and manage Docker and Kubernetes environments, ensuring efficient container orchestration with Amazon EKS
Configure and manage Yellow Pages (NIS), Auto Mounts, SOS, and Scratch Areas for centralized authentication and storage management
Collaborate with cross-functional teams to support scientific computing, enterprise applications, and DevOps processes
Manage and maintain Dell PowerEdge, NetApp HCI, Supermicro, and HPE server hardware, ensuring optimal performance and reliability

Senior HPC DevOps/ System Administrator

Synaptics Incorporated

01.2020 - 01.2022

Deployed and managed LSF clusters, optimized job scheduling, and ensured high availability in an HPC environment managing 5300+ cores
Configured and maintained NFS, LDAP, Yellow Pages (NIS), Auto Mounts, SOS, Scratch Areas, and other network services to support HPC environments
Designed and implemented automated patching and system updates to improve security and compliance
Developed Python and Bash scripts for log analysis, system monitoring, and proactive issue resolution
Installed and maintained enterprise-grade servers, including hardware diagnostics, troubleshooting, racking, stacking, and replacing faulty components
Managed Jenkins and GitLab Runners to streamline CI/CD workflows and automate deployment processes
Maintained 5 Jenkins masters with around 50 build VMs, EC2 instances, bare-metal servers, and Kubernetes clusters as slaves
Implemented AWS-based cloud solutions, deploying scalable infrastructure using Terraform and CloudFormation
Administered JFrog Artifactory, ensuring secure and efficient artifact storage for CI/CD workflows
Implemented Docker containerization strategies and orchestrated workloads using Kubernetes and Amazon EKS
Worked on cloud-based solutions to extend HPC capabilities using AWS and OpenStack
Maintained and managed IT OpenCockpit and Checkmk monitoring tools to ensure system performance and reliability
Managed and maintained Dell PowerEdge, NetApp HCI, Supermicro, and HPE server hardware, ensuring system stability and performance

DevOps Engineer

Apple Inc

08.2019 - 01.2020

Working closely with the Apple Customer care, Apple GBI team to enroll new apple devices in production Oracle and Cassandra databases
Involved in Creating new environment and maintain existing QA and production environment of DEP, ABM and AEM
Debugging and fixing device replacement, Fraud devices related issues after coordinating with the various teams within the Apple
Providing 24
7 production support on Oracle/Cassandra Data bases hosted on Unix/Linux servers
Application Deployments & Environment configuration using Chef and automated software installation using Chef Playbooks
Using Splunk, AppDynamics, Zabbix, Kafka, AWS Cloud Watch for instance, network, systems, and application monitoring
Taking care of the Apple weekly releases by deploying new code on the production Environment using tools like Jenkins, GitHub, Carnival, Dockers extra
Providing configuration management support by creating and cutting release branches and integrating the code before release
Taking care of running post release jobs and fixing the post release issues
Providing application support on Java servers (Jboss, tomcat, WebLogic) and applications on production servers
Writing SQL queries to identify and troubleshoot issues within Oracle and SQL Server databases
Configuring & troubleshooting the problems related to application servers WebLogic, jboss, tomcat
Debugging live sites issues using Splunk search queries and fixing them after coordinating with the DBA and development team
Working on Migration of database from Oracle to Cassandra
Working on Automating Ci/CD jobs
Writing scripts in Python and Bash to automate the daily jobs

Cloud Operation Lead Engineer

Arlo Technologies

San Jose, CA

08.2018 - 08.2019

Company Overview: Arlo Technologies provides security and video monitoring solutions for home and Business
It is focused on bringing together deep expertise in product design, wireless connectivity and RF engineering, cloud infrastructure
Troubleshot the issues generated while building and deploying and in production support on Java Servers on Bare Metal as well as on Cloud Platform Infrastructure on SLA based Service Now ticketing System
Create and maintain fully automated CI/CD pipelines for code deployment using Jenkins and Harness
Built and deployed Docker containers to break up monolithic app into micro services, improving developer workflow, increasing scalability, and optimizing speed
Involved in identifying and troubleshoot issues within Oracle and SQL Server databases and good experience on PLSQL
Maintaining and troubleshooting issues related to Dynamo DB
Using Splunk Query Language for logs analysis to debug live site issues
Deploying java application on production servers
Application Deployments & Environment configuration using Chef and automated software installation using Chef Playbooks
Using Splunk, AppDynamics, Zabbix, Kafka, AWS Cloud Watch for instance, network, systems, and application monitoring
Used & Implemented Kubernetes to deploy scale, load balance, scale and manage Docker containers with multiple name spaced versions
Extensive experience using Maven and Ant as build tools for building of deployable artifacts (jar, war & ear) from source code
Design & implement VPC service for extension of customer's on-premises datacenter with AWS Cloud using AWS VPC and VPN & Direct connect services
Used AWS S3 service as Build Artifact repository to create release-based bucket store various modules/branch-based artifact storage
Migrated Akamai Edge Caching Solution of ARLO Technologies on to AWS CloudFront console
Managed Amazon Web Services: VPC, EC2, S3 bucket, DynamoDB, CLI Rute53, ELB, Auto-Scaling, ACL, SQS, SNS, CloudFormation, KMS, and IAM
Hands on experience with CloudFormation templates and terraform
Worked with Engineers, QA, business and other teams to ensure automated test efforts are tightly integrated with the build system and infixing the error while doing the deployment and building
Scripting in Python and Bash for automation purposes
Installed Jenkins on a Linux machine and created a master and slave configuration to implement multiple parallel builds through a build farm
Involved in maintaining highly available secure multi-zone AWS cloud Arlo’s Dev/QA/Staging and Production on AWS cloud and deploying application like WOWZA steaming Engines, Utramedia, Apache web servers and Tomcat servers
Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/Ubuntu) and configuring launched instances with respect to specific applications
Install/configure/maintain Linux/Solaris servers, NIS, DNS, NFS, Mailing List, Sendmail, apache, FTP, SSHD
Experience in creating secured cloud infra using (VPC) for Staging and Development environment on AWS
Arlo Technologies provides security and video monitoring solutions for home and Business
It is focused on bringing together deep expertise in product design, wireless connectivity and RF engineering, cloud infrastructure

Release Engineer and Application support

PayPal Inc.

San Jose, CA

04.2010 - 08.2018

Provided 24x7 support for an ecommerce online payment solution by maintaining an automated build process, enhancing software installation (RPM) and service control, and resolving live site issues as they arise in a Red Hat Linux environment
Provided application support on Java servers (Jboss, tomcat, weblogic) and Sparta applications on production servers
Wrote SQL queries to identify and troubleshoot issues within Oracle and SQL Server databases
Configuring & troubleshooting the problems related to application servers WebLogic, jboss, tomcat
Deploying Java applications in continuous release format
Involved in troubleshooting applications to diagnose problems on PayPal production website and coordinating with operations teams to create strategies and detailed plans for deployment sequencing and timing skills
Coordinated with various teams including PayPal NOC, UNIX delivery, TIVOLI Team for the 100% uptime and fixing up the issues, performing upgrades, migration tasks on production and test environments
Supported large-scale, high-availability, 24x7 production environment on UNIX/LINUX and Windows based Network Systems of PayPal Inc
Wrote SQL queries to identify and troubleshoot issues within Oracle and SQL Server databases
Monitored system health and uptime using Big Brother and Nagios
Wrote Python, Perl, and Bash scripts to automate various tasks and reports
Working throughout all levels of the PayPal organization and worked closely with the engineering and quality teams, Operations, Deployment and Production Support teams
Involved in deploying PayPal products and applications into Production environments
Working closely with application developers to devise robust deployment, operating, monitoring, and reporting for the PayPal applications and was involved in Tuning application configuration to support optimize performance per developer
Using Unix Systems Administration skills in PayPal Inc
To smoothen the daily operation in the field of software and applications engineering, release engineering/configuration management, scripting and applications/web development, software development life-cycle, incident, problem change and release management, large scale distributed application environment, Java and C++, UNIX environment, UNIX platform, web architecture, load balancing appliances in a web environment
Pushing of C and Java code to production environments (RHEL 5.4 hosts and jboss application servers) using Tivoli framework tools and RPM utility
Web Server Administration - Configuring, managing multiple web server instances in Apache
Involved in Tech-refresh and Cap-add project of PayPal to migrate the OS (from Red Hat 6.2 to RHEL 5.4) and deploying code thru the third-party tool like Turbo-Roller and Tivoli scripts and to bring the server back in Traffic

Lead Consultant (UNIX and Application Support Engineer)

Genpact Headstrong India private Ltd (Client Agilent Technologies)

08.2005 - 02.2010

IT infrastructure support for Agilent Production environment having 1000+ Linux RHEL 4AS, HP9000/SUN-Solaris servers & 1000+ Windows 2000 servers supporting Agilent customers including the mission-critical/production applications
System Administration of AIX 5 L, Linux (red hat), SCO Open Server, UnixWare
Linux/Solaris/HP-UX/Windows 2000 Server Administration – Maintaining around 1000 odd Solaris/Linux/Hp-Ux/Windows 1000 servers (onsite), monitoring managing their memory / CPU loads, Disk Space, processes running on them, in order to support the website
Responsible for configuring search tools Endeca
Pushing of configuration / codes to various environments
Monitoring of production, UAT and development websites
Involved in trouble shooting the issues on with in the defined SLA and to escalate the issues if required
Building new Environments, websites and deploying the codes to the JBoss and WebLogic servers
Remote administration of HP servers using secured telnet and Ssh sessions
Configuring & trouble shooting the problems related to application servers like Broad Vision, WebLogic, like determining the process states and get them restarted in case they are hung
Involved in integrating apache with jboss/tomcat using mod-jk connectors by compiling apache
Building and compiling apache with different modules as per the environment need
Coordinating various teams including Akamai Teams and Deloitte Team for the 100% uptime
Configuring and troubleshooting the Rsync issues in Production and UAT environment

Education

Bachelor of Science - Computer Science And Engineering

Banglore University

Skills

Operating Systems: RHEL, CentOS, Ubuntu, SUSE Linux

Cluster & Job Scheduling: IBM LSF, SLURM

Scripting & Automation: Bash, Python, Ansible, Terraform

Cloud & Virtualization: VMware, AWS (EC2, S3, IAM, VPC, Route 53, Lambda, CloudFormation, Terraform, EKS), OpenStack, KVM, LXD

Containers and Orchestration: Docker, Kubernetes, Amazon EKS

Storage & Networking: NFS, SAN, iSCSI, LDAP, TCP/IP, DNS, DHCP, Yellow Pages (NIS), Auto Mounts, SOS, Scratch Areas

Security & Monitoring: SELinux, Firewalld, Nagios, Zabbix, Prometheus, ELK Stack, IT OpenCockpit, Checkmk

Configuration Management: Ansible, Puppet, Chef

Version Control & CI/CD: Git, GitLab, SVN, Jenkins, GitLab Runner, CI/CD pipelines, JFrog Artifactory

Server Hardware: Dell PowerEdge, NetApp HCI, Supermicro, HPE, Cisco UCS, RAID, BIOS/UEFI configurations, IBM RS/6000, racking, stacking, and component replacement

Certification

• AWS Certified Solutions Architect – Associate

• Red Hat Certified System Administrator (RHCSA)

• Oracle 10x

Timeline

Senior DevOps HPC & Build/Release Infra Engineer

Lucid Motors

01.2022 - Current

Senior HPC DevOps/ System Administrator

Synaptics Incorporated

01.2020 - 01.2022

DevOps Engineer

Apple Inc

08.2019 - 01.2020

Cloud Operation Lead Engineer

Arlo Technologies

08.2018 - 08.2019

Release Engineer and Application support

PayPal Inc.

04.2010 - 08.2018

Lead Consultant (UNIX and Application Support Engineer)

Genpact Headstrong India private Ltd (Client Agilent Technologies)

08.2005 - 02.2010

Bachelor of Science - Computer Science And Engineering

Banglore University