Summary
Overview
Work History
Education
Skills
PROFESSIONAL SUMMARY
Timeline
Generic

Nandan Yadla

Bridgeport,CT

Summary

Experienced Site Reliability Engineer/DevOps Engineer with8+ years of expertise in designing, implementing, and securing complex cloud infrastructures and deployments using Amazon Web Services (AWS). Experienced in designing, implementing, and maintaining and automation of highly available and scalable applications.

Overview

9
9
years of professional experience

Work History

SRE / DevOps Engineer

AIG
02.2023 - Current

Managed builds and deployments across multiple environments such as Development, Testing, Pre-production & Production environments, implementing high availability architecture with Disaster Recovery in Amazon Web Services (AWS) across multiple availability zones

  • Performed Infrastructure Monitoring using tools, such as Prometheus and Grafana, to gain insights into clusters and address issues as needed
  • Built performance dashboard in Grafana for tracking K8S metrics
  • Performed thorough analysis of the application performance by utilizing Real User Monitoring (RUM) Metrics using tools New Relic, by monitoring the four Golden Signals, Apdex ratings
  • Implemented efficient incident management techniques, leading to a significant decrease in both Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) for incidents using New Relic
  • Performed routine monitoring health checkups using New Relic & Splunk and created custom reports to present the stakeholders with status of the applications
  • Performed capacity assessment for calculating CPU and Memory requirements for new servers according need in hand
  • Used Terraform to set up the AWS infrastructures such as launching the EC2 instances, S3 buckets, Virtual Private Cloud (VPC), Public and Private Subnets, IAM roles and policies, Route Tables, Security Groups, Storage Groups, Elastic Load Balancer (ELB) & Application Load Balancer (ALB) and Elastic Kubernetes Services (EKS)
  • Worked with Terraform for IAC (Infrastructure as Code), ensuring consistent and reproducible infrastructure for staging and production environments
  • Hands-on experience with writing Terraform Scripts to create infrastructure for Dev, UAT, Staging and Prod environments
  • Administered and maintained Docker runtime environment, managing Docker images using versioning and lifecycle management strategies
  • Deployed and managed OpenShift clusters in production environments, ensuring scalability, high availability, and efficient resource utilization for containerized applications
  • Written Ansible playbooks to configure deploy and maintain software components of the existing infrastructure
  • Developed automation scripts and deployment pipelines to facilitate efficient and reliable Blue-Green organizations, ensuring reliable and repeatable processes
  • Used Git as source code repositories and managed Git repositories for branching, merging, and tagging
  • Implemented and maintained Jenkins pipelines, streamlining the CI/CD process, and ensuring efficient software delivery
  • Wrote and maintained Jenkins files, defining stages, steps, and integrating with a variety of tools for building, testing, and deploying applications
  • Creating a fully automated Build and Deployment Platform by coordinating code builds promotions and orchestrated deployments using Jenkins
  • Worked on Docker container snapshots, attaching to a running container, removing images, managing directory structures, and managing containers
  • Worked with Kubernetes to automate deployment, scaling, and management of web Containerized applications
  • Designed and implemented rollback strategies, allowing for quick and seamless rollback to the blue environment in case of issues or failures in the green environment

DevOps Engineer

HMS Holdings
01.2022 - 02.2023
  • Managed multiple applications across different environments such as DEV, QA, UAT, PRE-PROD and PROD for various releases, designed instance strategies and created Release Calendar
  • Played a key role in the installation, configuration, and administration of Red Hat Linux (versions4.x,5.x, and6.1) and Windows servers, utilizing Kickstart and Jump Start Servers
  • Provided support for applications running on these platforms
  • Utilized Helm to install Prometheus and Grafana, enabling monitoring of application in the Kubernetes cluster
  • Proficiently conducted Infrastructure Monitoring using industry-leading tools, including Prometheus and Grafana, to gain valuable insights into cluster performance and swiftly address any arising issues
  • Created customized dashboards in Amazon CloudWatch using scripting techniques, enabling comprehensive monitoring of EC2 performance, CPU utilization, memory usage, disk usage, and leveraged Splunk for enhanced application performance monitoring, providing crucial visibility and actionable insights
  • Implemented comprehensive monitoring and alerting solutions using New Relic, enabling proactive identification and resolution of performance issues across critical applications and infrastructure
  • Developed custom dashboards and reports in New Relic to visualize real-time performance data and provide actionable insights to engineering and operations teams
  • Configured and maintained CI/CD pipelines on OpenShift by integrating Jenkins, GitLab, or other tools, automating the deployment of microservices and applications in a reliable and consistent manner
  • Set up AWS CloudWatch alarms for server performance monitoring, including CPU utilization and disk usage, ensuring proactive identification of potential issues
  • Streamed AWS CloudWatch logs to Splunk by triggering AWS Lambda and pushing events for real-time analysis
  • Set up and maintained ELK (Elasticsearch, Logstash, Kibana) platform, parsing unstructured logs using regular expressions to structure JSON format
  • Performed log aggregation and analysis using tools like Elasticsearch, Logstash, and Kibana (ELK stack)
  • Leveraged log data to identify patterns, troubleshoot issues, and optimize system performance
  • Managed the performance dashboard in Kibana to track application metrics and application performance
  • Enhanced server security by configuring NGINX to block malicious requests, prevent directory listing, and limit access
  • Proactively resolved issues related to Ansible environments, troubleshooting, and comparing run lists of different environments, identifying the cause of failures, editing playbooks, maintaining coverage, and managing the configurations
  • Assisted System Administrators in troubleshooting configuration management and network-related issues, including IP networking problems involving firewalls, DNS, and load balancers
  • Collaborated with the team to identify and resolve issues, ensuring smooth functioning of the network infrastructure
  • Utilized Terraform for provisioning AWS cloud infrastructure, contributing to the creation of AWS Batch policies to enable Lambda functions for various tasks and leveraging the full potential of AWS Batch features
  • Created and managed S3 buckets, implemented policies, and utilized S3 and Glacier for archival storage and backup
  • Contributed to the development of a Maven-based build environment, focusing on building, testing, and integrating import and export components of the Java framework

DevOps Engineer

Synchrony
05.2020 - 07.2021
  • Working on DevOps/Agile operations process and tools (Code review, unit test automation, Build & Release automation, Environment, Service, Incident and Change Management)
  • Worked closely with the development and operations organizations to implement the necessary tools and processes to support the automation of builds, deployments, testing, and infrastructure using Ansible
  • Created Chef Cookbooks to deploy new software and plugins as well as manage deployments to the production Jenkins server
  • Worked in setting up Chef Infrastructure, Chef-repo, and Bootstrapping Chef nodes
  • Implemented automated local user provisioning VMs created Open stack and AWS Cloud through Chef recipes
  • Involved heavily in setting up the CI/CD pipeline using Jenkins, Maven, Nexus, GitHub, Puppet and AWS
  • Designed and created multiple deployment strategies using CI/CD Pipelines and configuration management tools with extreme execution to ensure zero downtime and shortened deployment cycles via automated deployments
  • Worked in designing and deploying AWS solutions using EC2 instances, EBS, S3, RDS, Elastic Load Balancer and Auto scaling groups
  • Worked on creating Docker containers and Docker consoles
  • Maintenance and monitoring of Docker in a cloud-based service during production
  • Used Jenkins and pipelines to drive all Microservices builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes
  • Extensively worked on Jenkins by configuring and maintaining for continuous integration and for end-to-end automation for all build and deployments
  • Build Automation and Build Pipeline development using Jenkins and Maven
  • Configured various plugins in Jenkins for automation of the workflow and to optimize and smooth running of build jobs and implemented continuous integration and deployment
  • Wrote PowerShell scripts for the teams use with customers that have been heavily utilized thus saving much time with each case
  • Managed and configured SVN/GIT, resolved issue regarding source code management, manages branching and merging, code freeze process
  • Collaborated closely with Development and Testing teams, providing process design, management, and support for source code control, compilation, change management, and production release management on AWS
  • Developed an automation system utilizing AWS Lambda and PowerShell scripts with JSON templates for remediating AWS services
  • Developed Splunk queries and dashboards on AWS to analyze and optimize application performance and capacity, enabling data-driven decision-making
  • Worked on integrating GIT into the continuous Integration (CI) environment along with Anthill-Pro and Jenkins
  • Used Nagios as a monitoring tool to identify and resolve infrastructure problems before they affect critical processes and worked on Nagios Event handlers in case of automatic restart of failed applications and services
  • Developed automated deployment scripts using Maven and Python to deploy war files, properties file and database changes to development server or QA server and Staging/Production server
  • Perform daily maintenance routines on Linux servers, monitoring system access, managing file space and tuning the system for optimum performance

Build & Release Engineer

Vodafone
02.2018 - 04.2020
  • Worked extensively with Azure PaaS and Azure IaaS - Virtual Networks, Virtual Machines, Cloud Services, Resource Groups, Express Route, VPN, Load Balancing, Application Gateways, Auto-Scaling, and Traffic Manager
  • Designed and implemented end-to-end CI/CD pipelines using Azure DevOps, ensuring automated builds, tests, and deployments for applications
  • Containerized applications using Docker and orchestrated containerized workloads on Azure Kubernetes Service (AKS), enhancing scalability, and reducing infrastructure overhead
  • Worked in creating and managing infrastructure using Azure Resource Manager (ARM) templates, enabling consistent and reproducible deployments
  • Designing and implementing fully automated server build management, monitoring and deployment by Using Technologies like Splunk, Shell scripts, GitLab, Maven, Jenkins, SonarQube, Nexus, Junit, Ansible
  • Used Elastic Load balancer (ALB & CLB) for pinging EC2 instances in round-robin process and health checking of EC2 instances along with Route53
  • Utilized Helm charts to install Prometheus and Grafana, enabling monitoring of application performance in the Kubernetes cluster
  • Developed Ansible playbooks to automatically generate start and stop scripts for application services, enhancing operational efficiency and ensuring consistent server shutdown and startup procedures
  • Developed Python scripts with scheduling capabilities to automate routine tasks, such as data backups, system maintenance, and report generation
  • Written Python scripts that automatically organizes, renames, and moves files based on predefined rules, optimizing file management, and ensuring data consistency
  • Developed Ansible scripts for automated server provisioning and Docker container deployments, reducing provisioning and deployment time
  • Designed and implemented Azure virtual servers using Ansible roles to ensure seamless deployment of web applications
  • Created and maintained Ansible scripts for automated deployment of CI/CD applications using Kubernetes and YAML files to ensure efficiency and consistency
  • Implemented auditing for Azure Kubernetes Service clusters and monitored logs within specific namespaces using Azure Monitor and Azure Log Analytics
  • Employed Azure Monitor and Log Analytics for monitoring and logging various application logs
  • Established a robust CI/CD pipeline using Jenkins, GitHub, Azure DevOps, Maven, and Azure Virtual Machines
  • Managed Docker images, container snapshots, image removal, and Docker volumes using Azure Container Registry and Azure Kubernetes Service
  • Integrated continuous integration systems with Git repositories and maintained Maven project dependencies by establishing parent-child relationships between projects
  • Installed and configured Azure Artifacts for sharing artifacts among internal teams, enhancing build efficiency
  • Automated servers build management, monitoring, and deployment using technologies such as Splunk, Shell scripts, Azure DevOps, GitLab, Maven, Jenkins, SonarQube, and Nexus
  • Implemented Elastic Load Balancer (ELB) and Azure Traffic Manager for load balancing and health checking of Azure Virtual Machines along with Azure DNS
  • Proficiently used Azure Kubernetes Service (AKS), Azure Container Service (ACS), Docker, Docker Swarm, and Ansible for building automation pipelines and managing production deployments
  • Utilized Azure Boards and Kanban boards for agile workflow visualization and project management
  • Designed and implemented robust Source Code Management (SCM) processes and procedures using GitHub and Azure DevOps
  • Demonstrated expertise in JIRA for issue tracking, project management, change management, and release management
  • Supported and developed tools for integration, automated testing, and release management

Linux Administrator

Prolifics Global technology
03.2016 - 01.2018
  • Installation, configuration and administration of Red Hat Linux servers and support for servers
  • Planned and performed the upgrades to Linux (RHEL, SUSE10,11, CentOS) operating systems and hardware maintenance like changing memory modules, replacing disk drives
  • Provided the support of building the server, patching, user administration tasks, deployment, software installation, performance tuning and troubleshooting and KVM
  • Installation and configuration of Oracle7, X/8
  • X
  • Handling NFS, Auto Mount, DNS, LDAP related issues
  • Monitoring CPU, memory, physical disk, hardware, and software raid, multipath, file systems, networks
  • Performing failover and integrity test on new servers before rolling out to production
  • Wrote Shell Scripts for automation of daily tasks, documenting the changes that happen in the environment and in each server, analyzing the error logs, analyzing the user logs, analyzing the /var/log/Messages
  • Good understanding of OSI Model, TCP/IP protocol suite DNS, IP, ARP, TCP, UDP, SMTP, FTP, and TFTP
  • Knowledge of Routers and Switches, Subnet, VLAN, TCP/IP, Ethernet, VPN, OSI model, Firewall
  • Worked on Network security skills include NAT/PAT, ACLs, AAA, and ASA firewall
  • Created local repositories on Linux servers and performed server updates, patching, upgrade, and package installations using RPM and YUM
  • Installed Firmware Upgrades, Kernel patches, systems configuration, performance tuning on Linux systems
  • Extensive knowledge on Server administration, Kernel upgrade and deployment of patches and applying all firewall and security policies with emphasis on maintaining best practices
  • Identified, troubleshot, and resolved problems with the OS build failures
  • Installation, configuration, and customization of services send mail, Apache, FTP servers to meet the user needs and requirements
  • Delivered customer-focused support through phone calls and ticket-based communications
  • Managed user accounts, groups, and access levels
  • Monitored system performance, including virtual memory, swap space, disk utilization, and CPU utilization
  • Implemented logical volume management
  • Administered systems security and user access using Role-Based Access Control
  • Installed and maintained applications on Linux servers
  • Performed regular system maintenance, including operating system and application patching on Linux Servers
  • Made recommendations for systems modifications to improve network and hardware components as needed

Education

MSICS -

University Of Bridgeport
Bridgeport, CT
01.2023

B.com - Computers

Osmania University
Hyderabad, India
01.2016

Skills

  • Operating Systems: Linux (Ubuntu, RHEL, CentOS, Fedora, SOLARIS, SUSE), Windows
  • Cloud Platform: Amazon Web Services (AWS)
  • Version Control System / SCM: GIT, GitHub, Bitbucket, AWS Code Commit
  • Infrastructure Monitoring: Prometheus, Nagios, Grafana, Amazon Cloud Watch
  • Application Monitoring: AppDynamics, New Relic, Splunk, Datadog, Dynatrace, ELK Stack, Kibana
  • Infrastructure Provisioning: Terraform, AWS Cloud Formation
  • Continuous Integration (CI) Tools: Jenkins, Azure DevOps, GitHub Actions, TravisCI, Bamboo
  • Continuous Deployment (CD) Tools: ArgoCD
  • Containerization: Docker
  • Orchestration Platforms: Kubernetes, Docker Swarm
  • Artifactory/Repositories: JFrog, Nexus, S3
  • Configuration Management: Ansible
  • Data Streaming Tools: Apache Kafka
  • Testing Tools: SonarQube, Selenium, JUnit, Pytest, Karma, Jasmine, TestNG
  • Languages: Java, Python, JavaScript, PHP, HTML, NodeJS
  • Databases: MySQL, Oracle, MongoDB, DynamoDB
  • Scripting Languages: Bash, Python, Perl, PowerShell, Groovy, HCL, JSON, YAML
  • Web Servers: Apache HTTP Server, Nginx, IIS, Cherokee
  • Application Servers: Apache Tomcat, IBM WebSphere, JBoss, Jetty, NodeJS, WebLogic, Oracle Application Server
  • Ticketing/Bug Tracking: JIRA, ServiceNow
  • Chaos Engineering: Litmus, Chaos Monkey, Gremlin, Chaos Toolkit

PROFESSIONAL SUMMARY

  • Experienced in provisioning, configuring, and troubleshooting of various AWS cloud services such as VPC, Route53, Security Groups, IAM, EC2, ELB (Load Balancers), S3, RDS, ASG, SNS, CloudWatch, Cloud Front, & CloudFormation Templates.
  • Experienced in deploying, maintaining, and troubleshooting applications on on-prem as well as cloud platform like AWS.
  • Experienced in defining and managing Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to ensure the delivery and performance of services align with business requirements, maintaining high availability and reliability of the application.
  • Designed and implemented an observability infrastructure from scratch, including monitoring, logging, and tracing components, to ensure real-time insights into system performance.
  • Experienced in setting up and instrumenting the logging, tracing, alerting, and monitoring tools like Splunk, AppDynamics, Dynatrace, New Relic, Datadog, ELK Stack, Prometheus, Grafana, Opentelemetry for performance analysis and troubleshooting in the Kubernetes environment.
  • Performed Application Monitoring using tools like New Relic, AppDynamics, Datadog, Dynatrace to measure the metrics such as Response time, Latency, Error Rates, Throughput, Uptime/Downtime.
  • Performed distributed tracing using AppDynamics, Dynatrace, Datadog & New Relic for troubleshooting and performing root cause analysis.
  • Performed thorough analysis of the web application performance by utilizing Real User Monitoring (RUM) Metrics using tools like New Relic, AppDynamics, Datadog, Dynatrace, by monitoring the four Golden Signals (Latency, Traffic, Errors, and Saturation), evaluating the user satisfaction using Apdex ratings, and assessing the XHR actions.
  • Participated on Incident Management and also provided24x7 support to production environments by being on-call.
  • Successfully reduced Mean Time to Detect (MTTD) and Mean Time to Recovery (MTTR) by implementing proactive monitoring and troubleshooting procedures, resulting in faster incident resolution and minimized downtime.
  • Conducted post-mortem analysis and recognized bottlenecks in the runbooks available to update them regularly to anticipate future incidents.
  • Resolved bottlenecks such as (OOM issues, thread issues, heap dump, garbage collection) , optimize configurations, and work with development teams to improve system performance and reduce latency.
  • Created custom dashboards in Dynatrace, and Grafana to present the Observability insights and findings from MELT (Metrics, Errors, Logs, Traces) data to stakeholders.
  • Experienced in Log aggregation and analysis using tools like Splunk and Elasticsearch, Logstash, and Kibana (ELK stack). Leveraged log data to identify patterns, troubleshoot issues, and optimize system performance.
  • Experienced in building performance dashboards by integrating the log and performing analysis using tools like ELK Stack and Splunk with Grafana and Kibana.
  • Experienced with Splunk Query Language (SPL) to search log data and create meaningful reports and visualizations.
  • Developed automated remediation scripts using Python and Bash Scripts to resolve common issues swiftly, reducing Mean time to detect (MTTD) and Mean time to resolution (MTTR).
  • Conducted performance profiling and optimization efforts based on insights derived from Observability.
  • Troubleshooted network connectivity problems within Kubernetes clusters, resolving issues related to service discovery, DNS resolution, network policies, and ingress configurations.
  • Debugged and resolved issues related to deployments and replication controllers, ensuring successful rollout and scaling of application components, troubleshooting issues with rolling updates, and managing replica sets.
  • Worked extensively in setting up intelligent alerting systems using tools such as PagerDuty and AWS CloudWatch Alarms, enabling rapid response to critical incidents.
  • Hands-on experience with implementing and enhancing the end-to-end workflow in the CI/CD pipelines using Jenkins.
  • Experience in various programming and scripting languages especially Shell and Python scripting with focus on DevOps tools, CI/CD and performed configuration, deployment, and support of cloud services on AWS Cloud Architecture.
  • Used Terraform to set up the AWS infrastructures such as launching the EC2 instances, S3 buckets, Virtual Private Cloud (VPC), Public and Private Subnets, IAM roles and policies, Route Tables, Security Groups, Storage Groups, Elastic Load Balancer (ELB) & Application Load Balancer (ALB) and Elastic Kubernetes Services (EKS).
  • Monitored and analyzed the performance of round-robin scheduling through metrics and logging, making data-driven decisions to optimize the load-balancing configuration.
  • Implemented content compression techniques to reduce bandwidth usage and improve website loading times.
  • Used Chaos Toolkit to define and execute chaos experiments, validating the system's ability to recover from failures.
  • Integrated Python, Shell Scripts and PowerShell Scripts into DevOps pipelines to automate code builds, deployments, and testing, leading to a continuous integration and continuous deployment (CI/CD).
  • Performed HPA setup to guarantee it is accurately defined, including the target resource (CPU or custom metrics), minimum and maximum replicas, and scaling thresholds.
  • Implemented Blue-Green deployment strategy, reducing downtime and mitigating risks during software releases, ensuring smooth transitions between versions.
  • Experience in writing Puppet manifests, Ansible playbooks for the administration of several number of servers.
  • Experience working on Docker Hub, creating Docker images, and handling multiple images primarily for middleware installations and domain configurations.
  • Proficient in orchestrating the Docker containers using the combination of tools like Docker-compose and Kubernetes.
  • Integrated AWS Web Application Firewall (WAF) with ELB to enhance security and protect against common web threats.
  • Configured the Nginx Ingress Controller to handle path-based routing inside Kubernetes cluster.
  • Experience in deploying JBOSS, Apache Tomcat Web Server, IIS Server, Oracle WebLogic, and IBM WebSphere.

Timeline

SRE / DevOps Engineer

AIG
02.2023 - Current

DevOps Engineer

HMS Holdings
01.2022 - 02.2023

DevOps Engineer

Synchrony
05.2020 - 07.2021

Build & Release Engineer

Vodafone
02.2018 - 04.2020

Linux Administrator

Prolifics Global technology
03.2016 - 01.2018

MSICS -

University Of Bridgeport

B.com - Computers

Osmania University
Nandan Yadla