Summary

Overview

Work History

Education

Skills

PROFESSIONAL SUMMARY

Timeline

Nandan Yadla

Bridgeport,CT

Summary

Experienced Site Reliability Engineer/DevOps Engineer with8+ years of expertise in designing, implementing, and securing complex cloud infrastructures and deployments using Amazon Web Services (AWS). Experienced in designing, implementing, and maintaining and automation of highly available and scalable applications.

Overview

years of professional experience

Work History

SRE / DevOps Engineer

AIG

02.2023 - Current

Managed builds and deployments across multiple environments such as Development, Testing, Pre-production & Production environments, implementing high availability architecture with Disaster Recovery in Amazon Web Services (AWS) across multiple availability zones

Performed Infrastructure Monitoring using tools, such as Prometheus and Grafana, to gain insights into clusters and address issues as needed
Built performance dashboard in Grafana for tracking K8S metrics
Performed thorough analysis of the application performance by utilizing Real User Monitoring (RUM) Metrics using tools New Relic, by monitoring the four Golden Signals, Apdex ratings
Implemented efficient incident management techniques, leading to a significant decrease in both Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) for incidents using New Relic
Performed routine monitoring health checkups using New Relic & Splunk and created custom reports to present the stakeholders with status of the applications
Performed capacity assessment for calculating CPU and Memory requirements for new servers according need in hand
Used Terraform to set up the AWS infrastructures such as launching the EC2 instances, S3 buckets, Virtual Private Cloud (VPC), Public and Private Subnets, IAM roles and policies, Route Tables, Security Groups, Storage Groups, Elastic Load Balancer (ELB) & Application Load Balancer (ALB) and Elastic Kubernetes Services (EKS)
Worked with Terraform for IAC (Infrastructure as Code), ensuring consistent and reproducible infrastructure for staging and production environments
Hands-on experience with writing Terraform Scripts to create infrastructure for Dev, UAT, Staging and Prod environments
Administered and maintained Docker runtime environment, managing Docker images using versioning and lifecycle management strategies
Deployed and managed OpenShift clusters in production environments, ensuring scalability, high availability, and efficient resource utilization for containerized applications
Written Ansible playbooks to configure deploy and maintain software components of the existing infrastructure
Developed automation scripts and deployment pipelines to facilitate efficient and reliable Blue-Green organizations, ensuring reliable and repeatable processes
Used Git as source code repositories and managed Git repositories for branching, merging, and tagging
Implemented and maintained Jenkins pipelines, streamlining the CI/CD process, and ensuring efficient software delivery
Wrote and maintained Jenkins files, defining stages, steps, and integrating with a variety of tools for building, testing, and deploying applications
Creating a fully automated Build and Deployment Platform by coordinating code builds promotions and orchestrated deployments using Jenkins
Worked on Docker container snapshots, attaching to a running container, removing images, managing directory structures, and managing containers
Worked with Kubernetes to automate deployment, scaling, and management of web Containerized applications
Designed and implemented rollback strategies, allowing for quick and seamless rollback to the blue environment in case of issues or failures in the green environment

DevOps Engineer

HMS Holdings

Irving, TX

01.2022 - 02.2023

Managed multiple applications across different environments such as DEV, QA, UAT, PRE-PROD and PROD for various releases, designed instance strategies and created Release Calendar
Played a key role in the installation, configuration, and administration of Red Hat Linux (versions4.x,5.x, and6.1) and Windows servers, utilizing Kickstart and Jump Start Servers
Provided support for applications running on these platforms
Utilized Helm to install Prometheus and Grafana, enabling monitoring of application in the Kubernetes cluster
Proficiently conducted Infrastructure Monitoring using industry-leading tools, including Prometheus and Grafana, to gain valuable insights into cluster performance and swiftly address any arising issues
Created customized dashboards in Amazon CloudWatch using scripting techniques, enabling comprehensive monitoring of EC2 performance, CPU utilization, memory usage, disk usage, and leveraged Splunk for enhanced application performance monitoring, providing crucial visibility and actionable insights
Implemented comprehensive monitoring and alerting solutions using New Relic, enabling proactive identification and resolution of performance issues across critical applications and infrastructure
Developed custom dashboards and reports in New Relic to visualize real-time performance data and provide actionable insights to engineering and operations teams
Configured and maintained CI/CD pipelines on OpenShift by integrating Jenkins, GitLab, or other tools, automating the deployment of microservices and applications in a reliable and consistent manner
Set up AWS CloudWatch alarms for server performance monitoring, including CPU utilization and disk usage, ensuring proactive identification of potential issues
Streamed AWS CloudWatch logs to Splunk by triggering AWS Lambda and pushing events for real-time analysis
Set up and maintained ELK (Elasticsearch, Logstash, Kibana) platform, parsing unstructured logs using regular expressions to structure JSON format
Performed log aggregation and analysis using tools like Elasticsearch, Logstash, and Kibana (ELK stack)
Leveraged log data to identify patterns, troubleshoot issues, and optimize system performance
Managed the performance dashboard in Kibana to track application metrics and application performance
Enhanced server security by configuring NGINX to block malicious requests, prevent directory listing, and limit access
Proactively resolved issues related to Ansible environments, troubleshooting, and comparing run lists of different environments, identifying the cause of failures, editing playbooks, maintaining coverage, and managing the configurations
Assisted System Administrators in troubleshooting configuration management and network-related issues, including IP networking problems involving firewalls, DNS, and load balancers
Collaborated with the team to identify and resolve issues, ensuring smooth functioning of the network infrastructure
Utilized Terraform for provisioning AWS cloud infrastructure, contributing to the creation of AWS Batch policies to enable Lambda functions for various tasks and leveraging the full potential of AWS Batch features
Created and managed S3 buckets, implemented policies, and utilized S3 and Glacier for archival storage and backup
Contributed to the development of a Maven-based build environment, focusing on building, testing, and integrating import and export components of the Java framework

DevOps Engineer

Synchrony

Hyderabad, India

05.2020 - 07.2021

Working on DevOps/Agile operations process and tools (Code review, unit test automation, Build & Release automation, Environment, Service, Incident and Change Management)
Worked closely with the development and operations organizations to implement the necessary tools and processes to support the automation of builds, deployments, testing, and infrastructure using Ansible
Created Chef Cookbooks to deploy new software and plugins as well as manage deployments to the production Jenkins server
Worked in setting up Chef Infrastructure, Chef-repo, and Bootstrapping Chef nodes
Implemented automated local user provisioning VMs created Open stack and AWS Cloud through Chef recipes
Involved heavily in setting up the CI/CD pipeline using Jenkins, Maven, Nexus, GitHub, Puppet and AWS
Designed and created multiple deployment strategies using CI/CD Pipelines and configuration management tools with extreme execution to ensure zero downtime and shortened deployment cycles via automated deployments
Worked in designing and deploying AWS solutions using EC2 instances, EBS, S3, RDS, Elastic Load Balancer and Auto scaling groups
Worked on creating Docker containers and Docker consoles
Maintenance and monitoring of Docker in a cloud-based service during production
Used Jenkins and pipelines to drive all Microservices builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes
Extensively worked on Jenkins by configuring and maintaining for continuous integration and for end-to-end automation for all build and deployments
Build Automation and Build Pipeline development using Jenkins and Maven
Configured various plugins in Jenkins for automation of the workflow and to optimize and smooth running of build jobs and implemented continuous integration and deployment
Wrote PowerShell scripts for the teams use with customers that have been heavily utilized thus saving much time with each case
Managed and configured SVN/GIT, resolved issue regarding source code management, manages branching and merging, code freeze process
Collaborated closely with Development and Testing teams, providing process design, management, and support for source code control, compilation, change management, and production release management on AWS
Developed an automation system utilizing AWS Lambda and PowerShell scripts with JSON templates for remediating AWS services
Developed Splunk queries and dashboards on AWS to analyze and optimize application performance and capacity, enabling data-driven decision-making
Worked on integrating GIT into the continuous Integration (CI) environment along with Anthill-Pro and Jenkins
Used Nagios as a monitoring tool to identify and resolve infrastructure problems before they affect critical processes and worked on Nagios Event handlers in case of automatic restart of failed applications and services
Developed automated deployment scripts using Maven and Python to deploy war files, properties file and database changes to development server or QA server and Staging/Production server
Perform daily maintenance routines on Linux servers, monitoring system access, managing file space and tuning the system for optimum performance

Build & Release Engineer

Vodafone

Hyderabad, India

02.2018 - 04.2020

Worked extensively with Azure PaaS and Azure IaaS - Virtual Networks, Virtual Machines, Cloud Services, Resource Groups, Express Route, VPN, Load Balancing, Application Gateways, Auto-Scaling, and Traffic Manager
Designed and implemented end-to-end CI/CD pipelines using Azure DevOps, ensuring automated builds, tests, and deployments for applications
Containerized applications using Docker and orchestrated containerized workloads on Azure Kubernetes Service (AKS), enhancing scalability, and reducing infrastructure overhead
Worked in creating and managing infrastructure using Azure Resource Manager (ARM) templates, enabling consistent and reproducible deployments
Designing and implementing fully automated server build management, monitoring and deployment by Using Technologies like Splunk, Shell scripts, GitLab, Maven, Jenkins, SonarQube, Nexus, Junit, Ansible
Used Elastic Load balancer (ALB & CLB) for pinging EC2 instances in round-robin process and health checking of EC2 instances along with Route53
Utilized Helm charts to install Prometheus and Grafana, enabling monitoring of application performance in the Kubernetes cluster
Developed Ansible playbooks to automatically generate start and stop scripts for application services, enhancing operational efficiency and ensuring consistent server shutdown and startup procedures
Developed Python scripts with scheduling capabilities to automate routine tasks, such as data backups, system maintenance, and report generation
Written Python scripts that automatically organizes, renames, and moves files based on predefined rules, optimizing file management, and ensuring data consistency
Developed Ansible scripts for automated server provisioning and Docker container deployments, reducing provisioning and deployment time
Designed and implemented Azure virtual servers using Ansible roles to ensure seamless deployment of web applications
Created and maintained Ansible scripts for automated deployment of CI/CD applications using Kubernetes and YAML files to ensure efficiency and consistency
Implemented auditing for Azure Kubernetes Service clusters and monitored logs within specific namespaces using Azure Monitor and Azure Log Analytics
Employed Azure Monitor and Log Analytics for monitoring and logging various application logs
Established a robust CI/CD pipeline using Jenkins, GitHub, Azure DevOps, Maven, and Azure Virtual Machines
Managed Docker images, container snapshots, image removal, and Docker volumes using Azure Container Registry and Azure Kubernetes Service
Integrated continuous integration systems with Git repositories and maintained Maven project dependencies by establishing parent-child relationships between projects
Installed and configured Azure Artifacts for sharing artifacts among internal teams, enhancing build efficiency
Automated servers build management, monitoring, and deployment using technologies such as Splunk, Shell scripts, Azure DevOps, GitLab, Maven, Jenkins, SonarQube, and Nexus
Implemented Elastic Load Balancer (ELB) and Azure Traffic Manager for load balancing and health checking of Azure Virtual Machines along with Azure DNS
Proficiently used Azure Kubernetes Service (AKS), Azure Container Service (ACS), Docker, Docker Swarm, and Ansible for building automation pipelines and managing production deployments
Utilized Azure Boards and Kanban boards for agile workflow visualization and project management
Designed and implemented robust Source Code Management (SCM) processes and procedures using GitHub and Azure DevOps
Demonstrated expertise in JIRA for issue tracking, project management, change management, and release management
Supported and developed tools for integration, automated testing, and release management

Linux Administrator

Prolifics Global technology

Hyderabad, India

03.2016 - 01.2018

Installation, configuration and administration of Red Hat Linux servers and support for servers
Planned and performed the upgrades to Linux (RHEL, SUSE10,11, CentOS) operating systems and hardware maintenance like changing memory modules, replacing disk drives
Provided the support of building the server, patching, user administration tasks, deployment, software installation, performance tuning and troubleshooting and KVM
Installation and configuration of Oracle7, X/8
X
Handling NFS, Auto Mount, DNS, LDAP related issues
Monitoring CPU, memory, physical disk, hardware, and software raid, multipath, file systems, networks
Performing failover and integrity test on new servers before rolling out to production
Wrote Shell Scripts for automation of daily tasks, documenting the changes that happen in the environment and in each server, analyzing the error logs, analyzing the user logs, analyzing the /var/log/Messages
Good understanding of OSI Model, TCP/IP protocol suite DNS, IP, ARP, TCP, UDP, SMTP, FTP, and TFTP
Knowledge of Routers and Switches, Subnet, VLAN, TCP/IP, Ethernet, VPN, OSI model, Firewall
Worked on Network security skills include NAT/PAT, ACLs, AAA, and ASA firewall
Created local repositories on Linux servers and performed server updates, patching, upgrade, and package installations using RPM and YUM
Installed Firmware Upgrades, Kernel patches, systems configuration, performance tuning on Linux systems
Extensive knowledge on Server administration, Kernel upgrade and deployment of patches and applying all firewall and security policies with emphasis on maintaining best practices
Identified, troubleshot, and resolved problems with the OS build failures
Installation, configuration, and customization of services send mail, Apache, FTP servers to meet the user needs and requirements
Delivered customer-focused support through phone calls and ticket-based communications
Managed user accounts, groups, and access levels
Monitored system performance, including virtual memory, swap space, disk utilization, and CPU utilization
Implemented logical volume management
Administered systems security and user access using Role-Based Access Control
Installed and maintained applications on Linux servers
Performed regular system maintenance, including operating system and application patching on Linux Servers
Made recommendations for systems modifications to improve network and hardware components as needed

Education

MSICS -

University Of Bridgeport

Bridgeport, CT

01.2023

B.com - Computers

Osmania University

Hyderabad, India

01.2016

Skills

Operating Systems: Linux (Ubuntu, RHEL, CentOS, Fedora, SOLARIS, SUSE), Windows
Cloud Platform: Amazon Web Services (AWS)
Version Control System / SCM: GIT, GitHub, Bitbucket, AWS Code Commit
Infrastructure Monitoring: Prometheus, Nagios, Grafana, Amazon Cloud Watch
Application Monitoring: AppDynamics, New Relic, Splunk, Datadog, Dynatrace, ELK Stack, Kibana
Infrastructure Provisioning: Terraform, AWS Cloud Formation
Continuous Integration (CI) Tools: Jenkins, Azure DevOps, GitHub Actions, TravisCI, Bamboo
Continuous Deployment (CD) Tools: ArgoCD
Containerization: Docker
Orchestration Platforms: Kubernetes, Docker Swarm
Artifactory/Repositories: JFrog, Nexus, S3

Configuration Management: Ansible
Data Streaming Tools: Apache Kafka
Testing Tools: SonarQube, Selenium, JUnit, Pytest, Karma, Jasmine, TestNG
Languages: Java, Python, JavaScript, PHP, HTML, NodeJS
Databases: MySQL, Oracle, MongoDB, DynamoDB
Scripting Languages: Bash, Python, Perl, PowerShell, Groovy, HCL, JSON, YAML
Web Servers: Apache HTTP Server, Nginx, IIS, Cherokee
Application Servers: Apache Tomcat, IBM WebSphere, JBoss, Jetty, NodeJS, WebLogic, Oracle Application Server
Ticketing/Bug Tracking: JIRA, ServiceNow
Chaos Engineering: Litmus, Chaos Monkey, Gremlin, Chaos Toolkit

PROFESSIONAL SUMMARY

Experienced in provisioning, configuring, and troubleshooting of various AWS cloud services such as VPC, Route53, Security Groups, IAM, EC2, ELB (Load Balancers), S3, RDS, ASG, SNS, CloudWatch, Cloud Front, & CloudFormation Templates.
Experienced in deploying, maintaining, and troubleshooting applications on on-prem as well as cloud platform like AWS.
Experienced in defining and managing Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to ensure the delivery and performance of services align with business requirements, maintaining high availability and reliability of the application.
Designed and implemented an observability infrastructure from scratch, including monitoring, logging, and tracing components, to ensure real-time insights into system performance.
Experienced in setting up and instrumenting the logging, tracing, alerting, and monitoring tools like Splunk, AppDynamics, Dynatrace, New Relic, Datadog, ELK Stack, Prometheus, Grafana, Opentelemetry for performance analysis and troubleshooting in the Kubernetes environment.
Performed Application Monitoring using tools like New Relic, AppDynamics, Datadog, Dynatrace to measure the metrics such as Response time, Latency, Error Rates, Throughput, Uptime/Downtime.
Performed distributed tracing using AppDynamics, Dynatrace, Datadog & New Relic for troubleshooting and performing root cause analysis.
Performed thorough analysis of the web application performance by utilizing Real User Monitoring (RUM) Metrics using tools like New Relic, AppDynamics, Datadog, Dynatrace, by monitoring the four Golden Signals (Latency, Traffic, Errors, and Saturation), evaluating the user satisfaction using Apdex ratings, and assessing the XHR actions.
Participated on Incident Management and also provided24x7 support to production environments by being on-call.
Successfully reduced Mean Time to Detect (MTTD) and Mean Time to Recovery (MTTR) by implementing proactive monitoring and troubleshooting procedures, resulting in faster incident resolution and minimized downtime.
Conducted post-mortem analysis and recognized bottlenecks in the runbooks available to update them regularly to anticipate future incidents.
Resolved bottlenecks such as (OOM issues, thread issues, heap dump, garbage collection) , optimize configurations, and work with development teams to improve system performance and reduce latency.
Created custom dashboards in Dynatrace, and Grafana to present the Observability insights and findings from MELT (Metrics, Errors, Logs, Traces) data to stakeholders.
Experienced in Log aggregation and analysis using tools like Splunk and Elasticsearch, Logstash, and Kibana (ELK stack). Leveraged log data to identify patterns, troubleshoot issues, and optimize system performance.
Experienced in building performance dashboards by integrating the log and performing analysis using tools like ELK Stack and Splunk with Grafana and Kibana.
Experienced with Splunk Query Language (SPL) to search log data and create meaningful reports and visualizations.
Developed automated remediation scripts using Python and Bash Scripts to resolve common issues swiftly, reducing Mean time to detect (MTTD) and Mean time to resolution (MTTR).
Conducted performance profiling and optimization efforts based on insights derived from Observability.
Troubleshooted network connectivity problems within Kubernetes clusters, resolving issues related to service discovery, DNS resolution, network policies, and ingress configurations.
Debugged and resolved issues related to deployments and replication controllers, ensuring successful rollout and scaling of application components, troubleshooting issues with rolling updates, and managing replica sets.
Worked extensively in setting up intelligent alerting systems using tools such as PagerDuty and AWS CloudWatch Alarms, enabling rapid response to critical incidents.
Hands-on experience with implementing and enhancing the end-to-end workflow in the CI/CD pipelines using Jenkins.
Experience in various programming and scripting languages especially Shell and Python scripting with focus on DevOps tools, CI/CD and performed configuration, deployment, and support of cloud services on AWS Cloud Architecture.
Used Terraform to set up the AWS infrastructures such as launching the EC2 instances, S3 buckets, Virtual Private Cloud (VPC), Public and Private Subnets, IAM roles and policies, Route Tables, Security Groups, Storage Groups, Elastic Load Balancer (ELB) & Application Load Balancer (ALB) and Elastic Kubernetes Services (EKS).
Monitored and analyzed the performance of round-robin scheduling through metrics and logging, making data-driven decisions to optimize the load-balancing configuration.
Implemented content compression techniques to reduce bandwidth usage and improve website loading times.
Used Chaos Toolkit to define and execute chaos experiments, validating the system's ability to recover from failures.
Integrated Python, Shell Scripts and PowerShell Scripts into DevOps pipelines to automate code builds, deployments, and testing, leading to a continuous integration and continuous deployment (CI/CD).
Performed HPA setup to guarantee it is accurately defined, including the target resource (CPU or custom metrics), minimum and maximum replicas, and scaling thresholds.
Implemented Blue-Green deployment strategy, reducing downtime and mitigating risks during software releases, ensuring smooth transitions between versions.
Experience in writing Puppet manifests, Ansible playbooks for the administration of several number of servers.
Experience working on Docker Hub, creating Docker images, and handling multiple images primarily for middleware installations and domain configurations.
Proficient in orchestrating the Docker containers using the combination of tools like Docker-compose and Kubernetes.
Integrated AWS Web Application Firewall (WAF) with ELB to enhance security and protect against common web threats.
Configured the Nginx Ingress Controller to handle path-based routing inside Kubernetes cluster.
Experience in deploying JBOSS, Apache Tomcat Web Server, IIS Server, Oracle WebLogic, and IBM WebSphere.

Timeline

SRE / DevOps Engineer

AIG

02.2023 - Current

DevOps Engineer

HMS Holdings

01.2022 - 02.2023

DevOps Engineer

Synchrony

05.2020 - 07.2021

Build & Release Engineer

Vodafone

02.2018 - 04.2020

Linux Administrator

Prolifics Global technology

03.2016 - 01.2018

MSICS -

University Of Bridgeport

B.com - Computers

Osmania University