Summary
Overview
Work History
Education
Skills
Timeline
Generic

Teja Appakondareddygari

Austin,TX

Summary

Experienced and strategic Lead Site Reliability Engineer with almost 10 years of distinguished IT experience. Demonstrated success in optimizing system performance, reducing costs, and ensuring unwavering availability for mission-critical applications. Proficient in capacity planning, disaster recovery, and the implementation of cutting-edge monitoring solutions. Eager to apply my expertise to elevate the reliability and scalability of crucial systems.

Overview

11
11
years of professional experience

Work History

Staff Site Reliability Engineer

Samsung Electronics America
12.2024 - Current
  • Design, implement, and manage scalable and resilient infrastructure solutions for Samsung's advertising technology platform.
  • Collaborate with development teams to integrate DevOps best practices into the software development lifecycle.
  • Implement and maintain CI/CD pipelines to automate software delivery and deployment processes.
  • Monitor, troubleshoot, and optimize system performance to ensure high availability and reliability.
  • Evaluating and estimating capacity and growth plan projections for future.
  • Work closely with security teams to implement and enforce best practices for infrastructure security.
  • Participate in on-call rotations to provide 24/7 support for critical systems.
  • Continuously evaluate and implement new technologies to enhance the efficiency of our infrastructure.
  • Being the Infrastructure and Operation subject matter expert for the development team.
  • Plan for the future capacity and growth plans including disaster recovery and BCP.

Principal Application Engineer - Site Reliability

Discover
04.2024 - 12.2024
  • Consult teams and provide hands-on training to teams in observability, incident management, and reliability best practices.
  • Includes defining SLOs\SLAs\SLIs, on-call support behaviors, troubleshooting, building support playbooks, implementing monitoring and alerting, logging standards, conducting fragility & performance testing.
  • Successfully implemented Datadog for real-time monitoring and alerting, improving system observability and response times.
  • Leveraged Terraform and Kubernetes to automate infrastructure provisioning, ensuring scalable and reliable environments.
  • Conducted regular reviews to enforce best practices and align reliability strategies with business needs.
  • Led discussions on failure points, chaos testing, and capacity management, enhancing system resilience.
  • Ensure delivery teams in the product family track and meet annual operational goals (MTTR reduction, incident reduction, platform availability, SLO\SLA targets)
  • Ensure automated delivery for all family-level products.

Lead Site Reliability Engineer

Apple
03.2019 - 03.2024

Technical Leadership and Infrastructure Management:

  • Led and managed a team: Successfully led a team of 8 professionals dedicated to supporting a critical payments and billing platform.
  • Capacity Planning and Resource Optimization: Continuously optimized costs by pinpointing over-allocated resources and proposing design optimizations, resulting in a remarkable reduction of the AWS Compute bill from $130,000/month to $25,000/month within one year.
  • Hardware Decommission and Replacement: Strategically planned and executed the decommissioning and replacement of obsolete hardware, ensuring data integrity and preventing data loss.
  • Data Center Simulation and Recovery Planning: Orchestrated pull-the-plug events simulating a data center loss while meeting all recovery time and recovery point objectives.

Performance Optimization and Troubleshooting:

  • Contribution to Product Design: Continuously contribute to refining product design to align with production scenarios, ensuring seamless integration and optimal functionality.
  • Scalability and Performance Optimization: Proactively enhance scalability and performance by assessing data growth, implementing auto-scaling techniques for peak loads, and optimizing SQL and MongoDB queries to achieve remarkable response times.
  • Database Optimization: Identified and resolved performance bottlenecks in databases using stress and load tests, significantly improving application response time.
  • Chronic Issue Resolution: Successfully troubleshooted and resolved various chronic issues within the application.
  • Monitoring and Alerting: Implemented comprehensive monitoring and alerting for the entire platform, utilizing tools like AppDynamics, Grafana, Prometheus, Jobwatcher, custom scripts, OpenTelemetry, Splunk, Elastic, and CloudWatch.

System Health and Security:

  • Monitoring Dashboards: Developed monitoring and analytics dashboards to gain a holistic view of application health and identify potential anomalies.
  • Production Support and Security Compliance: Participated in production support calls, especially for troubleshooting priority-1 issues, and actively contributed as an escalation point of contact in on-call rotations.
  • SSL Certificate and Keystore Management: Skillfully create and manage SSL certificates and keystores, ensuring secure encryption-based communication between application servers and dependent systems.

Process and Documentation:

  • Change Management and Procedures: Authored and published change management, escalation, and emergency fix procedures and guidelines.
  • Detailed Deployment Planning: Prepared detailed deployment plans with comprehensive cutover steps.
  • High Availability and Disaster Recovery: Implemented high availability and created a disaster recovery runbook.
  • Documentation and Training: Authored documentation for the entire product and specific troubleshooting/support manuals for individual services.

Team Lead | Sr Production Software Engineer

Oracle Cerner
10.2016 - 03.2019
  • Team Leadership: Led a dynamic team of 4 engineers responsible for supporting back-end Java services within the Alert Management System of the CareAware Connect mobile communication application.
  • Automation and Performance Optimization: Implemented a Scheduler Job to capture TCP/UDP packets, enhancing understanding of data flow for troubleshooting network-related application issues.
  • Problem Resolution and Support: Provided valuable assistance in analyzing, reviewing, and resolving software issues and support cases.
    Corrected database replication issues by identifying data spikes and addressing root causes of duplicate transactions.
  • Alert Accuracy Improvement: Enhanced the accuracy of patient alerts to handheld devices for healthcare providers by optimizing server-side rules engine filters.
  • Documentation and Communication: Developed and managed a Known Error Database and analyzed operational support procedures.
  • Leadership and Guidance: Provided assistance and guidance to team members on various development areas.

Java/J2EE Developer

IMCS Group Inc
01.2016 - 10.2016
  • Comprehensive SDLC Involvement: Played a pivotal role in all phases of the Software Development Life Cycle (SDLC), covering design, analysis, modeling, development, and system testing.
  • Change Management and Documentation:Transformed Functional Design Documents (FDD) into actionable Change Request Documents (CRD) for developers, following stakeholder meetings.
  • Design Patterns and Development:Implemented the MVC design pattern for a scalable and maintainable architecture.
  • Development:Developed the presentation layer using JSP, HTML, and JavaScript.
    Crafted Application controllers, Business, and Data service modules for web applications.
  • Database Interaction:Implemented PL/SQL stored procedures, triggers, cursors, and views using Oracle.
  • Hibernate and Spring Integration:Utilized Hibernate integrated with Spring Framework, writing HQL queries for efficient data persistence.

Programmer Analyst Trainee

Cognizant Technologies Solutions
12.2013 - 05.2014
  • Network Devices Proficiency: Install, monitor, and troubleshoot Cisco routers, multi-layer switches, ASA Firewalls, Dell SonicWALL firewalls, and Cisco Meraki & Aironet wireless networks.
  • Network Operations Expertise: Manage company VLANs, VPN tunnels, MPLS, and VOIP technology for seamless and secure network operations.
  • Technical Support Excellence: Provide escalated technical support to internal customers, effectively resolving incident and problem tickets.
  • Infrastructure Documentation: Maintain comprehensive documentation of the company's WAN/LAN infrastructure and security policies.
  • Cross-Functional Support: Support help desk, onsite support, and server management teams in resolving network, firewall, and VOIP-related issues.

Education

Master of Science - Computer Science

Texas A&M University
12-2015

Bachelor of Science - Computer Science And Engineering

LBRCE College of Engineering
05-2013

Skills

  • AWS cloud technologies and managed services
  • Java
  • Python
  • AppDynamics
  • Splunk
  • Datadog
  • Graffana & Prometheus
  • Oracle DB
  • MongoDB
  • Docker & Kubernetes, EKS
  • Terraform
  • Cloud formation
  • Open Telemetry
  • Linux & AIX
  • Jenkins
  • Helm Charts
  • GitHub
  • Ansible
  • Spinnaker
  • Scripting

Timeline

Staff Site Reliability Engineer

Samsung Electronics America
12.2024 - Current

Principal Application Engineer - Site Reliability

Discover
04.2024 - 12.2024

Lead Site Reliability Engineer

Apple
03.2019 - 03.2024

Team Lead | Sr Production Software Engineer

Oracle Cerner
10.2016 - 03.2019

Java/J2EE Developer

IMCS Group Inc
01.2016 - 10.2016

Programmer Analyst Trainee

Cognizant Technologies Solutions
12.2013 - 05.2014

Bachelor of Science - Computer Science And Engineering

LBRCE College of Engineering

Master of Science - Computer Science

Texas A&M University
Teja Appakondareddygari