Accomplished Technical Architect and Sr. SRE Engineer known for delivering scalable solutions and optimizing performance. Expertise in AI-driven automation and capacity planning to enhance operational efficiency and minimize disruptions.
Overview
11
11
years of professional experience
Work History
Sr. SRE Engineer / Technical Architect
BJs Wholesale Club
Westborough, USA
04.2019 - Current
Led automation initiatives that reduced manual workload, improving application efficiency by 30%.
Independently developed and deployed an AI-powered application with end-to-end MLOps pipeline, integrating model training and monitoring.
IBM Sterling OMS SME: Acted as the Subject Matter Expert (SME) for IBM Sterling Order Management System (OMS), providing strategic guidance, troubleshooting, and optimization of Sterling OMS workflows to enhance order fulfillment and inventory management efficiency.
Performance Tuning: Monitored, analyzed, and optimized application and infrastructure performance to ensure scalability and reliability. Identified bottlenecks in Call Center Portal, where slow response times impacted customer service, and in the Store Picking process, improving timely fulfillment of online orders.
Review & Implementation: Reviewed solutions with the customer architecture team and deployed automation tools in production. Example: Developed an automated configuration comparison tool, incorporating advanced parsing logic to minimize manual validation effort before deployment.
Collaboration with Development Teams: Partnered with developers to ensure reliability best practices were applied early in development. Example: After an outage investigation, proactively implemented error-handling improvements for a new feature to prevent similar failures in production.
Capacity Planning & Forecasting: Analyzed usage patterns to forecast resource needs and optimize agent topology for better workload distribution. Example: DB space reclaim activity, which recovered unused storage, reducing database cost by 25%.
Automation for Incident Mitigation: Developed and deployed automation scripts to act as short-term fixes for production issues until permanent resolutions were implemented. Example: Created self-healing scripts that automatically restarted impacted services, ensuring uptime while long-term fixes were developed.
High Availability & Scalability: Designed and maintained highly available and scalable infrastructure to handle retail peak demand during high-traffic events like holiday sales.
Proactive Monitoring & Alerting: Integrated NewRelic, Scalyr, Kafana and Logrocket for real-time anomaly detection, proactively identifying system failures before impacting customers.
Service Level Objectives (SLOs): Defined and monitored SLOs, SLIs, and SLAs to ensure alignment with business expectations. Example: Established non-functional requirements for new features, including a maximum response time and error rate thresholds.
Business Process Optimization: Automated complex and repetitive business processes to improve operational efficiency and minimize disruptions. Example: Developed automation for the store batch picking process, resolving fulfillment bottlenecks where orders were getting stuck and significantly reducing processing delays to ensure timely deliveries.
Root Cause Analysis & Incident Resolution: Led in-depth root cause analysis for production failures. Example: Diagnosed agent performance inefficiencies, refined execution processes, and optimized configurations to improve system responsiveness.
Change & Incident Management: Participated in post-incident reviews, tracked system changes, and ensured smooth rollouts with minimal downtime.
Automated Reporting Solutions: Developed cron job-driven reporting automation, ensuring timely reporting generation for proactive system insights.
Complex Configuration Diff Comparer: Built a pre-deployment validation script that automated configuration comparisons, reducing manual verification time by 2-3 Hr.
Technical Documentation & Training: Authored detailed process documentation and mentored teams on SRE principles, ensuring smooth knowledge transfer and adoption of best practice.
Identified recurring issues in Sterling OMS, resulting in a decrease in customer complaints and improved satisfaction.
Developed automation scripts using Python, PowerShell, and Bash, minimizing manual resource interaction.
Optimized performance tuning by resolving bottlenecks in Call Center Portal and Store Picking process and developed self-healing automation and proactive monitoring solutions to reduce downtime and enhance incident response efficiency.
Automated 75% of manual tasks, increasing platform reliability by 30%.
Led onshore and offshore teams, ensuring on-time deliverables with high success rate.
Designed and executed an omnichannel strategy, bridging online and offline retail experiences.
Developed automation-first approaches, reducing time-to-market for new features and boosting customer retention rates.
Implemented Microservices-based architecture to improve scalability, flexibility, and deployment efficiency.
Improved API performance, optimizing SQL queries to reduce execution times by 50%, enhancing response speed by 30%.
Automated deployment processes, cutting deployment time from hours to minutes—saving $100K annually in operational costs.
Implemented AWS cloud solutions, ensuring seamless integration with CI/CD pipelines.
Promotion Abuse Prevention: Implemented a solution to detect and prevent customers from placing orders with a one-day pass to exploit site promotions, ensuring fair usage while maintaining customer trust.
Streamlined incident management workflows, reducing resolution times and improving team efficiency.
Established efficient communication channels between development and support teams, decreasing incident frequency by 15%.