Summary
Overview
Work History
Education
Skills
Technology Stack
Certification
Additional Information
Timeline
Generic

Narasimhan Venkadeswaran

San Jose,California

Summary

Results-driven Staff Site Reliability Engineer with over 15 years of experience in designing, implementing, and maintaining scalable systems. Proven expertise in leveraging automation, monitoring, and incident response to enhance system reliability and performance. Adept at collaborating with cross-functional teams to drive best practices in software development and operations. Committed to fostering a culture of continuous improvement and resilience in high-availability environments.

Overview

17
17
years of professional experience
1
1
Certification

Work History

SITE RELIABILITY ENGINEER/DevOps/MLOps Engineer

DOMINO DATALAB
04.2021 - Current

As an Individual Contributor (IC)

  • Led the end-to-end migration of the self-hosted Teleport Unified Access Plane for Domino to a cloud-hosted solution, ensuring secure access for both cloud and on-premises deployments.
  • Designed and implemented Terraform modules for Teleport cloud deployment using CircleCI.
  • Created Helm charts from scratch for Kubernetes client interactions.
  • Developed procedures for Break Glass, audit logging, and granular RBAC within Kubernetes.
  • Integrated alert systems with PagerDuty for proactive incident management.
  • Produced thorough documentation while collaborating with multiple teams and conducting enablement sessions for the entire engineering organization.

As a Domino MLOps Platform Engineer

  • Championed the adoption of DevOps culture through several key initiatives:Defined the "definition of done" for service reliability in the Data Model Images building pipeline, Model Building and Publishing stack, and version control of build model artifacts.
    Monitored the end-to-end Domino MLOps pipeline at both infrastructure and application levels using tools like New Relic, Prometheus, and Grafana.
    Built a Terraform framework to capture Data Model and Machine Learning user journeys, tracking state transitions of models at each phase.

As an Embedded MLOps Engineer for Domino Enterprise Platform

  • Led by example and trained development teams in incident management and postmortem processes.
  • Fostered engineering team participation in owning incidents and conducting postmortems, increasing developer ownership and improving the reliability of the Domino MLOps product.
  • Reduced XL/Large incidents to Large/Medium at the 95th percentile and decreased monitoring ingestion costs by 30% through enhanced instrumentation.

Additional Achievements

  • Designed and implemented end-to-end event routing capabilities within the existing IaaS framework for monitoring and alerting, utilizing Terraform for all Domino customers, resulting in a 30% reduction in the time to identify and resolve incidents automatically.
  • Developed Python modules and libraries to identify issues with critical services, enhancing observability and reducing mean time to recovery (MTTR) during customer incidents.
  • Improved the visibility and reliability of a critical Data Model Build component in the MLOps pipeline from 0% to over 75%.

PRINCIPAL ENGINEER

YAHOO
12.2007 - 04.2021

Principal Production Engineer

  • Owned and operated Yahoo's media content platform, serving 100,000 requests per second (RPS).
  • Achievements include scaling the Cassandra backend to 30TB across five data centers and redesigning the platform in collaboration with engineering to reduce end-to-end latency from 1.5 seconds to 500 milliseconds, achieving four nines (99.99%) availability.
  • Received the Yahoo Spot and U Rock awards.

Principal Production Engineer

  • Managed the unified API middle tier running on Kubernetes, serving 50,000 RPS for Yahoo sites including Finance, Sports, and the Homepage.
  • Technical skill sets include ingress routing on Kubernetes, service mesh implementation using Istio, and canary analysis using Kayenta.
  • Actively mentored junior engineers.

Senior Tech Lead

  • Led the Media Analytics Real-Time Pipeline utilizing Druid, Apache Storm, Apache Kafka, and API stacks to provide real-time analytics for content editors.
  • Responsibilities included maintaining and upgrading the real-time analytics platform, ensuring monitoring and observability.
  • Championed the adoption of Imply Pivot, a powerful visualization tool for real-time analytics.

Tech Lead

  • Managed a private cloud based on Chroot jail for Yahoo media, hosting hundreds of frontend (Node.js) and backend (Java) applications.

Education

Master of Arts - Economics

MKU University
04.2014

Master of Science - Multimedia Technology

College of Engineering Guindy
04.2006

Bachelor of Technology - Information Technology

Madurai Kamaraj University
03.2004

Skills

  • Reliability Engineering
  • Devops/MLops
  • Infrastructure as code
  • Disaster Recovery/High Availability
  • Network Troubleshooting
  • Software Development and Scripting
  • Containerization Technologies
  • Database Technologies ( Relational and Nosql)
  • Near Real Time Analytics and Batch Processing
  • Cloud/On-Prem Datacenter
  • Monitoring, Observabilty

Technology Stack

  • Programming - Python,Bash,Go,Perl.
  • IaaC - Terraform
  • Container Orchestration and Cloud - Kubernetes,Docker,AWS,Azure,GCP.
  • Version Control - Git.
  • Monitoring & Observability - Prometheus,Grafana,NewRelic.
  • Deployment Pipeline - CircleCI, Argo, Jenkins.
  • Incident Management - Pagerduty.
  • NoSql - Apache cassandra,Redis,Mongodb ,Apache Druid,memcached,Hbase
  • Logging - fluentd, fluentbit,splunk,elk
  • Stream Processing - Apache Storm
  • Queueing/Stream Processing/Big Data - Kafka, Rabbitmq, Pulsar, Hadoop.
  • Relational Databases - Mysql/Postgresql
  • Search Enginees - Vespa , Elastic Search
  • Ingress - Apache TrafficServer,Nginx
  • Stream Processing - Apache Storm

Certification

  • CKAD Certified Kubernetes Application Developer
  • Issued Jul 2022 · Expires Jul 2025
  • Credential ID LF-72g4g7hudy
  • HashiCorp Certified: Terraform AssociateExpires May 2024
  • Credential ID 9cc04409-823d-45c5-b81b-e09cc727b5d6
  • CKA: Certified Kubernetes Administrator. Expires Apr 2025 Credential IDLF5qk3imd1qlCredential ID LF5qk3imd1ql
  • Microsoft Azure FundamentalsAZ-900 Issued Feb 2022 Credential ID 992669881
  • AWS Certified Cloud Practitioner
  • Expires Dec 2024
  • Credential ID AWS02543957Credential ID AWS02543957
  • MYSQL DBA Administrator
  • RHCA (Redhat Certified Administrator)

Additional Information


IPCOM000250809D - Method and System for Gossip Protocol based disaster recovery for Distributed web systems.

IPCOM000223854D - Method and System for Access Control of Mobile Network Applications using Fingerprints as Structural Patterns.

Timeline

SITE RELIABILITY ENGINEER/DevOps/MLOps Engineer

DOMINO DATALAB
04.2021 - Current

PRINCIPAL ENGINEER

YAHOO
12.2007 - 04.2021

Master of Arts - Economics

MKU University

Master of Science - Multimedia Technology

College of Engineering Guindy

Bachelor of Technology - Information Technology

Madurai Kamaraj University
Narasimhan Venkadeswaran