Summary

Overview

Work History

Education

Skills

Timeline

Vijay Kumar Annaldas

Austin

Summary

Highly accomplished and results-driven Data Engineering Leader with over 10+ years of experience spearheading complex data initiatives across various domains. Proven expertise in architecting, developing, and maintaining peta byte scale data platforms cloud eco-systemms (AWS, Azure) and Big Data (Hadoop, Spark) environments, consistently demonstrating a progressive mindset in adopting and mastering new technologies. Adept at driving significant cost savings, enhancing data quality and governance, ensuring regulatory compliance, and levaraging cutting edge GenAI models like Anthropic claude, AWS Nova models to boost productivity.

Overview

years of professional experience

Work History

Sr Data Engineer

Amazon.com Services

Austin

12.2020 - Current

HBA Data Lake Implementation

Directed a large-scale, multi-million dollar initiative to design and implement an enterprise data lake using AWS services (S3, Glue, Lake Formation, Kinesis, DDB, Firehose), enabling real-time OLTP data ingestion for Amazon Warehouse hiring ATS application, supporting 10+ teams' use cases across 3,000+ business users.
Leveraged AWS Lake Formation's fine-grained access control to enable secure, governed data sharing and self-service capabilities, resulting in $2M+ annual cost savings for analytical teams.
Developed comprehensive data governance and security controls, adhering to strict data classification standards (e.g., Highly Confidential, HR-PII) and ensuring end-to-end encryption, reducing compliance risks and potential financial penalties.
Architected data lifecycle management solution leveraging Apache Iceberg for automated HR data depersonalization across OLAP systems, ensuring GDPR compliance and reducing privacy violation risks
Collaborated with ML Engineers to design and implement robust data ingestion pipelines, optimizing MLOps architecture for seamless model training and inference, accelerating machine learning development cycles
Implemented Infrastructure-as-Code using Amazon CDK to automate data lake deployments, enabling scalable and resilient architecture through CI/CD pipelines

WFI Reporting Data Model

Engineered scalable, reusable ETL pipelines integrating diverse ATS platforms (Salesforce, Oleeo) into a centralized data warehouse, streamlining Amazon warehouse hiring analytics, marketing initiatives and yielding $300M+ annual savings while reducing cost per hire.
Collaborated with cross-functional teams (Data Engineering, BI, Workforce Staffing, Data Scientists) to understand requirements, design solutions, and deliver high-impact data platforms supporting key hiring and staffing initiatives.
Architected a high-performance data lakehouse using Medallion Architecture (Bronze, Silver, Gold S3 tiers).
Utilized ACID-compliant Apache Iceberg table format in AWS Glue, leveraging schema evolution and time travel features with PySpark for data inrstion & transformation.
Engineered robust CI/CD pipelines, integrating automated unit and integration testing with Pytest.
Led an engineering excellence program, combining hands-on mentorship with architecture reviews to foster team career progression and institutionalize cloud development best practices.
Leveraged GenAI coding assistants (Amazon Q, Cline) integrated with Amazon's internal Model Context Protocol (MCP) and Anthropic Sonnet/Opus LLMs to enhance coding standards and project deliverables without compromising quality, significantly improving developer productivity.
Automated routine operations using GenAI tools, significantly reducing manual effort and operational costs for the team, estimated to save thousands of dollars monthly in labor costs.

Data quality framework

Reduced data quality issues by 85%, eliminating ~2,000 hours of annual manual data reconciliation/corrections effort and preventing potential revenue losses of $3M+ through early detection of data anomalies before impacting downstream customers
Designed and implemented a scalable, automated, and reusable data quality framework using native AWS Glue Data Quality service, enabling validations across key dimensions (completeness, uniqueness, freshness).
Implemented an ML-based Data Quality rules recommendation engine to accelerate rule generation.
Integrated the data quality framework with multiple orchestration tools (Apache Airflow (MWAA), AWS Step Functions) for seamless execution of checks across various sources (Redshift, Glue, Iceberg)
Implemented centralized data quality monitoring and alerting, storing validation results in AWS Glue Catalog for reporting and publishing metrics to Amazon CloudWatch for real-time insights.
Educated and onboarded various teams, enabling seamless integration of the data quality framework into their ETL pipelines.

Data Platform Modernization

Led migration of 189+ Andes tables to new dataplane, architecting Java-Python interoperability solution for seamless API integration using JPype, ensuring zero downtime and 99.99% data accuracy for mission-critical applications
Implemented Redshift managed VPC endpoints across all clusters, eliminating public IP security vulnerabilities and establishing a robust, security-compliant connectivity framework with zero user downtime.
- Optimized Redshift performance by implementing intelligent Workload Management (AUTO WLM) and advanced QMR, enabling ML-driven resource allocation and query prioritization; achieved 50% faster ETL processing and enhanced performance for concurrent analytics.
Spearheaded migration of 70+ production DAGs from self-managed Apache Airflow to Amazon MWAA, modernizing workflows with Airflow 2.x features via Infrastructure-as-Code (CDK); significantly enhanced scalability and operational efficiency through cloud-native orchestration.

Data Engineer

Transamerica Life Insurance Company

Dallas

06.2017 - 11.2020

IFRS 17 Data Integration Platform

Architected a robust data integration platform, unifying complex insurance data from diverse sources (Mainframe DB2, Oracle, file-based systems) across multiple product lines to enable IFRS 17 regulatory reporting.
Designed a scalable distributed processing framework using Spark RDDs and DataFrames for high-throughput parallel computation.
Streamlined team collaboration and reducing time-to-market by 25% through sprint planning and incremental releases.
Collaborated with business analysts and data governance teams to develop enterprise-wide data lake standards and common data models for insurance administration systems.
Developed an automated reconciliation framework between legacy systems and IFRS 17 measurements, significantly reducing the financial close cycle by 70%.
Engineered an optimized ETL framework utilizing Spark SQL and optimized complex SQL's for efficient data manipulation, reducing data retrieval time by 30%.

Finane Data Warehouse Implementation

Architected a scalable financial data warehouse, integrating HDFS with Azure Synapse via BLOB storage, to process financial data with sub-hour latency for reporting layer
Designed a comprehensive reporting layer by combining data across diverse insurance domains to create cross-domain views, serving various business use-cases using stored procs via Azure Data Factory (ADF).
Implemented strategic partitioning, indexing, and sort keys, reducing query response time from minutes to seconds.
Built an automated data quality framework with real-time validation checks and reporting for missing records.

Enterprise Hadoop Platform Administration

Administrated enterprise Cloudera cluster supporting 100+ users, resulting in $10M annual infrastructure cost savings
Implemented automated resource scheduling and capacity planning, achieving 99.99% platform availability.
Created a comprehensive monitoring solution using Cloudera Manager APIs with automated alerting.
Established disaster recovery protocols and automated backup procedures, ensuring robust business continuity.
Created and maintained enterprise-wide Cloudera development standards and best practices; conducted periodic training and code reviews to ensure compliance.

Mainframe Developer /Big Data Engineer

Syntel

Chennai

06.2011 - 08.2015

Led development and enhancement of mission-critical COBOL applications for healthcare claims processing, ensuring compliance with federal regulations and key healthcare initiatives.
Architected system enhancements supporting major healthcare reform mandates, ensuring high performance and data accuracy.
Optimized DB2 queries and JCL workflows, reducing complex claims processing batch time from 6 hours to 2 hours through historical data archiving.
Implemented automated validation systems for critical healthcare coding standards (ICD-10, CPT, HCPCS).
Conducted technical training sessions and code reviews, enhancing team capabilities and fostering healthcare domain expertise.
Architected and executed large-scale healthcare data migration from legacy systems to Hadoop (CDH 4.3), successfully migrating petabytes of sensitive claims data with zero data loss.
Implemented parallel processing strategies using Sqoop and Hive to optimize data transfer rates for large-scale mainframe datasets.
Created a comprehensive data lineage tracking system for regulatory compliance and audit requirements.
Built a monitoring and alerting framework for migration jobs, ensuring timely identification and resolution of data quality issues.

Education

Master of Science -

University of Houston, ClearLake

Houston

12-2016

Skills

Programming Languages: Python, Scala, SQL, UNIX Shell Scripting, TypeScript

Big Data Technologies: Hadoop, Spark, Hive, Pig, Sqoop, MapReduce, Flink

ETL & Data Integration: Mainframes, SSIS, Informatica (PowerCenter, BDE), Apache Airflow

Databases & Data Warehousing: SQL Server, PostgreSQL, MySQL, Oracle, Snowflake, ,IBM DB2

Streaming & Real-Time Processing: Apache Kafka, Kafka Streams, Apache Flink, Amazon Kinesis, Spark Streaming

DevOps & CI/CD: Git, Jenkins, Docker

Azure (Azure Data Factory, Synapse, Azure Databricks, Blob Storage)

AWS (EC2, S3, Glue, Redshift, DynamoDB, RDS, Athena, Kinesis, Firehose, MWAA, EMR, CDK, CloudFormation, Route53, Glue, step functions, Lambda, IAM, LakeFormation, Bedrock, Sagemakes)