Summary
Overview
Skills
Work History
Projects
Education
Timeline
AssistantManager

Saeed Hussain

Lead Data Engineer | Cloud & Streaming Data Engineering Expert
Jersey City,NJ

Summary

Senior Data Engineer with 13+ years of experience architecting and optimizing large-scale, cloud-native data platforms across AWS, Azure, and GCP. Expert in building high-performance, fault-tolerant data ecosystems using Spark, Flink, Kafka, Delta Lake, and modern Data Mesh and Lakehouse architectures. Skilled in developing streaming and batch pipelines, implementing CDC frameworks, and enforcing metadata-driven governance, lineage, and observability at enterprise scale. Proficient in Python, Scala, Rust, SQL, and Terraform, with advanced expertise in schema evolution, data virtualization (Trino, Denodo, Starburst), and distributed query optimization. Adept at enabling feature stores, real-time analytics (Pinot, Druid, Materialize), and ML data pipelines, while driving platform scalability, cost efficiency, and DataOps automation across complex, federated data environments.

Overview

14
14
years of professional experience

Skills

Work History

Lead Data Engineer

Mavericks Labs
11.2020 - Current
  • Designed and implemented a multi-cloud Data Lakehouse ecosystem using Delta Lake and Apache Iceberg, unifying enterprise data domains across AWS and Azure with advanced schema evolution and governance frameworks.
  • Architected real-time ingestion pipelines leveraging Apache Flink, Kafka Streams, and Debezium CDC, ensuring low-latency event delivery and synchronization across analytical systems.
  • Built metadata-driven orchestration frameworks using Airflow 3.x and Dagster, integrating Apache Atlas and OpenMetadata for automated lineage tracking and impact analysis.
  • Enhanced query and compute performance with adaptive Spark execution, vectorized reads, and partition pruning, optimizing distributed query workloads.
  • Directed platform observability and reliability using OpenTelemetry, Prometheus, and Grafana, establishing proactive alerting and end-to-end monitoring across data pipelines.

Senior Data Engineer

Airbnb
04.2019 - 08.2020
  • Developed and maintained streaming data frameworks using Spark Structured Streaming, Kafka, and AWS Kinesis, processing 3B+ IoT records daily.
  • Modernized data warehouse architecture by migrating to Snowflake + dbt, embedding automated schema versioning, CI/CD validations, and Great Expectations-based testing.
  • Delivered cross-system query federation via Trino (PrestoSQL) and Starburst, enabling unified analytics across S3, Redshift, and PostgreSQL.
  • Engineered feature store ingestion pipelines (Feast + MLflow), reducing model training cycles and improving ML data freshness.
  • Standardized infrastructure-as-code deployments through Terraform, Kubernetes, and Docker, improving scalability and environment parity.

Data Engineer I

Amida Technology Solutions
07.2015 - 03.2019
  • Built and optimized PySpark and Scala-based ETL jobs for petabyte-scale transactional datasets using AWS EMR and Azure HDInsight clusters.
  • Developed CDC ingestion layers with Kafka Connect + Debezium, integrating Oracle, MySQL, and PostgreSQL systems into unified analytical schemas.
  • Implemented automated data validation pipelines with Deequ and custom Python checks, integrated into Airflow DAGs for continuous monitoring.
  • Enhanced Spark performance using broadcast joins, predicate pushdown, and partition compaction, reducing compute costs by 40%.
  • Partnered with ML teams to deliver Petastorm/Arrow-based data delivery pipelines for distributed model training.

Associate Data Engineer

Koverse Inc.
05.2012 - 05.2015
  • Built foundational ETL workflows using Talend, Python, and SQL, consolidating ERP, CRM, and REST API data into PostgreSQL and HDFS layers.
  • Designed normalized and dimensional schemas (3NF, Star, Snowflake) for analytics and BI reporting using Hive and Impala.
  • Supported the migration from on-prem Hadoop to AWS S3 + EMR, laying groundwork for scalable, cloud-native data operations.
  • Automated ETL job deployments with Jenkins and Git, introducing CI/CD for data pipelines.
  • Created profiling scripts for schema drift detection, missing data thresholds, and outlier pattern analysis.

Projects

1. Multi-Cloud Lakehouse Integration Platform 

Tech: Delta Lake, Apache Iceberg, Spark, Airflow, Dagster, AWS, Azure, Apache Atlas, OpenMetadata 

Description: 

Designed and built a unified multi-cloud Lakehouse platform enabling cross-cloud analytics across AWS and Azure. Implemented Delta Lake and Apache Iceberg for versioned storage, schema evolution, and ACID guarantees. Created metadata-driven ingestion workflows using Airflow and Dagster with full lineage tracking via Apache Atlas and OpenMetadata. Introduced modular data zones, governed schema propagation, and scalable compute layers for structured, semi-structured, and unstructured workloads. 

2. Real-Time Streaming & CDC Pipeline Framework 

Tech: Apache Flink, Kafka Streams, Kafka Connect, Debezium, Kubernetes, Terraform 

Description: 

Engineered a distributed streaming framework delivering continuous data synchronization across operational and analytical systems. Implemented Debezium-based CDC flows for relational sources and built transformation layers with Flink and Kafka Streams. Containerized and deployed the entire stack on Kubernetes using Terraform for infrastructure provisioning. Added schema-registry-driven compatibility rules and event routing patterns for consistent, deterministic streaming behavior across microservices. 

3. Data Quality & Observability Automation Layer 

Tech: Spark, Python, Deequ, Great Expectations, Prometheus, OpenTelemetry, Grafana 

Description: 

Developed a comprehensive DataOps observability stack combining data validation, lineage propagation, and pipeline health monitoring. Automated quality checks using Deequ and Great Expectations integrated directly into Spark ETL and ELT workflows. Implemented OpenTelemetry-based tracing across pipelines and instrumented Prometheus exporters for systemlevel metrics. Built Grafana dashboards for operational visibility, schema drift detection, anomaly surfacing, and pipeline reliability insights.

Education

Bachelor of Science - Computer Science

Preston University

Timeline

Lead Data Engineer

Mavericks Labs
11.2020 - Current

Senior Data Engineer

Airbnb
04.2019 - 08.2020

Data Engineer I

Amida Technology Solutions
07.2015 - 03.2019

Associate Data Engineer

Koverse Inc.
05.2012 - 05.2015

Bachelor of Science - Computer Science

Preston University
Saeed HussainLead Data Engineer | Cloud & Streaming Data Engineering Expert