Summary

Overview

Skills

Work History

Projects

Education

Timeline

Saeed Hussain

Lead Data Engineer | Cloud & Streaming Data Engineering Expert

Jersey City,NJ

Summary

Senior Data Engineer with 13+ years of experience architecting and optimizing large-scale, cloud-native data platforms across AWS, Azure, and GCP. Expert in building high-performance, fault-tolerant data ecosystems using Spark, Flink, Kafka, Delta Lake, and modern Data Mesh and Lakehouse architectures. Skilled in developing streaming and batch pipelines, implementing CDC frameworks, and enforcing metadata-driven governance, lineage, and observability at enterprise scale. Proficient in Python, Scala, Rust, SQL, and Terraform, with advanced expertise in schema evolution, data virtualization (Trino, Denodo, Starburst), and distributed query optimization. Adept at enabling feature stores, real-time analytics (Pinot, Druid, Materialize), and ML data pipelines, while driving platform scalability, cost efficiency, and DataOps automation across complex, federated data environments.

Overview

years of professional experience

Skills

Work History

Lead Data Engineer

Mavericks Labs

11.2020 - Current

Designed and implemented a multi-cloud Data Lakehouse ecosystem using Delta Lake and Apache Iceberg, unifying enterprise data domains across AWS and Azure with advanced schema evolution and governance frameworks.
Architected real-time ingestion pipelines leveraging Apache Flink, Kafka Streams, and Debezium CDC, ensuring low-latency event delivery and synchronization across analytical systems.
Built metadata-driven orchestration frameworks using Airflow 3.x and Dagster, integrating Apache Atlas and OpenMetadata for automated lineage tracking and impact analysis.
Enhanced query and compute performance with adaptive Spark execution, vectorized reads, and partition pruning, optimizing distributed query workloads.
Directed platform observability and reliability using OpenTelemetry, Prometheus, and Grafana, establishing proactive alerting and end-to-end monitoring across data pipelines.

Senior Data Engineer

Airbnb

04.2019 - 08.2020

Developed and maintained streaming data frameworks using Spark Structured Streaming, Kafka, and AWS Kinesis, processing 3B+ IoT records daily.
Modernized data warehouse architecture by migrating to Snowflake + dbt, embedding automated schema versioning, CI/CD validations, and Great Expectations-based testing.
Delivered cross-system query federation via Trino (PrestoSQL) and Starburst, enabling unified analytics across S3, Redshift, and PostgreSQL.
Engineered feature store ingestion pipelines (Feast + MLflow), reducing model training cycles and improving ML data freshness.
Standardized infrastructure-as-code deployments through Terraform, Kubernetes, and Docker, improving scalability and environment parity.

Data Engineer I

Amida Technology Solutions

07.2015 - 03.2019

Built and optimized PySpark and Scala-based ETL jobs for petabyte-scale transactional datasets using AWS EMR and Azure HDInsight clusters.
Developed CDC ingestion layers with Kafka Connect + Debezium, integrating Oracle, MySQL, and PostgreSQL systems into unified analytical schemas.
Implemented automated data validation pipelines with Deequ and custom Python checks, integrated into Airflow DAGs for continuous monitoring.
Enhanced Spark performance using broadcast joins, predicate pushdown, and partition compaction, reducing compute costs by 40%.
Partnered with ML teams to deliver Petastorm/Arrow-based data delivery pipelines for distributed model training.

Associate Data Engineer

Koverse Inc.

05.2012 - 05.2015

Built foundational ETL workflows using Talend, Python, and SQL, consolidating ERP, CRM, and REST API data into PostgreSQL and HDFS layers.
Designed normalized and dimensional schemas (3NF, Star, Snowflake) for analytics and BI reporting using Hive and Impala.
Supported the migration from on-prem Hadoop to AWS S3 + EMR, laying groundwork for scalable, cloud-native data operations.
Automated ETL job deployments with Jenkins and Git, introducing CI/CD for data pipelines.
Created profiling scripts for schema drift detection, missing data thresholds, and outlier pattern analysis.

Projects

1. Multi-Cloud Lakehouse Integration Platform

Tech: Delta Lake, Apache Iceberg, Spark, Airflow, Dagster, AWS, Azure, Apache Atlas, OpenMetadata

Description:

Designed and built a unified multi-cloud Lakehouse platform enabling cross-cloud analytics across AWS and Azure. Implemented Delta Lake and Apache Iceberg for versioned storage, schema evolution, and ACID guarantees. Created metadata-driven ingestion workflows using Airflow and Dagster with full lineage tracking via Apache Atlas and OpenMetadata. Introduced modular data zones, governed schema propagation, and scalable compute layers for structured, semi-structured, and unstructured workloads.

2. Real-Time Streaming & CDC Pipeline Framework

Tech: Apache Flink, Kafka Streams, Kafka Connect, Debezium, Kubernetes, Terraform

Description:

Engineered a distributed streaming framework delivering continuous data synchronization across operational and analytical systems. Implemented Debezium-based CDC flows for relational sources and built transformation layers with Flink and Kafka Streams. Containerized and deployed the entire stack on Kubernetes using Terraform for infrastructure provisioning. Added schema-registry-driven compatibility rules and event routing patterns for consistent, deterministic streaming behavior across microservices.

3. Data Quality & Observability Automation Layer

Tech: Spark, Python, Deequ, Great Expectations, Prometheus, OpenTelemetry, Grafana

Description:

Developed a comprehensive DataOps observability stack combining data validation, lineage propagation, and pipeline health monitoring. Automated quality checks using Deequ and Great Expectations integrated directly into Spark ETL and ELT workflows. Implemented OpenTelemetry-based tracing across pipelines and instrumented Prometheus exporters for systemlevel metrics. Built Grafana dashboards for operational visibility, schema drift detection, anomaly surfacing, and pipeline reliability insights.

Education

Bachelor of Science - Computer Science

Preston University

Timeline

Lead Data Engineer

Mavericks Labs

11.2020 - Current

Senior Data Engineer

Airbnb

04.2019 - 08.2020

Data Engineer I

Amida Technology Solutions

07.2015 - 03.2019

Associate Data Engineer

Koverse Inc.

05.2012 - 05.2015

Bachelor of Science - Computer Science

Preston University

Saeed Hussain

Summary

Overview

Skills

Work History

Lead Data Engineer

Senior Data Engineer

Data Engineer I

Associate Data Engineer

Projects

Education

Bachelor of Science - Computer Science

Timeline

Lead Data Engineer

Senior Data Engineer

Data Engineer I

Associate Data Engineer

Bachelor of Science - Computer Science

Similar Profiles

Neha KumavatNeha Kumavat

Ashley EspinozaAshley Espinoza

George CarpenterGeorge Carpenter

Aashka ParikhAashka Parikh

Debasish GhoshDebasish Ghosh