

Passionate Lead Data Scientist & Cloud Engineer with 10+ years architecting and leading scalable, high-performance data platforms that turn complex data into strategic business assets.
My expertise centers on AWS, where I've designed end-to-end solutions using Amazon S3 for robust data lakes, AWS Glue for serverless ETL/ELT, Amazon EMR (PySpark/Spark) for big data processing, Amazon Redshift and Spectrum for analytics warehousing, Amazon Kinesis for real-time streaming, AWS Lambda & Step Functions for event-driven orchestration, Amazon Athena for query-on-lake, and AWS Lake Formation for governance and security. I've led migrations to AWS-native architectures, optimized petabyte-scale pipelines (10–15TB+ daily), reduced costs by 35–45% through auto-scaling, Spot Instances, and serverless shifts, and achieved 99.9% uptime with sub-second latency for mission-critical analytics.
I also bring strong Azure capabilities, including Azure Data Factory for pipeline orchestration, Azure Databricks for Spark-based transformations and Delta Lake, Azure Synapse Analytics for unified warehousing and big data processing, and ADLS Gen2 for storage—enabling hybrid/multi-cloud strategies and seamless integrations when needed.
Beyond clouds, I excel in general data science engineering leadership: mentoring teams of 8–12 engineers on best practices, enforcing data governance and quality frameworks, collaborating cross-functionally with data science, analytics, and product teams, and delivering resilient batch/streaming pipelines that support ML models, BI dashboards, and real-time decision-making. Tools like Python (Boto3/PySpark/Pandas), SQL, Apache Spark/Kafka, Airflow (MWAA), Terraform/CloudFormation, and Docker are my daily drivers for building reliable, cost-efficient systems.
What drives me is solving tough data challenges at scale—whether slashing processing times by 50–70%, enabling self-service analytics for hundreds of users, or aligning infrastructure with revenue-generating outcomes. I've delivered millions in business value through optimized, governed platforms.
Open to connecting on multi-cloud data architecture, AWS/Azure migrations, team leadership in data engineering, or opportunities to build next-gen scalable solutions. Let's discuss how we can drive impact together! 🚀
Cloud Data Engineering & Distributed Systems
• Pipeline orchestration and ETL across Databricks, Airflow, Glue, and Data Factory
• Distributed data processing (Spark, Databricks)
• Lakehouse and data warehouse architectures (Delta Lake, BigQuery, Redshift etc)
• Act as a technical liaison between customers, service engineering teams, and leadership to design and deploy AWS solutions.
• Led architecture of enterprise data lake on Amazon S3 with AWS Glue Crawlers and Data Catalog, enabling scalable ingestion and discovery for petabyte datasets.
• Designed and implemented high-throughput ETL/ELT pipelines using AWS Glue and PySpark, processing 10TB+ daily and reducing runtime by 45%.
• Architected real-time streaming solutions with Amazon Kinesis Data Streams/Firehose to S3/Redshift, supporting sub-second analytics for mission-critical apps.
• Optimized Amazon Redshift clusters and Spectrum queries on S3, improving BI query performance by 3x and saving $400K+ annually in compute/storage.
• Built serverless data pipelines using AWS Lambda triggered by S3 events, integrated with Glue jobs for automated transformations and cost efficiency.
• Led cross-functional teams in adopting AWS Step Functions for workflow orchestration, enhancing pipeline reliability and reducing manual interventions by 70%.
• Implemented data governance frameworks with AWS Lake Formation and IAM policies, ensuring compliance and secure access across 500+ users.
• Mentored 7+ junior engineers on AWS best practices (Glue, EMR, Lambda), resulting in 30% improved team productivity and faster delivery.
• Engineered cost-optimization strategies across AWS services (Glue DPUs, EMR scaling, Athena partitioning), cutting monthly data platform expenses by 40%.
• Developed event-driven architectures with AWS Lambda and EventBridge, automating data quality checks and notifications via SNS.
• Integrated Amazon Athena for ad-hoc querying on S3 data lakes, enabling self-service analytics and reducing dependency on heavy warehousing.
• Built and implemented data infrastructure, ingesting and transforming data via ETL/ELT for large-scale apps.
• Designed and implemented large-scale distributed ETL pipelines across AWS and Databricks environments to support enterprise analytics.
• Led modernization from on-prem data platforms to cloud-based architectures, improving performance and operational efficiency by about 30 percent.
• Architected enterprise data models and analytics solutions supporting self-service reporting for 200+ users.
• Led a cross-functional team to design and implement a new CI/CD pipeline, reducing deployment time by 30% and presenting the results to senior leadership.
• Built and operated production MLOps platforms using SageMaker, Docker, MLflow, and CI/CD pipelines to standardize model delivery.
• Served as senior technical advisor, translating business requirements into scalable AWS solutions for data and analytics initiatives.
• Engineered a scalable AWS data pipeline using S3 and Lambda, ensuring GDPR compliance for a machine learning workflow
• Designed and secured multi-account AWS environments using VPC Peering and Transit Gateway
• Managed and mentored data engineers and analysts, improving delivery quality and architectural consistency.
• Enhanced data visualization techniques with relational databases like SQL, Python, ArcGIS, R, SAS, Analytical Tools (Regression Analysis, Web Analytics), Predictive Modeling, Data Visualization (R-Shiny, Power BI, and Tableau), reducing data analysis time by 50% and increasing data insights by 20%.
• Developed predictive analytics and statistical models to optimize program performance, operational efficiency, and financial forecasting for clients in healthcare, government, and technology.
• Automated reporting and pipeline workflows using Python, SQL, and Azure Data Factory, reducing manual processing time by up to 40%.
• Partnered with business stakeholders to translate requirements into scalable BI dashboards and data pipelines supporting enterprise KPIs and strategic decision-making.
• Established early MLOps best practices, including experiment tracking, dataset versioning, and reproducible model training pipelines.
• Conducted in-depth longitudinal analysis of student performance using SQL, Excel and R, driving a 20% improvement in test scores by identifying key performance gaps.
• Designed and implemented data-driven curriculum optimization experiments, reducing inefficiencies by 25% and enhancing learning outcomes.
• Translated complex data insights into actionable recommendations, presenting findings to senior stakeholders and aligning strategies with educational objectives.
• Partnered with school administrators and educators to develop interactive dashboards, streamlining student progress tracking and informing data-driven instructional decisions.
• Designed and implemented training sessions for staff on effective data utilization practices.
Cloud Platforms: Azure (AKS, ARM),
AWS (EC2, EKS), GCP
DevOps tools: JIRA, Jenkins, Slack,
AzureDevOps
Build Tools: Ant, Maven, MS Build
SCMs: SVN, Git, GitHub, Bitbucket,
GitLab, Azure Git
IAC Tools: Terraform,
CloudFormation
Containers/Orchestration: Docker,
Kubernetes
Application/Web Servers: Tomcat,
WebLogic 9.x/10.x/12c, Apache
2.x/1.3.x, JBoss 7.1
Operating Systems: Ubuntu 18.0.4,
Red Hat Linux, Windows, HP-UX and
Solaris 10
Programming & Scripting
Languages: Ruby, Python, Shell
scripting, UNIX Shell Scripts (Ksh,
Bash), Git Bash
Web Technologies : HTML5, CSS3,
JavaScript, JSON
Frameworks and Libraries: Angular,
Flask, RESTful APIs, React
Database Technologies: Oracle, SQL
Server, MySQL, PostgreSQL, S3, RDS,
DynamoDB
Methodologies: Agile, Scrum Networking/Security Tools: IAM,
ELB, Putty, VMware
• Certified AWS Certified Solutions Architect –Professional
• Certified AWS Certified Solutions Architect – Associate
• Certified Power BI Associate, Microsoft
• Certified Database Fundamentals (T-SQL), Microsoft
Recognized for Outstanding Leadership in mentoring junior data scientists and driving cross-functional collaboration at City of New York.
• Developed a predictive analytics pipeline using Python and Scikit-learn to identify high-risk properties for health and safety violations, based on historical inspection, complaint, and maintenance data.
• Integrated multi-source data using SQL and PySpark in Databricks, ensuring clean, scalable datasets for machine learning model training.
• Designed and published Power BI dashboards for operational managers and field inspectors to visualize risk levels across buildings, zones, and violation types.
• Built a forecasting model using R and AWS SageMaker to predict peak inspection periods and optimize inspector scheduling, improving field coverage and reducing overtime costs by 20%.
• Automated data extraction and cleansing using SQL scripts, enhancing the timeliness of inspection reports.
• Conducted clustering analysis to group buildings based on historical violations, population vulnerability, and inspection history to develop proactive inspection routes.
• Designed and implemented a financial forecasting system using predictive modeling (Random Forest, Linear Regression) to simulate various budget scenarios.
• Used Azure Machine Learning to deploy models and monitor performance in real-time.