Sandip Das

San Ramon,CA

Summary

Dynamic software development professional with over 12 years of experience, including more than 5 years specializing in data engineering. Expertise in designing, implementing, and maintaining robust batch and streaming data platforms, leveraging full-stack proficiency in Python, SQL, and Java, complemented by extensive knowledge of various RDBMS and Data Lake technologies. Success in executing ELT and ETL processes using tools such as Snowflake, Databricks, PIG, Hive, MapReduce, Spark, and YARN, alongside deploying jobs in Hadoop Clusters with Cloudera and Hortonworks distributions. Skilled in creating automated data movement frameworks utilizing Python scripts and Airflow Scheduler while possessing a solid understanding of cloud technologies like Amazon S3, Redshift, and Google Cloud Platform.

Overview

years of professional experience

Work History

Data Engineer

DoorDash

02.2021 - Current

Designed and implemented automated ETL platform to generate Merchant Invoices that generated revenue of 500M.
Developed ETL processes to efficiently transform and load large datasets into cloud storage solutions.
Collaborated with cross-functional teams to ensure data integrity and optimize data models.
Mentored junior engineers on best practices for database architecture and data management.
Spearheaded initiatives to improve system performance, reducing query response times significantly.
Established data governance policies to maintain compliance with industry regulations and standards.
Led projects focusing on automation of manual processes, increasing operational efficiency across business units.
Deployed in-line validation to detect data anomalies.
Fine-tuned query performance and optimized database structures for faster, more accurate data retrieval and reporting.
Enhanced data quality by performing thorough cleaning, validation, and transformation tasks.
Optimized data processing by implementing efficient ETL pipelines and streamlining database design.
Provided technical guidance and mentorship to junior team members, fostering a collaborative learning environment within the organization.
Collaborated with cross-functional teams for seamless integration of data sources into the company''s data ecosystem.

Data Engineer

PayPal

San Jose, CA

01.2014 - 11.2020

Design, develop and maintain automated low-latency and scalable data-movement framework that can ingest incremental data from oracle tables to Hadoop for Analytic use-cases with strict SLA adherence and with high data-quality.
Design and Architected data movement solution from Oracle to Hadoop HDFS using Python and Airflow.
Work with key stakeholders to understand the requirements and implemented the solution in timely manner.
Created a MySQL metadata driven framework for data ingestion.
Added data quality checks for data consistency and handshaking mechanism for downstream applications.
Created snapshot feature to merge incremental with historical data.
Provided daily maintenance support and optimize the queries to run jobs faster.
Onboarded new use cases into the scalable framework.
Two different types of hive tables Avro and Parquet file format created for different use-cases.
Develop and maintain a metadata-based batch data-movement framework that can move seamlessly historical and incremental data between various heterogeneous source and target systems. This framework serves thousands of downstream users running analytical use-cases.
Developed, maintained and supported Data movement and batch replication framework that moves 7000+ objects daily across heterogenous systems like Oracle, Teradata, Hadoop etc.
Jobs are scheduled thru python daemons. Implemented publish-subscribe model to optimally extract data from source once and use it multiple times for different targets.
Automate high-performance data processing systems to drive business growth and improve product experience.
Ensure high quality software engineering practices towards building data infrastructure and pipelines at scale.
Built and added Teradata TPT, Hadoop and Oracle connectors to the framework.
Ingested data from AWS redshift DB using the JDBC connector.
Optimized all data ingestion processes for strict SLA adherence.
Created metadata driven onboarding and daily execution process.
Work with customers to add new features and connectors.
Added data-quality checks in the entire framework to maintain data consistency.
Acted as technical lead and provided guidance to team members in resolving technical issues.
Design and Develop data pipelines using Metadata driven ETL Tools and Open source data processing frameworks
Provided production support and resolve high priority incidents and development coding issues.
Optimize SQL queries for high volume objects to improve SLA of the data pipelines.
Work with cross functional teams to enable data insights though Data lifecycle.
Followed Scrum agile methodology for all daily activities.
Acquired extensive experience in troubleshooting data issues, analyzing end to end data pipelines and in working with users in resolving issues.
Consulted regularly with internal customers on application development project status, new project proposals and software-related technical issues.
Create and maintain a robust, scalable streaming data-pipeline that can move billions of messages from kafka to HDFS with minimum data loss and in fault-tolerant way. This pipeline serves variety of use cases like running fraud models and other risk use-cases, generating dashboards, reconciliation engine.
Provided development and maintenance support for streaming data pipelines using Kafka , storm and HDFS that moves 1 billion messages and 100TB data daily .
Used Apache Kafka and Storm to ingest messages into HDFS.
Used Confluent schema registry to store schema of each Kafka topic.
Data security and vulnerability Management and Datacenter Migration.
As part of data security initiative, implemented automated and secure access management for all ETL processes thru secure keymaker application.
Involved in seamless datacenter migration of 500+ ETL and Data Movement hosts.
Develop and maintain Data Movement framework using ETL tools Informatica and Abinitio.
Worked closely with other business analysts and infrastructure specialists to deliver high availability solutions for mission-critical applications.
Installed and configured software applications and tested solutions for effectiveness.
Versed in complete software life cycle from preliminary needs analysis to enterprise-wide deployment and support.
Gathered data on integration issues and vulnerabilities and reported all findings, including improvement recommendations.
Project:

Software Engineer

PayPal

San Jose, CA

01.2007 - 01.2014

Design, develop and support various ETL workflows and ingest data for various DataMart.
Design and architected Oracle Change Data capture using Goldengate to create Operational Information store.
Designed, developed and maintained Informatica workflows to move data from Source to various target DataMart.
Converted legacy Abinitio applications into Informatica mapping and workflow.
Schedule all ETL jobs thru UC4 and Control-M scheduler.
Implemented performance tuning of long running critical data pipelines by adding parallelism.
Design and develop Data pipe from Oracle to extract incremental data for downstream usage.
Project :

Education

Bachelor of Engineering -

Jadavpur University

Kolkata, India

Skills

Database: Snowflake, Postgres, Oracle, Teradata, MySQL, AWS Redshift
Big Data: Hadoop, HDFS, Hive, PIG, Spark, Kafka , Storm, Zookeeper, Flume, Sqoop, Yarn
Language: Python, SQL , Java, Unix shell scripting,
Schedulers: Airflow, Control-M, UC4
ETL Tools: Informatica, Abinitio, Talend, Ansible, Databricks, dbt
Data Visualization Tools : Sigma, Tableau,
Spark framework

Performance tuning
SQL programming
Business intelligence
Problem-solving
Teamwork and collaboration
Excellent communication

Accomplishments

Build and maintain an automated data ingestion platform that led to generate 100K invoices across multiple business units to generate 500M revenue.

Contributed to building a data replication ETL platform that involved various heterogeneous source and targets like oracle, teradata, Hadoop etc.

Supervised and mentored a team of 4 junior teammembers

Timeline

Data Engineer

DoorDash

02.2021 - Current

Data Engineer

PayPal

01.2014 - 11.2020

Software Engineer

PayPal

01.2007 - 01.2014

Bachelor of Engineering -

Jadavpur University