Summary
Overview
Work History
Education
Skills
Timeline
Generic

Pradeep Reddy

Summary

Data Engineer with over 10 years of experience in designing, developing, and maintaining large-scale, cloud-native distributed systems. Proficient in building ETL/ELT pipelines with Python and PySpark, and orchestrating workflows with Apache Airflow. Experienced in scaling and optimizing large data transformations with Python, SQL, and Spark, and experienced in using Azure Synapse Spark Pools for fast, efficient big data processing. Skilled in building and integrating solutions across AWS and Azure environments, with hands-on expertise in AWS EC2, S3, RDS, and Azure-native services to deliver scalable, secure cloud architectures. Managed complex data warehousing solutions using Snowflake and Azure Synapse Analytics, focusing on schema design, performance tuning, and cost optimization. Proficient in creating interactive business intelligence reports and dashboards with Power BI. Extensive experience in data modeling and database design across relational and NoSQL systems, with strong expertise in writing complex, high-performance SQL queries. Proficient in building ETL workflows with SSIS and developing business intelligence reports with SSRS to deliver end-to-end data solutions. Experienced in Agile development and CI/CD automation, using tools like Azure DevOps and Git, and skilled in containerization and orchestration with Docker and Kubernetes. Focused on maintaining high standards of data quality, security, and governance by implementing best practices with Azure Policy, Azure Security Center, and serverless computing through Azure Functions.

Overview

11
11
years of professional experience

Work History

Sr. Data Engineer

Genesis
07.2023 - Current
  • Designed and developed scalable ETL pipelines using PySpark and Snowflake SQL on Databricks to process batch and real-time data workloads.
  • Built scalable data solutions in Databricks with complex Spark SQL logic and PySpark transformations.
  • Conducted performance tuning to improve reliability and cost-efficiency.
  • Led the migration architecture for transitioning on-premise data solutions to Snowflake cloud data warehouse.
  • Integrated data from a variety of structured and semi-structured sources, including Veeva/Salesforce and APIs, into Snowflake for centralized data processing.
  • Built and deployed ETL pipelines in Azure Data Factory (ADF), Snowflake, and Databricks, managing data flows for business intelligence and reporting systems.
  • Developed metadata-driven ingestion and transformation logic for scalable ETL and reusability.
  • Managed Delta Lake on Databricks, organizing data into Bronze, Silver, and Gold layers to facilitate real-time and batch processing.
  • Integrated Azure Synapse and Snowflake into the data warehouse landscape for high-performance analytical workloads.
  • Used Delta Live Tables for real-time pipeline orchestration with built-in quality checks and rollback features in Snowflake.
  • Implemented Delta Sharing to enable secure data exchange across organizational boundaries.
  • Built data models and asset structures for ODS, EDW, and Data Marts, aligning with reporting and BI consumption needs.
  • Automated deployments and versioning using Azure DevOps and GitHub CI/CD workflows.
  • Participated in Agile squad-based development, actively contributing to design discussions and sprint planning.
  • Environment: Python, Azure Blob Storage, Azure Data Factory, Azure Databricks, Azure Data Lake, REST APIs, Hackolade, PySpark, ETL Development, API Integration, Data Warehousing Concepts, Snowflake, Shell Scripting, Source Code Control, DevOps Tools, Cloud Security, Agile/Scrum, DataStage

Data Engineer

Merck Pharma
03.2021 - 06.2023
  • Implemented Medallion Architecture on Azure Databricks with Delta Lake to facilitate scalable, maintainable, and auditable data layers.
  • Built ingestion pipelines from SAP and cloud applications using ADF and PySpark.
  • Designed Delta Sharing flows to securely distribute curated datasets across internal analytics teams.
  • Developed reusable pipeline templates and libraries to standardize transformation logic.
  • Applied advanced performance tuning to Spark workloads, improving execution times and reducing compute costs.
  • Used Delta Live Tables for real-time pipeline orchestration with built-in quality checks and rollback features.
  • Automated deployment and testing processes via Azure DevOps pipelines.
  • Used Azure Databricks for large-scale data transformation and processing, utilizing PySpark for data handling.
  • Applied Azure DevOps to manage CI/CD pipelines, streamline deployment, and ensure smooth application integration.
  • Oversaw the integration of MDM principles into data architecture to ensure consistency, accuracy, and governance of master data across platforms. This included the development of schemas and integration patterns that align with business intelligence and compliance requirements.
  • Utilized Unix/Linux systems for deploying and managing applications, employing Shell scripting for automation.
  • Developed high-volume REST APIs to enable seamless data integration and exchange.
  • Demonstrated strong understanding of Data Warehouse concepts, including real-time data ingestion, data modeling, dimensional modeling, and denormalized structures.
  • Conducted unit and end-to-end testing to validate ETL pipeline functionality and performance, ensuring robustness and accuracy in systems critical to battery production operations.
  • Environment: Python, Azure Blob Storage, Azure Data Factory, Azure Databricks, Azure Data Lake, REST APIs, Hackolade, PySpark, ETL Development, API Integration, Data Warehousing Concepts, Snowflake, Shell Scripting, Source Code Control, DevOps Tools, Cloud Security, Agile/Scrum, DataStage

Data Engineer

Macy’s
10.2018 - 02.2021
  • Converted large datasets from SQL Server, MySQL, PostgreSQL, and CSV files into data frames using PySpark, handling tens of thousands of records in batch processing.
  • Researched and integrated Spark-Avro jars, developing PySpark code to save data frames to HDFS as Avro files.
  • Migrated data pipelines to AWS EMR clusters for processing, storing output in S3 and using Spark’s parallel processing to enhance data ingestion speed.
  • Developed and executed HQL scripts to create external tables within the raw data layer in Hive.
  • Implemented AWS IAM and encryption to secure data, ensuring privacy and compliance with industry standards.
  • Conducted online data migration from an ERP system using AWS DMS and AWS DataSync.
  • Leveraged Apache Spark to improve algorithm performance within the Hadoop ecosystem, integrating Spark with Snowflake for efficient data processing via Spark context, Spark-SQL, DataFrames, and RDDs.
  • Developed a script to transfer Avro-formatted data from HDFS to external tables in the raw data layer.
  • Created PySpark code using Spark SQL to transform Avro data in the raw layer into ORC format in the data service layer.
  • Managed PySpark scripts to create data frames from data service layer tables, storing results in a Hive data warehouse.
  • Set up Lambda jobs and configured AWS CLI Roles, utilizing AWS Athena and Glue for querying data.
  • Transferred data between AWS S3 and Snowflake for enhanced data integration.
  • Conducted data preprocessing and feature engineering using Python Pandas to support predictive analytics.
  • Migrated data pipelines from Informatica to run on AWS EMR clusters.
  • Developed and managed Airflow DAGs in Python, using Airflow to schedule and automate data ingestion tasks.
  • Worked with Informatica PowerCenter tools, including Designer, Repository Manager, Workflow Manager, and Workflow Monitor.
  • Used Informatica PowerCenter for ETL, extracting, transforming, and loading data from diverse source systems into target databases.
  • Designed and implemented mappings in PowerCenter Designer, transforming data to meet specific requirements.
  • Created custom data models for a Data Warehouse that supports real-time data from multiple sources.
  • Environment: AWS Glue, PySpark, Matillion, AWS Redshift, AWS Redshift Spectrum, AWS Lambda, AWS Step Functions, AWS S3, AWS Athena, DynamoDB, Oracle DB, SQL Server, SSIS, CI/CD Pipelines (GitHub), Python, Scala, Spark SQL, EMR, Auto Scaling, Flink, Cloud Watch, Cloud Formation, Flat files

Data Engineer

Yana Software Private Limited
09.2016 - 04.2018
  • Developed a custom File System plugin that allows unmodified Hadoop MapReduce programs, HBase, Pig, and Hive to directly access files.
  • Extensively utilized Expressions, Variables, and Row Count in SSIS packages.
  • Created MapReduce jobs using Pig Latin, actively engaging in ETL, data integration, and migration tasks.
  • Established and managed Hive tables with HiveQL, along with defining job flows.
  • Used Sqoop for data imports and exports between Oracle Database and HDFS.
  • Built batch jobs and configuration files to automate processes through SSIS.
  • Designed SSIS packages to transfer data between SQL Server and Excel Spreadsheets.
  • Scheduled and deployed SSRS reports for daily, weekly, monthly, and quarterly outputs.
  • Set up and loaded Hive tables, executing map-reduced Hive queries; developed a custom Hadoop File System plugin for Data Platform access.
  • Installed and configured Pig, writing scripts in Pig Latin.
  • Designed and implemented a large-scale parallel relation-learning system using MapReduce.
  • Conducted data validation and cleansing on staged records before loading into the Data Warehouse.
  • Automated extraction of various file types, including flat and Excel files, from sources like FTP and SFTP.
  • Environment: Hadoop, CDH, MapReduce, Pig, MS SQL Server, SQL Server Business Intelligence Development Studio, Hive, HBase, SSIS, Office, Excel, Flat Files, T-SQL

Data Engineer

Cybage Software Private Limited
10.2014 - 08.2016
  • Designed robust, reusable, and scalable data-driven solutions and pipeline frameworks to automate ingestion, processing, and delivery of structured and unstructured data in both batch and real-time streams, leveraging Python programming.
  • Built data warehouse structures, including facts, dimensions, and aggregate tables, utilizing dimensional modeling techniques such as Star and Snowflake schemas.
  • Applied data transformations within Spark DataFrames, performing in-memory computations to generate response outputs.
  • Skilled in troubleshooting and optimizing Spark applications and Hive scripts for peak performance.
  • Employed Spark DataFrame API for analytics on Hive data and performed data validation tasks using Spark DataFrame operations.
  • Built comprehensive ETL models to process extensive customer feedback, yielding actionable insights and business solutions.
  • Utilized Spark Streaming to partition streaming data into batches for processing within the Spark engine.
  • Developed Spark applications for data validation, cleansing, transformation, and custom aggregation, using Spark SQL to support data analysis for data scientists.
  • Automated data ingestion processes through scripts in PySpark and Scala, pulling from sources such as APIs, AWS S3, Teradata, and Snowflake.
  • Created an automated business category mapping system that aligns customer business information with external sources like Google, Facebook, Yelp, and Bing.
  • Developed a data quality control model to track and update outdated business information, using APIs for validation.
  • Oversaw a sentiment prediction model for customer reviews, ensuring high-performance ETL processes.
  • Performed data cleaning, pre-processing, and modeling using Spark and Python.
  • Implemented secure, real-time REST APIs for data consumption with AWS services (Lambda, API Gateway, Route 53, Certificate Manager, CloudWatch, Kinesis), integrated with Swagger, Okta, and Snowflake.
  • Developed automation scripts to transfer data from on-premises clusters to Google Cloud Platform (GCP).
  • Engaged in tuning and optimization of long-running Spark jobs and Hive/SQL queries.
  • Implemented real-time AWS CloudWatch log streaming to Splunk via Kinesis Firehose.
  • Developed a network monitoring dashboard using Django, Python, MongoDB, and JSON, tracking access points and performance metrics.
  • Built an application for monitoring, root cause analysis, and WLAN data management by parsing logs through Python Django and MongoDB, generating data in JSON format.
  • Environment: Python, Django, Flask, REST API, Pickling, SQL Server, MongoDB, Triggers, MySQL, Shell Scripting, AWS Kendra, AWS Neptune, AWS Cloud Watch, AWS S3 Cloud Storage, MS Excel, VBA, AWS RedShift

Education

B.Tech - Computer Science

B.V.Raju Institute of Technology
06.2014

Skills

  • Programming Languages: Python, Java, R, SQL, Scala, Shell Scripting, XML
  • Databases: SQL Server, Oracle, Snowflake, Azure SQL, PostgreSQL, MySQL, Hive, Redshift
  • Big Data Technologies: Spark, Hive, Delta Lake, Kafka, Storm, Impala, Hadoop Ecosystem, Spark SQL
  • Cloud Technologies: Azure (Data Factory, Databricks, ADLS Gen2), AWS (EC2, ECS, S3, Glue, Lambda, Step Functions, Redshift, EMR, Athena, Kinesis, DMS, DataSync), Unity Catalog
  • ETL Tools: Azure Data Factory, Databricks, IBM DataStage, Informatica, SSIS
  • Orchestration & DevOps: Azure DevOps, Jenkins, GitHub, CI/CD Pipelines, Terraform, Shell Scripting
  • Data Visualization: Power BI, Tableau
  • Modeling Tools: Erwin, ER Studio
  • Git version control
  • ETL development
  • Big data processing
  • Python programming
  • Kafka streaming
  • NoSQL databases
  • Data pipeline design
  • Data modeling
  • API development
  • Hadoop ecosystem
  • Performance tuning
  • Data warehousing
  • Spark development
  • Advanced SQL
  • Data security
  • Metadata management
  • Real-time analytics
  • Scala programming
  • Data curating
  • Linux administration
  • Java development
  • Continuous integration
  • Data integration
  • SQL and databases
  • Database design
  • RDBMS
  • SQL programming
  • Data migration
  • Advanced analytics
  • Relational databases
  • Storage virtualization
  • Risk analysis
  • Business intelligence
  • Data analysis
  • Technology leadership work streams
  • Database administration
  • Backup and recovery
  • Big data technologies
  • SQL transactional replications
  • Data governance
  • Data acquisitions
  • Amazon redshift
  • Enterprise resource planning software
  • Data programming
  • Analytical thinking
  • Advanced data mining
  • Large dataset management

Timeline

Sr. Data Engineer

Genesis
07.2023 - Current

Data Engineer

Merck Pharma
03.2021 - 06.2023

Data Engineer

Macy’s
10.2018 - 02.2021

Data Engineer

Yana Software Private Limited
09.2016 - 04.2018

Data Engineer

Cybage Software Private Limited
10.2014 - 08.2016

B.Tech - Computer Science

B.V.Raju Institute of Technology
Pradeep Reddy