Summary
Overview
Work History
Education
Skills
Certification
Timeline
Generic

SAITEJA RAPELLI

Austin,USA

Summary

Over 11+ years of experience in the IT industry, specializing in leveraging Azure tools and services, including Azure ADLS GEN2, Azure Blob Storage, Azure Synapse Analytics, Azure Data Factory, Azure Functions, Azure Stream Analytics, Azure Logic Apps and Azure Cosmos DB. Utilized Azure ADLS GEN2 for creating datalake, and triggered Azure Functions upon file changes. Effectively managed the Azure Data Lake Storage, overseeing metadata management for efficient organization of tables, partitions, and databases. Led the migration of on-premises data to Azure Synapse Analytics, optimizing data storage and streamlining analytics processes, resulting in a 20% reduction in query response times. Implemented real-time analytics solutions on Azure Databricks, providing stakeholders with timely insights and improving decision-making processes by 20%. Successfully implemented a real-time data processing solution using Azure Stream Analytics, creating producers and consumers for data publication and processing within Azure Event Hubs. Designed and implemented a scalable delta lake architecture using Azure Databricks. Enhanced query performance by optimizing SnowSQL scripts in Snowflake, resulting in a 30% reduction in query execution time and improved data accessibility. Implemented Snowpipes in Snowflake for real-time data ingestion, reducing data latency by 50% and streamlining the staging process for faster analytics. Transformed data modeling with DBT, optimizing data pipelines and increasing analytics accuracy, resulting in streamlined insights and informed decision-making. Led Snowpark integration in Snowflake for complex data transformations, enhancing ETL efficiency and enabling advanced analytics within Snowflake's virtual data warehouse. Achieved a secure Azure Virtual Machine environment by configuring Network Security Groups, implementing Azure Active Directory roles, and regularly updating and patching instances. Implemented advanced access controls in Azure Active Directory, defining policies with conditions based on IP address, time, and Virtual Network, enhancing security and compliance measures. Employed Azure Monitor to collect, store, and analyze Azure services and application logs efficiently. Attained low-latency performance by leveraging Azure Cosmos DB's indexing and partitioning for data retrieval. Enhanced flexibility, expertly selecting and managing diverse database engines on Azure SQL Database; MySQL, PostgreSQL, Oracle, SQL Server, for optimal performance. Proficient in developing and optimizing large-scale Spark applications with PySpark, utilizing RDDs, DataFrames, and Spark SQL for efficient analytics and processing. Experienced in architecting data pipelines with Azure Event Hubs, managing partitions and topics, seamlessly integrating with Azure Stream Analytics for continuous analysis of streaming data. Optimized T-SQL query efficiency by implementing data partitioning and indexing, significantly reducing data scanning and enhancing overall query performance. Utilized Azure SQL Database for operational needs, crafting complex SQL queries with Joins, Window functions, and indexing to optimize performance. Experienced in No-SQL databases like Azure Cosmos DB and Cassandra, proficient in schema design, data modeling, and query optimization, enabling scalable and high-performance storage and retrieval of unstructured data. Successfully automated workflows with Azure Logic Apps, ensuring security, scalability, and task orchestration. Implemented Snowflake for data warehousing, enhancing SQL-based analytics capabilities, and improving data accessibility for cross-functional teams, boosting overall data-driven decision-making efficiency. Successfully implemented Power BI solutions, collaborating with Data Analysts to create visualizations. Developed and implemented optimized data processing pipelines on Azure Databricks, resulting in a 30% reduction in processing time and improved overall system efficiency. Skilled in Azure DevOps version control, adept at branching, merging, and conflict resolution, ensuring code integrity. Proficient in Azure DevOps CI/CD pipelines, automating software build, test, and deployment for continuous integration and delivery. Streamlined project workflows using Git, BitBucket, JIRA, Confluence, and Notion, fostering transparent collaboration and accelerating project delivery. Expert in Agile, leading cross-functional teams through sprint planning, stand-ups, and retrospectives for software delivery.

Overview

11
11
years of professional experience
1
1
Certification

Work History

Sr. Data Engineer

Texas Department of Family and Protective Services
10.2023 - Current
  • Led the development of a customer insights dashboard at Bank of America, utilizing diverse data sources including Azure SQL Database, Azure Blob Storage, Azure Data Lake Storage GEN2, and Azure HDInsight, employing JSON, CSV, Parquet, and ORC formats
  • Orchestrated efficient data streaming into Azure Event Hubs for initial processing, leveraging producers and consumers
  • Optimized topic structures, saving costs by improving efficiency by 40%
  • Implemented Azure Synapse Analytics for Spark transformations, leveraging Synapse catalogs for metadata storage, and crafting 100+ Spark jobs to process data from various sources including Azure Blob Storage and transactional servers
  • Optimized Azure Data Factory pipelines, reducing data processing time by 40%, enhancing ETL efficiency, and ensuring timely and accurate data delivery for analytics and reporting
  • Integrated Unity Catalog with Azure Databricks, fortifying security, meeting compliance standards, and elevating data governance for enhanced regulatory adherence
  • Implemented complex workflows with Apache Airflow, ensuring seamless orchestration, efficient task scheduling, and reliable data pipelines for analytics and reporting
  • Engineered transformations on individual files using Azure Functions, scheduled through Azure Logic Apps, ensuring timely execution and maintenance of data processing tasks with minimal manual intervention
  • Implemented Unity Catalog, fostering seamless metadata management, ensuring data consistency, and facilitating comprehensive insights across diverse data sources
  • Designed and managed workflows using Azure Logic Apps, creating DAGs with Python, SQL, and Bash operators, ensuring seamless execution and monitoring of tasks with defined CRON expressions
  • Architected data pipelines for optimal transformation and loading into Azure Synapse Analytics, enabling large-scale data processing with reduced storage costs savings through efficient data management strategies
  • Implemented Azure RBAC to manage policies and permissions for different Azure resources, ensuring secure access and governance across the entire data infrastructure
  • Spearheaded the migration of on-premises data to Snowflake on Azure, reducing query times by 40% and enhancing overall data accessibility for the organization
  • Implemented a robust data governance framework on Snowflake, ensuring compliance with industry regulations and improving data quality, leading to more informed decision-making
  • Orchestrated the integration of external data sources with Snowflake on Azure, enabling real-time analytics and providing valuable insights that contributed to a 25% increase in operational efficiency
  • Developed Azure Functions and API Management for seamless financial data submission
  • Identified and implemented cost-saving measures on Azure Databricks by optimizing resource utilization, resulting in a 15% reduction in cloud infrastructure costs
  • Employed Spark and PySpark for scalable data processing, accelerating analytics workflows and handling large datasets effectively
  • Leveraged PySpark's RDD and DataFrame APIs within Spark for distributed data processing, enhancing performance and scalability
  • Applied Spark's machine learning libraries, utilizing PySpark for model development, training, and evaluation on diverse datasets
  • Established a robust CI/CD pipeline using Azure DevOps and Azure Functions for efficient financial data processing
  • Deployed Snowflake on Azure, utilizing Snow SQL for scalable data querying and management, enhancing analytics with features like automatic scaling and native support for semi-structured data
  • Worked with Python and Scala to transform Hive/SQL queries into Spark (RDDs, DataFrames, and Datasets), customizing them for financial data processing
  • Expertise in using the Scala programming language to build microservices for financial data applications
  • Applied expertise in utilizing Spark SQL to manage Hive queries in an integrated Spark environment, tailored for financial data analysis
  • Improved efficiency by 75%, enhancing query performance significantly
  • Created data frames and datasets using Spark and Spark Streaming, then performed transformations and actions, catering to the unique requirements of financial data processing
  • Demonstrated experience with Azure Event Hubs for publish-subscribe messaging as a distributed commit log, with a particular focus on managing financial data streams
  • Leveraged Azure Logic Apps for debugging and monitoring scheduled jobs, streamlining troubleshooting processes within the workflow management system
  • Facilitated seamless integration of transformed data with Power BI for visualization, empowering stakeholders to derive actionable insights that led to informed decision-making with significant cost savings
  • Collaborated effectively with cross-functional teams, including business analysts, data scientists, and data engineers, ensuring alignment with business objectives and delivery of high-impact data solutions
  • Environment: Azure Blob Storage, Azure SQL Database, Azure Data Lake Storage, Azure HDInsight, Azure Databricks, Unity Catalog, Azure Logic Apps, Azure Synapse Analytics, Azure Event Hubs, Azure Functions, Azure DevOps, Snowflake on Azure, Python, Scala, Spark (PySpark, SparkSQL), Kafka, Power BI, Linux, Java, Airflow, PostgreSQL, Oracle PL/SQL, Flink

Sr. Data Engineer

Klaviyo Inc
03.2022 - 09.2023
  • Collaborated with data scientists and utilized Azure Data Factory, including Azure Data Factory data flows and data catalogs, to develop a Spending Classification model on corporate card data, enhancing data organization and analysis capabilities
  • Conducted extensive data exploration, gathering data from Azure Synapse Analytics, Azure SQL Database, and Azure Data Lake storage GEN2 to facilitate comprehensive analysis and derive meaningful insights for informed decision-making processes
  • Leveraged Azure API Management for seamless API calls, employed Azure Functions to integrate data from Concur Expense management platform, and dynamically created Azure Data Lake Storage to streamline data management processes
  • Managed global financial data, leveraging Azure Cosmos DB and Azure SQL Database for efficient and scalable data storage
  • Leveraged Azure Data Factory data flows, and Azure RBAC for efficient metadata storage, and data management
  • Implemented feature engineering on large datasets using Azure HDInsight Spark clusters, optimizing data for model development and achieved a 20% improvement in processing efficiency
  • Successfully integrated machine learning models into Azure Databricks pipelines, enabling predictive analytics and improving business forecasting accuracy by 25%
  • Automated SQL queries and built Azure Logic Apps workflows for streamlined data processing and delivery, reducing manual intervention
  • Delivered data to data scientists in two modes - Azure Data Lake Storage and Azure Synapse Analytics, using Azure Functions for flexible and automated data access
  • Conducted data modeling in Snowflake on Azure, implementing STAR schema & Snowflake schema, and utilized Azure SQL Database for structured data representation with Azure RBAC ensuring secure access
  • Implemented Azure Data Factory ETL processes, enhancing data integration from diverse sources to Azure Synapse Analytics, optimizing performance, and ensuring seamless processing for analytics
  • Executed data cleansing and transformation tasks with PySpark on Azure Databricks, harnessing Spark's parallel processing capabilities for enhanced efficiency and performance
  • Established collaborative data science workflows on Azure Databricks, fostering cross-functional collaboration between data scientists, analysts, and engineers, leading to a 20% increase in productivity
  • Employed Spark SQL queries for seamless integration with various data sources, significantly improving data extraction, transformation, and loading (ETL) processes by 30%
  • Designed and executed Azure Logic Apps for orchestrating data workflows, significantly streamlining and automating complex data processing tasks across Azure services
  • Successfully integrated Azure Synapse Analytics for ad-hoc querying of data stored in Azure Data Lake Storage, providing users with quick insights, and facilitating dynamic, on-demand analysis
  • Deployed Azure Event Hubs for real-time data streaming, ingesting and processing high-velocity data, and enabling timely analytics for dynamic, event-driven applications
  • Orchestrated data migration and synchronization between on-premises databases and Azure using Azure Database Migration Service (DMS)
  • Configured Azure Monitor for comprehensive monitoring of data pipelines and infrastructure, proactively identifying and addressing performance issues, ensuring optimal system reliability and performance
  • Automated the ingestion of web server log data using Azure Stream Analytics, streamlining the process of storing data in Azure Data Lake Storage
  • Successfully implemented advanced techniques such as Partitioning, Dynamic Partitions, and Buckets in Hive, contributing to improved performance and logical data organization
  • Developed and implemented automated data quality checks on Azure Databricks, reducing data errors by 15% and ensuring high data integrity across the organization
  • Utilized Apache Airflow to automate and streamline data workflows, automating processes and reducing data engineers' overheads, allowing them to focus on more productive tasks
  • Enhanced data warehousing on Snowflake on Azure, ensuring scalability, multi-cloud flexibility, secure collaboration, time travel, versioning, and integration, including star schema & snowflake schema design for efficient analytics
  • Developed robust solutions for real-time data streaming using Azure Kafka, Azure Stream Analytics, and Azure Databricks, enabling immediate access and analysis of continuously generated data
  • Streamlined and optimized ETL processes using Azure Databricks, resulting in a 25% reduction in data processing time and increased data availability for business users
  • Environment: Azure Blob Storage, Azure Data Factory, Azure Synapse Analytics, Azure SQL Database, Azure Data Lake Storage, Azure Databricks, Azure Logic Apps, Azure HDInsight, Azure Event Hubs, Azure Functions, Azure RBAC, Python, Scala, Spark (PySpark, SparkSQL), Kafka, Azure Synapse Analytics, Azure Cosmos DB, Linux, Java, Apache Airflow, PostgreSQL, Snowflake on Azure

Data Engineer

JP Morgan Chase
07.2019 - 03.2022
  • Strengthened data security protocols by implementing Azure Identity and Access Management (IAM) policies and Virtual Network (VNet) configurations, ensuring restricted access to sensitive healthcare information
  • Implemented and optimized Azure HDInsight clusters for parallelized processing of large-scale healthcare datasets, reducing processing time and enhancing analytics capabilities
  • Streamlined healthcare data workflows by developing automated processes with Azure Functions and Logic Apps, improving efficiency and reducing manual intervention
  • Orchestrated scalable and cost-effective healthcare data storage solutions on Azure Blob Storage, facilitating seamless access and retrieval for various analytical purposes
  • Configured and optimized Azure Virtual Machines (VMs) to host healthcare applications, ensuring optimal performance and responsiveness for healthcare professionals and end-users
  • Implemented Azure Power BI to create interactive and insightful dashboards, providing healthcare stakeholders with real-time visualizations for data-driven decision-making
  • Facilitated collaborative healthcare data analysis by creating shared data environments on Azure, fostering cross-functional teamwork and knowledge sharing
  • Developed automated reporting solutions using Azure Functions and Azure Blob Storage, ensuring timely and accurate generation of healthcare reports for internal and external stakeholders
  • Implemented IAM policies and VNet configurations to ensure healthcare data management compliance with industry regulations, enhancing trust and data integrity
  • Implemented cost-saving measures by optimizing resource allocation and usage across Azure VMs, Azure Blob Storage, and other Azure services, ensuring efficient healthcare data infrastructure management
  • Implemented Azure Databricks for efficient analysis and visualization of relationships in large-scale datasets
  • Leveraged Azure Stream Analytics for real-time data streaming, enabling rapid processing and analysis of streaming data sources
  • Applied Azure Machine Learning pipelines for end-to-end model development, from data preparation to model deployment
  • Participated in all stages of SDLC, including requirement analysis, design, coding, testing, and production, for big data projects on Azure
  • Extensively utilized Azure Data Factory to import/export data between RDBMS and Azure Data Lake, creating data pipelines for last saved value, and performing incremental imports
  • Implemented efficient data storage solutions on Azure Data Lake Storage, optimizing healthcare data accessibility and retrieval for diverse analytical needs
  • Leveraged Azure Databricks for scalable and resource-efficient healthcare data processing, ensuring seamless scalability to handle growing volumes of data
  • Implemented Azure Cosmos DB for real-time processing of healthcare data, enabling immediate access and analysis of continuously generated information for timely insights
  • Utilized Azure Synapse Analytics to query and analyze structured healthcare data, optimizing performance and resource utilization for analytical purposes
  • Developed and optimized PySpark scripts for processing healthcare datasets, incorporating Azure Synapse SQL for structured data analysis, and creating Directed Acyclic Graphs (DAGs) for efficient workflow orchestration
  • Environment: Azure Virtual Machines (VM), Azure Blob Storage, Azure Functions, Azure Logic Apps, Azure HDInsight, Azure RBAC (Role-Based Access Control), Power BI, Hive, Hadoop, Spark, SparkSQL, Scala, PySpark, Python, Sqoop, Kafka, Oracle

Big Data Developer

Homesite Insurance
03.2017 - 06.2019
  • Enhanced ETL processes using Python, SQL, and Java, improving efficiency by 30% for a large-scale data pipeline, resulting in faster data retrieval and analysis
  • Implemented Hadoop and Hive for handling vast datasets, reducing processing time by 40% and enabling seamless analysis of healthcare data
  • Optimized MapReduce jobs, leveraging Pig and Java, to process and transform raw data efficiently, enhancing the overall performance of big data processing pipelines
  • Implemented HBase for real-time healthcare data processing, enabling immediate access to continuously generated information and supporting timely insights for stakeholders
  • Utilized YARN to ensure scalable and resource-efficient data processing, enabling seamless scalability to handle growing volumes of healthcare data
  • Integrated Spark and PySpark for real-time data streaming, facilitating rapid processing and analysis of streaming data sources, enhancing responsiveness and insights
  • Applied Spark SQL for structured data analysis within healthcare datasets, optimizing performance and resource utilization for analytical purposes
  • Managed and optimized HDFS storage solutions, ensuring efficient accessibility and retrieval of healthcare data for diverse analytical needs
  • Implemented data processing solutions using Teradata and Oracle databases, improving data integrity and facilitating seamless integration with existing systems
  • Developed and executed efficient PySpark scripts for processing healthcare datasets, incorporating SparkSQL for structured data analysis, and creating DAGs for workflow orchestration
  • Deployed Postgres for structured data analysis, improving analytical capabilities and providing healthcare professionals with enhanced reporting and visualization tools
  • Collaborated in SDLC stages, from requirement analysis to production, ensuring successful implementation and maintenance of big data projects, contributing to improved healthcare data infrastructure
  • Environment: Hadoop, Map Reduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Sqoop, Oozie, Hadoop, HDFS, Map Reduce, Hive, HBase, Linux, Cluster Management

ETL Informatica Developer

Kroger
01.2014 - 02.2017
  • Enhanced ETL workflows using Informatica, SQL, and Python, optimizing data extraction from Oracle databases, resulting in a 20% improvement in data processing efficiency
  • Developed complex Informatica mappings to transform and load data, leveraging SQL and Python for efficient data integration, contributing to streamlined ETL processes
  • Implemented Java transformations in Informatica, enhancing data processing capabilities and improving overall performance in a large-scale ETL environment for critical business systems
  • Utilized SQL queries to optimize Oracle database interactions, improving data retrieval efficiency and ensuring seamless integration with Informatica ETL processes for timely insights
  • Collaborated on ETL design and implementation, integrating Informatica and Python scripts for data validation, ensuring data accuracy and reliability across multiple systems
  • Conducted performance tuning of Informatica workflows, optimizing SQL queries, and enhancing overall ETL processing speed, resulting in significant time and resource savings
  • Designed and implemented Informatica workflows to extract, transform, and load data from various sources, utilizing SQL for data profiling and quality assurance
  • Applied Python scripting for data cleansing and transformation within Informatica workflows, improving data quality and facilitating accurate reporting for business stakeholders
  • Executed ETL tasks using Informatica PowerCenter, incorporating SQL optimization techniques, and enhancing overall system efficiency for large-scale data integration projects
  • Integrated Oracle PL/SQL within Informatica workflows, ensuring seamless communication between ETL processes and Oracle databases, enhancing data consistency and reliability
  • Developed Java-based custom transformations in Informatica, enabling complex data manipulations and contributing to the successful implementation of intricate ETL solutions
  • Automated Informatica ETL processes using Python scripts, reducing manual intervention, improving workflow reliability, and ensuring data consistency across diverse business systems
  • Environment: Informatica Power Center 9.6, Oracle 11g, Oracle, Putty, Shell Scripting, Notepad++, Informatica, Oracle, ETL, Manual Testing, UNIX/Linux

Education

Master’s -

Webster University
12.2013

Bachelors -

JNTUH College of Engineering Hyderabad
01.2011

Skills

  • Azure HDInsight
  • Azure Data Factory
  • ADLS GEN2
  • Azure Blob Storage
  • Azure Synapse Analytics
  • Azure DataBricks
  • Azure Cosmos DB
  • Azure DevOps
  • Purview
  • Azure Functional Apps
  • Azure Logic Apps
  • Entra ID
  • Azure Resource Manager
  • Azure Virtual Machines
  • Azure Load Balancer
  • Spark
  • Hadoop
  • HDFS
  • MapReduce
  • YARN
  • Hive
  • Oozie
  • Pig
  • Sqoop
  • Presto
  • Zeppelin
  • Flink
  • ZooKeeper
  • Python
  • Scala
  • Java
  • SAS
  • PySpark
  • SQL
  • PL/SQL
  • T-SQL
  • HBase
  • MongoDB
  • MYSQL
  • SQL SERVER
  • Oracle
  • PostgreSQL
  • Snowflake
  • Teradata
  • Tableau
  • Power BI
  • Sci-kit learn
  • Pandas
  • NumPy
  • PyTorch
  • TensorFlow
  • Azure ML
  • Git
  • GitHub
  • BitBucket
  • Shell scripting
  • Power Shell
  • Bash
  • UNIX/Linux
  • Kafka
  • Confluent Kafka
  • Azure Event Hubs

Certification

  • AZ-305 - Azure Solutions Architect
  • DEA-C01 - SnowPro Data Engineer Advanced

Timeline

Sr. Data Engineer

Texas Department of Family and Protective Services
10.2023 - Current

Sr. Data Engineer

Klaviyo Inc
03.2022 - 09.2023

Data Engineer

JP Morgan Chase
07.2019 - 03.2022

Big Data Developer

Homesite Insurance
03.2017 - 06.2019

ETL Informatica Developer

Kroger
01.2014 - 02.2017

Bachelors -

JNTUH College of Engineering Hyderabad

Master’s -

Webster University
SAITEJA RAPELLI