Summary
Overview
Work History
Education
Skills
Timeline
Generic

Sudheer K

Hyderabad

Summary

Experienced Sr. Data Engineer with a proven history of over 9+ years, proficiently designing and deploying scalable data ingestion pipelines involving Big Data, AWS, Microsoft Azure Cloud, PySpark, Python and transitioning from On-Premises to Azure Cloud Solutions. Developed and implemented end-to-end data integration solutions using Azure Data Factory, to orchestrate data integration workflows, utilizing linked services, source and sink datasets, pipelines, and activities to extract, transform, and load data from diverse sources into target systems Hands-on experience in Azure Databricks for distributed data processing, transformation, validation, cleansing, and ensuring data quality and integrity. Designed and implemented end-to-end data workflows with Azure Logic Apps, Azure Functions, and server less solutions. Extensive experience of implementing solutions using AWS services like (EC2, S3, and Redshift), Hadoop HDFS architecture and Map-Reduce framework. Worked in AWS environment for development and deployment of custom Hadoop applications. Hands-on experience in Python Boto3 for developing Lambda functions in AWS. Exhibited a high level of competence in utilizing Azure Event Hub to efficiently ingest real-time streaming data. Strong proficiency in utilizing Azure Synapse Pipelines to successfully orchestrate and manage data integration and transformation workflows. Ample hands-on experience with Azure Blob Storage, ensuring efficient storage and retrieval of both unstructured and semi-structured data. Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Data bricks, NoSQL DB), SQL Server, Oracle, Data warehouse etc. Built multiple Data Lakes. Proficient in Python and Scala using Spark framework. Applied knowledge in proficiently navigating through a range of file formats, such as CSV, JSON, Parquet, and Avro, ensuring optimized storage, processing, and data interchange within data engineering pipelines and analytics workflows. Hands-on experience in implementing data pipeline solutions using Hadoop, azure, ADF, Synapse, PySpark, Map-Reduce, Hive, Tez, Python, Scala, Azure functions, Logic apps, stream sets, ADLS Gen2 and Snowflake. Strong background in Data Pipeline Development and Data Modelling. Remarkable proficiency in Kafka streaming technology, skilfully employing its distributed messaging capabilities to construct resilient and high-performing data flows. Design, deploy, and optimize scalable ML workflows on GCP, leveraging services like Vertex AI, BigQuery, and Cloud Functions for seamless integration and automation Design, deploy, and optimize scalable ML workflows on GCP, leveraging services like Vertex AI, BigQuery, and Cloud Functions for seamless integration and automation. Experience in OpenAI technologies, including GPT-4, Vertex AI, Llama, with expertise in prompt design, prompt engineering, and fine-tuning Language Model (LLM) capabilities. Utilized Copilot, GitHub's AI-powered assistant, to streamline code development processes, improve code quality, and enhance collaboration within the data science team. Built codebase for a natural language processing and (AI) artificial intelligence, (ML) machine learning framework. Used C++ STL containers, algorithms in the application. Innovated and leveraged ML, Data Mining and statistical techniques to create new, scalable solutions for business problems. Designed the (AI) artificial intelligence, (ML) Machine learning data pipeline for regular monitoring and performance evaluation of the deployed ML models. Worked with GCS (Cloud Storage) and Pub/Sub to manage streaming and batch data processing for ML pipelines. Developed scalable micro services and REST APIs for model serving using Cloud Run and App Engine. Expertise in utilizing Spark Streaming to design and deploy real-time data pipelines that efficiently process large data volumes from diverse sources. Implemented scheduling Hadoop jobs using Apache Oozie, Importing, and exporting the data using SQOOP from HDFS to Relational Database systems. Optimized Hive and Spark query performance through strategic Bucketing and Partitioning strategies for efficient data retrieval and storage. Analysis was done using Python libraries such as PySpark. Strong experience in working with ELASTIC MAP REDUCE (EMR) and setting up environments on Amazon AWS EC2 instances. Configured and managed Zookeeper to ensure efficient coordination and synchronization of distributed data processing systems. Exhibited the ability to formulate and implement data integration strategies connecting Snowflake with external systems. Utilized technologies like Apache Airflow or custom-built orchestration frameworks to ensure seamless data movement and synchronization. Successfully integrated Snowflake with Azure Data Factory to arrange complex ETL pipelines, significantly optimizing data migration from diverse sources into Azure-based data warehouses. Demonstrated expertise in managing Snowflake's unique features such as Zero-Copy Cloning, Time Travel, and Data Sharing, for efficient data management. Implemented data pipelines using Snow SQL, Snowflake Integrated services, and snow pipe. Executed Hive query performance enhancement through bucketing and partitioning techniques, with extensive hands-on experience in tuning Spark jobs. Implemented SQL Analytical Functions & Window Functions for advanced data analysis. Proficient in utilizing Informatica Cloud for cloud-based data integration and management. Effectively partnered with data analysts and stakeholders to implement data models, structures, and designs in seamless coordination.

Overview

10
10
years of professional experience

Work History

Sr. Data Engineer

Wells Fargo
San Francisco
06.2024 - Current
  • Designed and implemented end-to-end data pipelines using Azure Data Factory to facilitate efficient data ingestion, transformation, and loading (ETL) from diverse data sources into Snowflake data warehouse.
  • Orchestrated robust data processing workflows utilizing Azure Databricks and Apache Spark for seamless large-scale data transformations and advanced analytics improving data processing speed by 14%.
  • Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools) worked on Azure suites.
  • Developed real-time data streaming capabilities into Snowflake by seamlessly integrating Azure Event Hubs and Azure Functions, enabling prompt and reliable data ingestion.
  • Load data into Azure Synapse Analytics by using Azure Data Factory.
  • Deployed Azure Data Lake Storage as a reliable and scalable data lake solution, implementing efficient data partitioning and retention strategies to store and manage both raw and processed data effectively.
  • Employed Azure Blob Storage for optimized data file storage and retrieval, implementing advanced techniques like compression and encryption to bolster data security and streamline storage costs.
  • Integrated Azure Logic Apps seamlessly into the data workflows, ensuring comprehensive orchestration and triggering of complex data operations based on specific events, enhancing overall data pipeline efficiency.
  • Enforced data governance and comprehensive data quality checks using Azure Data Factory and Snowflake, guaranteeing the highest standards of data accuracy and consistency.
  • Implemented robust data replication and synchronization strategies between Snowflake and other data platforms leveraging Azure Data Factory and Change Data Capture techniques, ensuring data integrity and consistency with reduction in data inconsistencies.
  • Designed and implemented efficient data archiving and retention strategies utilizing Azure Blob Storage and leveraging Snowflake's Time Travel feature, ensuring optimal data management and regulatory compliance.
  • Developed and deployed Azure Functions to handle critical data preprocessing, enrichment, and validation tasks within the data pipelines, elevating the overall data quality and reliability.
  • Worked on Azure Machine Learning and Snowflake to architect and execute advanced analytics and machine learning workflows, enabling predictive analytics and data-driven insights achieving 23% improvement in predictive accuracy.
  • Developed custom monitoring and alerting solutions using Azure Monitor and Snowflake Query Performance Monitoring (QPM), providing proactive identification and resolution of performance bottlenecks.
  • Integrated Snowflake seamlessly with Power BI and Azure Analysis Services to deliver interactive dashboards and reports, empowering business users with self-service analytics capabilities.
  • Optimized data pipelines and Spark jobs in Azure Databricks through advanced techniques like Spark configuration tuning, data caching, and data partitioning, resulting in superior performance and efficiency.
  • Implemented comprehensive data cataloging and data lineage solutions using Azure Purview and Apache Atlas, enabling in-depth understanding and visualization of data assets and their interdependencies.
  • Architected and optimized high-performing Snowflake schemas, tables, and views to accommodate complex analytical queries and reporting requirements, ensuring exceptional scalability and query performance.
  • Collaborated closely with cross-functional teams including data scientists, data analysts, and business stakeholders, ensuring alignment with data requirements and delivering scalable and reliable data solutions.
  • Environment: Azure Data Factory, Azure Databricks, Snowflake data warehouse, Azure Event Hubs, Azure Functions, Azure Data Lake Storage, Azure Blob Storage, Azure Logic Apps, Azure Machine Learning, Azure Monitor, Power BI, Azure Analysis Services, Apache Purview, Apache Atlas.

Data Engineer

Charles Schwab
Westlake
04.2023 - 05.2024
  • Developed various data loading strategies and performed various transformations for analysing the datasets by using Hortonworks Distribution for Hadoop ecosystem.
  • Implemented solutions for ingesting data from various sources utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, Hive.
  • Developing the PySpark applications for Spark SQL, Data frames and transformations using Python APIs to perform the business requirement on Hive staging tables, and load the final transformed data into Hive master tables.
  • Worked as a Hadoop consultant on technologies like Map Reduce, Pig, and Hive.
  • Involved in ingesting large volumes of credit data from multiple provider data sources to AWS S3. Created modular and independent components for AWS S3 connections, data reads.
  • Implemented Data ware house solutions in AWS Redshift by migrating the data to Redshift from S3.
  • Developed Spark code using Python to run in the EMR clusters.
  • Created User Defined Functions (UDF) using Scala to automate some business logic in the applications.
  • Automated the jobs and data pipelines using AWS Step Functions, AWS Lambda and configured various performance metrics using AWS Cloud watch.
  • Worked using Apache Hadoop ecosystem components like HDFS, Hive, Pig, and Map Reduce.
  • Designed AWS Glue pipelines to ingest, process, and store data interacting with different services in AWS.
  • Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
  • Developed a process to migrate local logs to Cloud Watch for better integration and monitoring.
  • Executed the program by using python API written in python to support Apache Spark or PySpark.
  • Helped Dev ops Engineers for deploying code and debug issues.
  • Worked in writing Hadoop Jobs for analysing data like Text format files, sequence files, Parquet files using Hive and Pig.
  • Worked on analyzing Hadoop cluster and different Big Data components including Pig, Hive, Spark, and Impala.
  • Populated database tables via AWS Kinesis Firehose and AWS Redshift.
  • Developed Spark code using Python and Spark-SQL for faster testing and data processing.
  • Created Hive External tables and loaded the data into tables and query data using HQL.
  • Developed ETL modules and data workflows for solution accelerators using PySpark and Spark SQL.
  • Used Spark SQL to process the huge amount of structured data.
  • Extracted the data from MySQL and AWS RedShift into HDFS using Kinesis.
  • Developed Pyspark application for creating reporting tables with different masking in both Hive and MySQL DB and made available for newly build fetch API’s.
  • Wrote numerous Spark code in Scala for information extraction, change, and conglomeration from numerous record designs.
  • Environment: Big Data, Spark, Hive, Pig, Python, Hadoop, AWS, Databases, AWS RedShift, Agile, SQL, HQL, Impala, Cloud Watch, AWS Kinesis.

Big Data Engineer

USAA
Texas
02.2022 - 03.2023
  • Designed and implemented a scalable ETL framework using Sqoop, Pig, and Hive to efficiently extract, transform, and load data from various sources, ensuring seamless data availability for consumption.
  • Processed data stored in Hadoop Distributed File System (HDFS), leveraging Hive to create external tables and developing reusable scripts for efficient table ingestion and repair across the project.
  • Developed robust ETL jobs using Spark and Scala to migrate data from Oracle to new MySQL tables, ensuring smooth data transfer and maintaining data integrity.
  • Leveraged the powerful capabilities of Spark, including RDDs, Data Frames, and Spark SQL, along with Spark-Cassandra Connector APIs, for diverse data tasks such as data migration and generating comprehensive business reports.
  • Engineered a high-performance Spark Streaming application for real-time sales analytics, enabling timely insights and decision-making.
  • Conducted comprehensive analysis of source data, effectively handled data type modifications, and utilized Excel sheets, flat files, and CSV files to generate on-demand Power BI reports.
  • Analysed SQL scripts and devised optimal solutions using PySpark, ensuring efficient data processing and transformation.
  • Leveraged Sqoop to efficiently extract data from multiple data sources into HDFS, facilitating seamless data integration.
  • Orchestrated data imports from various sources, executed transformations using Hive and MapReduce, and loaded processed data into HDFS.
  • Successfully extracted data from MySQL databases into HDFS using Sqoop, enabling seamless data transfer and integration.
  • Implemented streamlined automation for deployments using YAML scripts, resulting in accelerated and efficient build and release processes.
  • Expertly utilized Apache Hive, Apache Pig, HBase, Apache Spark, Zookeeper, Flume, Kafka, and Sqoop, leveraging their capabilities to optimize data processing and management.
  • Developed data classification algorithms using MapReduce design patterns, enhancing data processing efficiency and accuracy.
  • Employed advanced techniques including combiners, partitioning, and distributed cache to optimize the performance of MapReduce jobs.
  • Effectively utilized Git and GitHub repositories for comprehensive source code management and version control, fostering efficient collaboration and ensuring traceability of code changes.
  • Environment: Sqoop, Pig, HDFS, Power BI, GitHub, Apache Cassandra, ZooKeeper, Flume, Kafka, Apache Spark, Scala, Hive, Hadoop, Cloudera, HBase, MySQL, YAML, JIRA, Git, GitHub.

Data Warehouse Developer

Zensar Technologies
Hyderabad
10.2018 - 06.2021
  • Conducted comprehensive requirement analysis to identify data extraction needs from various source systems, including Netezza, DB2, Oracle, and flat files, for seamless integration into the Salesforce application.
  • Designed and developed robust ETL processes using Informatica Power Center to efficiently extract data from diverse sources and load it into the target data warehouse.
  • Implemented advanced performance tuning techniques to optimize data mappings and address bottlenecks in the data transfer process, resulting in improved efficiency and faster data processing.
  • Utilized Informatica Power Center tools, such as Designer, Workflow Manager, Workflow Monitor, and Repository Manager, to streamline development, monitoring, and management of ETL workflows, ensuring smooth execution and enhanced productivity.
  • Created intricate data mappings from scratch, leveraging a wide range of Informatica Designer Tools, including Source Qualifier, Aggregate, Lookup, Expression, Normalizer, Filter, Router, Rank, Sequence Generator, Update Strategy, and Joiner transformations, to ensure accurate data transformation and seamless integration.
  • Implemented efficient Incremental Loading mappings using Mapping Variables and Parameter Files, enabling incremental data transfer, and optimizing the overall ETL process for efficient data synchronization.
  • Developed reusable Transformations and Mapp let’s to promote code reusability, reduce development efforts, and enhance the maintainability of the ETL workflows.
  • Identified and resolved performance bottlenecks by leveraging the capabilities of the Netezza Database, optimizing Index Cache and Data Cache, and utilizing Rank, Lookup, Joiner, and Aggregator transformations for efficient data processing.
  • Created and executed Netezza SQL scripts to ensure accurate table loading, and developed SQL scripts for validating row counts and verifying data integrity, ensuring data accuracy and reliability.
  • Conducted comprehensive debugging and troubleshooting of Informatica Sessions using the Debugger and Workflow Monitor, enabling timely issue resolution and ensuring the smooth execution of ETL workflows.
  • Utilized Session Logs and Workflow Logs for effective error handling and troubleshooting in the development (DEV) environment, ensuring the stability and integrity of the ETL processes.
  • Prepared detailed ETL design documents and Unit Test plans for Mappings, ensuring comprehensive documentation and adherence to rigorous testing procedures to deliver high-quality solutions.
  • Compiled meticulous code migration documents and collaborated closely with the release team to facilitate the seamless migration of Informatica Objects and Unix Scripts across development, test, and production environments, ensuring successful deployment and minimizing downtime.
  • Successfully deployed ETL component code into multiple environments, strictly following the necessary approvals and adhering to established release procedures, ensuring seamless integration and minimizing disruption.
  • Provided dedicated production support by executing sessions, diagnosing problems, and making necessary adjustments to mappings based on changes in business logic, ensuring the uninterrupted flow of data and smooth operation of the ETL workflows.
  • Conducted rigorous Unit testing and Integration testing of mappings and workflows to validate their functionality and reliability, ensuring the accuracy and integrity of data throughout the ETL process.
  • Ensured strict adherence to client security policies and obtained all required approvals for code migration between environments, safeguarding data privacy and maintaining compliance with regulatory standards.
  • Actively participated in daily status calls with internal teams and provided comprehensive weekly updates to clients through detailed status reports, fostering effective communication, transparency, and project alignment.
  • Environment: Informatica Power Center, Repository Manager, Designer, Workflow Manager, Workflow Monitor, Repository Administration Console, Netezza, Oracle Developer, Oracle 11g, SQL Server 2016, T-SQL, TOAD, UNIX, HP Quality Center, Autosys, MS Office Suite.

Data Warehouse Developer

Value Momentum
Hyderabad
01.2015 - 09.2018
  • Actively participated in Agile Scrum Methodology, engaging in daily stand-up meetings. Proficiently utilized Visual SourceSafe for Visual Studio 2010 for version control and effectively managed project progress using Trello.
  • Implemented advanced reporting functionalities in Power BI, including Drill-through and Drill-down reports with interactive Drop-down menus, data sorting capabilities, and subtotals for enhanced data analysis.
  • Employed Data warehousing techniques to develop a comprehensive Data Mart, serving as a reliable data source for downstream reporting. Developed a User Access Tool empowering users to create ad-hoc reports and execute queries for in-depth analysis within the proposed Cube.
  • Streamlined the deployment of SSIS Packages and optimized their execution through the creation of efficient job configurations.
  • Demonstrated expertise in building diverse Cubes and Dimensions using different architectures and data sources for Business Intelligence. Proficiently utilized MDX Scripting to enhance Cube functionality and support advanced analytics.
  • Automated report generation and Cube refresh processes by creating SSIS jobs, ensuring the timely and accurate delivery of critical information.
  • Excelled in deploying SSIS Packages to production, leveraging various configuration options to export package properties and achieve environment independence.
  • Utilized SQL Server Reporting Services (SSRS) to author, manage, and deliver comprehensive reports, both in print and interactive web-based formats.
  • Developed robust stored procedures and triggers to enforce data consistency and integrity during data entry operations.
  • Leveraged the power of Snowflake to facilitate seamless data sharing, enabling quick and secure data exchange without the need for complex data pipelines.
  • Environment: Windows server, MS SQL Server 2014, SSIS, SSAS, SSRS, SQL Profiler, Power BI, Performance Point Server, MS Office, SharePoint.

Education

Bachelor’s - computer science

Andhra University
05.2015

Master of Science - Information Studies

Trine University
12.2023

Skills

  • Azure Data Factory
  • Azure Data Bricks
  • Logic Apps
  • Functional App
  • Snowflake
  • Azure DevOps
  • EC2
  • S3
  • Redshift
  • Lambda
  • Cloud watch
  • Glue
  • HQL
  • MapReduce
  • Hive
  • Python
  • PySpark
  • Scala
  • Kafka
  • Spark streaming
  • Oozie
  • Sqoop
  • Zookeeper
  • Cloudera
  • Horton Works
  • SQL
  • PL/SQL
  • Hive QL
  • HTML
  • CSS
  • JavaScript
  • XML
  • JSP
  • Restful
  • SOAP
  • Windows (XP/7/8/10)
  • UNIX
  • LINUX
  • UBUNTU
  • CENTOS
  • Ant
  • Maven
  • GIT
  • GitHub
  • Eclipse
  • Visual Studio
  • MS SQL Server
  • Azure SQL DB
  • Azure Synapse
  • MS Excel
  • MS Access
  • Oracle
  • Cosmos DB

Timeline

Sr. Data Engineer

Wells Fargo
06.2024 - Current

Data Engineer

Charles Schwab
04.2023 - 05.2024

Big Data Engineer

USAA
02.2022 - 03.2023

Data Warehouse Developer

Zensar Technologies
10.2018 - 06.2021

Data Warehouse Developer

Value Momentum
01.2015 - 09.2018

Bachelor’s - computer science

Andhra University

Master of Science - Information Studies

Trine University
Sudheer K