Summary
Overview
Work History
Education
Skills
Languages
Timeline
Generic
Vasu Valasapalli

Vasu Valasapalli

Prosper,TX

Summary

Experienced with designing and optimizing data pipelines to ensure seamless data flow. Utilizes advanced SQL and Python skills to create and maintain robust data architectures. Track record of implementing scalable solutions that enhance data integrity and support informed decision-making.

Overview

13
13
years of professional experience

Work History

Data Engineer

Diversified
09.2024 - Current
  • Leading the development of advanced data engineering solutions to optimize data processing, analytics, and machine learning workflows for enterprise clients
  • Designing and implementing scalable data pipelines using Snowflake, Spark, and AWS to support real-time and batch data processing needs
  • Responsibilities:
  • Developed and maintained optimal data pipeline architecture in Snowflake Data warehouse
  • Led data discovery, handling structured and unstructured data, cleaning and performing descriptive analysis, and storing as normalized tables for dashboards
  • Converting Hive/SQL queries into transformations using Python
  • Designed and developed scalable data pipelines and ETL workflows using Python, optimizing data ingestion, transformation, and storage processes
  • Built and maintained batch and streaming data pipelines, leveraging Apache Kafka, Spark Streaming, and AWS Kinesis for real-time data processing
  • Data deduplication, profiling for many production tables, and imported data files from S3 and SQL Workbench data pumper to Redshift tables
  • Performed complex joins on tables in hive with various optimization techniques, Created Hive tables as per requirements, internal or external tables defined with appropriate static and dynamic partitions intended for efficiency, and worked extensively with HIVE DDLS and Hive Query language (HQLs)
  • Data mining and aggregation can reduce the customer complaint rate (CCR) and improve product quality in the marketplace
  • Used Spark and Scala to develop machine algorithms that analyze click stream data
  • Used Spark SQL to pre-process, clean, and join large data sets
  • Open the SSH tunnel to Google DataProc to access the yarn manager and monitor spark jobs
  • Integrated Go-based services with distributed data storage systems such as Apache Hadoop and Apache Spark, facilitating real-time data processing and analytics
  • I worked with the Play framework and Akka parallel processing
  • Developed Spark applications using Scala for easy Hadoop transitions
  • I have hands-on experience writing Spark jobs and streaming API using Scala and Python
  • I used the Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
  • I also developed Spark code and Spark-SQL/Streaming for faster data testing and processing
  • I used Python to compare Spark’s performance with Hive’s and SQL/Oracle’s
  • Involved in creating UDFs in Spark using Scala and Python programming languages
  • Managed the Hadoop cluster using Cloudera Manager
  • Designed and implemented schema evolution strategies and data lake architectures using Apache Iceberg for optimized storage and analytics
  • Used GitHub repository for code reviews, committing, and retrieving code
  • Identified, designed, and implemented internal process improvements: automating manual processes, optimizing data delivery, and re-designing infrastructure for greater scalability
  • Optimized the performance by creating and modifying database triggers, stored procedures, or complex analytical queries, including multi-table joins, nested queries, and correlated subqueries
  • Worked on converting the design into infrastructure code developing solutions using AWS (VPC, EC2, S3, ELB, EBS, RDS, IAM, Cloud Formation, Route 53, Cloud Watch, Cloud Front, and Cloud Trail) using orchestration tools like Kubernetes, Docker, and Ansible
  • Provided technical leadership and mentorship to junior engineers, ensuring adherence to best practices in data engineering, automation, and cloud computing
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB
  • Worked on building Snowpipe and Data Ingestion
  • Experienced in using Snowflake Clone and Time Travel, Developed Coding for Stored Procedures/ Triggers/Tasks
  • Environment: Python, Snowflake, Spark-SQL, Spark, AWS, Kubernetes, Docker, AWS EMR, Amazon S3, DynamoDB, Hadoop, GitHub, Scala, SQL/Oracle.

Data Engineer

AllianzLife
10.2019 - 08.2024
  • Engineered large-scale data solutions to support AllianzLife’s actuarial and analytics teams
  • Designed and optimized ETL workflows in AWS to process high-volume insurance data
  • Developed scalable data lakes, integrating real-time streaming with Kafka and Spark for policy performance analysis
  • Automated performance calculations using Airflow and implemented cloud-based solutions with Scala and Akka to enhance data processing capabilities
  • Responsibilities:
  • Worked closely with the business analysts to convert the Business Requirements into Technical Requirements, paring low and high-level documentation, and worked closely with architects
  • Participated in all phases of project life, including data collection, data mining, data cleaning, developing models, validation, and reports
  • Created Data Lake using Spark, which is used for downstream applications, designed and developed Scala Workflows for data pull from cloud-based systems, and applied transformations on it
  • Developed Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data to uncover insights into customer usage patterns
  • Developed and deployed serverless data processing solutions using AWS Lambda, Step Functions, Glue, and Athena, improving automation and efficiency
  • Implemented Infrastructure as Code (IaC) using AWS CDK and Terraform to automate cloud resource provisioning and ensure scalable deployments
  • Applied data modeling best practices, optimized database performance tuning, and enhanced query execution efficiency in Snowflake, Redshift, and DynamoDB
  • Performing transformations using Hive and MapReduce, hands-on experience in copying .log and snappy files into HDFS from Greenplum using Flume & Kafka, loaded data into HDFS, and extracted the data into HDFS from MYSQL using Sqoop
  • Imported required tables from RDBMS to HDFS using Sqoop, Storm/ Spark streaming, and Kafka to get real-time data streaming into HBase
  • Implemented scalable infrastructure and optimized computational resources to support the training and deployment, ensuring high-performance and cost-effective operations at scale
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud
  • Used AWS Redshift, S3, Spectrum, and Athena services to query the extensive amounts of data stored on S3 to create a Virtual Data Lake without going through the ETL process
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC, Parquet, and Text Files into AWS Redshift
  • Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface
  • Experience in Writing Map Reduce jobs for text mining and working with a predictive analysis team and Experience in working with Hadoop components such as HBase, Spark, Yarn, Kafka, Zookeeper, PIG, HIVE, Sqoop, Oozie, Impala, and Flume
  • Extensively performed extensive data reading/writing to and from CSV and Excel files using Pandas
  • Tasked with maintaining RDDs using SparkSQL
  • Implemented two workflows in the Data Platform
  • A simple workflow is the most basic workflow offered by the Data Platform for ingesting data, curating it, and writing it to a data store (AWS Dynamo DB)—separate valid and invalid data using the rule engine
  • Spark Streaming provides methods to create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files
  • Worked with AWS-native workflow orchestration tools, such as Apache Airflow and Step Functions, to automate and monitor end-to-end data pipelines
  • Wrote HIVE UDFs as per requirements and handled different schemas and XML data
  • Implemented ETL code to load data from multiple sources into HDFS using Pig Scripts
  • Developed data pipeline using Python and Hive to load data into data link
  • Perform data analysis data mapping for several data sources
  • Loaded data into S3 buckets using AWS Glue and PySpark
  • Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB
  • Designed a new Member and Provider booking system, which allows providers to book new slots by sending out the member leg and provider Leg directly to TP through DataLink
  • Use Pandas, NumPy, and other Python tools to analyze various types of raw files, such as JSON, CSV, and XML
  • Developed Spark applications using Scala for easy Hadoop transitions
  • I have hands-on experience writing Spark jobs and streaming API using Scala and Python
  • I used the Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
  • I also developed Spark code and Spark-SQL/Streaming for faster data testing and processing
  • Designed and implemented schema evolution strategies and data lake architectures using Apache Iceberg for optimized storage and analytics
  • Collaborated with cross-functional teams, including data scientists, analysts, and DevOps teams, to design efficient and scalable data solutions
  • Automated the existing scripts for performance calculations using scheduling tools like Airflow
  • Created cloud-based software solutions written in Scala Spray IO, Akka, and Slick
  • Experience fetching live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka
  • Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper
  • Populated HDFS and Cassandra with vast amounts of data using Apache Kafka
  • Environment: Map Reduce, HDFS, Hive, Pig, HBase, Python, SQL, Sqoop, Flume, Oozie, Impala, Scala, Spark, Apache Kafka, Play, AWS, AKKA, Zookeeper, Linux Red Hat, HP-ALM, Eclipse, Cassandra, SSIS.

Data Engineer

RBC
03.2018 - 09.2019
  • Developed and optimized financial data pipelines for RBC’s banking and investment platforms
  • Led real-time data transformations using Hive and MapReduce to improve data accessibility for reporting and analytics
  • Integrated Spark Streaming and Kafka for real-time transaction monitoring, improving fraud detection capabilities
  • Played a key role in Agile development, ensuring the timely delivery of data-driven solutions
  • Responsibilities:
  • Performing transformations using Hive and MapReduce, hands-on experience in copying .log and snappy files into HDFS from Greenplum using Flume & Kafka, loaded data into HDFS, and extracted the data into HDFS from MYSQL using Sqoop and DataBricks
  • Imported required tables from RDBMS to HDFS using Sqoop, Storm/ Spark streaming, and Kafka to get real-time data streaming into HBase
  • Developing and Monitoring applications in Hortonworks Hadoop Data Lake Environment
  • Develops Scripts using SQL Mappings and Joins in Hive using HIVE QL to perform testing
  • Configuring Spark Streaming in Python to receive real-time data from Kafka and store it onto HDFS
  • Using PySpark, I prepared the data per business need and published the processed data to the HDFS
  • Data Profiling, Data Analysis, and Data Visualization for the PHI and sensitive data, Analyzing and Visualize the type of data from the sources and creating the Mapping document
  • Conducts data profiling and analysis to identify database anomalies and nuances
  • Performs project-specific, ad-hoc analysis and develops reproducible scripts where analytical approaches meet business requirements
  • Performs complex HQL coding, including data flows for evaluating business rules and creating necessary workflows to segregate data and load it into final database objects for data visualization using the analytical aspects of HIVE
  • Applied data modeling best practices, optimized database performance tuning, and enhanced query execution efficiency in Snowflake, Redshift, and DynamoDB
  • Performs ETL and SQOOP on the received data files to extract and load the data into proprietary databases, such as AAH, PSI, DB2, DataLake, etc
  • Extend the design to document low-level design specifications, including creating data flow, workflow, integration rules, normalization, and standardization methods
  • Communicate the day-to-day progress of both On-site and offshore teams to the client manager and ensure work is tracked and completed per project schedules
  • Collaborated with cross-functional teams, including data scientists, analysts, and DevOps teams, to design efficient and scalable data solutions
  • Takes an active part in the Agile Development Process by being a key player in participating in all Agile Ceremonies like Scrums, Sprint Planning, Backlog grooming, Sprint Demo, and Retrospectives
  • Environment: HDFS, Hive, HBase, Python, SQL, Sqoop, Flume, Oozie, Impala, Scala, Spark, Apache Kafka, Play, AKKA, Zookeeper, Linux Red Hat, HP-ALM.

Data Engineer

Insurance auto auctions
06.2016 - 02.2018
  • Designed and implemented data processing workflows to support Truss Health’s data analytics and regulatory reporting
  • Developed Spark-based ETL pipelines to ingest and transform healthcare data, ensuring compliance with industry standards
  • Built Apache Kafka-based real-time data streaming solutions to improve data accessibility and enable faster patient care and operational analytics decision-making
  • Responsibilities:
  • Developed scripts for the Data flow process from various sources to Target databases using data technologies to test the process flow of data
  • Used Spark, HIVEQL, and PIG extensively for retrieving, querying, storage, and data transformations
  • Developed Apache Kafka Streams using Console consumers and created Phoenix tables on streaming JSON data for User Validation
  • Created Spark streaming jobs to get the Real-time data and Batch processing data from source (DB2) to Data Lake
  • I worked on parsing the different types of files like XML and JSON and fixed width to flatten them to text format and load them in tables
  • I used Sqoop to load the batch data from the source to Target
  • Implemented Kafka to consume the live streaming data and integrated the Zena tool using Shell scripts to automate the data process
  • Created Hive Tables and performed required queries on top of the tables
  • Scheduling the jobs is based on time- and Event-based triggers
  • Analyse Data Quality, Size, format, and frequency of the data from sources
  • Design transformation rules based on tables involved and analyze the join criteria to grab the required fields
  • Design mapping documents by analyzing the data to refer to while developing scripts and unit testing as an entry criterion to the QA
  • Documentation of test cases and workflow instructions
  • Experience using JIRA and ALM to track issues/defects and load test cases and test results
  • Knowledge transfer from the client to the offshore team to transfer the necessary technical and business knowledge
  • Environment: Map Reduce, Hive, Pig, HBase, Python, SQL, Sqoop, Flume, Oozie, Impala, Scala, Spark, Zookeeper, Linux Red Hat, HP-ALM, Eclipse.

Software Developer

Walmart Ellicot
03.2012 - 05.2016
  • Developed and maintained Walmart’s customer analytics and e-commerce applications using Python and Django
  • Engineered Spark-based data aggregation solutions to optimize inventory and customer sentiment analysis
  • Built scalable web applications for tracking customer complaints, improving issue resolution and customer satisfaction
  • Enhanced search engine optimization and database management systems for e-commerce platforms
  • Responsibilities:
  • Involved in the project's analysis, design, implementation, and testing
  • Exposed to various phases of the Software Development Life Cycle using Scrum Software development methodology
  • Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface
  • Developed the customer complaints application using Django Framework, which includes Python code
  • Strong understanding and practical experience in developing Spark applications with Python
  • Developed Scala scripts and UDFs using both Data frames/SQL in Spark for Data aggregation
  • Designed, developed, tested, deployed, and maintained the website
  • Developed entire frontend and backend modules using Python on Django Web Framework
  • Developed Python scripts to update content in the database and manipulate files
  • Rewrite existing Java applications in the Python module to deliver a specific data format
  • Developed entire frontend and backend modules using Python on Django Web Framework
  • The generated property list for every application dynamically using Python
  • Designed and developed the website's UI using HTML, XHTML, AJAX, CSS, and JavaScript
  • Wrote Python scripts to parse XML documents and load the data in the database
  • The generated property list for every application dynamically using Python
  • Handled all the client-side validation using JavaScript
  • Performed testing using Django’s Test Module
  • Designed and developed a data management system using MySQL
  • Creating unit test/regression test framework for working/new code
  • Responsible for search engine optimization to improve the visibility of the website
  • Responsible for debugging and troubleshooting the web application
  • Environment: Python, Django, Java, MySQL, Linux, HTML, XHTML, CSS, AJAX, JavaScript, Apache Web Server
  • Improved software efficiency by troubleshooting and resolving coding issues.
  • Saved time and resources by identifying and fixing bugs before product deployment.
  • Collaborated with cross-functional teams to deliver high-quality products on tight deadlines.
  • Developed software for desktop and mobile operating systems.
  • Increased development speed by automating repetitive tasks using scripts and tools.

Education

Bachelor of Science - Information Technology

JNTU
India

Skills

  • ETL development
  • Data warehousing
  • Data modeling
  • Data pipeline design
  • Scripting languages
  • SQL expertise
  • Machine learning
  • NoSQL databases
  • API development
  • Analytical thinking
  • Team building
  • Data analytics
  • SQL transactional replications
  • Big data technologies
  • SQL and databases

Languages

English
Full Professional

Timeline

Data Engineer

Diversified
09.2024 - Current

Data Engineer

AllianzLife
10.2019 - 08.2024

Data Engineer

RBC
03.2018 - 09.2019

Data Engineer

Insurance auto auctions
06.2016 - 02.2018

Software Developer

Walmart Ellicot
03.2012 - 05.2016

Bachelor of Science - Information Technology

JNTU
Vasu Valasapalli