Summary

Overview

Work History

Education

Skills

Languages

Timeline

Vasu Valasapalli

Prosper,TX

Summary

Experienced with designing and optimizing data pipelines to ensure seamless data flow. Utilizes advanced SQL and Python skills to create and maintain robust data architectures. Track record of implementing scalable solutions that enhance data integrity and support informed decision-making.

Overview

years of professional experience

Work History

Data Engineer

Diversified

09.2024 - Current

Leading the development of advanced data engineering solutions to optimize data processing, analytics, and machine learning workflows for enterprise clients
Designing and implementing scalable data pipelines using Snowflake, Spark, and AWS to support real-time and batch data processing needs
Responsibilities:
Developed and maintained optimal data pipeline architecture in Snowflake Data warehouse
Led data discovery, handling structured and unstructured data, cleaning and performing descriptive analysis, and storing as normalized tables for dashboards
Converting Hive/SQL queries into transformations using Python
Designed and developed scalable data pipelines and ETL workflows using Python, optimizing data ingestion, transformation, and storage processes
Built and maintained batch and streaming data pipelines, leveraging Apache Kafka, Spark Streaming, and AWS Kinesis for real-time data processing
Data deduplication, profiling for many production tables, and imported data files from S3 and SQL Workbench data pumper to Redshift tables
Performed complex joins on tables in hive with various optimization techniques, Created Hive tables as per requirements, internal or external tables defined with appropriate static and dynamic partitions intended for efficiency, and worked extensively with HIVE DDLS and Hive Query language (HQLs)
Data mining and aggregation can reduce the customer complaint rate (CCR) and improve product quality in the marketplace
Used Spark and Scala to develop machine algorithms that analyze click stream data
Used Spark SQL to pre-process, clean, and join large data sets
Open the SSH tunnel to Google DataProc to access the yarn manager and monitor spark jobs
Integrated Go-based services with distributed data storage systems such as Apache Hadoop and Apache Spark, facilitating real-time data processing and analytics
I worked with the Play framework and Akka parallel processing
Developed Spark applications using Scala for easy Hadoop transitions
I have hands-on experience writing Spark jobs and streaming API using Scala and Python
I used the Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
I also developed Spark code and Spark-SQL/Streaming for faster data testing and processing
I used Python to compare Spark’s performance with Hive’s and SQL/Oracle’s
Involved in creating UDFs in Spark using Scala and Python programming languages
Managed the Hadoop cluster using Cloudera Manager
Designed and implemented schema evolution strategies and data lake architectures using Apache Iceberg for optimized storage and analytics
Used GitHub repository for code reviews, committing, and retrieving code
Identified, designed, and implemented internal process improvements: automating manual processes, optimizing data delivery, and re-designing infrastructure for greater scalability
Optimized the performance by creating and modifying database triggers, stored procedures, or complex analytical queries, including multi-table joins, nested queries, and correlated subqueries
Worked on converting the design into infrastructure code developing solutions using AWS (VPC, EC2, S3, ELB, EBS, RDS, IAM, Cloud Formation, Route 53, Cloud Watch, Cloud Front, and Cloud Trail) using orchestration tools like Kubernetes, Docker, and Ansible
Provided technical leadership and mentorship to junior engineers, ensuring adherence to best practices in data engineering, automation, and cloud computing
Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB
Worked on building Snowpipe and Data Ingestion
Experienced in using Snowflake Clone and Time Travel, Developed Coding for Stored Procedures/ Triggers/Tasks
Environment: Python, Snowflake, Spark-SQL, Spark, AWS, Kubernetes, Docker, AWS EMR, Amazon S3, DynamoDB, Hadoop, GitHub, Scala, SQL/Oracle.

Data Engineer

AllianzLife

10.2019 - 08.2024

Engineered large-scale data solutions to support AllianzLife’s actuarial and analytics teams
Designed and optimized ETL workflows in AWS to process high-volume insurance data
Developed scalable data lakes, integrating real-time streaming with Kafka and Spark for policy performance analysis
Automated performance calculations using Airflow and implemented cloud-based solutions with Scala and Akka to enhance data processing capabilities
Responsibilities:
Worked closely with the business analysts to convert the Business Requirements into Technical Requirements, paring low and high-level documentation, and worked closely with architects
Participated in all phases of project life, including data collection, data mining, data cleaning, developing models, validation, and reports
Created Data Lake using Spark, which is used for downstream applications, designed and developed Scala Workflows for data pull from cloud-based systems, and applied transformations on it
Developed Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data to uncover insights into customer usage patterns
Developed and deployed serverless data processing solutions using AWS Lambda, Step Functions, Glue, and Athena, improving automation and efficiency
Implemented Infrastructure as Code (IaC) using AWS CDK and Terraform to automate cloud resource provisioning and ensure scalable deployments
Applied data modeling best practices, optimized database performance tuning, and enhanced query execution efficiency in Snowflake, Redshift, and DynamoDB
Performing transformations using Hive and MapReduce, hands-on experience in copying .log and snappy files into HDFS from Greenplum using Flume & Kafka, loaded data into HDFS, and extracted the data into HDFS from MYSQL using Sqoop
Imported required tables from RDBMS to HDFS using Sqoop, Storm/ Spark streaming, and Kafka to get real-time data streaming into HBase
Implemented scalable infrastructure and optimized computational resources to support the training and deployment, ensuring high-performance and cost-effective operations at scale
Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud
Used AWS Redshift, S3, Spectrum, and Athena services to query the extensive amounts of data stored on S3 to create a Virtual Data Lake without going through the ETL process
Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC, Parquet, and Text Files into AWS Redshift
Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface
Experience in Writing Map Reduce jobs for text mining and working with a predictive analysis team and Experience in working with Hadoop components such as HBase, Spark, Yarn, Kafka, Zookeeper, PIG, HIVE, Sqoop, Oozie, Impala, and Flume
Extensively performed extensive data reading/writing to and from CSV and Excel files using Pandas
Tasked with maintaining RDDs using SparkSQL
Implemented two workflows in the Data Platform
A simple workflow is the most basic workflow offered by the Data Platform for ingesting data, curating it, and writing it to a data store (AWS Dynamo DB)—separate valid and invalid data using the rule engine
Spark Streaming provides methods to create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files
Worked with AWS-native workflow orchestration tools, such as Apache Airflow and Step Functions, to automate and monitor end-to-end data pipelines
Wrote HIVE UDFs as per requirements and handled different schemas and XML data
Implemented ETL code to load data from multiple sources into HDFS using Pig Scripts
Developed data pipeline using Python and Hive to load data into data link
Perform data analysis data mapping for several data sources
Loaded data into S3 buckets using AWS Glue and PySpark
Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables
Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB
Designed a new Member and Provider booking system, which allows providers to book new slots by sending out the member leg and provider Leg directly to TP through DataLink
Use Pandas, NumPy, and other Python tools to analyze various types of raw files, such as JSON, CSV, and XML
Developed Spark applications using Scala for easy Hadoop transitions
I have hands-on experience writing Spark jobs and streaming API using Scala and Python
I used the Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
I also developed Spark code and Spark-SQL/Streaming for faster data testing and processing
Designed and implemented schema evolution strategies and data lake architectures using Apache Iceberg for optimized storage and analytics
Collaborated with cross-functional teams, including data scientists, analysts, and DevOps teams, to design efficient and scalable data solutions
Automated the existing scripts for performance calculations using scheduling tools like Airflow
Created cloud-based software solutions written in Scala Spray IO, Akka, and Slick
Experience fetching live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka
Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper
Populated HDFS and Cassandra with vast amounts of data using Apache Kafka
Environment: Map Reduce, HDFS, Hive, Pig, HBase, Python, SQL, Sqoop, Flume, Oozie, Impala, Scala, Spark, Apache Kafka, Play, AWS, AKKA, Zookeeper, Linux Red Hat, HP-ALM, Eclipse, Cassandra, SSIS.

Data Engineer

RBC

03.2018 - 09.2019

Developed and optimized financial data pipelines for RBC’s banking and investment platforms
Led real-time data transformations using Hive and MapReduce to improve data accessibility for reporting and analytics
Integrated Spark Streaming and Kafka for real-time transaction monitoring, improving fraud detection capabilities
Played a key role in Agile development, ensuring the timely delivery of data-driven solutions
Responsibilities:
Performing transformations using Hive and MapReduce, hands-on experience in copying .log and snappy files into HDFS from Greenplum using Flume & Kafka, loaded data into HDFS, and extracted the data into HDFS from MYSQL using Sqoop and DataBricks
Imported required tables from RDBMS to HDFS using Sqoop, Storm/ Spark streaming, and Kafka to get real-time data streaming into HBase
Developing and Monitoring applications in Hortonworks Hadoop Data Lake Environment
Develops Scripts using SQL Mappings and Joins in Hive using HIVE QL to perform testing
Configuring Spark Streaming in Python to receive real-time data from Kafka and store it onto HDFS
Using PySpark, I prepared the data per business need and published the processed data to the HDFS
Data Profiling, Data Analysis, and Data Visualization for the PHI and sensitive data, Analyzing and Visualize the type of data from the sources and creating the Mapping document
Conducts data profiling and analysis to identify database anomalies and nuances
Performs project-specific, ad-hoc analysis and develops reproducible scripts where analytical approaches meet business requirements
Performs complex HQL coding, including data flows for evaluating business rules and creating necessary workflows to segregate data and load it into final database objects for data visualization using the analytical aspects of HIVE
Applied data modeling best practices, optimized database performance tuning, and enhanced query execution efficiency in Snowflake, Redshift, and DynamoDB
Performs ETL and SQOOP on the received data files to extract and load the data into proprietary databases, such as AAH, PSI, DB2, DataLake, etc
Extend the design to document low-level design specifications, including creating data flow, workflow, integration rules, normalization, and standardization methods
Communicate the day-to-day progress of both On-site and offshore teams to the client manager and ensure work is tracked and completed per project schedules
Collaborated with cross-functional teams, including data scientists, analysts, and DevOps teams, to design efficient and scalable data solutions
Takes an active part in the Agile Development Process by being a key player in participating in all Agile Ceremonies like Scrums, Sprint Planning, Backlog grooming, Sprint Demo, and Retrospectives
Environment: HDFS, Hive, HBase, Python, SQL, Sqoop, Flume, Oozie, Impala, Scala, Spark, Apache Kafka, Play, AKKA, Zookeeper, Linux Red Hat, HP-ALM.

Data Engineer

Insurance auto auctions

06.2016 - 02.2018

Designed and implemented data processing workflows to support Truss Health’s data analytics and regulatory reporting
Developed Spark-based ETL pipelines to ingest and transform healthcare data, ensuring compliance with industry standards
Built Apache Kafka-based real-time data streaming solutions to improve data accessibility and enable faster patient care and operational analytics decision-making
Responsibilities:
Developed scripts for the Data flow process from various sources to Target databases using data technologies to test the process flow of data
Used Spark, HIVEQL, and PIG extensively for retrieving, querying, storage, and data transformations
Developed Apache Kafka Streams using Console consumers and created Phoenix tables on streaming JSON data for User Validation
Created Spark streaming jobs to get the Real-time data and Batch processing data from source (DB2) to Data Lake
I worked on parsing the different types of files like XML and JSON and fixed width to flatten them to text format and load them in tables
I used Sqoop to load the batch data from the source to Target
Implemented Kafka to consume the live streaming data and integrated the Zena tool using Shell scripts to automate the data process
Created Hive Tables and performed required queries on top of the tables
Scheduling the jobs is based on time- and Event-based triggers
Analyse Data Quality, Size, format, and frequency of the data from sources
Design transformation rules based on tables involved and analyze the join criteria to grab the required fields
Design mapping documents by analyzing the data to refer to while developing scripts and unit testing as an entry criterion to the QA
Documentation of test cases and workflow instructions
Experience using JIRA and ALM to track issues/defects and load test cases and test results
Knowledge transfer from the client to the offshore team to transfer the necessary technical and business knowledge
Environment: Map Reduce, Hive, Pig, HBase, Python, SQL, Sqoop, Flume, Oozie, Impala, Scala, Spark, Zookeeper, Linux Red Hat, HP-ALM, Eclipse.

Software Developer

Walmart Ellicot

03.2012 - 05.2016

Developed and maintained Walmart’s customer analytics and e-commerce applications using Python and Django
Engineered Spark-based data aggregation solutions to optimize inventory and customer sentiment analysis
Built scalable web applications for tracking customer complaints, improving issue resolution and customer satisfaction
Enhanced search engine optimization and database management systems for e-commerce platforms
Responsibilities:
Involved in the project's analysis, design, implementation, and testing
Exposed to various phases of the Software Development Life Cycle using Scrum Software development methodology
Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface
Developed the customer complaints application using Django Framework, which includes Python code
Strong understanding and practical experience in developing Spark applications with Python
Developed Scala scripts and UDFs using both Data frames/SQL in Spark for Data aggregation
Designed, developed, tested, deployed, and maintained the website
Developed entire frontend and backend modules using Python on Django Web Framework
Developed Python scripts to update content in the database and manipulate files
Rewrite existing Java applications in the Python module to deliver a specific data format
Developed entire frontend and backend modules using Python on Django Web Framework
The generated property list for every application dynamically using Python
Designed and developed the website's UI using HTML, XHTML, AJAX, CSS, and JavaScript
Wrote Python scripts to parse XML documents and load the data in the database
The generated property list for every application dynamically using Python
Handled all the client-side validation using JavaScript
Performed testing using Django’s Test Module
Designed and developed a data management system using MySQL
Creating unit test/regression test framework for working/new code
Responsible for search engine optimization to improve the visibility of the website
Responsible for debugging and troubleshooting the web application
Environment: Python, Django, Java, MySQL, Linux, HTML, XHTML, CSS, AJAX, JavaScript, Apache Web Server
Improved software efficiency by troubleshooting and resolving coding issues.
Saved time and resources by identifying and fixing bugs before product deployment.
Collaborated with cross-functional teams to deliver high-quality products on tight deadlines.
Developed software for desktop and mobile operating systems.
Increased development speed by automating repetitive tasks using scripts and tools.

Education

Bachelor of Science - Information Technology

JNTU

India

Skills

ETL development
Data warehousing
Data modeling
Data pipeline design
Scripting languages
SQL expertise
Machine learning
NoSQL databases

API development
Analytical thinking
Team building
Data analytics
SQL transactional replications
Big data technologies
SQL and databases

Languages

English

Full Professional

Timeline

Data Engineer

Diversified

09.2024 - Current

Data Engineer

AllianzLife

10.2019 - 08.2024

Data Engineer

RBC

03.2018 - 09.2019

Data Engineer

Insurance auto auctions

06.2016 - 02.2018

Software Developer

Walmart Ellicot

03.2012 - 05.2016

Bachelor of Science - Information Technology

JNTU

Vasu Valasapalli

Summary

Overview

Work History

Data Engineer

Data Engineer

Data Engineer

Data Engineer

Software Developer

Education

Bachelor of Science - Information Technology

Skills

Languages

Timeline

Data Engineer

Data Engineer

Data Engineer

Data Engineer

Software Developer

Bachelor of Science - Information Technology

Similar Profiles

Vigneshkumar VVigneshkumar V

Robert BirknerRobert Birkner

NOOR SABAHNOOR SABAH

Ciara FrancoisCiara Francois

Sawyer CowanSawyer Cowan