Summary

Overview

Work History

Skills

Certification

Timeline

Nagendra Pasam

Columbus,OH

Summary

Dynamic and motivated IT professional with around 11 years of experience as a Data Engineer with expertise in designing data-intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data Engineering, Data Warehouse / Data Mart, Data Visualization, Reporting, and Data Quality solutions. In-depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker, and Map Reduce programming paradigm. Extensive experience in Hadoop-led development of enterprise-level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN. Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge of the Architecture of Distributed systems and Parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework. Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, and Pair RDD, and worked explicitly on PySpark and Scala. Handled ingestion of data from different data sources into HDFS using Sqoop, and Flume and perform transformations using Hive, Map Reduce, and then loaded data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables. Experience in importing streaming data into HDFS using Flume sources, and Flume sinks and transforming the data using Flume interceptors. Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows. Designed and developed Power BI graphical and visualization solutions with business requirement documents and plans for creating interactive dashboards. Utilized the Azure Paas service, analyze, plan, and develop modern data solutions that facilitate data visualization. Recognize the application's current state in production and assess how a new implementation will affect the current business procedures. Using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL, extract, transform, and load data from source systems into Azure Data Storage services Analytics for Azure Data Lake. Data is processed in Azure Databricks after being ingested into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, and Azure DW). Creating serverless yml files to AWS resources. Experience with Partitions, and bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance. Experience with different file formats like Avro, parquet, ORC, JSON, and XML. Expertise in Creating, Debugging, Scheduling, and Monitoring jobs using control-M and Oozie. Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL Server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase. Experience in developing enterprise level solution using batch processing (using Apache Pig) and streaming framework (using Spark Streaming, apache Kafka & Apache Flink). Migrated Database from SQL Databases (Oracle and SQL Server) to NO SQL Databases (Cassandra/MONGODB) Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers, and Transactions. Configured alerting rules and set up pagerduty alerting for Kafka, Zookeeper, Druid, Cassandra, Spark and different microservices in grafana. Set up and maintained Logging and Monitoring subsystems using tools loke; Elasticsearch, Fluentd, Kibana, Prometheus, Grafana and Alertmanager. Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS. Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML. Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR, and other services of the AWS family. Created and configured new batch job in Denodo scheduler with email notification capabilities Implemented Cluster setting for multiple Denodo nodes and created load balance for improving performance activity. Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like GIT, CFT, and Ansible. Experienced with JSON-based RESTful web services, and XML/QML-based SOAP web services and worked on various applications using python integrated IDEs like Sublime Text and PyCharm. Building and productionizing predictive models on large datasets by utilizing advanced statistical modeling, machine learning, or other data mining techniques. Developed intricate algorithms based on deep-dive statistical analysis and predictive data modeling that were used to deepen relationships, strengthen longevity, and personalize interactions with customers

Overview

years of professional experience

Certification

Work History

AWS Data Engineer

AgFirst Columbia

03.2024 - Current

Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization
Developed PySpark Data Ingestion framework to ingest source claims data into HIVE tables by performing Data cleansing, Aggregations and applying De-dup logic to identify updated and latest records
Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Spark framework and Tableau for data visualization
Worked on developing CFT’s for migrating the infra from lower environment to higher environment
Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency
Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS
Created serverless yml files to AWS resources
Experience in using the AWS services Athena, Redshift and Glue ETL jobs
Integrated Big Data Spark jobs with EMR and glue to create ETL jobs for around 450 GB of data daily
Designed, developed, and deployed DataLakes, Data Marts and Datawarehouse using AWS cloud like AWS S3, AWS RDS and AWS Redshift and terraform
Designed, developed, and deployed ETL pipelines using AWS services like, Lambda,Glue, EMR, StepFunction, CloudWatch events, SNS, Redshift, S3, IAM, etc
Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple data file formats to uncover insights into the customer usage patterns
Designed and developed ETL pipelines and dashboards using StepFunction, SageMaker,Lambda, Glue and QuickSight
Worked on TensorFlow and Scikit-learn to develop and deploy predictive models for a variety of applications, including natural language processing, computer vision, and time series forecasting
Implemented end-to-end machine learning pipelines on AWS, utilizing services such as Amazon SageMaker for model training and deployment, AWS Lambda for serverless computing, and Amazon S3 for data storage and retrieval
Optimized model performance and scalability through integration with AWS tools, employing techniques like distributed training with Amazon SageMaker's managed infrastructure and leveraging AWS Elastic Compute Cloud (EC2) for parallel processing of large datasets
Created Terraform modules and resources to deploy AWS services
Worked on processing batch and real time data using Spark using Scala
Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning
Written Programs in Spark using Scala for data quality check
Ingest the Data into Cargill data lake from different sources & did some transformations in data lake with spark-Scala as per the business requirements
Used Apache Spark Data frames, Spark-SOL, Spark MLLib extensively and developing and designing
POC's using Scala, Spark SOL and MLlib libraries
Developed CICD pipelines and developed require docker images for the pipelines
Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift, and Athena
Created different Power BI reports utilizing the desktop and the online service and schedule refresh
Assist end users with problems installing the Power BI desktop, installing, and configuring the Personal and On-Premises gateway, connecting to data sources and adding the different users
Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data
Built data ingestion pipelines and moved terabytes of data from existing data warehouses to the cloud and scheduled through AWS StepFunction and used EMR, S3 and Spark
Worked extensively on fine-tuning spark applications and optimizing SQL queries
Developed PySpark-based pipelines using spark data frame operations to load data to EDL using EMR for jobs execution & AWS S3 as a storage layer
Created a full spectrum of data engineering pipelines: data ingestion, data transformations, and data consumption
Developed an ETL application using Spark, Scala, and Java on EMR to process/transform files and loaded them into AWS S3
Queried and ran analysis over processed Analytics data using Athena
Improved the performance of the pipelines further using Apache Spark and Scala with batch and stream processing of the data based on the requirement
Migrated Database from SQL Databases (Oracle and SQL Server) to NO SQL Databases (Cassandra/MONGODB)
Automated the data flow and data validations on the input and output data to simplify the testing process using Shell Scripting and SQL
Established infrastructure and service monitoring using Prometheus and Grafana
Environment: Azure Databricks, AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Flink, Map Reduce, Snowflake, Scala, Apache, Pig, Java, SSRS, Tableau

AWS Data Engineer

Republic Services

05.2022 - 02.2024

Created and managed various types of Snowflake tables, including transient, temporary, and persistent tables, to cater to specific data storage and processing needs
Implemented advanced partitioning techniques in Snowflake to significantly enhance query performance and expedite data retrieval
Defined robust roles and access privileges within Snowflake to enforce strict data security and governance protocols
Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi, Druid
Implemented regular expressions in Snowflake for seamless pattern matching and data extraction tasks
Developed and implemented Snowflake scripting solutions to automate critical data pipelines, ETL processes, and data transformations
Developed and optimized ETL workflows using AWS Glue to extract, transform, and load data from diverse sources into Redshift for efficient data processing
Configured and fine-tuned Redshift clusters to achieve high-performance data processing and streamlined querying
Integrated AWS SNS and SQS to enable real-time event processing and efficient messaging
Implemented AWS Athena for ad-hoc data analysis and querying on data stored in AWS S3
Designed and implemented data streaming solutions using AWS Kinesis, enabling real-time data processing and analysis
Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data
Effectively managed DNS configurations and routing using AWS Route53, ensuring efficient deployment of applications and services
Implemented robust IAM policies and roles to ensure secure user access and permissions for AWS resources
Developed and optimized data processing pipelines using Hadoop ecosystem technologies such as HDFS, Sqoop, Hive, MapReduce, and Spark
Implemented Spark Streaming for real-time data processing and advanced analytics
Demonstrated expertise in scheduling and job automation using IBM Tivoli, Control-M, Oozie, and Airflow, for execution of data processing and ETL pipelines
Designed and developed database solutions using Teradata, Oracle, and SQL Server, including schema design and optimization, stored procedures, triggers, and cursors
Proficient in utilizing version control systems such as Git, GitLab, and VSS for efficient code repository management and collaborative development processes
Environment: AWS, AWS S3, redshift, EMR, SNS, SQS, Athena, glue, cloudwatch, kenisis, route53, IAM, Sqoop, MYSQL, HDFS, Apache Spark, Hive, Cloudera, Kafka, Zookeeper, Oozie, PySpark, Ambari, JIRA, IBM Tivoli, control-m, Flink,Druid,OOZIE, airflow, Teradata, oracle, SQL

Azure Data Engineer

Value Labs Inc

01.2021 - 08.2022

The main objective of the project is to create and manage moderate to advanced data models, along with developing and maintaining advanced reports that deliver precise and timely data for both internal and external clients
This involves conducting Data Analysis, Data Profiling, Data Modelling, and Data Governance utilizing Python, Microsoft Azure Cloud Services, ETL, and Big Data Technologies
The project centers on gathering vital data for prominent software companies renowned for their robust machine learning models, covering text, speech, and image analysis
Responsibilities:
Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL writing SQL queries against Snowflake
Developed Spark applications using Scala to perform data cleansing, validation, transformation and summarization activities according to the requirement
Developed data pipelines using Stream Sets Data Collector to store data from Kafka into HDFS, HBase
Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems
Created application interface document for the downstream to create new interface to transfer and receive files through Azure Data Share
Created linked service to land the data from SFTP location to Azure Data Lake
Used CosmosDB for storing catalog data and for event sourcing in order processing pipelines
Worked on the process of streaming the data using Kafka, Spark and Hive and developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using SQL
Worked with Spark Python libraries to manipulate the data using broadcast joins and sort merge joins
Built NiFi dataflow to consume data from Kafka, make transformations on data, place in HDFS and exposed port to run Spark streaming job
Worked on migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1N2)
Worked with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD and Spark YARN
Developed Spark applications using PySpark and Spark-SQL for data transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns
Involved in developing various Machine Learning Models such as Logistic regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python
Automated all the jobs, for pulling data from FTP server to load data into Hive tables, using Oozie workflows
Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS
Converted Hive/SQL queries into Spark transformations using Spark Data Frames and Scala
Worked with different file structures with different Hive file formats like Text file, Sequence file, ORC file, Parquet and Avro to analyze the data to build data model and reading them from HDFS and processing through parquet files and loading into HBASE tables
Developed several new MapReduce programs to analyse and transform the data to uncover insights into the customer usage patterns
Designed and developed custom aggregation framework for reporting and analytics in Hive, Presto and Vertica
Worked on Apache Airflow to run tasks in parallel to create a database in MongoDB and Cassandra
Performed data quality issue analysis using SnowSQL by building analytical warehouses on Snowflake
Developed MapReduce applications using Hadoop MapReduce programming for processing and used compression techniques to optimize MapReduce Jobs
Implemented Jenkins and built pipelines to drive all microservice builds out to Docker Registry and deployed to Kubernetes
Built CI/CD pipelines using Azure DevOps
Deployed applications integrating Git version control with it and optimized the PySpark jobs to run on Kubernetes Cluster for faster data processing
Worked on optimization of existing ETL workflows, implementing strategic improvements that resulted in a marked enhancement of system performance and reliability
Responsible for creating, debugging, scheduling and monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes
Engineered automated systems for routine data quality checks, integrating proactive monitoring tools that pre-empted data quality issues before they affected downstream processes
Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
Environment: Microsoft Azure (Data Factory, Databricks, Data Share, Data Lake, CosmosDB, Azure DevOps), Hadoop, Hive, Pig, Spark, Kafka, MapReduce, Flume, Oozie, HDFS, HBase, NiFi, Python, Scala, PySpark, R, SQL, Snowflake, Apache Airflow, Jenkins, Azure DevOps, Docker, Kubernetes, Git, MongoDB, Cassandra, Power BI .

Data engineer/Data Analyst

Reverse Informatics

02.2015 - 11.2019

As a Sr
Data Engineer designed and deployed scalable, highly available, and fault tolerant systems on Azure
Involved in complete SDLC life cycle of big data project that includes requirement analysis, design, coding, testing and production
Lead the estimation, review the estimates, identify the complexities and communicate to all the stakeholders
Defined the business objectives comprehensively through discussions with business stakeholders, functional analysts and participating in requirement collection sessions
Migrated on-primes environment on Cloud using MS Azure
Performed data Ingestion for the incoming web feeds into the Data lake store which includes both structured and unstructured data
Designed the business requirement collection approach based on the project scope and SDLC (Agile) methodology
Migrated data warehouses to Snowflake Data warehouse
Installed and configured Hive and also written Hive UDFs and Cluster coordination services through Zookeeper
Installed and configured Hadoop Ecosystem components
Defined virtual warehouse sizing for Snowflake for different type of workloads
Extensively used Agile Method for daily scrum to discuss the project related information
Worked with data ingestions from multiple sources into the Azure SQL data warehouse
Transformed and loading data into Azure SQL Database
Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations
Developed HIVE scripts to transfer data from and to HDFS
Implemented Hadoop based data warehouses, integrated Hadoop with Enterprise Data Warehouse systems
Performed reverse engineering using Erwin to redefine entities, attributes and relationships existing database
Used MongoDB to store Big Data in JSON format and Aggregation is used in MongoDB to Match, Sort and Group operation
Developed the back-end web services using Python and designed the front end of the application using Python, CSS, AJAX, JSON, Drupal and JQuery
Development and maintenance of data pipeline on Azure Analytics platform using Azure Databricks
Created Airflow Scheduling scripts in Python
Consumed Web Service using WSDL and SOAP tested using SOAP UI
Ingested data into HDFS using Sqoop and scheduled an incremental load to HDFS
Worked with Hadoop infrastructure to storage data in HDFS storage and use HIVE SQL to migrate underlying SQL codebase in Azure
Exposed Java APIs for other applications to access data using REST API
Implemented Web Services clients for APIs by using Spring Web Service Template class
Created Data Pipeline to migrate data from Azure Blob Storage to Snowflake
Worked on Snowflake modeling and highly proficient in data warehousing techniques for data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment and change data capture
Maintained NoSQL database to handle unstructured data, clean the data by removing invalidate data, unifying the format and rearranging the structure and load for following steps
Participated in NoSQL database maintaining with Azure Sql DB
Involved in Kafka and building use case relevant to our environment
Identified data within different data stores, such as tables, files, folders, and documents to create a dataset in pipeline using Azure HDInsight
Optimized and updated UML Models (Visio) and Relational Data Models for various applications
Wrote Python scripts to parse XML documents and load the data in database
Written DDL and DML statements for creating, altering tables and converting characters into numeric values
Translated business concepts into XML vocabularies by designing XML Schemas with UML
Worked on Data load using Azure Data factory using external table approach.

ETL Developer

Reliance Industries

08.2013 - 01.2015

Defined various Facts and Dimensions in the data mart, Aggregate and Summary facts
Extensively used Power Center/Mart to design Multiple Mappings with Embedded Business Logic
Created Transformations like Aggregator, Expression, Filter, Router, Sequence Generator, Update Strategy, Joiner, Rank and Source Qualifier Transformations in the Informatica Designer
Created Complex Mappings using Lookup, Sorter, Aggregator, and Router transformations for populating Target Table in Efficient manner
Prepared the ETL Mapping Spreadsheet and Fact Dimension Matrix simultaneously prior to performing the ETL process
Created Mapplet and used them in different Mappings
Performance tuning of the Informatica mappings using various components like Parameter files, Variables and Dynamic Cache
Excellent experience in ETL Tools like Informatica and in implementing Slowly Changing Dimensions (SCD)
Maintained Warehouse Metadata, Naming Standards and Warehouse Standards for future application development
Extensively Used SQL queries, Indexes, Oracle Functions
Oracle 10g, Informatica Power Center 9

Skills

Hadoop
MapReduce
HDFS
Sqoop
PIG
Hive
HBase
Oozie
Flume
NiFi
Kafka
Zookeeper
Yarn
Apache Flink
Apache Spark
Mahout
Sparklib
Apache Druid
Oracle
MySQL
SQL Server
MongoDB
Cassandra
DynamoDB
PostgreSQL
Teradata
Java
Python
PySpark
Scala

Shell script
Perl script
SQL
GCP
AWS
Microsoft Azure
PyCharm
Eclipse
Visual Studio
Plus
SQL Developer
TOAD
SQL Navigator
Query Analyzer
SQL Server Management Studio
SQL Assistance
Postman
SVN
Git
GitHub
Windows 7/8/XP/2008/2012
Ubuntu Linux
MacOS
Kerberos
Dimension Modeling
ER Modeling
Star Schema Modeling
Snowflake Modeling
Control-M
Grafana

Certification

Databricks certified data engineer professional

snowpro advanced data engineer

AWS certified data engineer

certified azure data engineer associate

Timeline

AWS Data Engineer

AgFirst Columbia

03.2024 - Current

AWS Data Engineer

Republic Services

05.2022 - 02.2024

Azure Data Engineer

Value Labs Inc

01.2021 - 08.2022

Data engineer/Data Analyst

Reverse Informatics

02.2015 - 11.2019

ETL Developer

Reliance Industries

08.2013 - 01.2015

Nagendra Pasam

Summary

Overview

Work History

AWS Data Engineer

AWS Data Engineer

Azure Data Engineer

Data engineer/Data Analyst

ETL Developer

Skills

Certification

Timeline

AWS Data Engineer

AWS Data Engineer

Azure Data Engineer

Data engineer/Data Analyst

ETL Developer

Similar Profiles

Tracey JonesTracey Jones

Journi CaldwellJourni Caldwell

Jordan McElroyJordan McElroy

Concetta DeFuscoConcetta DeFusco

Madison BushloperMadison Bushloper