I’m a highly skilled data engineer with over five years of experience in designing, developing, and optimizing data integration solutions in diverse environments. Adept at leveraging a wide range of technologies including Apache Hadoop, Spark, Kafka, AWS, and various relational and NoSQL databases to build robust ETL pipelines and real-time data processing systems. Demonstrated my expertise in Agile methodologies, data orchestration using tools like NiFi and Airflow, and containerization with Docker and Kubernetes. I have proven my ability to implement data governance frameworks ensuring data quality and compliance, while effectively utilizing advanced analytics tools such as Zeppelin and Jupyter Notebooks for deriving insights from large datasets. Extensive experience in cloud-based data warehousing solutions, cloud infrastructure management, and automating CI/CD processes. Also have strong background in developing scalable data architectures and maintaining enterprise data warehouses, coupled with hands-on experience in machine learning, data visualization, and performance tuning. I can proudly say I’m a collaborative team player with a track record of delivering high-quality data solutions that drive business value.
Overview
6
6
years of professional experience
2
2
Certifications
Work History
Intern as Data Engineer
Thrive Software Solutions
, WA
02.2024 - 05.2024
Utilized Apache Zeppelin and Jupyter Notebooks for advanced analytics, deriving insights from large datasets through statistical techniques and machine learning algorithms
Managed data orchestration and workflows efficiently with Apache NiFi and Luigi, handling various data formats including JSON, XML, Parquet, CSV, and ORC
Used Docker for containerization and Kubernetes for orchestration, facilitating the deployment and management of containerized applications
Implemented data governance frameworks with Apache Atlas and Collibra, ensuring data quality, privacy, and regulatory compliance
Leveraged Apache Kafka Streams and Amazon Kinesis for real-time data processing, optimizing streaming data pipelines for high-throughput and real-time analytics.
Graduate Research Assistant
Northern Illinois University
Dekalb, IL
01.2023 - 01.2024
Enhanced distributed data processing efficiency by leveraging Apache Hadoop, Spark, and Flink, focusing on in-memory and real-time stream processing
Implemented advanced resource management and scalable architectures using containerization and load balancing techniques
Developed optimized ETL processes with incremental loading and real-time data processing capabilities using Apache Kafka
Integrated automated monitoring and self-healing mechanisms into data pipelines, utilizing Apache Airflow for workflow orchestration
Ensured data quality and optimized performance by incorporating robust validation steps and profiling tools to identify and resolve bottlenecks.
AWS Data Engineer
Mindtree Ltd
Hyderabad
11.2020 - 07.2022
Developed cloud migration strategy and implemented best practices using AWS services like database migration service and server migration service
Setup and build AWS infrastructure using resources such as VPC, EC2, S3, DynamoDB, IAM, EBS, Route53, SNS, SES, SQS, CloudWatch, CloudTrail, Security Group, Auto Scaling, and RDS using CloudFormation templates
Implemented new tools like Kubernetes with Docker for auto-scaling and continuous integration (CI), deploying Docker images through Kubernetes, and using the Kubernetes dashboard for monitoring
Utilized AWS Lambda for serverless computing and trigger-based code execution
Worked on implementing data warehouse solutions in AWS Redshift and migrating data from various databases to AWS services
Developed scripts in BASH and Python for AWS infrastructure creation and automation tasks
Orchestrated and migrated CI/CD processes using CloudFormation, Terraform, and Docker, setup in OpenShift, AWS, and VPCs
Developed Python programs for automating tasks like extracting metadata and lineage from tools, saving significant manual effort
Utilized Spark for improving performance and optimizing existing algorithms in Hadoop environments
Integrated real-time monitoring for data ingestion processes using AWS CloudWatch
Configured Airflow connection to AWS EMR cluster and developed bash shell bootstrap scripts for initializing the cluster with necessary configurations
Defined, created, and deployed Star Schema, Snowflake Schema, and Dimensional Data Modeling on an Enterprise Data Warehouse (EDW).
Big Data Engineer
Arcesium
Hyderabad
07.2018 - 10.2020
Worked in Agile environments using tools like Rally to maintain user stories and tasks
Utilized Agile methodology and SCRUM process, providing daily reports and participating in design and development phases
Developed Spark/PySpark-based ETL pipelines for migrating credit card transactions, account, and customer data into an enterprise Hadoop Data Lake
Migrated MapReduce jobs to Spark for better performance and used Spark RDDs, Python, and Scala for data transformations
Maintained data integration programs in Hadoop and RDBMS environments from both structured and semi-structured data sources
Developed data pipelines using Spark, Hive, Pig, Python, Impala, and HBase
Utilized AWS services such as EMR, S3, Lambda, and SNS for data processing and storage
Designed and implemented database solutions in Azure SQL Data Warehouse and Azure SQL
Designed SSIS Packages for ETL from various environments into SQL Server for SSAS cubes
Transformed Teradata scripts and stored procedures to SQL and Python for Snowflake's cloud platform
Defined, created, and deployed Star Schema, Snowflake Schema, and Dimensional Data Modeling on an EDW
Implemented Composite server for data virtualization and created restricted data access views using a REST API
Batch processed data from S3 to MongoDB, PostgreSQL, and MySQL
Queried and analyzed data from Cassandra using CQL and joined various tables using Spark and Scala
Built and published customized interactive Tableau reports and dashboards
Created multiple dashboards in Tableau for various business needs and used SQL Server Reporting Services (SSRS) for formatted reports
Performed performance tuning on Hive queries and UDFs
Supervised data profiling and validation to ensure accuracy between source and target systems
Configured Topics in new Kafka clusters across environments and brought data into Hadoop and Cassandra using Kafka
Implemented Apache Drill on Hadoop to join data from SQL and NoSQL databases for storage.
Hadoop Developer
GENPACT
Hyderabad
01.2018 - 06.2018
Installed Oozie workflow engine to run multiple Hive and Pig Jobs
Developed Simple to complex Map/reduce Jobs using Hive and Pig
Developed Map Reduce Programs for data analysis and data cleaning
Implemented Avro and parquet data formats for Apache Hive computations to handle custom business requirements
Integrating external data sources and APIs into GCP data solutions, ensuring data quality and consistency
Building data transformation pipelines using GCP services like Dataflow and Apache Beam to cleanse, normalize, and enrich data
Build machine-learning models to showcase big data capabilities using PySpark
Designed, implemented, and deployed within a customer’s existing Hadoop / Cassandra cluster a series of custom parallel algorithms for various customer-defined metrics and unsupervised learning models
Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster
Extensively used SSIS transformations such as Lookup, Derived column, Data conversion, Aggregate, Conditional split, SQL task, Script task and Send Mail task etc
Performed data cleansing, enrichment, mapping tasks and automated data validation processes to ensure meaningful and accurate data was reported efficiently
Implemented Apache PIG scripts to load data from and to store data into Hive.
Education
Master of Science - Management Information Systems