Experienced Data Engineer with 1.8 years of a strong background in building and optimizing Big Data applications using Hadoop ecosystem technologies like HDFS, Hive, Sqoop, and Apache Spark. I specialize in designing scalable data pipelines and automating data workflows, ensuring efficient processing and seamless integration with cloud platforms like AWS. My focus is on delivering high-performance solutions and maintaining reliability in production environments.
Overview
2
2
years of professional experience
Work History
Big Data Engineer
Tata Consultancy services
Hyderabad, Telangana
12.2021 - 12.2022
I crafted and fine-tuned intricate Hive queries to effectively handle massive amounts of data.
I have integrated Hive with other big data technologies, such as Hadoop and Spark, in order to optimize data processing workflows.
Designed and managed Extract, Transform, Load (ETL) processes using Hive, leading to improved data consistency and accuracy.
Managed EMR cluster configurations and scaling based on workload requirements.
Leveraged Spark RDDs (Resilient Distributed Datasets) for low-level data processing tasks.
Integrated Spark with AWS Glue for automated ETL pipelines and schema evolution.
Integrated Spark with AWS Lambda for serverless data processing solutions.
Created lambda functions in AWS to run ECS containers.
Experienced in designing and implementing complex data integration solutions using Sqoop.
Proficient in using Sqoop to import and export data with complex schemas and data types.
Implemented Sqoop-based solutions to load data from external databases into Hadoop clusters in real-time.
Proficient in using Sqoop to automate data migrations between Hadoop clusters in different geographical regions.
Implemented data partitioning and shuffling strategies for optimization.
Created custom Spark applications for specific business use cases.
Optimized Spark jobs and data processing workflows for scalability, performance, and cost efficiency using techniques such as partitioning, compression, and caching.
Experienced in optimizing Spark SQL performance by tuning various configuration settings, such as memory allocation, caching, and serialization.
Expertise in using Spark SQL to process large-scale structured and semi-structured data sets, including querying, filtering, mapping, reducing, grouping, and aggregating data.
Proficient in managing and optimizing data storage solutions using Google Cloud Storage, ensuring efficient data organization, access, and security.
Experienced in deploying and managing data processing clusters with Google Dataproc, leveraging its scalability and automation features for large-scale data analysis.
Hands-on experience with managing Google Compute Engine instances, including image creation, network configuration, and instance scaling.
Strong knowledge of Google Cloud Functions triggers and bindings for seamless integration with various event-driven workflows.
Data Engineer Intern
Magnibot Technology solutions India Pvt Ltd
Bangalore
10.2020 - 08.2021
Integrated Spark with external data sources like JDBC and APIs for data extraction.
Integrated Spark with Hadoop ecosystems like HDFS and Hive for data storage and querying.
Collaborated with data engineers to design and optimize Spark data pipelines.
Familiarity with Spark Data Frame schema and data type operations, such as adding, renaming, and dropping columns, casting data types, and handling null values.
Knowledge of Spark Data Frame optimization techniques, such as predicate pushdown, column pruning, and vectorized execution, and their impact on query performance and resource utilization.
Designed and developed batch processing data pipelines on Amazon EMR using Apache Spark, Python, and Scala to process terabytes of data in a cost-effective and scalable manner.
I was involved in working on data analysis, data quality, and data profiling to support the business and assist the business team.
Worked with Spark's data serialization formats (Avro, Parquet, JSON, etc.).
Utilized Spark for log parsing and parsing unstructured data.
Designed and optimized Spark jobs for data deduplication.
Maintained and monitored Spark clusters on AWS EMR, ensuring high availability and fault tolerance.
Automated infrastructure provisioning and management on Google Compute Engine using Infrastructure as Code (IAC) tools like Terraform.
Developed serverless, event-driven workflows on Google Cloud Functions, streamlining data processing and reducing infrastructure complexity.
Developed serverless applications on Google Cloud Functions, leveraging event-driven architecture for real-time data processing and automation.
Proficient in Google Cloud Storage's versioning and object archiving features, ensuring data retention and compliance with data governance policies.
Education
Master of Science - Big Data Analytics & Information Technology
University of Central Missouri
Warrensburg, MO
03-2024
Timeline
Big Data Engineer
Tata Consultancy services
12.2021 - 12.2022
Data Engineer Intern
Magnibot Technology solutions India Pvt Ltd
10.2020 - 08.2021
Master of Science - Big Data Analytics & Information Technology
Assistant Delivery Manager at Tata Consultancy Services, Global Shared ServicesAssistant Delivery Manager at Tata Consultancy Services, Global Shared Services