- Involved in Analysis, Design, System architectural design, Process interfaces design, design documentation.
- Responsible for developing prototypes of selected solutions and implementing complex big data projects with focus on collecting, parsing, managing, analyzing and visualizing large sets of data using multiple platforms.
- Understand how to apply new technologies to solve big data problems and to develop innovative big data solutions.
- Developed various data loading strategies and performed various transformations for analyzing datasets by using Cloudera Distribution for Hadoop ecosystem.
- Worked extensively on designing and developing multiple Spark Scala ingestion pipelines both Realtime and Batch.
- Responsible for handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Worked on importing metadata into Hive/impala and migrated existing legacy tables and applications to work on Hadoop by using Spark, Hive and impala.
- Work on POC's to perform change data capture (CDC) and Slowly Changing Dimension phenom in HDFS using Spark and Delta Lake open-source storage layer that brings ACID transactions to Apache Spark.
- Extensively worked on POC to ingest data from S3 bucket to snowflake using external stages.
- Developed generic store procedures using Snow Sql and Javascrit to transform and ingest transactional tables data into Snowflake relational tables from external S3 stages.
- Worked on Prototype to create external function in snowflake to call remote service implemented in AWS Lambda.
- Developed multiple POCs using Spark and deployed on Yarn cluster, compared performance of Spark, with Hive and Impala.
- Responsible for Performance tuning Spark Scala Batch ETL jobs by changing configuration properties and using broadcast variables.
- Worked on Batch processing for History load and Real-time data processing for consuming live data on Spark Streaming using Lambda architecture.
- Developed Streaming pipeline to consume data from Kafka and ingest into HDFS in near real time.
- Worked on Performing tuning of Spark Streaming Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Implemented Spark SQL optimized joins to gather data from different sources and run ad-hoc queries on top of them.
- Wrote Spark Scala Generic UDFs to perform business logic operations at record level.
- Developing Spark code in Scala and Spark SQL environment for faster testing and processing of data and Loading data into Spark RDD and doing In-memory computation to generate output response with less memory usage.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Worked on parsing and converting JSON/XML formatted files to tabular format in Hive/impala by using Spark Scala, Spark SQL and Dataframe API's.
- Worked on various file formats and compressions Text, Json, XML , Avro, Parquet file formats, snappy, bz2, gzip compression.
- Worked on performing transformations & actions on RDDs and Spark Streaming data.
- Involved in converting Hive QL queries into Spark transformations using Spark RDDs, Spark SQL and Scala.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Installed Open source Zeppelin Notebook for using Spark Scala, PySpark, Spark SQL and Spark R API's interactively via web interface.
- Worked on integrating Zeppelin with LDAP for multiuser support in all environments.
- Responsible for Zeppelin Estimating resources usage and configuring interpreters for optimal use.
- Developed workflow in Oozie to automate tasks of loading data into HDFS and pre-processing data and used Zookeeper to coordinate clusters.
- Used Zookeeper for various types of centralized configurations.
- Met with key stakeholders to discuss and understand all major aspects of project, including scope, tasks required and deadlines.
- Supervised Big Data projects and offered assistance and guidance to junior developers.
- Multi-tasked to keep all assigned projects running effectively and efficiently.
- Achieved challenging production goals on consistent basis by optimizing .
Environment: Hadoop, Cloudera distribution, Scala, Python, Spark core, Spark SQL, Spark Streaming, Hive, HBase, Pig, Sqoop, Kafka, Zookeeper, Java 8, and UNIX Shell Scripting, Zeppelin Notebook, Delta Lake, AWS S3, AWS Lambda, Snowflake, SnowSql.