5+ years of experience as a Software Developer with expertise in Hadoop Ecosystem, Spark, AWS & Azure tools and Exposure to Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases.
· Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development.
· Built scalable distributed data solutions using Apache Hadoop and Spark.
· Used Scala collection framework to store and process the complex consumer information and Scala functional programming concepts to develop business logic.
· Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files, worked on AWS EMR.
· Involved in loading the created HFiles into HBase for faster access of large customer base without taking Performance hit.
· Implemented working with different sources using Multi Input formats using Generic and Object Writable.
· Used Spark and Spark-SQL to read the parquet data and create the tables in Hive using the Scala API.
· Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
· Collected the logs from the physical machines and the OpenStack controller and integrated into HDFS using Flume.
· Used Stream Sets Data Collector to create ETL pipeline for pulling the data from RDBMS system to HDFS.
· Worked on deploying producer and consumer micro services inside Kubernetes cluster and tried connecting it to KAFKA outside of the cluster.
· Implemented daily CRON jobs that automate parallel tasks of loading the data into HDFS using Oozie coordinator jobs.
· Installed Oozie workflow engine to run multiple Hive & Pig jobs and benchmarked Hadoop and HBase clusters for internal use.
· Built Spark applications using PySpark and used Python programming languages for data engineering in Spark framework.
· Implemented Kafka High level consumers to get data from Kafka partitions and move into HDFS.
· Used Azure Data Factory to schedule the flows by connecting different pipelines and data bricks notebooks.
· Responsible for building scalable distributed data solutions using Hadoop. Implemented nine nodes CDH3 Hadoop cluster on RedHat Linux.
· Used Pig as ETL tool to do transformations, event joins, filter, and some pre-aggregations.
· Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
· Transferred the data using Informatica tool from AWS S3 to AWS Redshift.
Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
· Involved in analyzing system failures, identifying root causes, and recommended course of actions.
· Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
· Managed & scheduled Jobs on a Hadoop cluster and designed data warehouse using Hive.
· Created ETL guidelines document which involves coding standards, naming conventions for development and production support log and root cause analysis documents for troubleshooting DataStage jobs.
· Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS. Developed the Pig UDF’S to pre-process the data for analysis.
· Develop Hive queries for the analysts.
· Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
· Cluster co-ordination services through Zookeeper.
· Collected the logs data from web servers and integrated in to HDFS using Flume.
· Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the Map Reduce jobs given by the users.
· Responsible to manage data coming from different sources and involved in loading data from UNIX file system to HDFS.
· Flexible with full implementation of spark jobs with PySpark API and Spark Scala API.
· Used Apache Kafka for large-scale data processing, handling real-time analytics and real streaming of data.
· Using Kafka on publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
· Responsible for building scalable distributed data solutions using Hadoop. Involved in loading data from edge node to HDFS using shell scripting.
· Created HBase tables to store variable data formats of PII data coming from different portfolios. Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
· Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports.
· Worked with highly engaged Informatics, Scientific Information Management, and enterprise IT teams.
· Involved in various phases of Software Development Life Cycle (SDLC) as design development and unit testing.
· Developed and deployed UI layer logics of sites using JSP, XML, JavaScript, HTML/DHTML, and Ajax.
· Agile Scrum Methodology been followed for the development process.
· Developed proto-type test screens in HTML and JavaScript.
· Involved in developing JSP for client data presentation and, data validation on the client side within the forms.
· Experience in writing PL/SQL stored procedures, Function, Triggers, Oracle reports and Complex SQL’s.
· Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
· Used Hive data warehouse tool to analyze the unified historic data in HDFS to identify issues and behavioral patterns.
· Worked with JavaScript to perform client-side form validations.
· Used Struts tag libraries as well as Struts tile framework.
· Used JDBC to access Database with Oracle thin driver of Type-3 for application optimization and efficiency. Created connection through JDBC and used JDBC statements to call stored procedures.
· Used Data Access Object to make application more flexible to future and legacy databases.
· Developed UDFs in Java as and when necessary to use in PIG and HIVE queries.
· Worked on Java application development using JSP, Servlets, Struts, Hibernate, Spring, Rest and SOUP Web Services.
· Implemented AWS EC2, Auto Scaling, AWS API and exposed as the Restful Web services.
· Actively involved in tuning SQL queries for better performance.
· Used Git for source code control, involved in Unit Testing and Bug Fixing.
Specific Technologies:- Data Structures and Algorithms, Object Oriented Programming Skills
undefined