10+ years of professional experience in information technology with an expert hand in the areas of Big Data, Hadoop Spark, Hive, Impala, Sqoop, Flume, Kafka, SQL tuning, ETL development, report development, SAS, database development, data modeling and strong knowledge of oracle database architecture.
Experience in Big Data analytics, Data manipulation, using Hadoop Eco system tools Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop, GCP, Azure, Spring Boot, Spark integration with Cassandra, Avro, Solr and Zookeeper.
Strong experience in migrating other databases to Snowflake.
Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), Data Factory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes.
Built a scalable, automated data pipeline using AWS services (Glue, EMR, Redshift) and GCP tools (BigQuery, Pub/Sub), integrating diverse data sources into Snowflake for analytics.
Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau, Power BI.
Adopted best practices for AWS Lambda security, ensuring compliance with industry standards and enhancing data protection protocols.
Created Snowflake Schemas by normalizing the dimension tables as appropriate, and creating a Sub Dimension named Demographic as a subset to the Customer Dimension.
Utilized for ETL (Extract, Transform, Load) processes, integrating various data sources to centralize and prepare data for analysis.
Leveraged Airflow for orchestration, dbt for data transformations, and Informatica for data quality. Implemented continuous data validation and governance using IDQ and AWS Lambda for serverless automation.
Expertise in Java programming and have a good understanding on OOPs, I/O, Collections, Exceptions Handling, Lambda Expressions, Annotations
Conducted training sessions on Apache Flink for team members, fostering knowledge sharing and enhancing team capabilities in stream processing.
Ensured data integrity and compliance with regulations (e.g., GDPR, CCPA) while managing big data projects, safeguarding sensitive information.
Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and MongoDB using Python.
Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy. Experience in working on creating and running Docker images with multiple micro services.
Experienced in utilizing AWS services such as Amazon S3 for data storage, Amazon RDS for relational database management, and Amazon Redshift for data warehousing solutions.
Implemented event-driven architectures with AWS Lambda, enabling real-time data processing and enhancing system responsiveness.
Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like Map Reduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
Having extensive experience in Microsoft Azure Cloud Computing, GCP and SQL BI Technologies.
Hands-on experience in Azure Cloud Services (PaaS & SaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis Services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
Good experience in tracking and logging end to end software applications build using Azure DevOps.
Used SQL Azure extensively for database needs in various applications.
Experienced with Docker and Kubernetes on multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
Strong experience in core Java, Scala, SQL, PL/SQL and Restful web services.
Overview
11
11
years of professional experience
Work History
Senior Data Engineer
Humana
08.2023 - Current
Created powerful search solutions with Apache Solr, enabling users to perform detailed and fast full-text searches and complex queries
Built and managed scalable Aerospike clusters that efficiently handled massive data loads, ensuring minimal downtime and high system reliability
Integrated AWS Lambda with third-party APIs to automate data retrieval and processing tasks, reducing manual data entry efforts and improving accuracy
Worked on migration of data from On-Prem SQL server to Cloud database (Azure Synapse Analytics (DW) & Azure SQL DB)
Managed NoSQL database on GCP, designed for large analytical and operational workloads, often used for time-series data or high-throughput applications
Developed and deployed real-time data processing applications using Apache Flink, enabling the analysis of streaming data with latency as low as a few seconds
Developed interactive dashboards using tools like Tableau and Power BI to present complex data findings to stakeholders, enhancing decision-making processes
Developed and managed ETL processes using AWS Glue, enabling seamless data integration from multiple sources and automating data transformation tasks
Built a serverless RESTful API with AWS Lambda and API Gateway for a mobile application, achieving seamless integration with front-end components and improving response times
Created tabular models on Azure analysis services for meeting business reporting requirements
Implemented AWS security best practices, including IAM (Identity and Access Management), to manage user permissions and ensure data compliance with regulations like GDPR
Experience in data transformations using Azure HDInsight, HIVE for different file formats
Developed Spark and SparkSQL code to process the data in Apache Spark on Azure HDInsight to perform the necessary transformations based on the STMs developed
Developed business intelligence solutions using SQL Server data tools to load data to SQL & Azure Cloud databases
Leveraged big data technologies (e.g., Hadoop, Spark) to analyze large datasets, driving insights that informed strategic business decisions
Created an automated data ingestion process utilizing AWS Lambda, which processed and transformed incoming data streams from IoT devices, resulting in a 60% reduction in manual intervention
Analyzed, designed and built modern data solutions using Azure PaaS services to support visualization of data
Converted Talend Job lets to support the snowflake functionality
Used Airflow for scheduling the Hive, Spark and Map Reduce jobs
Developed Oozie work processes for planning and arranging the ETL cycle
Transformed date related data into application compatible format by developing Apache Pig UDFs
Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
Programmer analysts with expertise Tableau Servers in ETL, Teradata and other EDW data integrations and developments
Managed NoSQL database on GCP, designed for large analytical and operational workloads, often used for time-series data or high-throughput applications
Used for building and managing secure, scalable data lakes that store raw data from various sources
Cloudera integrates with Spark, Hadoop, and other big data technologies for efficient data management
Used for event-driven serverless computing
Lambda helps automate data processing tasks such as triggering ETL pipelines or handling data events from sources like S3
Consulting on Snowflake Data Platform Solution Architecture, Design, Development and deployment focused to bring the data driven culture across the enterprises
Worked on Oracle Databases, Redshift and Snowflakes
Documented the requirements including the available code, which should be implemented using Spark, Hive, HDFS, HBase and Elastic Search
Developed Kafka consumer API in Scala for consuming data from Kafka topics
Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2
Used Sqoop to channel data from different sources of HDFS and RDBMS
Developed Scala scripts using both Data frames/SQL and RDD/Map Reduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP
Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system
Integrated Oozie with Hue and scheduled workflows for multiple Hive, Pig and Spark Jobs
Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats
Involved in modeling different key risk indicators in Splunk and building extensive Hive & SPL quires to understand behavior across the customer life cycle
Utilized AWS Cost Explorer and Trusted Advisor to monitor and optimize cloud resource usage, resulting in significant cost savings
Created and Managed Splunk DB connect identities, connections inputs, outputs lookups and access controls
Created dashboards, reports and alerts for real time monitoring in Splunk
Tableau and Jasper soft
Performed statistical analysis and predictive modeling using R to uncover trends and patterns, which helped the business make informed decisions
Data Analyst
HCL TECHNOLOGIES LTD
12.2019 - 05.2022
Company Overview: USAA
Worked in Agile environment, and used rally tool to maintain the user stories and tasks
Implemented Apache Sentry to restrict the access on the Hive tables on a group level
Designed and implemented by configuring Topics in new Kafka cluster in all environment
Created multiple dashboards in tableau for multiple business needs
Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access
Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
Extract Transform and Load data from sources systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, Azure Data Lake analytics
Designed & implemented database solutions in Azure SQL Data Warehouse, Azure SQL
Implemented Composite server for the data virtualization needs and created multiples views for restricted data access using a REST API
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team Using Tableau
Migrated Map reduce jobs to Spark jobs to achieve better performance
Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala and Python
Developed Apache Spark applications by using spark for data processing from various streaming sources
Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala
Queried and analyzed data from Cassandra for quick searching, sorting and grouping through CQL
Joined various tables in Cassandra using spark and Scala and ran analytics on top of them
Applied Spark advanced procedures like text analytics and processing using the in-memory processing
Implemented Apache Drill on Hadoop to join data from SQL and No SQL databases and store it in Hadoop
Brought data from various sources in to Hadoop and Cassandra using Kafka
SQL Server reporting services (SSRS)
Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports
Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing
USAA
Data Engineer
Avon Technologies Pvt Ltd
12.2018 - 12.2019
Company Overview: Proxima
Ingested data from Oracle databases using Sqoop and Flume, ensuring smooth data flow into our systems
Developed a custom Pig User-Defined Function (UDF) to convert various date and timestamp formats from unstructured files into standardized formats
Engaged in hands-on Extract, Transform, Load (ETL) processes with Ab Initio, managing data mapping and transformation in a complex, high-volume environment
Analyzed and managed system logs using tools like Splunk and syslog to ensure data integrity and system performance
Imported and exported data to and from Hadoop Distributed File System (HDFS) and Hive using Sqoop and Kafka
Developed MapReduce programs using Apache Hadoop, allowing efficient processing of large datasets
Validated Sqoop jobs and Shell scripts, ensuring accurate data loading without discrepancies
Also handled migration and testing of both static and transactional data across core systems
Utilized Apache Kafka to enhance data processing by transforming live streaming data with batch processing to generate insightful reports
Proficient in several open-source programming languages, including Perl, Python, Scala, and Java
Wrote scripts for managing HBase tables, including creating, truncating, dropping, and altering tables to store processed data for future analytics
Designed and implemented self-service reporting solutions in Azure Data Lake Store Gen2 using an ELT (Extract, Load, Transform) approach
Developed data warehouse models in Snowflake for over 100 datasets, utilizing the Cape tool for efficient data management
Worked within Agile methodologies, participating in Scrum stories and sprints while focusing on data analytics and wrangling tasks in a Python environment
Tuned the performance of Phoenix/HBase, Hive queries, and Spark applications to optimize system performance
Installed Kafka to collect data from various sources, storing it for further consumption
Utilized a custom File System plugin to enable seamless access for Hadoop MapReduce programs, HBase, Pig, and Hive
Wrote PySpark and Spark SQL transformations in Azure Databricks to implement complex business rules and data transformations
Extended the functionality of Hive and Pig by writing custom UDFs, UDTFs, and UDAFs to meet specific project needs
Built and maintained a robust environment on Azure's Infrastructure as a Service (IaaS) and Platform as a Service (PaaS)
Implemented best practices for Continuous Integration and Continuous Development using Azure DevOps, ensuring effective code versioning
Architected and implemented medium to large-scale Business Intelligence (BI) solutions on Azure, leveraging various Azure Data Platform services, including Azure Data Lake, Data Factory, and Stream Analytics
Utilized the Azure Portal extensively, including Azure PowerShell, Storage Accounts, and Data Management for efficient operations
Created Azure PowerShell scripts to transfer data between the local file system and HDFS Blob storage
Managed various database and Azure Data Platform services, including Azure Data Lake, Data Factory, SQL Server, and Oracle, successfully building multiple Data Lakes
Developed ETL jobs using Spark-Scala to migrate data from Oracle databases to new Hive tables, ensuring efficient data handling
Gained experience in various scripting technologies, including Python and Unix shell scripts
Created Spark code using Scala and Spark-SQL/Streaming to facilitate quicker testing and processing of data
Developed middleware component services using Java Spring to fetch data from HBase through the Phoenix SQL layer for various web application use cases
Proxima
ETL Developer
HI-Gate Infosystems Pvt. Ltd
07.2015 - 11.2018
Company Overview: Barrick Gold Corporation
Developed Python utility to validate HDFS tables with source tables
Implement code in Python to retrieve and manipulate data
Designing and implementing data processing systems on GCP using services such as BigQuery, Dataflow, and Dataproc
Building and managing data warehouses and data lakes on GCP, ensuring data integrity and security
Implementing real-time data streaming and processing solutions using GCP services like Pub/Sub and Apache Beam
Designed ETL Process using Informatica to load data from Flat Files, and Excel Files to target Oracle Data Warehouse database
Scheduled different Snowflake jobs using NiFi
Developing and maintaining scalable data pipelines for ingesting, processing, and transforming large volumes of data
Designing and optimizing data models and schemas for efficient storage and retrieval
Handled importing of data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS
Automated all the jobs for pulling data from FTP server to load data into Hive tables using Oozie workflows
Designing and implementing data ingestion processes to capture and load data from various sources into GCP storage systems such as Cloud Storage or Bigtable
Responsible for developing Python wrapper scripts which will extract specific date range using Sqoop by passing custom properties required for the workflow
Involved in filtering data stored in S3 buckets using Elastic search and loaded data into Hive external tables
Designed and developed UDF'S to extend the functionality in both PIG and HIVE
Import and Export of data using Sqoop between MySQL to HDFS on regular basis
Developed a shell script to create staging, landing tables with the same schema as the source and generate the properties which are used by Oozie Jobs
Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources
Developed Oozie workflows for executing Sqoop and Hive actions
Built various graphs for business decision making using Python matplotlib library
Barrick Gold Corporation
Hadoop Developer
Hewlett Packard
01.2014 - 06.2015
Company Overview: Dot-com team
Installed Oozie workflow engine to run multiple Hive and Pig Jobs
Developed Simple to complex Map/reduce Jobs using Hive and Pig
Developed Map Reduce Programs for data analysis and data cleaning
Implemented Avro and parquet data formats for Apache Hive computations to handle custom business requirements
Integrating external data sources and APIs into GCP data solutions, ensuring data quality and consistency
Building data transformation pipelines using GCP services like Dataflow or Apache Beam to cleanse, normalize, and enrich data
Build machine-learning models to showcase big data capabilities using Pyspark and MLlib
Designed, implemented and deployed within a customer's existing Hadoop / Cassandra cluster a series of custom parallel algorithms for various customer-defined metrics and unsupervised learning models
Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster
Extensively used SSIS transformations such as Lookup, Derived column, Data conversion, Aggregate, Conditional split, SQL task, Script task and Send Mail task etc
Performed data cleansing, enrichment, mapping tasks and automated data validation processes to ensure meaningful and accurate data was reported efficiently
Implemented Apache PIG scripts to load data from and to store data into Hive