Summary
Overview
Work History
Education
Skills
Timeline
Generic

Praveen Kumar

Austin

Summary

  • 9 years of experience as Data Engineer and Data Analyst including designing, developing and implementation of data models for enterprise-level applications and systems.
  • Hands on experience in installing, configuring, and using Apache Hadoop ecosystem components like Hadoop Distributed File System (HDFS), Map Reduce, PIG, HIVE, HBASE, ZOOKEEPER, SQOOP, Kafka, Storm.
  • Extensive experience in development of T-SQL, Oracle PL/SQL Scripts, Stored Procedures and Triggers for business logic implementation.
  • Hands on working knowledge of Linux operating system, Unix, Windows OS, AWS, Azure and Google cloud platform for machine learning applications to create and manage databases on cloud platform and analyze data sets
  • Experience in deploying major software solutions for various high-end clients meeting the business requirements such as data Processing, Ingestion, Analytics and Cloud Migration from On-prem to Azure Cloud.
  • Hands on experience on configuring a Hadoop cluster in an enterprise environment and on VMWare and Amazon Web Services (AWS) using an EC2 instances.
  • Experienced in Splunk, ELK (Elastic, Logstash, and Kibana) for centralized logging and then store logs and metrics into an S3 bucket using Lambda function and Used AWS Lambda to manage the servers and run the code in the AWS.
  • Expertise in SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS) tools.
  • Expertise in synthesizing Machine learning, Predictive Analytics and Big data technologies into integrated solutions.
  • Good understanding and knowledge of Microsoft Azure services like HDInsight Clusters, BLOB, ADLS & Data Factory
  • Involve in writing SQL queries, PL/SQL programming and created new packages and procedures and modified and tuned existing procedure and queries.
  • Hands on experience in ELK (Elasticsearch, Logstash, and Kibana) stack.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS.
  • Experience in designing, building and implementing complete Hadoop ecosystem comprising of MapReduce, HDFS, Hive, and MongoDB.
  • Experience in various Teradata utilities like Fastload, Multiload, BTEQ, and Teradata SQL Assistant.
  • Extensively worked with Teradata utilities BTEQ, Fast Export and Multi Load to export and load data to/from different source systems including flat files.
  • Experience in HDI Spark, HDI Hadoop, HDI Hive Interactive and HDI Kafka
  • Design, build and manage the ELK (Elasticsearch, Logstash, and Kibana) cluster for ... SQL/NoSQL Databases: Oracle, Teradata, MongoDB, Dynamo DB.
  • Good in System analysis, ER Dimensional Modeling, Database design and implementing RDBMS specific features.
  • Experience in Worked on NoSQL databases - HBase, Cassandra & MongoDB, database performance tuning & data modeling.
  • Experience in designing the Conceptual, Logical and Physical data modeling using Erwin and E/R Studio Data modeling tools.
  • Good in System analysis, ER Dimensional Modeling, Database design and implementing RDBMS specific features.
  • Experience in Worked on NoSQL databases - HBase, Cassandra & MongoDB, database performance tuning & data modeling.
  • Strong experience in writing SQL and PL/SQL, Transact SQL programs for Stored Procedures, Triggers and Functions.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly-changing Dimension Tables and Fact tables.
  • Experienced in Data Scrubbing/Cleansing, Data Quality, Data Mapping, Data Profiling, Data Validation in ETL.
  • Usage of different Talent Hadoop Component like Hive, Pig and Sqoop.
  • Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
  • Good understanding of NoSQL databases like Cassandra and HBase.
  • Good experience with continuous Integration of application using Jenkins.
  • Extensive experience in SQL Server 2016/2012/2008 Business Intelligence tools - SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS).
  • Experienced in Agile Methodologies like attending daily Scrums, maintaining User Stories and Burn down Charts, Backlog Grooming and Retrospective.

Overview

10
10
years of professional experience

Work History

Sr. Data Engineer

Wayside publishing ltd
01.2023 - Current
  • As a Data Engineer, provided technical expertise and aptitude to Hadoop technologies as they relate to the development of analytics
  • Responsible for building scalable distributed data solutions using Big Data technologies like Apache Hadoop, MapReduce, Shell Scripting, Hive
  • Led end-to-end migration project, successfully transferring terabytes of data from on-premises databases to Amazon Redshift, optimizing data storage and query performance
  • Designed and implemented ETL pipelines using AWS Glue and Apache Spark, extracting data from various sources, transforming and cleaning it, and loading it into Redshift for analysis
  • Developed optimized schema design for Redshift tables, leveraging columnar storage and compression techniques, resulting in improved query execution times
  • Created dimension and fact tables following Kimball methodology, ensuring accurate representation of business data and facilitating reporting
  • Automated data extraction and transformation processes using AWS Data Pipeline, reducing manual intervention and ensuring consistent and timely data updates
  • Designed and implemented AWS CloudFormation templates to provision and manage infrastructure resources for data processing pipelines, resulting in a more efficient and consistent deployment process
  • Developed CI/CD pipelines using Jenkins and YAML-based Jenkinsfiles to automate the deployment of CloudFormation stacks, ensuring reliable and repeatable infrastructure changes
  • Developed custom Python scripts to orchestrate ETL workflows, enhancing data reliability and scalability
  • Collaborated with cross-functional teams including data scientists and analysts to define data requirements, ensuring the delivered solutions met their analytical needs
  • Worked closely with stakeholders to gather business requirements, translating them into technical solutions that aligned with the overall data strategy
  • Created comprehensive technical documentation for ETL processes, ensuring knowledge transfer and providing a resource for troubleshooting and maintenance
  • Set up monitoring and alerting mechanisms using Amazon CloudWatch and AWS CloudTrail, proactively identifying and resolving issues to minimize downtime and data loss
  • Implemented data security measures such as encryption at rest and in transit for sensitive data, ensuring compliance with industry standards and regulations
  • Stayed updated with the latest AWS services and best practices through online courses, webinars, and conferences, applying new knowledge to improve data engineering processes
  • Conducted a thorough analysis of client requirements and existing case studies to design and implement tailored predictive models using GCP AI/Vertex and AI/ML skills
  • Led the development of data pipelines using GCP Data Pipeline and GCP Glue, ensuring seamless ETL processes for efficient data processing and model training
  • Worked in an Agile project environment, participating in daily stand-ups, sprint planning, and sprint retrospective meetings to align with project goals and timelines
  • Independently switched technical skills based on project needs, demonstrating adaptability and versatility in handling diverse tasks
  • Ensured the successful deployment of software packages into a fully automated environment, utilizing expertise in software packaging and deployment
  • Collaborated with the DevSecOps team to implement CI/CD pipelines, Blue-Green deployments, and feature toggles using tools such as Git, Jenkins, and uDeploy
  • Documented and designed proposals for the predictive analytics implementation, providing clear communication to both technical and non-technical stakeholders
  • Environment: Hadoop 3.0, MapReduce, Hive 2.3, Agile, HDI Insight, Apache Kafka, Azure,,AWS, Oozie 5.1, Pig 0.17, HDFS, Spark 2.4, Python, HBase 1.2, OLAP, OLTP, Scala, SSIS, SSRS

Sr. Data Engineer

CVS
06.2021 - 01.2023
  • As a Data Engineer, provided technical expertise and aptitude to Hadoop technologies as they relate to the development of analytics
  • Responsible for building scalable distributed data solutions using Big Data technologies like Apache Hadoop, MapReduce, Shell Scripting, Hive
  • Configured Azure SQL database with Azure storage Explorer and with SQL server
  • Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL
  • Led the development and optimization of data pipelines on Azure Databricks
  • Involved in setting up and configuring Azure Databricks clusters
  • Contributed to creating and maintaining comprehensive documentation for known issues and solutions, enabling support teams to resolve issues more efficiently
  • Set up and managed monitoring and alerting systems to promptly identify and respond to issues, ensuring minimal downtime and improved system reliability
  • Provided prompt and courteous support to end-users, addressing access issues, permissions, and basic inquiries, and effectively guided them through common usage challenges
  • Developed automation scripts to streamline support tasks, including log analysis, cluster scaling, and error detection, reducing manual effort and improving efficiency
  • Designed and maintained data lake architecture on Azure for storage and retrieval of large datasets
  • Implemented data security measures and access controls for sensitive data within Azure Databricks
  • Created automated data ingestion and transformation processes, reducing manual effort
  • Familiarity with DevOps practices and CI/CD pipelines for deploying Databricks solutions
  • Knowledge of best practices in data warehousing, data modeling, and data governance
  • Implemented various Azure platforms such as Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, HDInsight, Azure Data Lake and Data Factory
  • Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases
  • Created data integration and technical solutions for Azure Data Lake Analytics, Azure Data Lake Storage, Azure Data Factory, Azure SQL databases and Azure SQL Data Warehouse for providing analytics
  • Worked on multiple tools like HDI Kafka, HDI Hive, Apache NiFi and Spark to create the flow
  • Worked in Azure environment for development and deployment of Custom Hadoop Applications
  • Used Agile (SCRUM) methodologies for Software Development
  • Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS
  • Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight/Databricks
  • Involved in all phases of data mining, data collection, data cleaning, developing models, validation and visualization
  • Worked on machine learning on large size data using Spark and MapReduce
  • Primarily responsible for creating new Azure Subscriptions, data factories, Virtual Machines, SQL Azure Instances, SQL Azure DW instances, HDInsight clusters
  • Designed and implemented scalable Cloud Data and Analytical architecture solutions for various public and private cloud platforms using Azure
  • Designed and develop Big Data analytic solutions on a Hadoop-based platform and engage clients in technical discussions
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig
  • Responsible in loading and transforming huge sets of structured, semi structured and unstructured data
  • Developed Spark scripts by using python and bash Shell commands as per the requirement
  • Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems
  • Implemented Security in Web Applications using Azure and deployed Web Applications to Azure
  • Responsible for translating business and data requirements into logical data models in support Enterprise data models, ODS, OLAP, OLTP and Operational data structures
  • Created SSIS packages to migrate data from heterogeneous sources such as MS Excel, Flat files and CVS files
  • Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala
  • Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
  • Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File, RDBMS as part of a POC using Amazon EC2
  • Worked with Microsoft Azure Cloud services, Storage Accounts, Azure date storage and Azure Data Factory
  • Worked closely with the SSIS, SSRS Developers to explain the complex data transformation using Logic
  • Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data modeling tools
  • Worked on Spark Streaming and Apache Kafka to fetch live stream data
  • Solution architecting BIG Data solution for Projects & Proposal using Hadoop, Spark, ELK Stack, Kafka, Tensor flow
  • Responsible for importing and exporting data from different sources like MySQL, Teradata databases into HDFS using SQOOP to save in file formats AVRO, JSON and ORC file formats
  • Experience in administration and configuration of ELK Stack (Elasticsearch, Logstash, Kibana) on AWS and performed Log Analysis
  • Environment: Hadoop 3.0, MapReduce, Hive 2.3, Agile, HDI Insight, Apache Kafka, Azure, Oozie 5.1, Pig 0.17, HDFS, Spark 2.4, Python, HBase 1.2, OLAP, OLTP, Scala, SSIS, SSRS

Sr. Data Engineer

Johnson & Johnson
04.2020 - 05.2021
  • Worked as a Data Engineer to review business requirement and compose source to target data mapping documents
  • Involved in all phases of SDLC using Agile and participated in daily scrum meetings with cross teams
  • Developed Big Data solutions focused on pattern matching and predictive modeling
  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2
  • Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system
  • Responsible for loading, extracting and validation of client data
  • Used Python programs for data manipulation, automation process of generating reports of multiple data sources or dashboards
  • Deployed and configured Elastic Search, Log stash and Kibana (ELK) for log analytics, full text search, application monitoring in integration with AWS Lambda and Cloud Watch
  • Worked on NoSQL databases including Cassandra
  • Implemented multi-data center and multi-rack Cassandra cluster
  • Coordinated with Data Architects on AWS provisioning EC2 Infrastructure and deploying applications in Elastic load balancing
  • Used pyspark to create, load, transform using spark context, RDD, dataframes
  • Design and develop, data pipeline and ETL integration patterns using PySpark (Python On Spark), databricks
  • Created sheet selector to accommodate multiple chart types (Pie, Bar, Line) in a single dashboard by using parameters
  • Performed Reverse Engineering of the current application using Erwin, and developed Logical and Physical data models for Central Model consolidation
  • Performed Data Analysis and Data Profiling and worked on data transformations and data quality rules
  • Developed data pipelines to consume data from Enterprise Data Lake (MapR Hadoop distribution - Hive tables/HDFS) for analytics solution
  • Created Hive External tables to stage data and then move the data from Staging to main tables
  • Developed incremental and complete load Python processes to ingest data into Elastic Search from Hive
  • Created data models for AWS Redshift and Hive from dimensional data models
  • Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations
  • Created Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability
  • Developed Rest services to write data into Elastic Search index using Python Flask specifications
  • Worked in data management performing data analysis, gap analysis, and data mapping
  • Environment: Agile, AWS, Hadoop 3.0, Hive 2.3, HDFS, Python, pyspark, Cassandra 3.11, NoSQL

Data Analyst/ Data Modeler

Fission Labs
09.2016 - 12.2019
  • Created Oozie workflow jobs for query scheduling and actions and automated the entire CI/CD using Scripts, Git and Jenkins
  • Written Hive queries to transform data for downstream processing
  • Created schemas in Hive with performance optimization using indexing, bucketing and partitioning
  • Implementing the big data pipeline with real-time processing using Python, Pyspark and Hadoop ecosystem
  • Developed Spark code in Python using Spark SQL and Data Frames
  • Worked on dimensional modeling and maintained dimensions like Product, Customer, and Region as a part of Snowflake schema and loaded the data in the member fact tables
  • Used Sqoop to insert and retrieve data from various RDBMS like Oracle and SQL Server
  • Managed full SDLC processes involving requirements management, workflow analysis, source data analysis, data mapping, metadata management, data quality, testing strategy and maintenance of the model
  • Used SDLC (System Development Life Cycle) methodologies like the RUP and the waterfall
  • Comprehensive knowledge and experience in Software Development Life Cycle (SDLC) with business process models: Waterfall & Agile (Scrum) methodologies, Scaled Agile Framework (SAFe)
  • Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Data Stage, and creating ETL jobs in Data Stage
  • Performed analysis on enterprise data/report integration & provided functional specification to development team to build Enterprise Reporting Systems
  • Collected business requirements to set rules for proper data transfer from Data Source to Data Target in Data Mapping
  • Created data mapping documents mapping Logical Data Elements to Physical Data Elements and Source Data Elements to Destination Data Elements
  • Responsible for different Data mapping activities from Source systems to Teradata and Assisted in the oversight for compliance to the Enterprise Data Standards
  • Worked with data investigation, discovery and mapping tools to scan every single data record from many sources
  • Performed data mining on Claims data using very complex SQL queries and discovered claims pattern
  • Written complex SQL queries for validating the data against different kinds of reports generated by Business Objects XIR2
  • Worked in importing and cleansing of data from various sources like Teradata, Oracle, flat files, SQL Server 2005 with high volume data
  • Performed data analysis and data profiling using complex SQL on various sources systems including Oracle and Teradata
  • Performing data management projects and fulfilling ad-hoc requests according to user specifications by utilizing data management software programs and tools like Perl, Toad, MS Access, Excel and SQL
  • Delivered a foundation of comprehensive requirements & high-level design to transition a paper-based legacy mortgage foreclosure system into a modernized efficient application as part of project
  • Good understanding of the AS-IS and TO-BE business processes (GAP analysis) and experience in converting these requirements into technical specifications for preparing test plans
  • Extracted data from multiple structured and unstructured data sources, transformed, processed and loaded into relational and non-relational databases
  • Performed data analysis, outlier detection, anomaly detection, data profiling, trend analysis, financial/statistical analysis & reporting for new and legacy data sources
  • Automated reconciliation process between 7 different systems to validate the flow of data between them using Shell scripting, Python and Databases
  • Developed Informatica solutions for complex and large volumes of data using SQL Server, Oracle fixed width and delimited files
  • Developed complex SQL code using partitioning, materialized views, stored procedures, functions, cursors, and arrays
  • Data Replatforming from RDBMS system to Hadoop ecosystem using Python and native Hadoop tools such as Sqoop, Hive, Map Reduce, YARN
  • Developed data pipeline using Flume, Sqoop and Spark to ingest customer behavioral data and purchase histories into HDFS for analysis
  • Responsible for importing and exporting data from different sources like MySQL, Teradata databases into HDFS using SQOOP to save in file formats AVRO, JSON and ORC file formats
  • Extensively used MS Access to pull the data from various databases and integrate the data
  • Involved in HDFS maintenance and loading of structured and unstructured data
  • Develop and run Map-Reduce jobs on a multi Peta byte YARN and Hadoop clusters which processes billions of events every day, to generate daily and monthly reports as per user's need
  • Weekly meetings with technical collaborators and active participation in code review sessions with senior and junior developers
  • Used to manage and review the Hadoop log files, and Responsible to manage data coming from different sources
  • Performed data processing like collecting, aggregating, moving data from various sources using Apache Flume and Kafka
  • Written multiple Map Reduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats
  • Delivering tuned, efficient and error free codes for new Big Data requirements using my technical knowledge in Hadoop and its Eco-system
  • Environment: Hadoop, MapReduce, SDLC, HDFS, Hive, Pig, Sqoop, Java, Red Hat Linux, AWS, XML, MySQL, Eclipse, Kafka, ETL, Python, Pyspark, SQL, Teradata.

Data Analyst

Efftronics Systems Pvt. Ltd
06.2014 - 08.2016
  • Worked with Data Analyst for requirements gathering, business analysis and project coordination
  • Responsible for the analysis of business requirements and design implementation of the business solution
  • Translated business concepts into XML vocabularies by designing XML Schemas with UML
  • Gathered business requirements through interviews, surveys with users and Business analysts
  • Involved in designing and developing SQL server objects such as Tables, Views, Indexes (Clustered and Non-Clustered), Stored Procedures and Functions in Transact-SQL
  • Performed Data analysis of existing database to understand the data flow and business rules applied to Different databases using SQL
  • Performed data analysis and data profiling using complex SQL on various sources systems and answered complex business questions by providing data to business users
  • Used MS Visio and Rational Rose to represent system under development in a graphical form by defining use case diagrams, activity and workflow diagrams
  • Wrote a complex SQL, PL/SQL, Procedures, Functions, and Packages to validate data and testing process
  • Worked in generating and documenting Metadata while designing OLTP and OLAP systems environment
  • Established a business analysis methodology around the RUP (Rational Unified Process)
  • Developed stored procedures in SQL Server to standardize DML transactions such as insert, update and delete from the database
  • Created SSIS package to load data from Flat files, Excel and Access to SQL server using connection manager
  • Developed all the required stored procedures, user defined functions and triggers using T-SQL and SQL
  • Produced report using SQL Server Reporting Services (SSRS) and creating various types of reports
  • Environment: XML, T-SQL, SQL, PL/SQL, OLTP, OLAP, SSIS, SSRS

Education

Master of Science - Data Science

Stevens Institute of Technology

Bachelor of Technology - Computer Science

SRM UNIVERSITY

Skills

  • Advanced SQL
  • Data Warehousing
  • Hadoop Ecosystem
  • Scala Programming
  • Agile Methodologies
  • Git Version Control
  • NoSQL Databases
  • Machine Learning
  • Spark Development
  • Python Programming
  • ETL Development
  • Performance Tuning
  • Kafka Streaming
  • Continuous Integration
  • Data Visualization
  • Big Data Processing
  • Data Pipeline Design
  • Advanced Analytics
  • Data Science Research Methods
  • AWS,(Redshift,Sagemaker,S3,EC2Instance)
  • Azure
  • GCP
  • Databricks

Timeline

Sr. Data Engineer

Wayside publishing ltd
01.2023 - Current

Sr. Data Engineer

CVS
06.2021 - 01.2023

Sr. Data Engineer

Johnson & Johnson
04.2020 - 05.2021

Data Analyst/ Data Modeler

Fission Labs
09.2016 - 12.2019

Data Analyst

Efftronics Systems Pvt. Ltd
06.2014 - 08.2016

Master of Science - Data Science

Stevens Institute of Technology

Bachelor of Technology - Computer Science

SRM UNIVERSITY
Praveen Kumar