Summary
Work History
Education
Skills
Languages
Certification
Work Preference
Work Availability
Websites
Timeline
Mrigonav Saikia

Mrigonav Saikia

Data Scientist
Mechanicsburg,PA

Summary

I specialize in designing, building, and optimizing large-scale data architectures to drive strategic decision-making and operational efficiency. My journey has spanned diverse industries and roles, where I've leveraged advanced data analytics, big data technologies, and machine learning to tackle complex business challenges and unlock data-driven insights.

Currently, I work as a Data Scientist at JPMorgan Chase & Co., where I develop predictive models, automate resiliency testing frameworks, and build ETL pipelines that support enterprise resilience and operational continuity. My expertise spans Spark, Hadoop, Snowflake, AWS, and a range of data engineering and machine learning tools, enabling me to deliver scalable, robust, and efficient solutions.

My approach combines a strong foundation in data engineering with a passion for continuous learning and innovation. I focus on emerging technologies and best practices in data science, cloud computing, and data governance. Thriving in collaborative environments, I have a proven track record of leading projects that enhance system performance, improve data accuracy, and deliver transformative insights to support business resilience and strategic goals.

Work History

Data Scientist

JPMorgan Chase
Plano, Texas
03.2020 - Current

• Machine Learning Model Development: Designed and deployed machine learning models to predict system vulnerabilities, optimizing the organization's resilience strategies.
• Automated Resilience Frameworks: Developed and implemented the ART framework to automate resiliency testing, enabling the simulation of real-world disruption scenarios.
• Risk Analysis & Mitigation: Conducted risk assessments using statistical analysis and predictive modeling to identify potential points of failure in enterprise systems.
• Anomaly Detection: Designed algorithms for real-time anomaly detection in system performance metrics, enhancing disaster recovery mechanisms.
• Cloud Integration: Migrated resilience models to cloud platforms like AWS and Azure, enabling real-time monitoring and analysis.
• Big Data Processing: Leveraged Hadoop and Spark for processing terabytes of data to assess enterprise-wide performance during simulated disruptions.
• Visualization Dashboards: Developed interactive dashboards in Tableau and Power BI to provide actionable insights into system resilience and risk factors.
• Disaster Recovery Optimization: Provided data-driven recommendations to improve disaster recovery plans, and reduce downtime during incidents.
• Real-Time Monitoring Solutions: Created systems for monitoring critical infrastructure health in real-time, ensuring rapid response to anomalies.
• Collaboration with Stakeholders: Partnered with IT infrastructure, risk management, and business continuity teams to align analytics solutions with organizational goals.
• Simulation Modeling: Built simulation models for testing the impact of infrastructure failures on enterprise operations, improving predictive accuracy.
Regulatory Compliance Analysis: Ensured resilience testing and analytics complied with financial industry regulations and internal governance policies.

Data Engineer

Discover
Riverwoods, USA
10.2018 - 02.2020
  • Company Overview: Riverwoods, IL
  • Worked on Snowflake Shared Technology Environment for providing stable infrastructure, secured environment, reusable generic frameworks, robust design architecture, technology expertise, best practices and automated SCBD (Secured Database Connections, Code Review, Build Process, Deployment Process) utilities
  • Analyzed and classified multiple biological conditions using Azure machine learning techniques PCA, PLS-DA, R-SVM and RF
  • Designed ETL process using Pentaho Tool to load from Sources to Targets with Transformations
  • Developed Pentaho Bigdata jobs to load heavy volume of data into S3 data lake and then into Redshift data warehouse
  • Migrated the data from Redshift data warehouse to Snowflake database
  • Build dimensional modelling, data vault architecture on Snowflake
  • Built scalable distributed Hadoop cluster running Hortonworks Data Platform (HDP 2.6)
  • Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using SparkContext, Spark-SQL, PairRDD's
  • Serializing JSON data and storing the data into tables using Spark SQL
  • Spark Streaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase)
  • Worked on Spark framework on both batch and real-time data processing
  • Hands on experience in MLlib from Spark are used for predictive intelligence, customer segmentation and for smooth maintenance in Spark streaming
  • Developing programs for Spark streaming which takes the data from Kafka and pushes into different sources
  • Loading the data from the different Data sources like (Teradata, DB2, Oracle and flat files) into HDFS using Sqoop and load into Hive tables, which are partitioned
  • Created different pig scripts & converted them as shell command to provide aliases for common operation for project business flow
  • Implemented Partitioning, Bucketing in Hive for better organization of the data
  • Created few Hive UDF's to as well to hide or abstract complex repetitive rules
  • Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables
  • Developed bash scripts to bring log files from FTP server and then processing it to load into Hive tables
  • All the bash scripts are scheduled using Resource Manager Scheduler
  • Developed Map Reduce programs for applying business rules on the data
  • Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker
  • Involved in loading and transforming large sets of structured data from router location to EDW using an Apache NiFi data pipeline flow
  • Implemented Kafka event log producer to produce the logs into Kafka topic which are utilized by ELK (Elastic Search, Log Stash, Kibana) stack to analyze the logs produced by the Hadoop cluster
  • Did Implementation using Apache Kafka replacement for a more traditional message broker (JMS Solace) to reduce licensing and decouple processing from data producers, to buffer unprocessed messages
  • Implemented receiver-based approach, here I worked on Spark streaming for linking with Streaming Context using Python and handle proper closing & waiting stages as well
  • Experience in Implementing Rack Topology scripts to the Hadoop Cluster
  • Implemented the part to resolve issues related with old Hazel cast API Entry Processor
  • Used Akka Toolkit to perform few builds and used Akka with Scala
  • Excellent knowledge with Talend Administration console, Talend installation, using Context and global map variables in Talend
  • Used dashboard tools like Tableau
  • Used Talend Admin Console Job conductor to schedule ETL Jobs on daily, weekly basis
  • Riverwoods, IL

Data Engineer

Capital One
Richmond, USA
01.2017 - 09.2018
  • Company Overview: Richmond, VA
  • Developed highly optimized Spark applications to perform data cleansing, validation, transformation and summarization activities
  • Data pipeline consist Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data
  • Created Spark jobs and Hive Jobs to summarize and transform data
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data
  • Converted Hive/SQL queries into Spark transformations using Spark DataFrames and Scala
  • Used different tools for data integration with different databases and Hadoop
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data
  • Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming
  • Leveraged Apache Airflow to orchestrate the execution of machine learning models for fraud detection and prevention
  • Scheduled Airflow workflows to trigger model training, validation, and deployment tasks based on predefined schedules or event-driven triggers
  • Integrated Airflow with model training pipelines implemented in frameworks such as TensorFlow or scikit-learn
  • Utilized Airflow's workflow templating and parameterization features to dynamically configure model training experiments and hyperparameters
  • Ingested syslog messages parse them and streams the data to Kafka
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS
  • Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
  • Analyzed the data by performing Hive queries (Hive QL) to study customer behavior
  • Helped Devops Engineers for deploying code and debug issues
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data
  • Scheduled and executed workflows in Oozie to run various jobs
  • Implemented business logic in Hive and written UDF's to process the data for analysis
  • Addressing the issues occurring due to the huge volume of data and transitions
  • Designed, documented operational problems by following standards and procedures using JIRA
  • Richmond, VA

ETL/Data Analyst

GGK Tech
Hyderabad, India
01.2014 - 07.2015
  • Company Overview: Hyderabad, India
  • Involved In different modules of Enterprise data warehousing, Supply chain and .com projects
  • Involved as a Business Analyst interacted with product and Technical Managers of the Source systems and came with Business transformation logic
  • Involved in Design, Development, Testing, Documentation and Implementation of the project
  • Involved in full life cycle for the various projects, Design, data Modelling, Requirements gathering, Unit/QA testing and production
  • Onsite co-coordinator - offshore team communication
  • Design Parallel Partitioned Ab Initio graphs using GDE Components for high volume data warehouse
  • Worked with continuous components and XML components
  • Hands on experience with Teradata Sql Assistant to interface with the Teradata
  • Implemented Control-M as the primary scheduling and automation tool for managing data warehouse ETL processes
  • Scheduled Control-M jobs to extract data from source systems, transform it according to business rules, and load it into the data warehouse
  • Orchestrated complex data workflows spanning multiple systems and environments using Control-M's workflow orchestration capabilities
  • Integrated Control-M with version control systems and deployment pipelines for seamless CI/CD (Continuous Integration/Continuous Deployment) of ETL code and configurations
  • Hyderabad, India

Education

M.S - Information Technology & Cybersecurity

New England College, Henniker, NH
08-2024

M.S - Data Analytics

New England College, Henniker, NH
12.2022

M.S - Systems & Engineering Management

Texas Tech University, Lubbock, TX
12.2017

B.Tech -

National Institute of Technology, Silchar, India
12.2014

Skills

  • Hadoop
  • Big Data
  • HDFS
  • MapReduce
  • Yarn
  • HBase
  • Pig
  • Hive
  • Sqoop
  • Flume
  • Oozie
  • Zookeeper
  • Splunk
  • Hortonworks
  • Cloudera
  • SQL
  • Python
  • R
  • Scala
  • Spark
  • Linux shell scripts
  • RDBMS
  • MySQL
  • DB2
  • MS-SQL Server
  • Terradata
  • PostgreSQL
  • NoSQL
  • MongoDB
  • Cassandra
  • Snowflake
  • Tableau
  • Spyder
  • SSIS
  • Informatica Power Center
  • Pentaho
  • Talend
  • Microsoft Visio
  • ER Studio
  • Erwin
  • R-tidyr
  • Tidyverse
  • Dplyr
  • Reshape
  • Lubridate
  • Beautiful Soup
  • Numpy
  • Scipy
  • Matplotlib
  • Python-twitter
  • Pandas
  • Scikit-learn
  • Keras
  • Regression
  • Clustering
  • MLlib
  • Linear Regression
  • Logistic Regression
  • Decision Tree
  • SVM
  • Naive Bayes
  • KNN
  • K-Means
  • Random Forest
  • Gradient Boost
  • Adaboost
  • Neural Networks
  • Time Series Analysis
  • Machine Learning
  • Deep Learning
  • Data Warehouse
  • Data Mining
  • Data Analysis
  • Big data
  • Visualizing
  • Data Munging
  • Data Modelling
  • SnowSQL
  • Amazon Web Services
  • AWS
  • Microsoft Azure
  • Google Cloud Platform
  • GCP
  • EMR
  • EC2
  • S3
  • RDS
  • Cloud Search
  • Redshift
  • Data Pipeline
  • Lambda
  • JIRA
  • MS Excel
  • Power BI
  • QlikView
  • Qlik Sense
  • D3
  • SSRS
  • Pycharm
  • Agile
  • Scrum
  • Waterfall
  • Machine Learning
  • Deep Learning
  • Data Warehouse
  • Data Mining
  • Data Analysis
  • Big data
  • Visualizing
  • Data Munging
  • Data Modelling
  • SnowSQL
  • Amazon Web Services
  • AWS
  • Microsoft Azure
  • Google Cloud Platform
  • GCP
  • EMR
  • EC2
  • S3
  • RDS
  • Cloud Search
  • Redshift
  • Data Pipeline
  • Lambda
  • JIRA
  • MS Excel
  • Power BI
  • QlikView
  • Qlik Sense
  • D3
  • SSRS
  • Pycharm
  • Agile
  • Scrum
  • Waterfall

Languages

English
Full Professional
Hindi
Full Professional
Assamese
Native/ Bilingual

Certification

  • Data Science A-Z Hands-on Exercises and ChatGPT Prize [2024]
  • Advanced Python: Working With Data
  • Advanced Snowflake: Deep Dive Cloud Data Warehousing and Analytics
  • Machine Learning A-Z: Hands-on Python & R in Data Science
  • Amazon Web Services: Data Services
  • Apache Spark Essential Training: Big Data Engineering
  • Data Engineering Pipeline Management with Apache Airflow
  • End-to-End Data Engineer: Python for Data Science with Real Exercises

Work Preference

Work Type

Contract WorkFull Time

Work Location

Remote

Important To Me

Company CultureHealthcare benefitsWork from home optionStock Options / Equity / Profit SharingTeam Building / Company RetreatsCareer advancementPersonal development programs

Work Availability

monday
tuesday
wednesday
thursday
friday
saturday
sunday
morning
afternoon
evening
swipe to browse

Timeline

Data Scientist - JPMorgan Chase
03.2020 - Current
Data Engineer - Discover
10.2018 - 02.2020
Data Engineer - Capital One
01.2017 - 09.2018
ETL/Data Analyst - GGK Tech
01.2014 - 07.2015
New England College - M.S, Information Technology & Cybersecurity
New England College - M.S, Data Analytics
Texas Tech University - M.S, Systems & Engineering Management
National Institute of Technology - B.Tech,
Mrigonav SaikiaData Scientist