Summary
Overview
Work History
Skills
Timeline
Generic

SANDHYA ISAI B

Summary

Experienced Big Data Engineer with expertise in advanced Apache Spark processing, including Scala, Python, and Java. Proficient in Spark Streaming, Data Frame API, and Spark SQL, handling large-scale data processing. Strong background in SQL performance tuning using Hive/Impala, and dashboarding with Elasticsearch and Kibana. Skilled in integrating Spark streaming jobs with Apache Kafka and AWS Kinesis. Expertise in building automated ETL pipelines with a focus on data flow, error handling, and recovery. Knowledgeable in setting up and tuning Spark clusters using Yarn, Mesos, and standalone environments. Experienced in AWS services such as EMR, Glue, S3, Athena, and Lambda, with a solid understanding of data warehousing, physical table design, and job scheduling tools like Airflow and AWS Data Pipeline.

Overview

7
7
years of professional experience

Work History

Sr. Big Data Engineer

Bank of America
Chicago, IL
01.2024 - Current
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Spark SQL and PySpark
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL) and processing the data in Azure Databricks
  • Experience in the development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm, and MapReduce open-source tools
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns
  • Continuous monitoring and managing the Hadoop cluster using Cloudera Manager
  • Implemented Spark using Python and Spark SQL for faster processing of data
  • Sound knowledge and hands-on experience with - NLP, MapR, IBM Infosphere suite, Storm, Flink, Talend, ER Studio, and Ansible
  • Actively working on Azure, Azure SQL Database, Azure SQL Data Warehouse, ADF v2, Blob Storage, PolyBase, and using SSIS in ADF environment for the purpose of scripting calling APIs, and so on
  • Experience in data discovery to data provisioning to Azure ADW and leveraging SSIS to create facts and dims, leveraging ADF to automate and provision data inbound to ADW
  • Used the Spark-Cassandra Connector to load data to and from Cassandra
  • Implemented test scripts to support test-driven development and continuous integration
  • Dumped the data from HDFS to Oracle database and vice-versa using Sqoop
  • Extensively involved in the Installation and configuration of Cloudera Hadoop Distribution
  • Analyzed business requirements, facilitating planning and development phases during client interactions
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python
  • Implemented Spring MVC architecture and Spring Bean Factory using IOC, AOP concepts
  • Developed Data pipeline in Azure Data bricks to load and transform text, fault, failure, and part attribute information from a variety of text sources as needed for part identification models
  • Deliver the components to take text data from the raw layer for each source all the way to model input for the machine learning models
  • Developed REST API endpoint for Azure ML Flow Model Serving
  • Involved in configuring Azure platform for data pipelines, ADF, Azure Blob Storage and Data Lakes and building workflows to automate data flow using ADF
  • Involved in the development of real-time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster
  • Prepared supporting documents as part of data engineering tasks for Smart Part ID Project
  • Utilized Hive partitioning, Bucketing and performed various kinds of joins on Hive tables
  • Maintained thorough documentation and instructions for each component and the overall Data Engineering pipeline
  • Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data
  • Understand current Production state of application and determine the impact of new implementation on existing business processes
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, Data bricks)
  • Designed and Developed Real-time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Azure ML Flow Model Serving
  • Lead the team in developing real-time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster
  • I have also done several data transformations with Apache Flink Java API
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Azure data bricks cluster
  • Used Apache Spark Data frames, Spark-SQL extensively and developing and designing PoC's
  • Environment: Azure Databricks, Azure ML Flow (Model Serving, Model Enabling), Python, Pandas, PySpark, Azure Data Lake Storage, Azure SQL Server, Azure Blob Storage, Scikit Learn, Azure ML Cluster, Databricks Notebooks, GitLab, Rest API, Azure Data Factory, Azure SQL Data warehouse.

Sr. Data Engineer

USAA
San Antonio, Texas
09.2022 - 12.2023
  • Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data
  • Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect
  • Designed and developed Security Framework to provide fine-grained access to objects in AWS S3 using AWS Lambda, DynamoDB
  • Set up and worked on Kerberos authentication principles to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users
  • Performed end-to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift and S3
  • Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 Data Lake
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB
  • Used Spark SQL for Scala & Python interface that automatically converts RDD case classes to schema RDD
  • Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response
  • Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources
  • Importing & exporting databases using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages)
  • Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates
  • Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects
  • Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server
  • Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real-time log analysis of monitoring End to End transactions
  • Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction
  • Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker
  • Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon Sage Maker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau.

Data Engineer

Humana Healthcare
Louisville, KY
06.2019 - 08.2022
  • Running Spark SQL operations on JSON, converting the data into a tabular format with data frames, then saving and publishing the data to Hive and HDFS
  • Developing and refining shell scripts for data input and validation with various parameters, as well as developing custom shell scripts to execute spark Jobs
  • Creating Spark tasks by building RDDs in Python and data frames in Spark SQL to analyze data and store it in S3 buckets
  • Working with JSON files, parsing them, saving data in external tables, and altering and improving data for future use
  • Taking part in design, code, and test inspections to discover problems throughout the life cycle
  • At appropriate meetings, explain technical considerations and upgrades to clients
  • Creating data processing pipelines by building spark jobs in Scala for data transformation and analysis
  • Working with structured and semi-structured data to process data for ingestion, transformation, and analysis of data behavior for storage
  • Using the Agile/Scrum approach for application analysis, design, implementation, and improvement as stated by the standards
  • Creating and putting data into Hive tables for dynamically adding data into data tables for EDW tables and historical metrics utilizing partitioning and bucketing
  • Performed Linux actions on the HDFS server for data lookups, job changes if any commits were disabled, and data storage rescheduling
  • Using SQL queries, test and validate database tables in relational databases, as well as execute Data Validation and Data Integration
  • Collaborate with SA and Product Owners to gather needs and analyze them for documentation in JIRA user stories for technical and business teams to enhance the requirements
  • Installed and configured Hive, Pig, Sqoop, and Oozie on the Hadoop cluster by Setting up and benchmarking Hadoop clusters for internal use
  • Installed and configured Hadoop Ecosystems like Hive, Oozie, and Sqoop by which implemented using the Cloudera Hadoop cluster for helping with performance tuning and monitoring
  • Loaded the aggregated data onto Oracle from the Hadoop environment using Sqoop for reporting on the dashboard
  • Worked closely with DevOps team to understand, design, and develop end-to-end flow requirements by utilizing Oozie workflow to do Hadoop jobs
  • Developed and implemented data acquisition of Jobs using Scala that are implemented using Sqoop, Hive & Pig for optimization of MR Jobs to use HDFS efficiently by using various compression mechanisms with the help of Oozie workflow
  • For the data exploration stage used Hive to get important insights about the processed data from HDFS
  • Handled importing of data from various data sources, performed transformations using Hive and MapReduce for loading data into HDFS, and extracted the data from MySQL into HDFS using Sqoop
  • Used UDFs to implement business logic in Hadoop by using Hive to read, write, and query the Hadoop data in HBase
  • Environment: MapReduce, Hive, Pig, Hadoop, My SQL, Cloudera Manager, Sqoop, Oozie, No SQL, Eclipse, MobaExtrm, Linux, Shell, JIRA, Confluence, Jupyter, SQL, HDFS, Spark, Hive 2.0, Python, confluence, AWS, CDH, Putty.

Data Analyst

ICICI
Hyderabad, India
04.2018 - 05.2019
  • Responsible for gathering requirements from Business Analysts and identifying the necessary data sources for requests
  • Utilized SAS Programs to efficiently convert Excel data into Teradata tables, streamlining data processing tasks
  • Worked on importing and exporting large volumes of data between files and Teradata, ensuring seamless data transfer
  • Actively reviewed extensive datasets, comprising over 208 unique variables and 4,700 rows, utilizing Excel and Python for thorough analysis
  • Leveraged Python's diverse data science packages such as Pandas, NumPy, SciPy, Scikit-learn, and NLTK to extract valuable insights from the data
  • Dedicated to maintaining the stability of the production, testing (QA), and development environments for our distributed Data Warehouse application
  • Actively participated in code review sessions to optimize the performance of existing SSRS reports and enhance Dataset query efficiency
  • Utilized MDX scripting to execute queries on OLAP cubes, ensuring efficient data retrieval for analysis
  • Conducted detailed data analysis, examining the structure, content, and quality of the data from source systems and data samples, utilizing SQL and Python
  • An expert in creating OLAP cubes, dimensions, business KPIs, and MDX queries using Analysis Services, ensuring efficient data retrieval and analysis
  • Utilized Inner Join and Outer Join techniques while creating tables from multiple tables, ensuring accurate data representation
  • Created classes with Dimension, Detail, and Measure objects, along with developing Custom hierarchies to support drill-down reports, enhancing data visualization capabilities
  • Additionally, optimized the capabilities and performance of the universe by creating Derived tables in the BO Designer, ensuring efficient data processing and retrieval
  • Environment: Tableau, Teradata, BTEQ, VBA, Python, SAS, SQL, Windows, Excel, Pandas, NumPy, SciPy, Scikit-learn, NLTK, SSRS, MDX, Tabular Modeling, SSDT, OLAP cubes, Inner Join.

Jr. Data Analyst

Crescendo Global
Hyderabad, India
05.2017 - 03.2018
  • Developed and implemented data acquisition jobs using Scala, Sqoop, Hive, and Pig to optimize MapReduce jobs, effectively managing HDFS storage and improving data processing speed through various compression techniques with Oozie workflow automation
  • In the data pre-processing phase, leveraged Apache Spark to clean and transform large datasets by removing missing data, ensuring data integrity and readiness for analysis
  • Utilized Hive for data exploration, executing complex queries to extract insights from processed datasets stored in HDFS, and driving key business decisions through effective data analysis
  • Imported data from multiple sources, including relational databases such as MySQL, into HDFS using Sqoop, and performed data transformations using Hive and MapReduce to prepare the data for analytical processing
  • Created and utilized User-Defined Functions (UDFs) to implement advanced business logic within Hadoop, integrating with HBase to read, write, and query large datasets, ensuring seamless data flow for analytical purposes
  • Managed and monitored Hadoop clusters using Cloudera Manager, ensuring high availability and performance by regularly updating operating systems, applying patches, and upgrading Hadoop versions as required
  • Developed data pipelines using Sqoop, Pig, and Hive to ingest and process large datasets, including customer, clinical, biometrics, lab, and claims data, optimizing the flow of data into HDFS for further analysis and reporting
  • Designed and developed Proof of Concepts (POCs) in Spark using Scala, performing comparative analysis with Hive and SQL/Oracle to validate the performance improvements and scalability offered by Spark in a Big Data environment
  • Employed Oozie workflow engine to automate the execution of multiple Hive and Pig scripts, integrating with Kafka for real-time processing of streaming data, facilitating efficient navigation and loading of log file data into HDFS
  • Worked with a variety of actions in Oozie, including Sqoop, Pig, Hive, Shell, and Java actions, to design complex workflows that enhanced automation and data processing efficiency in a distributed environment
  • Analyzed substantial amounts of structured and unstructured data to determine the optimal strategies for aggregating and reporting on datasets, improving the overall efficiency of data-driven decision-making processes.

Skills

Competencies:
Hadoop Cluster Setup, Hive, Pig, Sqoop, Oozie, Spark, Data Transformation, Data Analysis, Data Visualization, ETL Processes, Data Pipelines, SQL Performance Tuning, Real-time Data Processing, Workflow Automation, Big Data Analytics, Cloudera Manager

Programming Tools & Languages:
Tableau, Power BI, Python Visualizations, Excel Dashboard, T-SQL, Java, Scala, PL/SQL, SQL, C, C, XML, HTTP, MATLAB, DAX, Python, R, SAS E-miner, SAS, SQL Server, MS-Access, Oracle, Teradata, Cassandra, Neo4j, MongoDB, Git, GitHub, Anaconda Navigator, Jupyter Notebook, Azure Data Factory, Azure Databricks, Azure Analysis Services, Looker, Smart View, Nexus

Timeline

Sr. Big Data Engineer

Bank of America
01.2024 - Current

Sr. Data Engineer

USAA
09.2022 - 12.2023

Data Engineer

Humana Healthcare
06.2019 - 08.2022

Data Analyst

ICICI
04.2018 - 05.2019

Jr. Data Analyst

Crescendo Global
05.2017 - 03.2018
SANDHYA ISAI B