Summary
Overview
Work History
Education
Skills
Certification
Competitionsleadershipexperience
Projects
Languages
Timeline

DINESH SURAM

Dallas,Texas

Summary

A seasoned Data Engineer with extensive experience in ETL development, data warehousing, and cloud technologies, primarily with Snowflake, AWS, and Azure. Demonstrated expertise in building robust data infrastructure, optimizing data processing, and ensuring high data quality across multinational environments. Proficient in a variety of technical tools including Python, SQL, and Spark, with a strong background in both Agile and Scrum methodologies. Holds a Master of Science in Data Science from The University of Texas at Dallas and a Bachelor of Technology in Chemical Engineering from the National Institute of Technology, Warangal. Certified in AWS Cloud and Snowflake, with a proven track record of leading data-driven projects to successful completion. Eager to leverage analytical skills and technological proficiency to contribute to innovative data solutions.

Overview

10
10
years of professional experience
1
1
Certification

Work History

ETL Developer/Data Scientist

Toyota Financial Services
2022.10 - 2024.08
  • Created High Level and Detailed Level design
  • Conduct design reviews and design verification
  • Create final work estimate
  • Created brand new data platform in AWS using services like ec2, S3 buckets and Iam Roles
  • Proficient in Planning and Charting the program, working with Architects and Data steward managers for Program execution
  • Seasoned working in cross-functional, cross vendor and offshore teams
  • Performed the planning and formulated the plan to assess and capture the usable lineage of data attributes in the existing system and template requirement procedure
  • Developed Data Vault Models tailored to various use cases, designing specific models for each of the five countries involved in the project.
  • Designed and developed jobs that extracted data from source databases (Mexico, Brazil, Canada, Colombia, Puerto Rico) using DB connectors, and utilized the IBM Sterling tool for the MFT process to securely transfer files from various countries for data integration
  • Implemented Snowflake's cloud data warehousing solution to consolidate data silos into a single source of truth, enhancing data accessibility and integrity for real-time analytics across the organization
  • Designed and executed data migration to Snowflake, utilizing its scalable compute and storage capabilities to optimize data processing speeds and reduce infrastructure costs
  • Leveraged Snowflake's unique architecture to perform complex SQL queries on large datasets without impacting performance, resulting in a 50% decrease in query execution time
  • Utilized Snowflake's Time Travel and Zero-Copy Cloning features to improve data recovery processes and support efficient environment management for development, testing, and production workflows
  • Developed and optimized ETL pipelines using Snowflake's native capabilities, such as Snow pipe for continuous data ingestion and Streamlet for real-time data processing (Columbia, Puerto Rico)
  • Implemented role-based access control within Snowflake to enhance data security and compliance with regulatory requirements, ensuring that sensitive data is protected, and access is audited
  • Conducted regular performance tuning and optimization of Snowflake environments, achieving significant cost savings by optimizing data storage and compute resources
  • Created automated data transformation scripts using Snow SQL to facilitate complex data manipulation tasks, reducing manual effort and minimizing the risk of errors
  • Integrated Snowflake with various BI tools and platforms such as Tableau and PowerBI, enabling advanced data visualization and analytics capabilities for business users
  • Provided training and support to team members on best practices for using Snowflake, enhancing team productivity and ensuring efficient use of the platform across the organization
  • Snowflake Zero Copy cloning - Cloning databases for DEV and QA environments
  • Developed Ingestion Frameworks for generating and Execution of Snow SQL scripts in Python
  • (Canada, Brazil, Mexico)
  • Deployed ETL Process for Production implementation, resolving issues, monitoring performance, performance tuning of SQL queries, Data Transformation, Data Validation, Data Modeling, mapping documentation
  • Worked on ingestion process to ingest data into S3 buckets on daily bases End-to-end implementation, maintenance, optimizations, and enhancement of the application
  • Involved in creating SQL queries, performance tuning and Used flatten table function to produce lateral view of varient, object and array column
  • Worked in Agile/Scrum environment
  • Created Python scripts to generate user specific reports and send them over emails in a scheduled fashion
  • Extensively used materialized views for designing Fact tables and Dim tables
  • Ensured that operational and analytical data warehouses can support all business requirements for business reporting
  • Developed Unix shell scripts and worked on Python scripts for controlled execution of DataStage jobs
  • Shared sample data using grant access to customer for UAT
  • Extensively worked on dimensional modeling and data load into dim and fact
  • Deployed ETL Process for UAT implementation, change management, Testing of ETL Process
  • Leading the team and ensuring the delivery happens on time with the best quality and least defects
  • This is important in living up to the Quality standards of Toyota Financial Services
  • Maintained data pipelines to support the development of risk models, including Logistic Regression, Random Forest, and XGBoost, which were used to predict customer defaults and assess financial risks
  • Collaborated with data scientists to build Random Forest models for generating risk scores, integrating data from five countries
  • Contributed to identifying key features like income level and credit history that significantly influenced risk prediction
  • Supported the implementation of XGBoost models for predicting adverse financial events
  • Provided the data infrastructure and conducted data cleaning processes, which improved model performance by 30%
  • Assisted in hyperparameter tuning of XGBoost models by analyzing the distribution of the underlying risk data, contributing to a reduction in false positives
  • Engineered datasets for Logistic Regression models that classified customers into risk categories
  • Provided clean, normalized data and managed missing values, leading to a significant improvement in classification accuracy
  • Contributed to feature selection by analyzing transactional patterns, which directly enhanced the model's ability to predict high-risk customers
  • Aggregated and pre-processed time-series data to forecast future risk trends using ARIMA models
  • Collaborated with data scientists to identify seasonal patterns and cyclical trends that were critical to accurate forecasting
  • Verify production code, support first three executions of code in production, Transition of the code and Process to the Maintenance/Support team
  • Co-ordinate with various Business partners, Analytical teams, Stakeholders to provide status reporting
  • Actively participated in the team meetings in day-to-day calls, meeting reviews, status calls, batch reviews.
  • Environment: Unix/Linux, Snowflake, Jenkins, Shell, Python, Artifactory, ECR, Autosys Git, OpenShift, Unix

ETL Developer/Data Engineer

Reliance Industries Ltd.
2014.07 - 2020.10
  • Coordinate with Business Analysts (BA) across seven petrochemical sites and 2 refinery clusters to gather business requirements, evaluate the scope of design and technical feasibility
  • Analyze complexity and technical impact of requirements to cope with the existing design and discuss with Business Analyst for further refinement of requirements
  • Created High Level and Detailed Level design
  • Conduct design reviews and design verification
  • Create final work estimate
  • End-to-end implementation, maintenance, optimizations, and enhancement of the application
  • Built predictive models to predict the possibility of equipment failure; resulted in reduction of O&M costs by 12% annually
  • Code Review, Presenting code & design in Technical Review Board
  • Designed and developed jobs that extract data from the source databases using DB connectors Oracle, DB2 and Teradata
  • Involved in creating SQL queries, performance tuning and creation of indexes
  • Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop MapReduce, HDFS, HBase, Hive, Spark, Sqoop, Pig, Zookeeper and Flume
  • Data warehouse, Business Intelligence architecture design and develop
  • Designed the ETL process from various sources into Hadoop/HDFS for analysis and further processing of data modules
  • Designing and Creating Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL Azure Data Lake Analytics
  • Data Ingestion to at least one Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks
  • Created, provisioned numerous Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters
  • Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases
  • Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks
  • Extensively chipped on Spark Context, Spark-SQL, RDD's Transformation, Actions and Data Frames
  • Developed custom ETL solutions, batch processing and real-time data ingestion pipeline to move data in and out of Hadoop using PySpark and shell scripting
  • Created Spark RDDs from data files and then performed transformations and actions to other RDDs
  • Created Hive Tables with dynamic and static partitioning including buckets for efficiency
  • Also Created external tables in HIVE for staging purposes
  • Loaded HIVE tables with data, wrote hive queries that run on MapReduce and Created customized BI tool for management teams that perform query analytics using HiveQL
  • To meet specific business requirements wrote UDF's in Scala and PySpark
  • Experience in developing Spark applications using Spark-SQL in EMR for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns
  • Utilized Spark in Memory capabilities, to handle large datasets
  • Used Broadcast variables in Spark, effective & efficient Joins, transformations, and other capabilities for data processing
  • Experienced in working with EMR cluster and S3 in AWS cloud
  • Creating Hive tables, loading and analyzing data using hive scripts
  • Implemented Partitioning (both dynamic Partitions and Static Partitions) and Bucketing in HIVE
  • Involved in continuous Integration of application using Jenkins
  • Lead in Installation, integration, and configuration of Jenkins CI/CD, including installation of Jenkins plugins
  • Implemented a CI/CD pipeline with Docker, Jenkins, and GitHub by virtualizing the servers using Docker for the Dev and Test environments by achieving needs through configuring automation using Containerization
  • Installing, configuring, and administering Jenkins CI tool using Chef on AWS EC2 instances
  • Performed Code Reviews and responsible for Design, Code, and Test signoff
  • Worked on designing and developing the Real - Time Tax Computation Engine using Oracle, Stream Sets, Kafka, Spark Structured Streaming
  • Validated data transformations and performed End-to-End data validations for ETL workflows loading data from XMLs to EDW
  • Extensively utilized Informatica to create complete ETL process and load data into database which was to be used by Reporting Services
  • Created Tidal Job events to schedule the ETL extract workflows and to modify the tier point notifications
  • Environment: Python, SQL, Oracle, Hive, Scala, Power BI, Azure Data Factory, Data Lake, Docker, Mongo DB, Kubernetes, PySpark, SNS, Kafka, Data Warehouse, Sqoop, Pig, Zookeeper, Flume, Hadoop, Airflow, Spark, EMR, EC2, S3, Git, GCP, Lambda, Glue, ETL, Databricks, Snowflake, AWS Data Pipeline.

Education

Master of Science in Data Science -

The University of Texas at Dallas, Richardson, TX
12.2023
  • Scholarship: Dean's Excellence Scholarship Recipient
  • GPA: 3.71

Bachelor of Science - Chemical Engineering

National Institute of Technology Warangal, India
12.2014

GPA: 7.0

Skills

  • Big Data Ecosystem:
    HDFS, Spark, MapReduce
    Hive, Pig, Sqoop
    Flume, HBase, Kafka Connect
    Impala.
  • Data Engineering Tools:
    Airflow, Zookeeper
    Amazon Web Services (AWS)
    Cloudera CDP, Hortonworks HDP
    Apache Hadoop 1x/2x
  • Programming Languages:
    Python, Scala
    Pig Latin, HiveQL
    Shell Scripting
    SQL Development
  • Databases:
    MySQL, PostgreSQL
    MS SQL Server, Snowflake
    MongoDB, Cassandra
    Aginity Workbench
  • ETL/BI Tools:
    Power BI, Tableau
    Informatica
    Data Modeling
    Data Lineage
  • Version Control:
    GIT, SVN
    Bitbucket
    Real-time Processing
    Data Transformation
  • Cloud Technologies:
    EC2, S3, Lambda
    SQS, SNS, EMR
    Code Build, CloudWatch
    Azure HDInsight, Databricks
  • Operating Systems:
    Windows (XP/7/8/10)
    Linux (Unix, Ubuntu)
    Mac OS
    SQL Data Warehousing

Certification

  • Certified - AWS Cloud Practitioner (Amazon Web Services
  • Certified - SNOWPRO CORE CERTIFICATION.

Competitionsleadershipexperience

Elected as Student's Council President by 5000 students in college elections; successfully conducted tech & cultural fest with 10k+ footfall.

Projects

Machine learning - Seoul Rental Bike prediction, June 2014, Implemented a linear regression model on the dataset to predict the rented bike count, Using the gradient descent algorithm with batch update rule-built models to predict rental bike count with LR, Logistic, SVM, DT, NN, K-Means, EM and achieved 92% confidence and experimented with various parameters for linear regression used neural networks package to classify problem, experimented with number of layers and number of nodes, activation functions (Tanh, sigmoid). Big Data Analysis using Hadoop, June 2014, Processed 1M rows into HDFS from a local server, created Hive and Pig tables, and ran SQL queries, imported data to Tableau using ODBC drivers for Data Visualization and executed Regression Analysis using R and attaining accuracy of 85%. Implemented Data Integration and connected all three different platforms, i.e., Cloudera, Tableau, and R Studio (Rserve), to assess driver's behavior and mitigate potential road accidents. Forecasting Time series Data (Python, Deep Learning, Machine Learning), Utilizing a diverse ensemble of models, including SARIMAX, RNN, LSTM, AdaBoost Regressor, Gradient Boost Regressor, and Random Forest Regressor, I conducted a time-series analysis with Kaggle's web traffic dataset. My work involved meticulous fine-tuning and optimization of these models to improve their precision in forecasting web traffic for the subsequent year, with performance evaluation grounded in RMSE metrics.

Languages

English
Native or Bilingual
Hindi
Native or Bilingual
Telugu
Native or Bilingual

Timeline

ETL Developer/Data Scientist - Toyota Financial Services
2022.10 - 2024.08
ETL Developer/Data Engineer - Reliance Industries Ltd.
2014.07 - 2020.10
The University of Texas at Dallas - Master of Science in Data Science,
National Institute of Technology Warangal - Bachelor of Science, Chemical Engineering
  • Certified - AWS Cloud Practitioner (Amazon Web Services
  • Certified - SNOWPRO CORE CERTIFICATION.
DINESH SURAM