Summary
Overview
Work History
Education
Skills
Websites
Accomplishments
Aws Certifications
Projects
Timeline
Generic
Tejaswani Dash

Tejaswani Dash

Data Engineer
Fairfax,VA

Summary

Experienced Data Engineer over 5+ years with experience in Machine Learning, healthcare data, automotive data, customer data, media data, reporting, SAS, SAP, PyTorch, Tensorflow, Spark, Scala, NLP, LLM, various data/reporting tools and programming using SQL, Python, R, Shell scripting. Solid understanding of statistical analysis, machine learning algorithms, and predictive modeling. Proven track record of delivering innovative data solutions with effective project management. Hands-on experience in delivering work before deadlines with 98% accuracy. Seeking a challenging role in a dynamic environment to apply my skills and knowledge effectively. Dynamic ETL Developer practiced in helping companies with diverse transitioning, including sensitive data and massive big data installations. Promotes extensive simulation and testing to provide smooth ETL execution. Known for providing quick, effective tools to automate and optimize database management tasks.

Overview

6
6
years of professional experience
21
21
years of post-secondary education

Work History

Senior Data Engineer

Intelligenie
2 2023 - Current
  • Crafted highly intricate Python, SQL, and PySpark code, ensuring maintainability and ease of use, to fulfill application requirements and drive data processing and analytics to unprecedented levels of success
  • Developed highly complex Python, SQL and PySpark code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries
  • Designing and developing data pipelines and workflows using Abinitio's graphical interface and scripting language
  • Experienced in ETL processes from REST APIs, encompassing data extraction, transformation, loading, integration, optimization, security, monitoring, and collaboration for data-driven decision-making
  • Skilled in managing cloud infrastructure and scaling ETL solutions to meet growing data demands using cloud services
  • Experience in Data Pipelines, phases of ETL, ELT data process, converting BigData/unstructured data sets (JSON, log data) to structured data sets for Product analysts, Data Scientists
  • Proficient in establishing and enforcing data governance policies, metadata standards, and ensuring compliance with data regulations
  • Developing and implementing efficient data structures tailored to specific data processing and storage requirements
  • Experience in collecting, processing, and aggregating large amounts of streaming data using Kafka, Spark Streaming
  • Designing and implementing ETL data integration workflows using Informatica PowerCenter and data quality tools to ensure accurate and reliable data transfer from diverse sources to Data environments
  • Proficient in designing and optimizing ETL pipelines using Java, Python, Hive, and Pig for data transformation and processing
  • Skilled in Hadoop architecture, HDFS, and NoSQL databases like Cassandra
  • Expertise in query optimization and performance tuning for complex SQL and NoSQL queries
  • Responsible for end-to-end data pipeline development, integration into CI/CD processes, automation, testing, performance optimization, security, collaboration, and documentation to ensure the seamless and reliable delivery of high-quality data solutions
  • Experienced in administering HDFS and managing data in Hadoop clusters
  • Competent in working with MySQL and NoSQL databases, including schema design and data modeling
  • Familiar with indexing and partitioning strategies
  • Adept at creating bash shell scripts to automate ETL processes and routine tasks
  • Proficient in utilizing UNIX utilities and commands for data manipulation and system administration
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity
  • Build an ETL which utilizes spark jar inside which executes the business analytical model
  • Developed robust and scalable data pipelines using Spark, PySpark, and Scala for healthcare data extraction, transformation, and loading, ensuring data quality and traceability with GIT integration
  • Designed and implemented scalable data architectures on Azure, ensuring performance, reliability, and security, including data modeling and storage mechanisms
  • Developed and optimized ETL processes for seamless data integration from diverse sources into Azure storage solutions, focusing on data quality and workflow efficiency
  • Established Azure-based data warehouses for advanced analytics and reporting, employing dimensional modeling techniques aligned with business intelligence requirements
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
  • Constructed scalable big data solutions by utilizing Google Cloud Platform (GCP) products and services including BigQuery, Dataflow, Dataproc, and Pub/Sub, vertex AI and Apache Spark, and maximizing speed and scalability via distributed computing strategies and configuration tweaking
  • Performed data migration work, ensured robust data governance and security on GCP by implementing access controls, encryption, and monitoring processes, while also addressing compliance requirements through meticulous auditing and adherence to regulatory standards
  • Utilizing GCP tools like Apache Beam and Apache Spark for data processing and transformation, optimizing performance and scalability
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).Strong experience of leading multiple Azure Big Data and Data transformation implementations in Pharmacovigilance
  • Optimized Spark and PySpark jobs for performance, employing techniques like partitioning and caching to efficiently process large healthcare datasets, while addressing data quality and compliance with regulatory standards
  • Developed data visualization dashboards using Power BI, MicroStrategy to communicate findings effectively.
  • Developed, implemented and maintained data analytics protocols, standards, and documentation.
  • Ensured data quality through rigorous testing, validation, and monitoring of all data assets, minimizing inaccuracies and inconsistencies.

Senior Data Engineer- ETL Developer

CoxAutomotive
  • Proficiency in Python, PyTorch, Tensorflow, SQL, ELT development (particularly in Snowflake), and ELT orchestration using Airflow
  • Extensive experience in ETL processes, especially with Informatica Power Center, and expertise in integrating data from diverse sources into Data Warehouses
  • Designed and implemented data solutions utilizing various non-relational databases and data stores
  • Evaluated suitability and selected appropriate non-relational databases and data stores (object storage, document or key-value stores, graph databases, column-family databases) for specific use cases
  • Improved optimized data models and schema designs for efficient storage and retrieval
  • Strong AWS Cloud skills, including experience with analytical services, non-relational databases, data stores and Terraform
  • Familiarity with reporting tools, production support, Python data manipulation, and version control with GIT
  • Knowledge of New Relic for monitoring and performance management, as well as experience with project management using Rally and programming in Scala
  • Ensuring data is stored in a way that maximizes efficiency and performance, utilizing appropriate data structures such as arrays, lists, trees, graphs, or hash tables
  • Proficient in migrating data from EC2 instances to AWS MWAA, encompassing data analysis, transformation, transfer, and validation with a strong focus on security and compliance
  • Collaborated with Data Scientist team, Business Process Owners to capture business and functional requirements for Scope of Data Migration
  • Implemented real-time data streaming solutions using Kinesis and Firehose to ingest and process large volumes of data
  • Developed and optimized Apache Spark jobs for efficient data processing and analysis, ensuring scalability and performance
  • Designed and maintained data pipelines to seamlessly integrate streaming data from Kinesis and Firehose into Spark for real-time analytics and insights generation
  • Worked on AWS services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager
  • Hands on experience with AWS Services like S3, AWS Glue, Redshift, EMR, Kinesis, FireHose, IAM role and permission, Lambda Functions, SQS, SNS, EC2 etc
  • Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket
  • Selecting appropriate AWS services to design and deploy an application based on given requirements
  • Orchestrated and automated complex data workflows using Apache Airflow, ensuring efficient ETL processes, task dependencies, error handling, and scalability
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates
  • Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline
  • Used Pandas in Python for Data Cleansing and validating the source data
  • Experienced in AWS services, including compute engine, cloud load balancing, cloud storage, and SQL
  • Skilled in AWS services and technologies, including MWAA configuration and DAG orchestration
  • Able to optimize performance, troubleshoot, and ensure a smooth data migration process
  • Collaborated with data scientist teams to develop AI chatbots optimized using Machine Learning, NLP, Gen AI, LLM fine-tuning with GPT 3.4 and 3.5 for prompt customer responses.

Research Assistant

George Mason University
08.2021 - 01.2023
  • Conducted text analysis using regression models and performed medical image processing using NLP with 92% accuracy
  • Developed computational methods and algorithms to analyze and quantify biomedical data achieved 87% accuracy in predicting outcome by applying information analysis, healthcare interoperability tools HL7/FHIR visualization, and SQL
  • Worked on a Smart Watch Prediction Model, achieving a 90% success rate for health record accuracy
  • Assisted in the Empowered Communities Opioid Project, analyzing behavioral data resulting in a 95% reduction in processing time.
  • Gathered, arranged, and corrected research data to create representative graphs and charts highlighting results for presentations.

Clinical Resolution Data Analyst Intern

HTC Global
05.2022 - 08.2022
  • Analyzed real-time data using machine learning for specific clients
  • Developed prediction models and managed clients directly
  • Expertise in data computation and security on GCP.

Senior Data Engineer

APCER Life Science
01.2018 - 12.2020
  • Led end-to-end data integration and migration projects, employing Python, SQL, and ETL processes to merge Pharmacovigilance ICSR data from ARISg & Argus databases
  • Played pivotal role in system and data integration of Pharmacovigilance ICSR Data from ARISg & Argus database as a part of their business merger using Python and SQL
  • Expanded robust data pipelines that extract, transform, and load (ETL) data from various sources into structured formats using appropriate data structures
  • Used ETL Process to combine multiple data source, transform, load and SQL framework as an intermediate database to analyze data
  • Accomplished 6 cycles of data transfer from legacy system to target system
  • Collaborated with Business Process Owners to capture business and functional requirements for Scope of Data Migration
  • Led Pre-load and Post load validation report publications and follow-up actions
  • Implemented Real Time Load Summary Report using SQL and Reporting Dashboard to provide key insights on Data Migration status, eradicating all manual efforts to update the statistics after each load attempt Analyzing adverse drug effect data using python and SQL, SAS, visualizations, performed predictions of possible positive cases, and coding events
  • Responsible for data extraction, client management, and data management
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB)
  • Strong experience of leading multiple Azure Big Data and Data transformation implementations in Pharmacovigilance
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity
  • Involved in complete project life cycle starting from design discussion to production deployment
  • Optimized Hive queries and used Hive on top of Spark engine
  • Proficient with Azure Cloud Platform services like Azure Data Factory (ADF), Azure Data Lake, Azure Blob Storage, Azure SQL Analytics, Azure Databricks
  • Worked on Sequence files, Map side joins, Bucketing, Static and Dynamic Partitioning for Hive performance enhancement and storage improvement
  • Experience in retrieving data from oracle using PHP and Java programming
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
  • Proficient in designing and building data pipelines, conducting ETL processes, and integrating machine learning models into data flows to support data-driven decision-making
  • Worked closely with the business team to gather their requirements and new support features
  • Developed a 16-node cluster in designing the Data Lake with the Cloudera Distribution
  • Responsible for building scalable distributed data solutions using Hadoop
  • Implemented and configured High Availability Hadoop Cluster
  • Designing and developing data pipelines to ETL data from SAP S/4 HANA, SAP IBP, and other SAAS data warehouses
  • Installed and configured Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, Zookeeper)
  • Enhanced Hive scripts to analyze data and PHI are categorized into different segments and promotions are offered to customer based on segments
  • Extensive experience in writing Pig scripts to transform raw data into baseline data
  • Developed UDFs in Java as and when necessary to use in Pig and HIVE queries
  • Worked on Oozie workflow engine for job scheduling
  • Created Hive tables, partitions and loaded the data to analyze using HiveQL queries
  • Created different staging tables like ingestion tables and preparation tables in Hive environment
  • Created tables in HBase to store the variable data formats of data coming from different upstream sources
  • Experience in managing and reviewing Hadoop log files
  • Good understanding of ETL tools and how they can be applied in a Big Data environment
  • Handled 25 clients with different data for data processing to predict the performance and positive and negative events.
  • Optimized data pipelines by implementing advanced ETL processes and streamlining data flow.
  • Prepared documentation and analytic reports, delivering summarized results, analysis and conclusions to stakeholders.
  • Trained staff and updated training documents to meet regulations and standards.
  • Reduced workload backlog by effectively prioritizing high-priority cases based on severity level and regulatory deadlines during peak periods of volume influx.
  • Championed the adoption of agile methodologies within the team, resulting in faster delivery times and increased collaboration among team members.

Education

Master of Science - Data Analytics & Health Informatics

George Mason University
Fairfax, VA
05.2001 - 05.2022

Skills

Scripting Languages: Python, R, SQL, Java, PHP, Bash, Powershell, Pyspark, Scala

undefined

Accomplishments

  • Excellence Performance Award for Business Client management in Apcer Life Science, 01/2020
  • Excellent performance in Data Engineering and recognition in cloud data Migration, 01/2020
  • Best Debut Performance in Pharmacovigilance Data, 12/2018

Aws Certifications

  • AWS Certified Data Engineer, https://www.credly.com/badges/885e9f66-7c47-4fea-bb5b-3c6489f30fdc/public_url, 03/2024
  • AWS Certified Machine Learning – Specialty, https://www.credly.com/badges/15e9f69c-8ec3-4247-ab25-89f50b88af64/public_url, 04/2024

Projects

Capstone Project- Effect of Social Determinants incidents on Diabetes, 2022, https://github.com/TejaswaniDash/Effect-of-Social-Determinants-incidents-on-Diabetes- 

  • Used LASSO regression to construct a causal network for explaining variation in social determinants incidence of diabetes, achieving an accuracy of 76%.


AI Project on Fingerphoto, 2022, https://github.com/TejaswaniDash/AIProject-Color-space-finger-photo-presentation-detection-on-background-variation-using-ResNet 

  • Addressed presentation attack issues using Convolutional Neural Networks (CNNs) for Fingerphoto-based authentication, a touchless authentication method in medical patient scenarios, incorporating GAN models.


Credit Card Fraud Prediction using ML and AI, 2022, https://github.com/TejaswaniDash/Credit-Card-Detection 

  • Achieved 96% Accuracy and Minimized False Negatives Using Python, Logistic Regression, and Ensemble Models.
  • Employed exploratory data analysis (EDA), implemented advanced classification algorithms, and fine-tuned models via hyperparameter optimization to effectively predict fraudulent credit card transactions.
  • The meticulous approach yielded exceptional results, demonstrating reliable detection and combatting fraudulent activities with a high accuracy rate of 96%.


Machine Learning Model to predict Covid infection and death rates, 2022, https://github.com/TejaswaniDash/Covid-19-Death-Recovery-and-confirmed-Prediction-and-Analysis-using-ML-and-AI 

  • Utilized European and country-level USA datasets to predict future US infection and death rates with 90% accuracy. Compared multiple ML models (regression/forest, etc.), LLM and neural network performance.


Text Analysis of data to predict sentiment using NLP, LLM, 2023, https://github.com/TejaswaniDash/Text-Analysis-of-data-to-predict-sentiment 

  • Employed LASSO regression, multi-level regression, LLM and ML tools to create a network for sentiment prediction. Achieved a 95% success rate in predicting sentence sentiment.


Mental Health in Tech world using Deep Learning and ML, 2021, https://github.com/TejaswaniDash/Mental-Health-In-Tech-World

  • Analyzed and predicted the need for medical treatment for mental health issues in the tech industry using support vector networks, decision trees, and random forests.

Timeline

Clinical Resolution Data Analyst Intern

HTC Global
05.2022 - 08.2022

Research Assistant

George Mason University
08.2021 - 01.2023

Senior Data Engineer

APCER Life Science
01.2018 - 12.2020

Master of Science - Data Analytics & Health Informatics

George Mason University
05.2001 - 05.2022

Senior Data Engineer

Intelligenie
2 2023 - Current

Senior Data Engineer- ETL Developer

CoxAutomotive
Tejaswani DashData Engineer