Seasoned Data Engineer with over 10 years of experience, starting as a Java developer and progressively expanding expertise into advanced areas such as machine learning, natural language processing (NLP), and state-of-the-art data engineering. Proven track record in designing and implementing large-scale data pipelines, integrating complex machine learning models, and building distributed systems. Adept at working with a wide range of technologies, including Apache Spark, Kafka, Python, and TensorFlow, to deliver end-to-end data solutions. Skilled at deploying and managing machine learning models in production environments, optimizing for performance, and ensuring data integrity. Strong foundation in Java-based applications with deep expertise in cloud platforms like AWS. A results-driven professional who combines a solid software engineering background with cutting-edge data science capabilities.
• Built data pipelines using python to enable exploratory data analysis for NLP around speech-to-text from
Cloudera support calls.
• Significantly improved the performance of EMR jobs by 50% by implementing the latest developments around Spark SQL and EMR clusters
• Implemented hot-warm data architecture using AWS S3 and Apache Spark and this helped improve the
performance of call stats by ~70% and also reduced storage costs
• Ingested data from disparate data sources using a combination of Spark SQL, Google Analytics API, python to create data views to be used in BI tools like Tableau/Looker
• Designed and implemented batch and streaming pipelines with robust high availability to ensure product
uptime (delivered product uptime of ~99.999%)
• Some highlights are: - Automated ETL processes, making it easier to wrangle data and reducing time by as much as 40% - Increased the efficiency of the data fetching by using query optimization and indexing
• Constructed a data pipeline that helps detect entities from scientific articles and modified the entity extraction
process to help prepare training data and thereby helping in building classifiers using state-of-the-art Natural Language Processing techniques such as NER.
• Built a Django based web application tool
• Used Spark to implement scalable Machine Learning topic algorithms on large datasets.
• Improved company nomination system between various entities of interest that nominates links between companies, chemicals and harms
• Detected harms from boxed warnings using a combination of semantics and natural language techniques.
This award was presented in recognition of my consistent high performance within the team over several quarters.