Experienced Data Engineer keen to help companies collect, collate and exploit digital assets. Practiced at cleansing and organizing data into new, more functional formats to drive increased efficiency and enhanced returns on investment.
Big Data Project “Twitter sentiment analysis" (Python, Scala) May 2017-Sep 2017
• Used Apache Kafka to stream performed sentiment analysis on the fly (live stream) from the producer to the consumer using StanfordNLP library to classify the tweets as 'positive', 'negative' or 'neutral'.
• Classified 50,000 tweets and identified the number of tweets from each category originating from different states of USA using Elasticsearch's Kibana.
Big Data Project “Crime rate Forecasting System" (Pyspark, Scala) May 2017-Sep 2017
• Used the Portland crime rate dataset which consisted of 829,384 rows and 19 columns which I clustered into three clusters using the Spark MLlib (K-means clustering).
• Developed a time forecasting system on the three clusters using the ARIMA (Autoregressive Integrated Moving Average) model that forecasted the crime rate for one month with an accuracy of 80%.