Role :
Building data pipeline (AWS Kinesis, S3), ETL & data transformations using Apache Spark, Warehousing with AWS Redshift database to power dashboards, unit testing with ScalaTest / JUnit / PHPUnit, adding features to Apache Solr with small Plugins and algorithm changes, collecting stats and feature engineering, making, managing and maintaining search and internal APIs and front end Applications.
Search & optimization projects:
- Worked on Apache Solr adding small features (Solr Plugins) during indexing and searching (Querying) to enhance search results and quality of product. Increased indexing speed on 10 million documents from 4hrs to 40 minutes by increasing indexing threads within Apache Solr and writing parallel program to index documents in parallel.
- Involved in scaling API for 12-13 million users having 170 Million request per day and 200k-300k request per second at peak times. Scale was achieved with AWS Elastic BeanStalk, Docker which were used to scale search engine Apache Solr with average latency of 100 milliseconds response time and maximum of 20 hosts. Web Server scale was achieved by tuning PHP-FPM and Linux kernels.
- Developed RESTful API in Java Jersey (Tomcat) and PHP.
Big data projects:
- Logging data project: Logged all click, conversion, impression events on the site with PHP application using AWS firehose fluming agent to send events to AWS kinesis to AWS S3 storage and AWS Redshift.
- Designed schema for Redshift.
- Worked on big data project to send job alerts to 5 million users. Customized Email job alerts for each user using Apache Spark, AWS EMR, Redis by generating unique user profiles (user search and click history) for each user based on user activity like searched keywords and click events, wrote program in Scala/Apache Spark and deployed on AWS EMR.
- Customized third party (publisher) searches to generate more CTR (click through rate) and revenue per visitor by providing user job recommendations based on click events.
- Worked on Databricks cloud to do some ETL and big data collection tasks using Apache Spark SQL, Parquet and Data Frames api.
- Created reporting dashboard system with Data Warehouse using AWS Redshift to show impressions, clicks and conversions at job level, ETL was performed with spark and loading some data from S3 directly.
- Collected data for machine learning converted to features and generic stats to train models for Classification, Regression, Clustering by running Spark jobs with scheduler.
Web projects:
- Top Spot Job project: Made Top Spot feature on site to show high value jobs on top of site.
- Built Mobile website & search API in PHP for publishers and jQuery/HTML/PHP front end pushed it to production.
- Created small API with Java Jersey for reporting system for jobs2careers client and used jQuery/PHP application to display it.