- Worked on Data Lake project where semi-structured data was ingested onto cloud data warehouse to support course/system usage reporting, statistical analysis for predicting at-risk students in course, and market research/analysis. Team size on this project varied from 20 to 40 people.
- Student and course activity data from different sources were being streamed onto AWS S3 through Kafka clusters, transformed/aggregated in Snowflake cloud data warehouse, and eventually used for reporting and predictive analysis.
- Underlying process for reports that would originally take days to generate for course instructors and system administrators, was migrated to Snowflake which reduced report generation time to minutes. Developed provisioning, deployment, and configuration process that client would use to implement migration to this new pipeline.
- Goal for centralized data warehouse was to be able to dynamically scale on-demand for different systems serving 350+ and 1,400+ clients respectively.
- Created workflow/process for ingesting raw semi-structured course access logs and transforming into multiple relational tables on cloud that would support Looker reports which provided with charts and dashboards used by internal staff. Above-mentioned access log table was receiving 100+ million rows of raw data per day so it had to be archived as one of last ELT steps.
- Used GIT to collaborate with team members on Data Lake's codebase. Used Jira agile board for task tracking and visualizing team activity that were reported to product management at end of each sprint.
- Different tools and languages used as part of Data Lake project: Hue, Hive, Pig, Impala, Presto, Sqoop, Oozie, Flyway, AWS (S3, EC2, EMR, RDS), jq for JSON, IntelliJ, Shell script, Scala, JavaScript, Node.js, Highcharts, Snowflake, Looker, Python, PyCharm, Airflow
Environment: Cloudera CDH, AWS, Stash (GIT), Jira Agile, SQL Server 2005/2008/2012, SSRS, SSIS, BIDS Helper 2012, ASP.NET, Visual C#, Windows Server 2008 R2 / Vista Enterprise, MS Visual Studio 2013, MS Visual Web Developer 2008, SQL Server Data Tools, MS Visio