Team-oriented Senior Cloud Data Engineer with technical hands-on and lead experience. Good understanding of designing large data warehouse systems, including relational and dimensional data models. Successful in implementing On-Premise (Hadoop/BigData) and Cloud (AWS) technologies, including data-pipeline and data-lake.
Responsible for on-premise Cloudera data-lake implementation and its migration to AWS, and designing secure data-pipelines in the Cloud utilizing Terraform for infrastructure as code.
· AWS (Cloud)
o Collaborated in the design of secure data-lake solution, and led the implementation using a staged multi-zone approach, including creation of data-pipelines for multiple source systems.
o Leveraged AWS services for the pipelines including Glue ETL/Crawler/DataCatalog, Lambda, Step-functions, Lake formation, CloudWatch, Athena, Redshift and QuickSight
o Developed the infrastructure using Terraform, Glue, Lambda, PySpark and DBT (Data build tool), and leveraged Jupyter notebook, Sagemaker, VSCode, Git, Shell as development tools.
o Implemented CICD for the data-pipeline workflow using Git and AWS CodePipeline.
o Performed troubleshooting for issues using CloudTrail, Cloudwatch and Kibana dashboards.
o Automated AWS key and password rotation using AWS SecretsManager.
o Ensured secure access to production data, at raw S3 layer, or Redshift, leveraging proper IAM roles and policies, and data access using appropriate KMS keys.
o Designed and implemented the history data migration of on-premise ODS data using Glue jobs and lambda triggers for data transfers between S3 source and targets across VPCs.
· Cloudera/HortonWorks (On-premise)
o Built on-premise Cloudera (CDP) clusters including OS and cluster software installation on multiple Linux servers. Designed and implemented the data-lake using a secure multi-zone architecture
o Enabled authentication (Kerberos) for securing the cluster for valid access. Enabled security for data in motion using trusted certificates implementation (TLS/SSL)
o Designed data-lake zones, to support raw, trusted and refined data, enforcing security for data at rest (HDFS encryption) by applications and zones
o Utilized Apache Atlas to provide the metadata layer for the ingested data, with an option to provide business taxonomy for schema attributes and provide visual lineage of loaded data
o Implemented policies through Apache Ranger for data access at Hive and HDFS level including auditing for policy violations. Managed the encryption keys for data-lake zones through Ranger
o Clients included: Allina Health System, Medtronic, BI Worldwide, Children’s Hospital and Fortis
Participated in various capacities on a program involving a major worldwide Hotel chain to transition from their legacy to a proprietary customer loyalty application.
o Implemented a multi-tenant data-lake for secure code/data storage/retrieval comprising of migration, ingestion, integration, reporting and analytics zones, using raw, staged and processed layers.
o Developed data-pipeline processes for client-data migration, ingestion, integration and reporting. Used a combination of NiFi, Kafka, Scala, PySpark, Hive, Airflow, Shell scripts and SSRS.
o Wrote processes to audit ingested Hive data against Kafka events from source. Created producer to send missing data keys back to Kafka broker so it could be updated on source for next NiFi fetch.
o Developed processes targeting change-data capture related to hard-deletes/updates on source RDBMS to flow to Hadoop/Hive using a combination of Kafka/NiFi, HiveQL and Python scripts.
o Worked with BA team to review the requirements for sensitive fields and used Atlas to define the Tags. Integrated with Ranger to enable Tag-based policies as per AD groups. Enforced LDAP group-based Data-lake folders access and Hive tables access for Dev/QAs, using Apache Ranger.
o The whole development process used an agile framework with sprints, using tools including JIRA, Confluence, IntelliJ, HiveRunner, Git/BitBucket and Bamboo for CI/CD of development branches.
Led the implementation of an Enterprise Data-Warehouse and BigData platform. Conducted POCs, delivered prototypes and recommended appropriate solutions for application and technology implementation teams.
o Principal lead for EDW database, ETL and BI, including database designs, data management, perform system health checks, mentor and conduct training, working with offshore teams.
o Architected strategy and plan to successfully transition platform migration of 170TB EDW to lower-cost compute/storage platforms within a 24-hour production outage window, resulting in $4M+ savings and eliminating $400K outside consulting costs by driving project in-house.
o Implemented a HortonWorks Hadoop cluster to augment EDW to store 20 times more data.
o Improved performance using de-normalized versions of Terabyte-sized Hive tables.
o Enabled engineers to use Hive query to output results to be analyzed using JMP and Tableau.
o Collaborated with business partners and analysts, to identify data-sources, define ingestion processes and storage methodology with appropriate standards and governance.
o Developed strategic roadmap for Enterprise data warehouse and Business Intelligence including key initiatives for platform migrations, product upgrades, security enhancements and cost reduction.
Cloud: AWS (S3, Lambda, Glue, Step-functions, VPC, Lake Formation, Redshift, Secrets), Terraform
BigData: Hive, Cloudera Manager, Ambari, NiFi, Kafka, Sqoop, Atlas, Ranger, Airflow, Zeppelin
ETL/BI/Dev: DBT, QuickSight, Jupyter, Sagemaker, Denodo, Informatica, BOBJ, VSCode, IntelliJ, Shell
Agile/CI/CD: Confluence, GitHub, JIRA, BitBucket, SourceTree, Bamboo, Docker, Kanban, Scrum
Databases: Redshift, Oracle, SQL Server, MySQL, PostgreSQL
Languages: Unix/Linux shell, Python, PySpark, C, HiveQL, Perl, SQL, PL/SQL
Platforms: AWS, Cloudera/HortonWorks, HDFS, Unix, Linux, AIX, Zaloni, PagerDuty