Senior Data Engineer
Data Pipeline and Infrastructure Development:
- Initiated and led development of Kafka-based data streaming platform. This platform processes millions of events daily, resulting in 30% reduction in latency for real-time data availability.
- Integrated DynamoDB streams with AWS Lambda for real-time processing and PII data handling, leading to 25% improvement in operational efficiency and enhanced data quality for analytics.
Cloud Infrastructure as Code:
- Implemented Terraform to automate provisioning of AWS resources, yielding a 50% faster rollout for new environments and 20% reduction in cloud-related costs through efficient resource utilization.
- Enforced best practices for infrastructure management, resulting in zero downtime during large-scale deployments.
Cloud Data Services and ETL Processes:
- Architected and maintained scalable AWS Glue ETL processes with PySpark to handle complex transformations and data processing jobs doubling data processing capacity and aided advanced analytics initiatives.
- Improved analytics teams' data discovery by Streamlining metadata management in Glue Data Catalog.
Data Storage and Management:
- Formulated compliant S3-based data lake, incorporating Delta tables and Parquet, enhancing data query performance by 35% and supporting petabyte-scale data storage.
Data Warehousing and Querying:
- Crafted and deployed optimized Amazon Redshift data warehouse, leveraging star schema modeling, increasing query performance five-fold specifically complex analytical queries.
- Expanded the use of AWS Athena, facilitating ad-hoc querying capabilities that empowered business users to perform data exploration without IT intervention.
Workflow Automation:
- Orchestrated and automated data workflows using Apache Airflow, which enabled consistent execution of batch jobs reducing manual intervention by 90%.
Data Visualization and Reporting:
- Connected data pipelines to Tableau and AWS QuickSight, providing advanced reporting producing self- service dashboards and actionable insights to business stakeholders to enhance data-driven decision- making capabilities across the company.
Performance Optimization and Security:
- Applied PySpark's in-memory processing to tune the performance of data jobs, contributing to a 20% increase in overall system performance.
- Implemented comprehensive security measures, including encryption and IAM policies, ensuring full compliance with internal and external security standards.