Adept Spark Scala Developer with a proven track record at JPMorgan Chase, enhancing data migration and processing efficiency by leveraging AWS, Spark, and Scala. Demonstrated expertise in ETL methodologies and big data analytics, coupled with exceptional problem-solving abilities, led to significant performance optimizations. Skilled in API integration and fostering teamwork, consistently achieving project milestones with precision and agility.
Responsibilities:
· Develop and deploy ETL (Extract, Transform, Load) processes for efficient data migration from on-premises systems to AWS.
· Implement and optimize data transformation logic in Spark using Scala and Spark SQL for data cleansing, standardization, and enrichment
· Configure and manage Amazon EMR clusters to run Spark jobs efficiently, ensuring scalability and reliability.
· Optimize EMR cluster performance through effective resource allocation, job scheduling, and cost management practices.
· Monitor EMR clusters for performance issues and implement tuning strategies to enhance job execution.
· Utilize Amazon S3 for robust data storage solutions, ensuring efficient data access and retrieval for Spark processing.
· Develop and manage S3 bucket policies to maintain data integrity, security, and compliance with best practices.
· Implement data lifecycle management policies in S3 to optimize storage costs and performance.
· Implement and manage AWS IAM (Identity and Access Management) policies to control secure access to AWS resources and data.
· Ensure all Spark jobs and data processing activities adhere to security best practices and compliance requirements.
· Use Apache Airflow (through Astronomer) to orchestrate and manage complex data workflows, scheduling, and task dependencies.
· Develop and maintain Airflow DAGs to automate and streamline ETL processes and data pipeline operations.
· Monitor and troubleshoot Airflow workflows to ensure reliable and timely execution of data tasks.
· Utilize Amazon SQS for managing message queues and integrate them with Spark applications to ensure reliable data ingestion and processing.
· Perform comprehensive testing, including unit testing, integration testing, and end-to-end testing, to validate data integrity and migration success.
· Continuously monitor and optimize the performance of Spark jobs, EMR clusters, and overall data processing workflows.
· Implement best practices for efficient data processing and resource utilization, including tuning Spark applications and optimizing AWS resource configurations.
· Document technical designs, configurations, migration processes, and best practices to facilitate knowledge sharing and support future projects.
Tools : AWS EMR, S3, SQS, IAM, Spark, Scala, Python, Airflow, Cassandra, Airflow, Control-M, JIRA, Kafka.