Senior Data Engineer
Company Name
- Led the gathering of requirements, conducted system analysis, and provided development and testing effort estimations
- Contributed to the design of various system components, including Sqoop, Hadoop processes involving map reduce, Hive, Spark, and FTP integration for downstream systems
- Implemented optimized Hive and Spark queries using techniques such as window functions and customized Hadoop shuffle and sort parameters
- Developed ETL processes using PySpark, utilizing both Dataframe API and Spark SQL API for transformations and actions
- Resulting data was stored in HDFS and transferred to the Snowflake database
- Successfully migrated an on-premises application to AWS, utilizing services like EC2 and S3 for small dataset processing and storage
- Proficient in maintaining Hadoop clusters on AWS EMR
- Expertise in real-time data analytics using Spark Streaming, Kafka, and Flume
- Configured Spark Streaming to extract ongoing information from Kafka and store it in HDFS
- Designed and developed ETL processes in AWS Glue to migrate Campaign data from external sources to AWS Redshift, employing various Spark transformations and actions for data cleansing
- Utilized Jira for issue tracking and Jenkins for continuous integration and deployment
- Enforced data catalog and governance standards
- Created DataStage jobs incorporating different stages for ETL processes, including Transformer, Aggregator, Sort, Join, Merge, Lookup, and more
- Proficient in creating, debugging, scheduling, and monitoring ETL batch processing jobs using Airflow for Snowflake
- Built ETL pipelines for data ingestion, transformation, and validation on AWS, collaborating with data stewards for data compliance
- Scheduled jobs using Airflow scripts with Python, adding tasks to DAGs and managing dependencies between tasks
- Employed PySpark for data extraction, filtering, and transformation in data pipelines
- Monitored servers using Nagios, CloudWatch, and ELK Stack (Elasticsearch, Kibana)
- Utilized Data Build Tool for ETL transformations, AWS Lambda, and AWS SQS
- Developed Spark applications using Spark SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats
- Responsible for estimating cluster size, monitoring, and troubleshooting Spark Databricks clusters
- Automated data load processes to the target Data Warehouse using Unix Shell scripts
- Implemented monitoring solutions in Ansible, Terraform, Docker, and Jenkins
- Environment: Python, Power BI, AWS Glue, Athena, SSRS, SSIS, AWS S3, AWS Redshift, ETL, AWS EMR, AWS RDS, DynamoDB, SQL, Tableau, Distributed Computing, Snowflake, Spark, Kafka, MongoDB, Hadoop, Linux Command Line, Data Structures, PySpark, Oozie, HDFS, MapReduce, Cloudera, HBase, Hive, Pig, Docker, and Tableau.
