• Involved writing Pyspark code from Golang logic.
• Developed data ingestion workflows using AWS S3 as data storage and Spark's built-in capabilities to efficiently process large-scale datasets.
• Migrated data processing workflows from AWS EMR to Databricks, utilizing Databricks notebooks and clusters for interactive data exploration, prototyping, and job scheduling.
• Utilized Databricks Delta Lake, optimized data lake solution, to efficiently store and manage large volumes of structured and semi-structured data, ensuring data integrity, reliability, and ACID compliance.
• Collaborated with data scientists and analysts to provide them with reliable and curated datasets in Databricks for advanced analytics, machine learning, and AI model development.
• Implemented security measures in Databricks, such as data encryption, role-based access control, and network isolation, to ensure data privacy and compliance with regulatory requirements.
• Optimized Spark SQL queries in Databricks by analyzing query execution plans, identifying inefficient operations or unnecessary shuffling, and applying appropriate optimizations such as predicate pushdown or join reordering.
• Created external tables in Athena, pointing to data stored in S3, using Glue Data Catalog for metadata management, enabling seamless query access to structured, semi-structured, and unstructured data.
• Developed AWS Kinesis Firehouse and lambda and s3 pipeline to fetch live data from API and store it in s3.
• Involved in developing Spark application to process stored data in S3 and write output back to s3.
• Designed and executed data quality jobs in Databricks using SQL queries, Python, or Scala, leveraging Databricks' distributed computing capabilities to process large volumes of data efficiently.
• Led successful migration project from AWS Redshift to Snowflake, ensuring seamless transition of data and analytics processes to new platform.
• Conducted thorough assessment of existing AWS Redshift infrastructure and identified opportunities for optimization and improvement in migration process.
• Designed and executed comprehensive migration plan, including data extraction from AWS Redshift, data transformation, and loading into Snowflake, while ensuring data integrity and minimal downtime.
• Developed and executed data migration scripts and processes, utilizing Snowflake's data loading capabilities, such as COPY command and Snow pipe, to efficiently transfer data from AWSRedshift to Snowflake.
• Performed data validation and reconciliation to ensure accuracy and consistency of data after migration, identifying and resolving any discrepancies or anomalies.
• Advanced knowledge of the GCP ecosystem with a focus on Big Query
• Designed and implemented complex data processing workflows in Snowflake, leveraging its powerful SQL capabilities and scalable
architecture to handle large volumes of data.
• Analyze user needs to determine how software should be built or if existing software should be modified.
• Designing and coding Big Query to analyze data collections.
• Apache Spark or Python libraries to perform advanced data processing tasks, including machine learning algorithms, natural language processing, or graph analysis.
• Implemented Snowflake's Time Travel and Fail-safe features to manage and recover from data processing errors, ensuring data integrity and maintaining a reliable and consistent data processing environment.
• Optimized query performance in Snowflake for complex data processing scenarios by analyzing query execution plans, leveraging query hints, and applying optimization techniques such as clustering and partitioning.
• Designed and implemented end-to-end data solutions using AWS S3 as a data lake for storing raw and processed data, EMR for big data processing, Snowflake as data warehouse, and Tableau for data visualization and reporting.
• Developed data ingestion processes using AWS S3 and EMR, leveraging technologies such as Apache Spark to extract, transform, and load data from various sources into Snowflake for further analysis.
• Collaborated with business stakeholders and Tableau developers to understand reporting and visualization requirements, translating them into meaningful visualizations and interactive dashboards that provide actionable insights.
• Implemented data quality checks and validation using AWS Lambda and Airflow to ensure integrity and accuracy of data in S3, EMR, and Snowflake.
• Designed and implemented data archiving and backup strategies using S3 Glacier.
• Experienced in installing, configuring Databricks in AWS and azure
Apache Spark
Database Development
Unix Shell
AWS Big Data Stack, Azure (s3, EC2, EMR,Lambda, Glue, Athena, Redshift)
Data warehouse (Snowflake,Redshift)
ETL -Informatica Power Center 961, AWS Glue
RDBMS - Microsoft SQL Server,Oracle11g
Databricks
Visualization (Tableau, Python libraries)
Hadoop Eco System (MapReduce,Hive, Scoop)
Programming languages (Scala,Python, Java, Golang)
Version Control (GitHub)
Qlik sense
Amazon Web Services Architect, covering resources like S3, EC2, IAM, Databases (Dynamo DB, Redshift), VPC, Lambda, Glue, Athena, SQS, SNS, SES, API
Gateway, Kinesis
Teamwork and Collaboration
Multitasking Abilities
Data analysis