•Proficient in crafting and designing multiple data pipelines, overseeing complete ETL and ELT process for data ingestion and transformation within Google Cloud Platform (GCP)
•Successfully set up Continuous Delivery pipeline using Docker and GitHub, streamlining deployment process
•Developed, deployed, and managed results utilizing Spark and Scala code within Hadoop cluster hosted on GCP
•Practical familiarity with Google Cloud Functions, employing Python for transferring data from CSV files stored in Google Cloud Storage (GCS) buckets into BigQuery
•Proficient in processing and loading both bounded and unbounded data from Google Pub/Sub topics to BigQuery via Cloud Dataflow
•Leveraged Spark and Scala APIs to assess performance of Spark in comparison to Hive and SQL
•Successfully deployed applications to GCP using Spinnaker, leveraging rpm-based packages
•Architected several Directed Acyclic Graphs (DAGs) to automate ETL pipelines for seamless data processing.
•Developed pipeline for Proof of Concept (POC) to assess performance and efficiency of pipeline execution, comparing Google Cloud Dataproc clusters with Google Cloud Dataflow
•Automated feature engineering using Python scripts and deployed these on Google Cloud Platform (GCP) and Big Query
•Responsible for implementing monitoring solutions using Terraform, Docker, and Jenkins
•Also automated Datadog Dashboards using Terraform Scripts
•Proficient in architecting ETL transformation layers and writing Spark jobs to facilitate data processing.
•Proficient in collecting and processing large-scale raw data through scripting, web scraping, API calls, SQL queries, and application development
•Experienced in fact-dimensional modeling, including Star schema, Snowflake schema, transactional modeling, and Slowly Changing Dimensions (SCD)
•Involved in building ETL processes within Kubernetes, employing tools like Apache Airflow and Spark on GCP
•Proficient in machine learning techniques such as Decision Trees, Linear/Logistic Regression, and Statistical Modeling
•Experience in implementing machine learning back-end pipelines, particularly with Pandas and NumPy.
•Crafted data pipelines within Google Cloud's Airflow to streamline ETL tasks using airflow operators
•Proficient in wide range of Google Cloud Platform (GCP) services including Dataproc, Google Cloud Storage (GCS), Cloud Functions, and BigQuery
•Developed real-time analytics pipeline on Google Cloud Platform (GCP), leveraging Apache Kafka for management and analysis of extensive streaming data stored in Google Cloud Storage (GCS), facilitating prompt insights for business decision-making
• Designed and implemented data migration pipelines using Google Cloud's suite of services such as Cloud Storage, BigQuery, and Dataflow to transfer data seamlessly from Azure to GCP
• Integrated Datadog into continuous integration and continuous deployment (CI/CD) pipelines to monitor performance impact of code changes, track deployments, and ensure reliability of applications throughout the software development lifecycle
• Actively participated in migrating on-premises Hadoop systems to GCP (Google Cloud Platform)
• Conducted in-depth analysis of data from diverse domains to enable seamless integration into a Data Marketplace
• Developed Pyspark programs, established data frames, and executed data transformations
• Proficiently employed a variety of GCP services, including GCP Cloud Storage, Dataproc, Data Flow, BigQuery, Cloud Storage, Dataproc, Compute Engine, and GKE
• Configured Snowflake to directly ingest data from GCP storage services like Google Cloud Storage using storage integrations
• Leveraged GCP's managed services, including Cloud Dataflow and Apache Beam, to orchestrate complex data processing tasks and perform batch and stream processing on data stored in Snowflake and other GCP services
• Developed a Continuous Delivery pipeline incorporating Maven, Ant, Jenkins, and GCP
• Engineered multi-cloud strategies, leveraging strengths of GCP, especially its Platform as a Service (PaaS) offerings
• Crafted and implemented automated remediation workflows utilizing Datadog's integrations and APIs for finance data management
• These workflows effectively addressed monitoring alerts, executed self-healing actions, and mitigated incidents in real time
• Enacted daily data file storage in Google Cloud buckets, effectively harnessing DataProc and BigQuery for maintaining cloud-based solutions
• Collaborated with various business units to steer design and development strategy
• Produced functional specifications and technical design documentation
• Coordinated with teams such as cloud security, Identity Access Management, Platform, and Network to secure necessary accreditations and intake processes
• Leveraged cloud and GPU computing technologies for automation of machine learning and analytics pipelines, with primary focus on GCP
• Actively engaged in Proof of Concept (POC) to assess different cloud offerings, including Google Cloud Platform (GCP)
• Conducted comparative analysis between self-hosted Hadoop and GCP's DataProc, while also exploring Big Table (managed HBase) use cases and evaluating performance improvements.
• Leveraged Spark RDD, Data Frame API, Data Set API, Data Source API, Spark SQL, and Spark Streaming alongside SQL and DynamoDB for comprehensive data processing
• Developed Spark applications using both Python and R, including implementing Apache Spark data processing projects to handle data from various RDBMS and streaming sources
• Employed Apache Spark's data frames, Spark-SQL, and Spark MLlib extensively, designing and developing POCs using Scala, Spark SQL, and MLlib libraries
• Pioneered the deployment of AWS CloudFormation templates to streamline provisioning and managing infrastructure resources, ensuring scalability and resilience in multi-tier application environments
• Efficiently extracted data from SQL server, Amazon S3 buckets, and internal SFTP, loading them into AWS S3 buckets in a data warehouse context
• Developed Spark jobs for data processing and orchestrated instances and clusters to load data into AWS S3 buckets, thereby creating a DataMart
• Leveraged AWS EMR for processing and transforming data to assist the Data Science team based on business requirements
• Designed and developed ETL processes in AWS Glue to migrate campaign data from external sources, such as S3, ORC/Parquet/Text files, into AWS Redshift
• Engaged in both batch processing and real-time data processing using Spark Streaming with a Lambda architecture
• Developed Python code for various tasks, dependencies, and time sensors in the context of workflow management and automation using the Airflow tool
• Collaborated with the DevOps team to implement Nifi Pipelines on EC2 nodes, integrated with Spark, Kafka, and Postgres running on other instances, using SSL handshakes in QA and Production Environments.
•Orchestrated pipelines to extract, transform, and load data from diverse sources including Azure SQL, Blob storage, Azure SQL Data Warehouse, and write-back tools
•Analyzed, designed, and constructed contemporary data solutions using Azure's Platform as a Service (PaaS) to facilitate data visualization
•Extracted, transformed, and loaded data from source systems to Azure Data Storage services
•Designed and maintained data models and schemas in Azure Synapse Analytics for efficient querying and reporting, utilizing T-SQL for schema management and optimization
•Implemented a scalable data integration solution on Microsoft Azure utilizing Informatica, enabling seamless extraction, transformation, and loading (ETL) of large datasets from diverse sources into Azure data repositories for advanced analytics and reporting.
•Developed and deployed Java MapReduce jobs on Azure HDInsight, enhancing data processing capabilities
•Designed and implemented data processing and transformation logic in Azure kaf using Spark, PySpark, and SQL
•Architected scalable and cost-effective data processing pipelines using Azure Databricks, Spark, and Delta Lake to handle large volumes of streaming and batch data
•Integrated Azure data services with other Azure platform services like Azure Active Directory, Azure VNet, and Azure Monitoring
•Implemented SVN (Subversion) version control system for maintaining and tracking revisions in data pipelines, facilitating effective collaboration and versioning control among development teams.
•Proficiently analyzed the Hadoop cluster and various big data analytic tools, including the HBase database and Sqoop
•Leveraged Talend for data integration, cleansing, and transformation, while using dbt to refine raw data into structured datasets, leading to faster processing times and higher data quality
•Crafted, developed, and maintained Tableau functional reports according to user specifications, ensuring meaningful data visualization
•Deployed Hadoop and Cloudera Distribution for Hadoop (CDH) to optimize the data processing pipeline, including setup, real-time data ingestion Flume, and Spark analytics
•Proficiency in Python and Scala, with a knack for creating user-defined functions (UDF) for Hive and Pig using Python
•Integrated MongoDB with big data processing frameworks like Hadoop and Spark to build end-to-end data pipelines for batch and stream processing
•Configured HBase tables to accommodate various data formats, specifically PII data from diverse portfolios
•Developed complex Hive SQL queries to extract, transform, and load data from HDFS into Hive tables
•Demonstrated a commitment to best practices in unit testing, continuous integration, continuous delivery (CI/CD), performance testing, capacity planning, documentation, monitoring, alerting, and incident response.
Azure Services: Azure SQL, Blob storage, Azure Data Storage, Azure Synapse Analytics, Azure Databricks, HDInsight
undefined