Highly experienced Data Engineer with over 10 years of expertise in building, optimizing, and managing scalable data infrastructure, enabling data-driven decision-making across diverse domains by utilizing advanced data engineering best practices and cloud-native solutions. Proficient in SQL and NoSQL technologies including PostgreSQL, MySQL, Oracle, Cassandra, and MongoDB, delivering efficient querying and storage for both structured and unstructured datasets. Strong programming expertise in Python, Scala and Java, applied extensively to design robust ETL pipelines, real-time stream processors, and automated data transformation scripts for high-performance data solutions. Deep understanding of the Hadoop ecosystem, including HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, and Oozie, delivering distributed processing and big data analytics solutions at scale. Comprehensive hands-on experience with the Apache Spark ecosystem, including PySpark, Spark Core, Spark SQL, and Spark Streaming, for processing batch and real-time data, performing in-memory analytics, and building machine learning pipelines. Designed and implemented cloud-native ETL workflows using AWS Glue, Lambda, and Athena, enabling automated data extraction, transformation, and querying across distributed systems with high scalability and low operational overhead. Built and maintained data lake architectures on AWS S3 and EMR, integrating real-time and batch data pipelines to support large-scale analytics, reduce processing time, and ensure high availability in production environments. Proven expertise in designing and managing end-to-end data pipelines, encompassing data ingestion, extraction, transformation, and loading (ETL) to support enterprise-wide analytics, machine learning, and business intelligence needs. Specialized in data quality and security practices, including data cleansing, validation, profiling, deduplication, and encryption, ensuring accuracy, integrity, and protection of sensitive data across all processing stages. Strong knowledge of data governance frameworks and compliance standards such as GDPR and CCPA, implementing controls for secure data handling, auditing, and enterprise-level data integration, migration, and processing workflows. Developed and orchestrated complex data pipelines using Apache Airflow, Luigi, and Dagster, ensuring seamless scheduling, monitoring, and dependency management across data workflows. Implemented streaming data applications using Apache Kafka and Apache Flink, enabling real-time data ingestion, event processing, and analytics for time-sensitive business use cases. Applied Infrastructure as Code (IaC) practices using Terraform, enabling consistent, automated, and scalable deployment of cloud infrastructure for data engineering platforms. Designed and optimized large-scale data warehouses using Snowflake, AWS Redshift, Azure Synapse Analytics, and Databricks Delta Lake, driving data availability for advanced analytics and reporting. Specialized in dimensional modeling, including the design and implementation of Star and Snowflake Schemas, OLAP cubes, Fact and Dimension tables for efficient analytical querying. Well-versed in data serialization and file formats including Avro, Parquet, ORC, JSON, and XML, optimizing storage and exchange of large datasets across distributed systems. Developed traditional ETL pipelines using Informatica, Talend, Apache NiFi, and SSIS, ensuring seamless data extraction, transformation, and loading from diverse sources to centralized platforms. Created interactive and insightful data visualizations and dashboards using Tableau, Power BI, and QlikView, empowering business users with self-service analytics and data storytelling. Containerized data engineering applications and deployed them using Docker and Kubernetes, ensuring scalability, portability, and efficient orchestration across environments. Used GIT, SVN, and JIRA/Confluence for version control, team collaboration, and project tracking, contributing to streamlined DevOps and agile software delivery cycles. Integrated CI/CD pipelines using AWS CodePipeline , AWS Code Deploy and Jenkins, promoting automation, testing, and continuous integration in the deployment lifecycle of data engineering applications. Focused on real-time data processing, pipeline performance tuning, and data modeling strategies to ensure high availability, reliability, and low-latency insights from complex datasets. Experienced in Agile Scrum and Waterfall SDLC methodologies, collaborating cross-functionally with stakeholders to deliver high-quality data products in iterative and structured development environments.
Python