Highly competent Data Engineer with background in designing, testing, and maintaining data management systems. Possess strong skills in database design and data mining, coupled with adeptness at using machine learning to improve business decision making. Previous work resulted in optimizing data retrieval processes and improving system efficiency.
• Employed automation processes using AWS tools like CloudFormation and Elastic Beanstalk, reducing infrastructure setup time, and improving team productivity by 25%.
Spearheaded a benchmark using Spark and Scala APIs to compare Spark against Hive and SQL, achieving a 30% performance improvement for data processing tasks.
• Utilized SQL queries for data transformation and enrichment, including aggregations, filtering, and integration of external datasets for enriching customer profiles.
• Designed an efficient data model using SQL for storing processed and enriched data, optimizing indexes for fast querying, and storage efficiency.
• Designed and executed Hive tables to store website log data in a structured format, enabling efficient analysis with Apache Spark, and reducing data processing time for marketing campaign performance analysis, leading to a 35% faster identification of key insights.
• Developed efficient ETL workflows using AWS Data Pipeline calculations for extraction, transformation, and loading, which reduced data processing time by 10%.
• Utilized Spark SQL to pre-process, cleanse, and join terabytes of data, achieving an improvement in data quality metrics, such as minimized errors by 20%.
Led a cross-functional team to implement automated dashboards and reporting using Power BI, empowering stakeholders with data-driven insights, and achieving a remarkable 25% improvement in workforce performance.
Programming Languages: Scala, Python, Java, SQL, PL/SQL, R
IDEs: PyCharm, Jupyter Notebook
Big Data Ecosystem: Hadoop, MapReduce, Hive, DynamoDB, BigQuery, HDFS, Apache Spark, Apache Airflow, Storm
Machine Learning: Linear Regression, Logistic Regression, Decision Tree, K-means, Naïve Bayes, Random Forest, Reconsolidation models, calculation models
Cloud Technologies: AWS (EC2, S3 Bucket, Amazon Redshift, Lambda, IAM, Kinesis, EMR), Kafka, Databricks, Microsoft Azure
Packages: NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Seaborn, TensorFlow, PySpark
Reporting Tools: Tableau, Power BI, SSRS
Databases: MS SQL Server, PostgreSQL, MongoDB, MySQL, Cassandra
Operating Systems: Windows, macOS