Summary
Overview
Work History
Education
Skills
Certification
Websites
Timeline
Generic

Vivek K

Summary

With over 11+ years of experience as a Data Engineer, I specialize in implementing comprehensive Big Data solutions and architecting Hadoop ecosystems. My expertise spans a wide range of technologies including Apache Hadoop components like HDFS, MapReduce, YARN, Hive, Sqoop, and HBase, with a strong command of Spark for both ETL processes and real-time data processing. My technical background includes optimizing performance in Hive, troubleshooting complex queries, and managing large-scale data environments in both AWS (EMR, S3, Redshift) and Azure (HDInsight, Databricks, Data Lake). I have successfully leveraged tools like Kafka for real-time data pipelines and Glue for automated ETL processes, alongside managing enterprise data lakes for structured and unstructured data analysis. I excel in working with RDDs, DataFrames, and Dataset APIs, while also utilizing cloud-native services such as AWS Glue and Athena for streamlined data extraction and transformation. My skill set includes NoSQL databases such as HBase and Cassandra, enabling seamless integration within Hadoop clusters, and my experience extends to columnar file formats like Avro, ORC, and Parquet. I have worked with various Hadoop distributions (Cloudera, Hortonworks) and utilized scheduling tools like Oozie and Airflow to automate workflows efficiently. In addition to Big Data, I have a solid foundation in core programming languages including Java, and JavaScript, which has equipped me to design scalable, secure, and efficient RESTful APIs. I have led several data integration efforts, working within Agile environments using JIRA and Confluence, and possess strong data modeling skills for both OLTP and OLAP systems, with experience in SQL, PySpark, and backend database analysis. This well-rounded expertise enables me to drive high-performance data engineering initiatives, from cloud data migration to advanced analytics and data pipeline optimization.

Overview

11
11
years of professional experience
1
1
Certification

Work History

Senior Data Engineer

Citibank
11.2024 - Current
  • Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and verifying pipeline stability.
  • Led the migration of large-scale on-premises Hadoop workloads, optimizing performance, scalability, and reducing legacy infrastructure dependencies.
  • Designed and maintained data pipelines using Hadoop ecosystem tools (HDFS, Hive, Impala, Sqoop, Spark), enabling efficient batch and streaming data processing.
  • Implemented Spark (Scala/PySpark) based ETL workflows to transform and cleanse structured and semi-structured datasets, improving downstream data quality.
  • Optimized Hive queries, partitioning strategies, and file formats (ORC, Parquet) to significantly reduce query execution time and improve reporting performance.
  • Migrated legacy ETL jobs to modern Spark-based solutions, reducing operational overhead and increasing maintainability.
  • Implemented data quality checks and validation frameworks to reconcile migrated datasets against source systems, ensuring accuracy and completeness.
  • Participated in capacity planning and resource optimization for Hadoop clusters, fine-tuning YARN and Spark configurations to improve utilization.
  • Performed impact analysis and dependency mapping to ensure smooth transition of critical business applications during data migration.
  • Conducted performance tuning, root-cause analysis, and troubleshooting for production pipelines, ensuring high availability and adherence to SLAs.
  • Migrated legacy shell-script based ETL workflows into reusable Spark jobs, enhancing maintainability and performance.

Data Engineer

Cotiviti
11.2022 - 10.2024
  • Implemented a centralized Data Lake on AWS using S3, EMR, Redshift, and Athena, optimizing big data storage and processing solutions.
  • Built and automated ETL data pipelines using AWS Glue and PySpark, streamlining data ingestion and transformation processes from multiple sources.
  • Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances. Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Built a series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.
  • Created AWS Glue crawlers for crawling the source data in S3 and RDS.
  • Created multiple Glue ETL jobs in GlueStudio and then processed the data by using different transformations and then loaded into S3, Redshift and RDS.
  • Built and optimized real-time data pipelines using Kafka and Spark Streaming, enabling low-latency data ingestion from external financial feeds into Redshift for analytics.
  • Worked extensively on fine-tuning spark applications and providing production support to various pipelines running in production.
  • Worked closely with business teams and data science teams and ensured all their requirements are translated accurately into our data pipelines.
  • Worked on writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala Spark.

Data Engineer

Ebay
03.2021 - 11.2022
  • Designed and developed ETL pipelines using AWS services (Glue, Redshift, S3), transforming raw data into curated datasets for business intelligence.
  • Leveraged Terraform for the deployment and management of AWS infrastructure, ensuring scalability and cost-efficiency.
  • Create VPCs, subnets including private and public, NAT gateways in a multi-region, multi-zone infrastructure landscape to manage its worldwide operation.
  • Manage Amazon Web Services (AWS) infrastructure with orchestration tools such as CFT, Terraform and Jenkins Pipeline. Create Terraform scripts to automate deployment of EC2Instance, S3, EFS, EBS, IAM Roles, Snapshots and JenkinsServer.
  • Build Cloud data stores in S3storage with logical layers built for Raw, Curated and transformed data management. Create data ingestion modules using AWS Glue for loading data in various layers in S3 and reporting using Athena and Quick sight.
  • Create manage bucket policies and lifecycle for S3 storage as per organizations and compliance guidelines. Create parameters and SSM documents using AWS Systems Manager.
  • Established CICD tools such as Jenkinsand Git Bucket for code repository, build and deployment of the python code base.
  • Build Glue Jobs for technical data cleansing such as deduplication, NULL value imputation and other redundant column removal. Also build Glue jobs to build standard data transformations (date/string and Math operations) and Business transformations required by business users.
  • Integrated AWS Kinesis and Databricks streaming workloads to process user activity events in near real-time, supporting fraud detection and personalization use cases.
  • Create Athena data sources on S3buckets for adhoc querying and business dashboarding using Quicksight and Tableau reporting tools.
  • Copy Fact/Dimension and aggregate output from S3 to Redshift for Historical data analysis using Tableau and Quicksight.
  • Use Lambda functions and Step Functions to trigger Glue Jobs and orchestrate the data pipeline and Use PyCharm IDE for Python/PySpark development and Git for version control and repository management.

Data Engineer

WebbMason Analytics
01.2020 - 02.2021
  • Implemented big data pipelines using AWS Step Functions, Glue, and PySpark to process and analyze large volumes of behavioral data.
  • Developed Scala-based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization, and data preparation needed for machine learning and reporting teams to consume.
  • Designed and Developed ETL jobs in AWS GLUE to extract data from S3 objects and load it in data mart in Redshift.
  • Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
  • Built Kafka producers to stream data from external APIs into Kafka topics, ensuring real-time data availability for downstream processing
  • Designed structured streaming applications in Spark to deliver real-time behavioral analytics dashboards, improving client marketing campaign responsiveness.
  • Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient joins, transformations, and other capabilities.
  • Experience working for EMR cluster in AWS cloud and working with S3.
  • Involved in creating Hive tables, loading, and analysing data using Hive scripts.
  • Designed, developed, and maintained data integration applications that worked with both standard and non-traditional source systems and RDBMS and NoSQL data storage for data access and analysis in a Hadoop and RDBMS context. Spark's in-memory computing capabilities were used to accomplish advanced tasks like text analytics and processing. RDDs and data frames are supported by Spark SQL queries that mix Hive queries with Scala and Python data operations.
  • Good experience with continuous Integration of applications using Jenkins.
  • Used Reporting tools like Tableau to connect with Athena for generating daily reports of data.
  • Collaborated with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.
  • Designed, and documented operational problems by following standards and procedures using JIRA.

Hadoop Developer

IMG Solutions
10.2017 - 12.2019
  • Designed scalable, reusable data pipelines using Hadoop and Spark, automating data ingestion and transformation for both batch and streaming data.
  • Migration of on-premise Hadoop workloads to AWS EMR, significantly improving scalability and reducing infrastructure maintenance.
  • Utilized Spark for in-memory processing, optimizing performance and speeding up data transformations on large datasets.
  • Automated the scheduling of data workflows using Oozie and migrated jobs to AWS Simple Workflow (SWF), ensuring seamless cloud integration.
  • Worked with various Hadoop ecosystem components such as Hive, Pig, and HDFS for efficient data processing and management.
  • Built and implemented automated procedures to split large iles into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.
  • Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple ile formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed ile formats Codecs like gZip, Snappy, Lzo.
  • Developed various custom UDF’s in Spark for performing transformations on date ields, complex string columns, encrypting PI ields, etc.
  • Written complex hive scripts for performing various data analyses and creating various reports requested by business stakeholders.
  • Used Oozie and Oozie Coordinators for automating and scheduling our data pipelines.
  • Worked extensively on migratingour existing on-prem data pipelines to the AWS cloud for better scalability and infrastructure maintenance.
  • Worked extensively in automating the creation/termination of EMR clusters as part of starting the data pipelines.
  • Worked extensively on migrating/rewriting existing oozie jobs to AWS simple workflow.
  • Loaded the processed data into Redshift tables for allowing downstream ETL and Reporting teams to consume the processed data.
  • Good experience working on analysis tools like Tableau, and Splunk for regression analysis, pie charts, and bar graphs.

Software Engineer

Qentelli Solutions
07.2014 - 09.2017
  • Involved in preparing high-level design documents, coding, analyzing business, and enhancing my programming skills.
  • Developed Python AWS serverless lambda with concurrent and multi-threading to make the process faster and asynchronously executing the callable.
  • Developed Pythonautomation scripts to facilitate quality testing.
  • Automated nightly build to run quality control using Python with BOTO3 library to make sure pipeline does not fail which reduces the effort by70%.
  • Wrote Python modules to extract/load asset data from the MySQL source database.
  • Experience in using PL/SQLto write stored procedures, functions, and triggers.
  • Worked with a backed team to design, build and implement RESTful APIs for various services.
  • Analyzed business process workflows and assisted in the development of ETL procedures for mapping data from source to target systems.
  • Created a web application using Python scripting for data processing, MySQL for the database, and HTML CSS, jQuery and High Charts for data visualization of the served pages.
  • Developed Python automation scripts to streamline ETL processes, reducing execution time by 70% through eficient multi-threading and concurrency.
  • Managed AWS Lambda functions to trigger automated data workflows, ensuring seamless data ingestion and processing in near real-time.
  • Integrated MySQL with AWS services, using Python to build RESTful APIs for data access and manipulation.
  • Automated infrastructure management and data processing workflows using Jenkins and BOTO3, improving reliability and reducing manual intervention.
  • Performed debugging and troubleshooting the web applications using Git as a version-controlling tool to collaborate and coordinate with the team members.

Education

Bachelor of Technology - Electronics and Communication

Amrita School of Engineering

Skills

  • AWS Kinesis, S3, EMR, Redshift, Athena, Glue
  • Pyspark, Core Java, Scala, Python, Unix Shell Scripting, NodeJS
  • Cassandra, MongoDB, Snowflake
  • MySQL, Teradata, Oracle, PostgreSQL
  • Hadoop, HDFS, Airflow, Terraform, Sqoop, MapReduce, Hive, HBase, Hue, Cloudera, Hortonworks, Spark, Kafka

Certification

AWS Certified Solutions Architect – Associate

Timeline

Senior Data Engineer

Citibank
11.2024 - Current

Data Engineer

Cotiviti
11.2022 - 10.2024

Data Engineer

Ebay
03.2021 - 11.2022

Data Engineer

WebbMason Analytics
01.2020 - 02.2021

Hadoop Developer

IMG Solutions
10.2017 - 12.2019

Software Engineer

Qentelli Solutions
07.2014 - 09.2017

Bachelor of Technology - Electronics and Communication

Amrita School of Engineering
Vivek K