Summary
Overview
Work History
Education
Skills
Timeline
Generic

Kevin (Kaijun) Li

Issaquah,WA

Summary

Experienced software engineer with 8+ years in ML infrastructure, cloud tech, and distributed systems. Proven track record optimizing ML workflows at Nvidia, Oracle Cloud, and AWS. Skilled in Kubernetes, Terraform, Ansible, and multiple programming languages. Effective leader, adept at driving innovation and collaboration across teams to tackle complex challenges.

Overview

12
12
years of professional experience

Work History

Software Development Engineer

Amazon
03.2022 - Current
  • Role: Led the design/development of new CloudWatch Publisher for Route53 Health Check service.
  • Implemented middleware component for validating CloudWatch metrics to ensuring consistent metrics publishing.
  • Reached out to internal customers (NLB, CloudMap, AGA) by listening to concerns for our migration and answering questions.
  • Keywords: AWS Route53, Cloudwatch Publishing, Sharding, Java

Software Engineer

Oracle
02.2020 - 03.2022
  • Role: Oversaw CI/CD deployment workflow and actively contributed to the development of Kafka agents.
  • Developed shell scripts for SSH configuration, managing inventory files, and automating token rotation processes.
  • Implemented a Kafka agent in Go and successfully deployed it to AWS, Azure, Digital Ocean, and Rackspace.
  • Launched IPM solution successfully after A/B testing to replace CISCO ThousandEyes Monitoring Solution.
  • Keywords: Go, Ansible, Shell, Terraform, Kafka, Cloud Object Store (token/secrets), S3 (Terraform states)

Machine Learning Infrastructure Engineer

NVIDIA
07.2016 - 02.2020
  • Role: Primary owner for On-boarding automated vehicles (AV) team to new Kubernetes cluster.
  • Enhanced logging architecture resulting in 20% improvement in training speed.
  • Implemented GPU affinity functionality to minimize occurrence of partial failures in distributed training jobs within Kubernetes cluster.
  • Keywords: Go, Kubernetes, DGX, Prometheus, Grafana, Jenkins

Test Engineer

Broadcom
02.2014 - 07.2016
  • Role: Automated tested engineer (ATE) for next-gen High-speed Ethernet solution.
  • Took requirements from application team and delivered testing framework for company first automotive ethernet application.
  • Implemented over 50 features in Python with cross-functional teams delivering on ATE framework.
  • Keywords: Python, ATE, Test framework

Quality Assurance Engineer

Intel
10.2011 - 01.2014
  • Role: QA engineer for PCIE/SATA and post-silicon automation workflows.
  • Coordinated between analog design team and packaging team to deliver on releases/features.
  • Developed risk monitoring work in Python to identify potentially missed process that could turn into system failure.
  • Keywords: Python, Workflow Automation, QA

Education

Master of Science - Electrical And Computer Engineering

Boise State University
Boise, ID
05.2010

Master of Science - Automotive Engineering

Tsinghua University
Beijing, China
01.2007

Bachelor of Science - Automation

Harbin Institute of Technology
Harbin, China
07.2004

Skills

  • Python/Go/Java/Rust/Javascript
  • Docker/Kubernetes
  • Terraform/Ansible/Jenkins
  • AWS/Azure/GCP
  • Prometheus/Grafana/Thanos/InfluxDB
  • Kafka/SQS/Kinesis

Timeline

Software Development Engineer

Amazon
03.2022 - Current

Software Engineer

Oracle
02.2020 - 03.2022

Machine Learning Infrastructure Engineer

NVIDIA
07.2016 - 02.2020

Test Engineer

Broadcom
02.2014 - 07.2016

Quality Assurance Engineer

Intel
10.2011 - 01.2014

Master of Science - Electrical And Computer Engineering

Boise State University

Master of Science - Automotive Engineering

Tsinghua University

Bachelor of Science - Automation

Harbin Institute of Technology
Kevin (Kaijun) Li