Overview
Work History
Education
Skills
Timeline
Generic

Hao Zhang

Seattle,WA

Overview

7
7
years of professional experience

Work History

Senior Software Engineer

Snap Inc.
Seattle, WA
03.2021 - Current

Tech lead of Snap Observability team, a lean team responsible for whole Snap's real-time metrics, monitoring and dashboard.

Leading the development and migration of the next generation real-time metrics framework for Snap

  • Setup the strategy to build the modernized label-based managed Prometheus stack to unify the fragmented Observability solutions across company.
  • Drove the performance/stress testings, led the contract negotiations with multiple cloud vendors and set the metrics migration plan for the whole company.
  • Designed and implemented the Prometheus collection and ingestion layers that work for thousands of Kubernetes clusters and hundreds of millions of Snap users.
  • Driving the query migration of hundreds of thousands of monitoring rules, tens of thousands Grafana dashboard and hundreds of microservices.

Developing and operating one of the world's largest OSS metrics system with high scalability, stability, cost-efficiency and performance

  • Designed and implemented the metrics aggregation system to process metrics with hundreds of millions cardinality at 99.95% availability.
  • Designed and implemented cost-efficient spam metrics detection system to prevent the metrics system from being abused.
  • Designed, implemented and rolled out CPU and memory efficient metrics collection sidecar running on hundreds of thousands pods in thousands of Kubernetes clusters.
  • Ops and feature development work to accommodate the rapidly evolving requirements from product team.

Software Engineer

Amazon Web Services Inc.
Seattle, WA
02.2017 - 02.2021

Founding member of Timestream storage team to build state-of-the-art distributed time series database.

Deterministic Failure Handling Built failure hint strategies to handle Timestream distributed micro-service's failures deterministically.

  • Designed and implemented strategies to handle insert 5xx failure hints, auto scaling hints and lease change hints to support 99.99% insert availability.

Ingestion Auto Scaling Implemented strategies to auto scale up tiles (Timestream internal storage unit to store customers' data) to manage GB/s insert traffic at hundreds milliseconds latency per customer table.

  • Designed Aurora MySQL table schema to store insert traffic statistics data in real-time with tens milliseconds latency and built distributed tile auto scaling strategies upon statistics data.
  • Designed and implemented data movement strategies to allow moving over 3GB customers' data to another Aurora clusters within two minutes, which allows auto scaling through the whole ingestion period.

Kinesis GetRecords Enhanced Fan-out Led, designed and implemented 20 consumers Enhanced Fan Out to support Kinesis shard read capacity from 10 MB/s to 40 MB/s with 99.9% availability.

  • Rebuilt Kinesis storage load management service with producer-consumer model to make it react 10 times faster.
  • Designed and Implemented long-running performance testng service to simulate heavy customer traffic.

Education

Master of Science - Computer Science

Georgia Institute Of Technology
12.2016

Bachelor of Science - Computer Science

Wuhan University
07.2015

Skills

  • Technologies: Kubernetes, Envoy, Spinnaker, AWS and GCP services, Chronosphere
  • Programming Languages: Java, Golang, Ruby, Python, JS, Bash/Zsh, SQL
  • Knowledge and Interests: Large Scale Distributed Systems, Streaming Systems, Observability Systems

Timeline

Senior Software Engineer

Snap Inc.
03.2021 - Current

Software Engineer

Amazon Web Services Inc.
02.2017 - 02.2021

Master of Science - Computer Science

Georgia Institute Of Technology

Bachelor of Science - Computer Science

Wuhan University
Hao Zhang