Summary
Overview
Work History
Education
Skills
Personal Information
Core Technical Skills
Onsite Availability
Timeline
Generic
Fahad Najam

Fahad Najam

AI Infrastructure Architect & Strategist
Austin,TX

Summary

Versatile and visionary technical leader specializing in high-performance computing (HPC), AI/ML acceleration, embedded systems, and distributed infrastructure across networking, cloud, and enterprise platforms. Proven expertise in architecting low-latency, high-throughput systems integrating hardware and software (C, C++, Python) across GPUs, vGPUs, CPUs, NICs, SmartNICs, and DPUs, with deployments spanning AWS, Azure, edge, and on-prem environments. Skilled in GPU/FPGA/NPU/DPU acceleration integration, parallel programming, HPC orchestration, and performance optimization for large-scale, mission-critical workloads.

Overview

17
17
years of professional experience

Work History

Principal System Engineer ( Research & Innovation )

Self Employed
06.2025 - Current
  • Conducting independent R&D in HPC, AI/ML acceleration, and networking, focused on distributed compute platforms, congestion control, and system scalability.
  • Designed and evaluated simulation clusters for AI workloads (object detection and perception algorithms), optimizing performance across GPUs, CPUs, SmartNICs, and NVSwitch interconnects.
  • Researched GPU–memory bottlenecks, congestion control, and NVSwitch-based scaling to address challenges in large-scale AI training and inference.
  • Investigating multi-cloud orchestration strategies (AWS, Azure ) and interoperability bottlenecks to benchmark portability, resilience, and cost-efficiency of HPC/AI pipelines.
  • Explored next-gen NIC and DPU architectures for high-throughput data movement, applying OSI stack protocols (TCP/IP, HTTPS, SSL/TLS, RDMA) to low-latency workloads.
  • Publishing technical insights and white papers from ongoing research, contributing thought leadership on congestion control, GPU/CPU/NIC scaling, and distributed AI architectures.

Principal Platform Architect & Engineer

Parry Labs
03.2025 - 06.2025
  • Led Edge computing proprietary defense platform distributed computing (classified), integrating HPC/AI workloads at the tactical edge.
  • Architected distributed compute and AI inference pipelines for object detection and perception, aligned with mission and ecosystem constraints.
  • Conducted industry and ecosystem analysis to identify opportunity theses, enablers, risks, and competitive advantages in edge AI adoption.
  • Partnered with internal and external stakeholders to co-develop platform strategy, informing investment recommendations and roadmap planning.

Software Development Manager (Hybrid Robot Fleet Server Platform)

Amazon
06.2021 - 09.2024
  • Led architecture and development of low-latency, high-throughput HPC platforms for robotic fleet systems, integrating GPU, FPGA, NPU, and DPU acceleration for AI inference across AWS cloud, edge, and embedded domains.
  • Architected mission-critical edge ML pipelines (C++) leveraging PCIe, RDMA (RoCE), and Kubernetes-based HPC clusters, reducing latency by 30% and increasing accuracy by 20%.
  • Enhanced inter-GPU communication efficiency by 30% via NCCL and Open MPI tuning for parallel AI workloads in secure environments.
  • Boosted throughput by 25% by optimizing PCIe memory coherence pathways for -grade data movement.
  • Implemented zero-copy data transfers using RoCE, minimizing CPU overhead and enabling real-time decision support.

Principal System Software Engineer (Medical Infection Detection Platform)

Bio-Rad Laboratories
01.2018 - 01.2021
  • Spearheaded secure AI-driven diagnostic platforms using NVIDIA GPUs (TensorFlow, CUDA, TensorRT) for real-time pathogen detection.
  • Optimized low-power CPU, GPU, and FPGA workflows under IEC 61508 safety standards, reducing detection latency by 30%.
  • Enhanced diagnostic throughput by 25% using DSP, ARM SoCs, and AI accelerators for transformer-based models.
  • Reduced network latency by 25% with DPDK-based kernel bypass for time-sensitive diagnostics.
  • Deployed mission-aligned bioinformatics pipelines on AWS, Azure, and GCP, achieving -level scalability.

Staff Software Engineer (Network and PCIe Switches Platform)

Broadcom
01.2016 - 01.2018
  • Designed PCIe-based embedded platforms integrating ML co-processors, SmartNICs, and accelerators for high-performance, low-latency applications.
  • Developed RDMA over PCIe fabric for ultra-low latency and zero-copy data movement in tactical HPC workloads.
  • Collaborated with industry leaders (Netflix, Google, NVIDIA) to optimize GPU-accelerated NVMe-oF workloads.
  • Managed BMC firmware development ensuring secure remote management capabilities for mission-critical platforms.

Staff Software Engineer (Medical Sterilization Product)

Johnson & Johnson
02.2014 - 01.2016
  • Developed FDA-approved IoT-based sterilization devices with ARM, FPGA, DSP, Secure Boot, TEE, ARM TrustZone, and TPM-based security, ensuring compliance with IEC 62304 and ISO 26262.
  • Reduced contamination detection time by 25% and accelerated sterilization cycle verification by 20% through DSP/FPGA signal optimization under safety-critical constraints.
  • Led cross-functional teams of engineers, biologists, and regulatory specialists to deliver prototype validation, advanced sensor calibration, and FDA clearance ahead of schedule.
  • Directed full device lifecycle management from embedded firmware to cloud integration, ensuring interoperability with hospital IT systems and improving detection accuracy by 15%.

Staff/Senior Firmware Engineer (Hybrid SSD, HDD Products)

Western Digital
12.2008 - 02.2014
  • Designed and optimized firmware for Hybrid Disk, NAND Flash, and HDD platforms (C/C++, Perl, ARM, QNX, DSP) to achieve -grade performance in data center and aerospace applications.
  • Architected NAND Flash Manager with advanced Error Recovery, Data Relocation, and Wear Leveling, extending drive lifespan and reliability under mission-critical workloads.
  • Implemented Dynamic Power Management, TLER, storage service optimizations, and HDD robust algorithms for media-related failure reduction, boosting manufacturing yield and operational readiness.
  • Developed drivers to connect storage devices with in-house tools, optimized performance during testing with IO meter and Perf tools, and led scrum meetings while championing in-circuit emulation, oscilloscopes, and logic analyzers to resolve complex issues and ensure seamless integration.

Education

BS - Electrical Engineering

California State Polytechnic University
09.2007

Certification - Artificial Intelligence

Stanford University
11.2024

Certification - Design Controls

AAMI Foundation
04.2018

Skills

  • HPC Architecture & Optimization & Research
  • AI/ML Acceleration
  • Low-Latency Networking (PCIe, RDMA, RoCE)
  • Networking (TCP/IP, UDP, Https)
  • Energy-Aware & Microgrid-Aware Computing
  • Embedded Systems (QNX, Linux, Zephyr)
  • Secure Boot, TPM, TEE
  • GPU/FPGA/NPU/DPU Integration
  • Parallel Programming (MPI, OpenMP, CUDA)
  • HPC Cluster Orchestration
  • Mission-Critical & Compliance
  • Cloud/Edge/Embedded Interoperability
  • Cyber Resilience & Secure Frameworks

Personal Information

  • Citizenship: US Citizen
  • Title: Principal HPC Architect – High-Performance Networking, Cloud, and AI/ML Acceleration
  • Nationality: US Citizen
  • Availability: Onsite Available
  • Visa Status: US Citizen

Core Technical Skills

  • HPC Architecture & Optimization & Research
  • AI/ML Acceleration
  • Low-Latency Networking (PCIe, RDMA, RoCE)
  • Networking (TCP/IP, UDP, Https)
  • Energy-Aware & Microgrid-Aware Computing
  • Embedded Systems (QNX, Linux, Zephyr)
  • Secure Boot, TPM, TEE
  • GPU/FPGA/NPU/DPU Integration
  • Parallel Programming (MPI, OpenMP, CUDA)
  • HPC Cluster Orchestration
  • Mission-Critical & Compliance
  • Cloud/Edge/Embedded Interoperability
  • Cyber Resilience & Secure Frameworks

Onsite Availability

True

Timeline

Principal System Engineer ( Research & Innovation )

Self Employed
06.2025 - Current

Principal Platform Architect & Engineer

Parry Labs
03.2025 - 06.2025

Software Development Manager (Hybrid Robot Fleet Server Platform)

Amazon
06.2021 - 09.2024

Principal System Software Engineer (Medical Infection Detection Platform)

Bio-Rad Laboratories
01.2018 - 01.2021

Staff Software Engineer (Network and PCIe Switches Platform)

Broadcom
01.2016 - 01.2018

Staff Software Engineer (Medical Sterilization Product)

Johnson & Johnson
02.2014 - 01.2016

Staff/Senior Firmware Engineer (Hybrid SSD, HDD Products)

Western Digital
12.2008 - 02.2014

BS - Electrical Engineering

California State Polytechnic University

Certification - Artificial Intelligence

Stanford University

Certification - Design Controls

AAMI Foundation