Skip to content
View AbhiOnGithub's full-sized avatar
🎯
Focusing
🎯
Focusing
  • 13:36 (UTC -07:00)

Block or report AbhiOnGithub

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
AbhiOnGithub/README.md

Hey there πŸ‘‹ I'm Abhishek

Obsessed with making models go brrr β€” from training to real-time inference at scale


⚑ About Me

  • πŸ”₯ I live and breathe AI Inference β€” optimizing models to run faster, cheaper, and at massive scale
  • 🧠 Deep in the NVIDIA inference stack: TensorRT, Triton Inference Server, CUDA, TensorRT-LLM, and NIM
  • πŸš€ Passionate about squeezing every last TFLOP out of GPUs β€” from A100s to H100s to Blackwell
  • πŸ—οΈ Building and scaling inference pipelines that serve millions of requests with minimal latency
  • 🌐 Background in cloud-native architecture across AWS, Azure, and GCP β€” now laser-focused on GPU-accelerated inference infrastructure
  • 🀝 Open to collaborating on open-source inference tooling, model optimization, and high-performance serving systems

GitHub Streak

Top Langs


πŸ› οΈ Inference & AI Stack:

TensorRT Triton CUDA TensorRT-LLM NIM Python C++ Go Rust PyTorch vLLM Docker Kubernetes

☁️ Cloud & Infra:

aws azure gcp docker kubernetes git


  • πŸ’¬ Ask me about GPU-accelerated inference, model optimization, batching strategies, and scaling LLM serving
  • πŸ‘― Looking to collaborate on inference engines, model compilers, and open-source AI infrastructure
⚑ What I'm focused on in 2025–2026
- Optimizing LLM inference β€” KV-cache management, speculative decoding, continuous batching
- TensorRT-LLM and TensorRT for maximum throughput on NVIDIA GPUs
- Triton Inference Server β€” model ensembles, dynamic batching, multi-GPU serving
- NVIDIA NIM microservices for production-grade AI deployment
- CUDA kernel optimization and custom inference operators
- vLLM, SGLang, and other open-source LLM serving frameworks
- Multi-node inference on H100 / Blackwell clusters with NVLink & NVSwitch
- Quantization (FP8, INT4, AWQ, GPTQ) for efficient model deployment
- Go, Rust, and C++ for high-performance inference infrastructure
🧠 Technologies I know
- Inference: TensorRT, TensorRT-LLM, Triton Inference Server, NVIDIA NIM, vLLM, ONNX Runtime
- GPU/Compute: CUDA, cuDNN, NCCL, NVLink, Multi-Instance GPU (MIG)
- ML Frameworks: PyTorch, JAX, ONNX
- Cloud: AWS (SageMaker, EKS, EC2 P/G instances), Azure (AKS, NC/ND VMs), GCP (GKE, A3/A2 VMs)
- Containers & Orchestration: Docker, Kubernetes, Helm, NVIDIA GPU Operator
- Languages: Python, C++, Go, Rust, C#, Java
- IaC: Terraform, Pulumi, AWS CloudFormation, Azure ARM
- Monitoring: Prometheus, Grafana, Splunk, Elastic Stack
- Streaming: Apache Kafka, Apache Flink, Spark Streaming
πŸ“š Previously
- Cloud-native architecture and distributed systems across AWS & Azure
- Serverless and modular monolithic architectures
- Full-stack development with C#/.NET, Java/Spring Boot, React
- GoLang microservices (GoORM, Fiber, Chi, Mux)
- Distributed Application Runtime (DAPR)
- Cross-platform development with Xamarin/MAUI

Pinned Loading

  1. ML-for-Dot-Net-developers ML-for-Dot-Net-developers Public

    This Repository contains Code used in Blog Posts of ML.NET

    C# 2 3

  2. vllm-project/vllm vllm-project/vllm Public

    A high-throughput and memory-efficient inference and serving engine for LLMs

    Python 78.8k 16.3k

  3. ai-dynamo/dynamo ai-dynamo/dynamo Public

    A Datacenter Scale Distributed Inference Serving Framework

    Rust 6.7k 1.1k