./shivansh

$ whoami

ML and HPC engineer specializing in computer vision, C++, and GPU performance. I like working from the highest to lowest level of the ML stack.

// PROJECT PORTFOLIO

FUSED ATTN KERNEL

CUDA implementation of tiled multi-head self-attention using online softmax to avoid materializing the full N×N score matrix. Shared memory blocking reduces HBM round trips; benchmarked against PyTorch unfused attention on sequences up to 2k tokens.

> VIEW SOURCE ↗

inference_engine.py

LLM INFERENCE ENGINE

Continuous-batching inference runtime with an iteration-level scheduler that separates prefill and decode phases to maximize GPU utilization. Backed by a Rust paged KV cache allocator for O(1) memory management. Exposes a gRPC serving API. The stack spans Python orchestration, CUDA compute kernels, C++ runtime, and Rust memory management — built layer by layer from scratch.

PYTHONCUDAC++RUSTgRPC

> VIEW SOURCE ↗

TRITON FUSED OPS

RMSNorm, SwiGLU, and cross-entropy appear sequentially in every Llama-style forward pass. Running them as separate PyTorch ops writes intermediate results to HBM between each step. This project fuses all three into a single Triton kernel to eliminate those round-trips. Benchmarked memory bandwidth utilization with ncu before and after, and compared throughput against the unfused baseline across a range of batch sizes and sequence lengths.

TRITONCUDAPYTORCH

> VIEW SOURCE ↗

PagedAttention-inspired memory allocator written in Rust. Achieves O(1) block allocation and deallocation with LRU eviction for cache pressure management in long-context inference. Exposed to Python via PyO3 bindings; serves as the memory backend for the inference runtime.

> VIEW SOURCE ↗

simd_kernels.cpp

SIMD TRANSFORMER KERNELS

Hand-vectorized CPU kernels for softmax, RoPE positional encoding, and RMSNorm using AVX2 and AVX-512 intrinsics. Built as the CPU-side compute layer of a broader inference stack. Benchmarked latency and throughput against a scalar baseline and Intel oneDNN; exposed to Python via pybind11.

C++AVX-512pybind11

> VIEW SOURCE ↗

MOE EXPERT ROUTER

Token routing module for Mixture-of-Experts models with both top-k and expert-choice routing strategies. Multi-GPU expert dispatch over NCCL. Async collective pipelining overlaps communication with compute to reduce per-token latency at inference time.

PYTORCHCUDANCCL

> VIEW SOURCE ↗

precision_profiler.py

MIXED PRECISION PROFILER

Profiled a 3-layer GPT training loop across fp32, fp16 AMP, and bf16 AMP at system level with Nsight Systems (CPU/GPU timeline, kernel launch gaps, host-device sync) and kernel level with Nsight Compute (Tensor Core utilization, memory throughput). Output: a side-by-side comparison of step time, GPU memory usage, and loss stability across all three precision modes.

PYTORCHCUDAnsys/ncu

> VIEW SOURCE ↗

pytorch_job.yaml

K8S PYTORCH DDP

Distributed PyTorch DDP training job deployed on a local Kubernetes cluster via minikube and the Kubeflow Training Operator (PyTorchJob CRD). Built the Docker image, wrote the 2-worker job manifest, and traced the full lifecycle: MASTER_ADDR/PORT rendezvous, restart policy semantics, and single-worker failure behavior. A ground-up look at how ML infra teams run training jobs on-prem and in the cloud.

KUBERNETESDOCKERPYTORCH

> VIEW SOURCE ↗

Open source contributor to Scalene (PLASMA Lab, UMass). Adding GPU kernel profiling via CUPTI Callback and Activity APIs with per-line attribution, roofline model bottleneck diagnosis, and multi-GPU load imbalance detection.

PYTHONC++CUPTICUDA

> VIEW SOURCE ↗

Platform for deploying networks of collaborating AI agents from a single prompt. Go microservices backend with gRPC for inter-service communication and Kafka for event-driven agent task dispatch across distributed execution environments. Kubernetes deployment with horizontal pod autoscaling, namespace-based multi-tenant isolation, and zero-downtime rolling updates. Tiered data layer — Redis for sub-5ms session caching, PostgreSQL for durable agent memory. Recruited GAP's CTO as advisor and acquired the first enterprise customer running production pilots.

GOgRPCKAFKAKUBERNETESREDIS

> VIEW SOURCE ↗

Marketplace connecting skilled contractors with local businesses for technical work. Contractor reputations are backed by cryptographic work attestations — verified, portable records that eliminate the cold-start problem on every new platform.

NEXT.JSPOSTGRESTYPESCRIPT

> VIEW SOURCE ↗

Legal help is expensive and inaccessible — Clause lets Massachusetts tenants identify security deposit violations and generate demand letters in minutes. RAG pipeline on Snowflake Cortex grounded against MA housing statutes, Gemini 2.0 Flash Thinking for legal reasoning, ElevenLabs voice for multilingual and low-literacy users, Solana for tamper-proof violation records. Won Best Use of AI, HackUMass 2025.

FASTAPIREACTRAGGEMINISOLANA

> VIEW SOURCE ↗