LLM Inference Optimization using Speculative Decoding
Graduate AI Research Showcase | CMU
Optimized inference pipelines for LLaMA2 and GPT-2 using speculative decoding, pruning, and quantization in PyTorch to reduce latency and memory usage. Built modular benchmarking workflows to evaluate accuracy–performance tradeoffs across CPU and GPU environments, targeting real-time and edge deployment scenarios.