In this Phase 2, I want to lay emphasis on the inference and serving end of things. Mostly focusing on KV Caching and specialised attention mechanisms tailored for inference. Some core experiments i wish to run here:

  1. KV-Caching

    • cache growth ablation between MHA/GQA/MQA , with different depths and context-lengths.
  2. Build a Pre-fill vs Decode Benchmark

    • Compare between different :
      • Attention Types
      • model input prompt lengths
    • Look for tokens/sec on both metrics
  3. Flash Attention

    • Compare training speeds ( torch SDPA already has a FA3 backend )
    • memory efficiency gains
  4. Paged Attention

    • Possibly use vLLM and do a model “serving” study.
    • memory efficiency during inference
  5. Bouns:

    • MLA (DeepSeek multi-latent attention): Interesting KV compression angle

Note: Add inference and cache related metrics before running these experiments.


Reads

0 items under this folder.