During this phase i want to focus purely on MoEs. Start off with a paper train of the idea and them perform a simple model surgery to replace the FNN of the model form the previous phase with a router-experts mechanism. And once we have that we can start to study and experiment the following:

  • MoE FFN block: top-1 routing + aux load-balancing loss_Replace FFN, track expert utilization histogram, router entropy
  • Routing stability study: Collapse detection, expert starvation, aux loss weight sensitivity
  • Top-1 vs top-2 routing ablation: Quality, compute, utilization, routing smoothness
  • Capacity factor + token dropping: Overflow rate, throughput, training stability, quality impact
  • Dense vs sparse final comparison: Matched active-compute budget, fair comparison

0 items under this folder.