During this phase i want to focus purely on MoEs. Start off with a paper train of the idea and them perform a simple model surgery to replace the FNN of the model form the previous phase with a router-experts mechanism. And once we have that we can start to study and experiment the following:
- MoE FFN block: top-1 routing + aux load-balancing loss_Replace FFN, track expert utilization histogram, router entropy
- Routing stability study: Collapse detection, expert starvation, aux loss weight sensitivity
- Top-1 vs top-2 routing ablation: Quality, compute, utilization, routing smoothness
- Capacity factor + token dropping: Overflow rate, throughput, training stability, quality impact
- Dense vs sparse final comparison: Matched active-compute budget, fair comparison