An Analogy to Help Understand Mixture of Experts

If you're having a hard time understanding MoE strength vs dense models, and roughly where they might land when comparing them, think about this super oversimplified analogy. I'm hoping it makes sense:

The Scenario

Imagine a paid trivia competition, but all the questions are about carpentry regulations: you're given a piece of paper, you fill out the paper and then hand it in.

There are two "teams" competing with each other, except one team just has a single dude on it. Both teams need a place to sit in the building while the competition is going on.

Team 1 (10b Dense Model)

Team 1 is just some fairly experienced carpenter with 10 years of experience. He gets the paper, works through every question himself, and turns it in.

He really likes his personal space, so he reserved 10 seats all to himself. (Bear with me...)

Total experience on the team: 10 years
Experience applied to each question: 10 years
Total Seats Needed: 10 seats

Team 2 (40b a10b MoE Model)

Team 2 is a large crew of 40 first-year apprentices. None of them know the full trade; each one has only learned a few specific things about carpentry during their year.

Each question has multiple parts to it, and for each part, 10 of the apprentices are picked based on whoever among them has the most relevant knowledge to that specific part. Once a part is answered, those ten return to the group, and the process repeats for the next part. By the time a single question is fully answered, dozens of different apprentices may have contributed.

When answering, each set of ten apprentices that get called up aren't huddling up and collaborating; they each independently write their own answer to the question part on a small piece of paper, and then all of those answers get blended together to create one combined response. The final answer written on the trivia paper for that part of the question will be a mix of what they all came up with.

Once all of the questions have been answered in this fashion, they turn it in.

Total experience on the team: 40 years
Experience applied to each question: 10 years (10 apprentices x 1 year each)
Total Seats Needed: 40 seats

Comparing the Teams

Now, technically you could say that each team is applying the same number of years of experience to each question, even though the way the teams are structured is totally different. For each question, they are bringing an aggregate total of 10 years of experience.

But beyond that: Team 2's combined aggregate knowledge and experience of 40 years is much larger.

Team 2's setup is so powerful because even though their team is full of apprentices who each only know a slice of the trade, they are hand-picking the best ten people for each question part. Depending on what all the different apprentices studied, you could end up with Team 2's total knowledge including information Team 1's carpenter doesn't know; and they may reason through things that the carpenter struggled with alone.

The downside to team 2's setup is that they need 40 seats, while Team 1 only needs 10 seats. Team 2 takes up a LOT more space than Team 1.

Socg's note: The seats are memory. In case you missed that lol. I couldn't figure out a better way to shoehorn that into the analogy.

Team 3 (40b Dense Model)

Now, imagine if there was a third team with a master carpenter that had 40 years of experience; the same number of years of experience as all of Team 2 combined. And he absolutely loves his space, so he also got 40 seats. But its 1 really really experienced and smart carpenter doing all the work.

Even though team 2 has a combined total of 40 years of experience, and the master carpenter has 40 years, and even though both teams required 40 seats: the quality difference is going to be significant. The master carpenter will likely have 'seen it all' and experienced it, too, while the apprentices are only ever applying 10 aggregate years of apprentices at a time.

This means that not only is that master carpenter likely going to make better use of their overall knowledge, but they will understand the questions much better and be able to really comprehend what is being asked at a level the apprentices likely won't.

Total experience on the team: 40 years
Experience applied to each question: 40 years
Total Seats Needed: 40 seats

The Takeaway

When comparing models, it's pretty safe to say:

  • All things being equal, an MoE will likely outperform a model that has the same number of parameters as the active parameters. So a 30b a3b MoE (30b model, but only 3b active) will beat out a 3b dense model.
  • All things being equal, an MoE will likely have worse overall comprehension than a similar size dense model of the same size as its total parameters. Even if their knowledge might be similar, the dense model will simply "get" things better than the MoE. For example, a 120b a5b MoE will likely misunderstand statements far more often than a 120b dense model, which will "read between the lines" on what you want far better and understand inferred speech better.

Anyhow, that's majorly over-simplified, but hopefully it helps paint a better picture.