New Research Reveals How to Build Better AI Teams

A new paper finds that offline recommendations to mix AI models from different "families" for diversity may not hold up in real-time interactive settings—the very environments where multi-LLM systems are actually deployed. This discovery could change how multi-AI systems are designed.

Researchers from a new preprint on arXiv studied how multiple language models interact when working together in real time, such as debating or evaluating each other's answers. While prior offline research suggested that using models from different families (e.g., one from the GPT family and one from the Llama family) ensures behavioral diversity, this paper tested that assumption in interactive multi-LLM systems—the actual setting used in deployed systems—and found that the family label alone does not reliably predict behavioral differences.

The key finding is that a model's "post-training recipe"—the fine-tuning and alignment process it went through—shapes its conversational behavior more than which model family it belongs to. This means that simply picking one model per family may not create the diversity needed for productive multi-agent collaboration. Instead, developers should consider the specific training methods applied to each model.

This matters because AI teams are becoming more common in tools we use daily, from customer service chatbots to complex problem-solving systems. When AI models are too similar in behavior, they tend to agree too much, which limits creativity and problem-solving. By understanding what really drives behavioral diversity—the post-training recipe—engineers can build more innovative and reliable multi-LLM systems.