Multimodal models and world models are emerging as promising frameworks for extending language-based AI beyond text, towards ...