Core Concepts

Multimodal AI

Also known as: multi-modal

In one line

A model that understands more than one type of input — e.g. text, images, and audio together.

What does Multimodal AI mean?

Multimodal models can, for example, look at a photo of a receipt and answer questions about it, or listen to a meeting and produce structured minutes.

A real-world example

Uploading a screenshot to ChatGPT and asking "why is my code broken?"

Related terms

Large Language Model (LLM)

A neural network trained on huge text collections to predict the next word — the engine behind ChatGPT, Claude and Gemini.