Chapter 31 sections from Deep Learning with PyTorch.
5 items
A vision-language model learns a joint representation of images and text.
Audio-visual learning studies models that jointly process sound and visual information. The goal is to learn representations that combine what is seen with what is heard.
A unified foundation model is a neural network trained across many modalities, tasks, and domains using a shared architecture and shared representations.
A retrieval system finds relevant information from an external memory source.
A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.