Generate text and speech from image or video inputs
Generate detailed descriptions from images and videos