Combines visual inputs like image and video with a natural language question concerning the input and generates a natural ...