Model learns to pick out objects within an image, using spoken descriptions.
MIT computer scientists have developed a system that learns to identify objects within an image, based on a spoken description of the image. Given an image and an audio caption, the model will highlight in real-time the relevant regions of the image being described.
Unlike current speech-recognition technologies, the model doesn’t require manual transcriptions and annotations of the examples it’s trained on. Instead, it learns words directly from recorded speech clips and objects in raw images, and associates them with one another.