# Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

> Research article (IEEE/ACM Transactions on Audio Speech and Language Processing, 2021) · cited 66× · AI/ML

**Wikidata**: [openalex:W3137384391](https://www.wikidata.org/wiki/openalex:W3137384391)  
**Source**: https://4ort.xyz/entity/bridging-text-and-video-a-universal-multimodal-transformer-for-audio-visual-scene-aware-dialog
