Efficient In-Context Learning in Vision-Language Models for Egocentric Videos

University of Michigan
Teaser image for the paper

\(\mathbb{EILEV}\) elicits in-context learning capabilities in VLMs for egocentric videos by allowing the model to handle data interleaved with videos and texts, and training it on data with clusters of similar verbs and nouns, a long tail distribution of infrequent items and both homonyms and synonyms. The resulting model not only generates more accurate action narrations than other VLMs with more parameters and training data, but also generalizes to novel, rare actions via in-context learning.

BibTeX

@article{yu2023efficient,
  title={Efficient In-Context Learning in Vision-Language Models for Egocentric Videos},
  author={Yu, Keunwoo Peter and Zhang, Zheyuan and Hu, Fengyuan and Chai, Joyce},
  journal={arXiv preprint arXiv:2311.17041},
  year={2023}
}

Acknowledgement

We thank Shane Storks for his valuable insight and comments. This work has been supported by the Defense Advanced Research Projects Agency (DARPA) under the PTG Program, Contract No. HR00112220003. The views expressed are those of the authors and do not necessarily reflect the views of DARPA.