VILA: On Pre-training for Visual Language Models

Code License Model License Python 3.10+

VILA arxiv / VILA Demo / VILA Huggingface

💡 Introduction

VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling video understanding and multi-image understanding capabilities. VILA is deployable on the edge by AWQ 4bit quantization and TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance; (4) token compression extends #video frames. VILA unveils appealing capabilities, including: video reasoning, in-context learning, visual chain-of-thought, and better world knowledge.

See this repository for more details on VILA and its applications.