Instill AI changelog

Introducing LLaVA 🌋: Your Multimodal Assistant

changelog cover

📣 The latest LLaVA model, LLaVA-v1.6-7B, is now accessible on Instill Cloud!

What's LLaVA?

LLaVA stands for Large Language and Vision Assistant, an open-source multimodal model fine-tuned on multimodal instruction-following data. Despite its training on a relatively small dataset, LLaVA demonstrates remarkable proficiency in comprehending images and answering questions about them. Its capabilities resemble those of multimodal models like GPT-4 with Vision (GPT-4V) from OpenAI.

What's New in LLaVA 1.6?

According to the original blog post, LLaVA-v1.6 boasts several enhancements compared to LLaVA-v1.5:

  • Enhanced Visual Perception: LLaVA now supports images with up to 4x more pixels, allowing it to capture finer visual details. It accommodates three aspect ratios, with resolutions of up to 672x672, 336x1344, and 1344x336.

  • Improved Visual Reasoning and OCR: LLaVA's visual reasoning and Optical Character Recognition (OCR) capabilities have been significantly enhanced, thanks to an improved mixture of visual instruction tuning data.

  • Better Visual Conversations: LLaVA now excels in various scenarios, offering better support for different applications. It also demonstrates improved world knowledge and logical reasoning.

  • Efficient Deployment: LLaVA ensures efficient deployment and inference, leveraging SGLang for streamlined processes.

👉 Dive into our tutorial to learn how to leverage LLaVA's capabilities effectively.