The LLaVA-NeXT model was proposed from the research on LLaVA-NeXT, aimed at improving reasoning ability, OCR (Optical Character Recognition), and world knowledge, and was collaboratively developed by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Building on the original LLaVA model, LLaVA-NeXT (also known as LLaVA-1.6) significantly enhances the capabilities of OCR and commonsense reasoning by increasing the resolution of input images and training on an improved visual instruction tuning dataset.
The core innovation of this model lies in its ability to handle higher resolution images, providing users with stronger performance in visual understanding tasks. The goal of LLaVA-NeXT extends beyond text processing; it also integrates more real-world knowledge into the model, enhancing its application potential in complex tasks.
It is important to note that the team that released LLaVA-NeXT did not write the model card for this model. Instead, the model card was created by the Hugging Face team to provide users with detailed information and usage guidelines.