Apple has trained an LLM to effectively understand the long -form video


Apple researchers have developed an adapted version of the SlowFast-Llava model which beats larger models to long analysis and understanding. Here’s what it means.

Cheesy bits

Very Basically, when an LLM is formed to also understand the video, he learns to divide the videos into frames, to apply a vision of the computer to extract visual features, to analyze how these features change over time and to align all this with language so that it can describe or reason on the video in the form of text.

A very ineffective way of doing so is to analyze each frame of a video, which creates a overwhelming quantity of duplicated information, as most executives rarely include significant changes from one to the other.

With this overwhelming quantity of duplicated information at hand, it is very easy to blow in front of the LLM context window, which is the maximum amount of information that it can keep immediately. Once an LLM exceeds her context window, so that a conversation continues, she stops taking into account the older tokens to make room for new ones because he predicts each new token.

Of course, there are more effective ways to form LLM video (Nvidia recently published a interesting paper On this), but it is the general idea of ​​keeping in mind for the study of Apple.

Apple’s study

As Apple researchers explain in the newspaper SlowFast-Llava-1.5: A family of chopped video tongue tongue models for a long-forming video understanding::

“Great language video models (LLM) integrate the perception of video in pre-formulated LLMs to process videos and generate responses to user controls. Although significant progress has been made, notable limitations remain in existing video LLM. ”

The limitations, according to them, are triple:

  • The existing models tend to rely strongly on long context windows and a large number of frames, which is ineffective and not easily transferable to smaller models;
  • Most of them require training pipelines with complex several stages (often using private data sets) difficult to reproduce;
  • Many are optimized only for video tasks, which limits their usefulness as models for general use which also include images.

To respond to these limitations, Apple first examined SlowFast-Llava, an open source model that had already shown promising results by combining space and temporal indices through a two-way configuration: a slow flow that examines fewer frames in higher details to capture what is in the scene, and a quick flow that examines more frames in the lower details to follow how time.

First of all, Apple set SlowFast-Lava to the images, in order to create general visual reasoning capacities. Then, it was jointly trained in images and videos (public data sets), to learn the temporal structure without sacrificing the understanding of the image.

The result was slow SlowFast-Lava-1.5 (or SF-LAVA-1.5), a family of models at parameter scales 1B, 3B and 7B, which manages to surpass much more important models in a range of video tasks, sometimes “by important margins”, as the researchers themselves noted.

In fact, on long video references like Longvideobench and MLVU, the Apple model establishes new cutting -edge results in all model sizes, including its smallest 1B version.

In addition, the model also surmounts one of the three shortcomings noted by researchers, and also works well on image tasks, including references for knowledge, mathematical reasoning, OCR and scenarios rich in text.

The team even tested several video compression strategies, but found that their configuration has concluded the best balance between speed, precision and the number of tokens.

However, there are limits

With SF-Llava-1.5, Apple researchers decided that the model would have a maximum input length of 128.

This means that, whether it is to analyze a clip that lasts a few minutes or a few hours, it is always maximum at 128 images, with 96 frames uniformly spaced selected for the quick flow, and 32 frames uniformly spaced selected for slow flow.

In this spirit, researchers say that:

“This approach can miss some key frames in long videos and mislead the model on the speed of reading a video. (…) The performance of SF-LLAVA-1.5 can be further improved by adjusting all the parameters, including the visual coder. However, we have found that it is not a trivial for long studies for future studies could explore the integration of the entire GPU memory, such as activation values. Bp. “

That said, Apple’s approach made it a cutting edge model, with the additional chops to be formed exclusively on public data sets. Sf-llava-1.5 is now an open source model available on Github And FaceAnd you can find the full study on arxiv.

Here are some examples of the model in action:

Appon wort limits work on Amazon

FTC: We use automatic income affiliation links. More.

Leave a Reply

Your email address will not be published. Required fields are marked *