Meta partners with Microsoft to distribute next-gen LLaMA 2 AI model for commercial use

Meta unveils V-JEPA AI model that improves training by learning from video

Posted on

Meta Platform Inc.’s AI research division has released a new artificial intelligence model today that makes crucial steps in AI training that advances learning by interpreting video information similar to the way that humans understand the world.

The model, named V-JEPA, or the Video Joint Embedding Predictive Architecture model, works differently than large language models in that it uses images rather than words and it is a non-generative model, meaning that it doesn’t use the entire image all at once. A generative model would attempt to compare every pixel in every frame to every pixel, whereas V-JEPA attempts to use abstract concepts such as trees, people, animals and objects along with their relationships to each other to learn.

The project was spearheaded by Meta’s vice president and chief AI scientist Yann LeCun, who proposed the original JEPA model in 2022.

“V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning,” LeCun said. “Our goal is to build advanced machine intelligence that can learn more like humans do, forming internal models of the world around them to learn, adapt, and forge plans efficiently in the service of completing complex tasks.”

Because V-JEPA doesn’t need to ingest and analyze every pixel for every frame of a video, the model can improve training efficiency by a factor of 1.5 to 6 times, the researchers said. It can also be trained entirely with unlabeled data. Labels are only required to prepare the model for a particular task after pre-training. This means that it can be prepared with videos before curation of the objects and subjects in the data are labeled, meaning that labeling isn’t a bottleneck.

As part of model training, large portions of a video are “masked out,” or hidden from it during training. This means that it has to predict what’s happening underneath the hidden sections. This is similar to the way that a human infant might learn when a person leaves their field of view to retrieve a ball and then returns. This allows the model to develop a logical understanding of how objects interact.

The researchers said that the model works best with “fine-grained object interactions and distinguishing detailed object-to-object interactions that happen over time.” As an example, the model can tell the difference between someone passing by to pick up a pen, putting it down or picking it up and pretending to put it down but not doing it. However, it’s only good at short time scales, maybe around 10 seconds – increasing the time that the model can make predictions is the next step.

Right now, the “V” in V-JEPA only stands for “video,” which means that the model can only address the video content of videos, it’s not capable of understanding audio spoken in videos. The researchers said that they are considering adding this capability in the future to add additional context to its capabilities.

The Meta researchers said that with its capabilities of being able to observe abstract visual activity, the model would open up opportunities for future embodied AI, such as much smarter AI agents that behave even more like people.

As mixed and augmented reality becomes more mainstream and AI agents are capable of seeing what people are doing and appear as lifelike characters in people’s living rooms, being able to watch someone chop vegetables or interact with their smart TV could make them superior helpers. For example, if someone is attempting to cook their favorite meal and picks the wrong ingredient from the cupboard, V-JEPA could quickly tell and help find the right one.

To help researchers and developers build on the new model and start developing quickly, the Meta researchers said that they’re releasing the new model under a Creative Commons license so that it can be extended. It is available today in a GitHub public repository.

Image: Pixabay

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *