The Future of multimodal AI: From GPT-4 to Meta’s ImageBind

3 min readJun 12, 2024

I was surprise by the announcement and demo of gt4o specially the ability of audio processing, sure you can use whisper for speech to text and text to speech, but that is a big increase in latency, that is translated to slow responses and awful user experience.

Nowadays we are used to fast results, If i had to go back in time 10–5 years ago i will appreciate the old internet, but i am pretty sure i would have a hard time waiting for things to load. Thats why from an user perspective the new gt4o is a game changer. I can picture myself talking all day with this new model, finally improving my German. Danke OpenAi Ich werde Deutch Sprechen.

This is me, you know, learning some “German”

Now gpt-4 can process images, audio, text and video. Its clearly that the trend and next iterations of this model is directly linked to Multi modality.

That’s why i would like to bring the attention to a not so new Meta model: Image binding.

According to meta AI labs:

Introducing ImageBind, the first AI model capable of binding data from six modalities at once

This new open-source model (that’s right) brings more accessibility and humanity to technology. Right now, most of the outputs have a quality comparable to what GPT-3 used to have back in the day. However, it didn’t take long until we achieved impressive capabilities and extended context windows.

Envision the new UX/UI that can emerge from these capabilities and the accessibility applications. For instance, streamlined interfaces could greatly enhance user experiences, making technology more intuitive and user-friendly. When I used to work in tech support for a big corporation, sometimes I had to deal with complex and hard-to-use software for people with disabilities, such as those with vision impairments. Drawing from my personal experience, I can easily imagine tools that can help them interact with the world by installing an app on their phone via voice command.

I advocate that we need to heal our relationship with technology and make it more human-accessible. Now, LLMs can process language more effectively, and a huge window of opportunity for innovative applications will emerge.

Sam Altman, CEO of OpenAI, said in a podcast that people are complaining about how the new capabilities of ChatGPT are killing their startups. He mentioned that the reason behind this is because people are building without considering that ChatGPT will improve. If you are building a side project, app, or have an idea in mind, it’s important for staying future-proof to consider the trends in the industry and the direction of the next iterations of these models.

The Future of multimodal AI: From GPT-4 to Meta’s ImageBind

Written by hans haar