Kyutai Labs on Wednesday launched Moshi AI, an artificial intelligence (AI) chatbot that responds verbally in real-time. The French AI firm has announced that Moshi’s entire audio language model was developed in-house. It can also modulate the voice to express emotions and respond in various speaking styles. The AI model can be accessed by the public, for free. Currently, the AI model restricts conversations to five minutes. Interestingly, OpenAI also announced similar speech features with the release of GPT-4o, but it is yet to be released.
Moshi AI features
The company states that the AI model was developed in six months with a team of eight people. While unveiling the AI model at an event in Paris, the Kyutai Labs said that Moshi is not an AI assistant but a prototype that can be used to develop tools for different use cases. It has also made the chatbot publicly available here. Users can enter their email and join the queue, but Gadgets 360 staff members were able to get immediate access to the platform without any wait time.
Yesterday we introduced Moshi, the lowest latency conversational AI ever released. Moshi can perform small talk, explain various concepts, engage in roleplay in many emotions and speaking styles. Talk to Moshi here https://t.co/a4EbAQiih7 and learn more about the method below 🧵. pic.twitter.com/NkJRybTRLQ
— kyutai (@kyutai_labs) July 4, 2024
The platform interface is quite minimalistic. There is a simplified AI design where users can check the loudness of their voice when they speak. There is a text box where only the responses of the AI appear. Another box near the top displays technical details such as audio duration, latency, and missed audio.
At the very top, there is a button to disconnect the call. Currently, the maximum call duration can be five minutes. The description page highlights that Moshi can think, speak, and listen at the same time to maximise the flow of conversation.
Gadgets 360 found that the latency is extremely low, and the AI often responds instantly. However, there are a few instances where the lag in response time can exceed 10-15 seconds. But this can be due to the heavy server load. However, sometimes the verbal prompts were not registered at all, even after three-fourths of the volume meter was filled up.
Â
Gadgets 360 also found that the AI model can respond in an emotive voice, and can speak in different styles and using various voice modulations. The AI model is also connected to the Internet and can fetch responses to the queries that require looking up the web. Notably, the chatbot does not allow text prompts, and voice is the only medium to interact with it.
Kyutai Labs has stated that the AI model will be open-sourced. However, the AI firm has yet to host the model weights and code on a portal. Once available, users will be able to download and install it locally, and can be run on an unconnected device.