AI Benchmarking Guide 2024: Top 10 world products - Arabian Business: Latest News on the Middle East, Real Estate, Finance, and More

About the ChipAI Agency’s research

The AI Benchmarking Guide 2024: Top 10 World’s Products study included the analysis and testing of some of the large language models (LLMs) around the globe. At the same time, our main goal was to determine which models, able to process queries in two or more languages, could provide the highest number of accurate answers.

OpenAI’s GPT-4 LLM supports 26 different languages, including Italian and Korean, among others. The research process involves creating queries in each of these languages, receiving a response from the model, and evaluating that response with a group of experts to determine its accuracy and usefulness.

The final score and ranking position are determined based on two expert evaluations. To determine a model’s ranking, experts use a combination of two factors: the evaluation of the model’s performance in its “native” language (typically English, as this is the language in which the model was trained), and the highest evaluation of the model’s response in any other language it has been evaluated in.

Five pitfalls to avoid when using AI in the workplace

The method employed by ChipAI for side-by-side (SBS) testing of the AI models was meticulously designed to ensure an equitable and comprehensive comparison. The agency’s analysts conducted a series of blind tests where the AI’s outputs were evaluated by experts who were unaware of the model’s identity behind each response.

This method included a variety of tasks such as language translation, creative writing, code generation, and data analysis to assess the AI’s versatility and depth of understanding. The evaluation criteria were tailored to each model’s claimed capabilities and included metrics such as accuracy, coherence, relevacy, and the ability to handle nuanced prompts. This robust testing protocol has been refined over years of benchmarking and is in alignment with the latest industry standards and technological advancements.

According to the results of the AI Benchmarking Guide 2024 research

— 1-3 places in the ranking

The first place and the undeniable leader of the ranking was the GPT-4 model, which was released a year ago on March 14, 2023. This model demonstrated maximum efficiency in English, scoring the highest scores. At the same time, German, the second language that was highly appreciated by experts, also turned out to have excellent results.

In second place is the product from the company Anthropic – LLM Claude 3. This company was founded by some of the people from OpenAI, so it’s not surprising that their product is only slightly inferior to GPT-4. Claude 3 achieved the highest score when answering questions in English, and a very high score in Greek.

The third place was taken by Code Llama 70B, a product from the Meta company. This model supports 20 languages, and it achieved the highest score when answering questions in English. In addition, it also achieved a high score in French, demonstrating its ability to understand and respond to queries in two different languages.

— 4-5 places in the ranking

GigaChat, a multimodal neural network developed by the Russian state-owned bank Sberbank, came in fourth place. GigaChat was trained primarily in Russian, which is why it received the maximum score not only in our ranking, but was generally recognized as the best in Russian. GigaChat also received a very high score for answers in English.

5th place went to Gemini, a model from Google. Although the Gemini Pro version of the model performed better in English than the third-place Code Llama 70B, the model was unable to replicate its success in other languages. Gemini scored well when answering questions in Italian.

— 6-10 places in the ranking

The GPT-3.5 model is another product from OpenAI that made it into the rating, but, unlike the newer version, took only 6th place. The best scores were demonstrated when answering queries in English and German.

In 7th place was Claude 3 Sonnet (a variation of the model from Anthropic) with the best scores in English and Greek.

In 8th place was the Zephyr (a model from HuggingFace). Best scores: English and French.

The 9th place is occupied by another model of the Claude 3 family – Opus. Best scores: English and Greek.

The final 10th place in the rating was taken by the Mistral model (model from Mistral AI). It was on its basis that Zephyr was created. Best ratings: French and English.

Brand View allows our business partners to share content with Arabian Business readers.
The content is supplied by Arabian Business Brand View Partners.

For all the latest business news from the UAE and Gulf countries, follow us on Twitter and LinkedIn, like us on Facebook and subscribe to our YouTube page, which is updated daily.