Generative AI

Selecting the Right LLM for Your Task : Quality Benchmarking of Two popular LLMs - Mistral 7b and Llama 7b
Srivatsava Daruru
December 19, 2023

Looking for the perfect large language model (LLM) for your use case, but not sure where to start? You're not alone. With thousands of open-source LLMs available, it can be overwhelming to choose the right one for your needs. That's where we come in!

In our previous blog, we focused on the performance characteristics of LLMs, studying how latency, throughput, and concurrency vary as we increase the number of cores and tokens. But what truly sets an LLM apart is its quality. In this blog, we'll dive deeper into the world of LLMs and compare the quality of Mistral 7b and Llama 7b models on popular benchmarks.

Our LLM AIQ engine is built with a powerful catalog of multiple large language models. Whether you're looking for a model that delivers high performance, exceptional quality, or the perfect balance between the two, we've got you covered. With access to a range of top performing LLMs, we'll help you navigate this vast array of options and select the model that's perfect for your needs.

Experiment Setup

Different benchmarks test different abilities of the model. We used Eleuther AI evaluation harness to test Mistral 7B Instruct and LLama 7B chat models on various benchmarks. We ran these experiments using 1 GPU on Predera’s AIQ engine. These benchmarks test the language model’s reasoning, generation, question answering, instruction following capabilities.  The benchmarks we test can be broadly placed into three categories:  

  1. Intrinsic knowledge: Benchmarks testing the model’s knowledge on various topics along with problem solving capabilities.
  1. Reasoning: Benchmarks testing the model’s reasoning capabilities.
  1. Question Answering: Testing the model’s ability to answer based on a known context and question.

For each of the categories, the model also needs to be good at language understanding and generation in addition to the core problem they are trying to solve.

Intrinsic Knowledge Benchmarks

MMLU (Massive Multitask Language Understanding): This is a benchmark designed to measure the knowledge acquired by the models during pretraining in zero and few shot settings. The test covers ~16k questions and 57 tasks including elementary mathematics, US history, computer science, law, astronomy and more. It ranges in difficulty from elementary level to advanced professional level testing both world knowledge and problem solving.

CMMLU (Massive Multitask Language Understanding in Chinese): Measuring massive multitask language understanding in Chinese. This is like the above task but measures the task’s ability to answer various subjects in Chinese including Chinese only tasks like Chinese literature/language. It has around 11,528 questions across 67 subjects.

Arithmetic: A small battery of 10 tests that involve asking language models for a simple arithmetic problem in natural language.

MathQA: is a large-scale dataset of 37k English multiple-choice math word problems covering multiple math domain categories.

OpenBookQA: is a question-answering dataset modelled after open book exams for assessing human understanding of a subject. It has 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small “book” of 1,326 core science facts and the application of these facts to novel situations.  

The AI2 Reasoning Challenge (ARC): ARC dataset consists of 7,787 grade-school level multiple choice science exam questions drawn from a variety of sources.

Truthful QA: Evaluates model's ability to provide truthful and factual responses. The benchmark comprises 817 questions spanning 38 categories including health, law, finance and politics. To perform well, models must avoid generating false answers learned from imitating human texts.

Aligning AI With Shared Human Values (Hendrycks ethics): The ETHICS dataset is a benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. It measures language model’s knowledge of basic concepts of morality. It has over 130k examples divided into train, dev and test sets.

Result Analysis

We create a spider plot comparing the quality of Llama 2 7b chat model and Mistral 7b instruct model on the Intrinsic knowledge benchmarks. From the following figure it is evident that Mistral is generally better than llama in most of these tasks. It is significantly better in TruthfulQA and Arithmetic although they are much smaller datasets compared to MMLU, CMMLU, MathQA. Usually more comprehensive and high quality pretraining leads to better accuracy across these tasks and Mistral seems to be the better model here.

Reasoning Benchmarks

Hella Swag (Can a machine really finish your sentence): Evaluates a model's common-sense reasoning through multiple choice questions. These are trivial for humans, but models can struggle with them. It has around 70k multiple choice questions.

Big bench hard (Beyond the Imitation Game Benchmark): A suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the tasks for which prior language model evaluations did not outperform the average human-rater. These are around 6k hard multiple-choice reasoning questions that test the model’s reasoning capabilities.

PIQA (Reasoning about Physical Commonsense in Natural Language): was designed to investigate the physical knowledge of existing models. Tests questions about physical common sense like “To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?”. It has around 16k train, with ~2k dev and ~3k test question answer pairs.

ANLI (Adversarial NLI): ANLI is a dataset collected via an iterative, adversarial human-and-model-in-the-loop procedure. It consists of three rounds that progressively increase in difficulty and complexity, and each question-answer includes annotator-provided explanations. The task contains 162,865 / 3,200 / 3,200 train/dev/test entailment questions that need reasoning to answer well.

WinoGrande: Winogrande is a collection of 44k problems to check robustness against the dataset-specific bias. It tests common sense reasoning on 273 expert crafted pronoun resolution problems.  

SWAG (Situations With Adversarial Generations): Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate what might come next ("then, she examined the engine"). SWAG (Situations With Adversarial Generations) is a large-scale dataset for this task of grounded commonsense inference, unifying natural language inference and physically grounded reasoning. The dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene. The correct answer is the (real) video caption for the next event in the video; the three incorrect answers are adversarial generated, and human verified, to fool machines but not humans.  

Result Analysis

We plot the same plot for reasoning benchmarks from the results gained from Eleuther harness. On the reasoning benchmarks, both llama and mistral are very close to each other except for BigBenchHard where llama is better. Both the models struggle on this particular data set.

Question Answering Benchmarks

RACE (ReAding Comprehension dataset from Examinations): RACE is a large-scale reading comprehension dataset with more than 28,000 passages and 98,000 multiple choice questions from english exams targeting chinese students in middle school and high school.

Boolq: BoolQ is a task that evaluates an LLM’s ability to answer boolean questions. The task consists of a question and a passage of text, and the LLM must determine whether the answer to the question can be inferred from the passage. It has around 16,000 query, passage answer pairs and are gathered from anonymized google yes/no queries.

DROP: Discrete Reasoning Over Paragraphs DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than was necessary for prior datasets. The questions consist of passages extracted from Wikipedia articles. The dataset is split into a training set of about 77,000 questions, a development set of around 9,500 questions and a hidden test set similar in size to the development set.

Result Analysis

On the question answering datasets, the model is forced to answer from the context and not from its own intrinsic knowledge. On these datasets, both Llama2 7B and Mistral 7B are similar to each other. Both struggle on adversarially created DROP dataset with exact match metric.

Summary

In this blog we compared Mistral 7B instruct and Llama 7b Chat model on various quality benchmarks. We divided the benchmarks into intrinsic knowledge tasks, reasoning tasks and question answering tasks.  Mistral was consistently and significantly better than llama on all the intrinsic knowledge tasks. Llama was better on BigBenchHard reasoning dataset but close to Mistral-7b on all the other reasoning and question answering tasks. Having a powerful platform like AIQ has enabled us to run various benchmarks quickly on different models.  If you would like to test various large language models and deploy them in production for your use case, please contact info@predera.com.  


We hope you found our blog post informative. If you have any project inquiries or would like to discuss your data and analytics needs, please don't hesitate to contact us at info@predera.com. We're here to help! Thank you for reading.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.