Large Language Model Benchmarking: Open Source Models? Licensed Models?

Uğur Özker
13 min readMay 2, 2024

--

As the market for AI models consolidates around OpenAI and a handful of other proprietary systems and players, some companies aim to compete by offering AI models for free. The open source approach of distributing technology freely for the public to use, share and modify helped create the modern internet, cloud computing and billion-dollar companies. There’s no guarantee it will work on open-source large language models, but some of the tech’s biggest players believe it could help them break OpenAI’s dominance. OpenAI will account for nearly 80% of the global generative AI market in 2023, market research firm Valuates Reports predicts. Meta’s ambitious product Llama is currently leading the open source models. For AI practitioners, one of the key factors beyond all criteria when choosing the right solution is the integration cost of productive AI applications. There are two basic options for developing a large language model-supported application:

1- API Access Solution (Licensed Models): Using closed source models such as OpenAI’s latest GPT-4 model through API access. Thanks to this method, there is no need for any infrastructure resource positioning and management costs for the large language model used, and the model is generally served to users through SaaS service and similar models.
2- In-house Solution Positioning (Open Source Models): Creating a model based on a pre-trained open source model and hosting it in your own IT infrastructure. In this method, all files and data of the model need to be hosted in a private or public cloud environment provided by the institution. The infrastructure is fully allocated by the resources provided by the user institution, and the maintenance and management costs are borne by the institution, resulting in minimal costs for model consumption or training.

Open source AI models look very attractive to businesses that prioritize data security and have their own data centers or GPU clusters, as they offer a way to use large language models without paying a vendor like Microsoft and sharing data. And because they are shared publicly, open source models often include opportunities that companies can use to create their own models. So which one should we use? Which method is best for our company? What are the different components that contribute to the large language model cost? What are the considerations when evaluating each component? While searching for answers to these questions, it is necessary to approach the issue from three different perspectives;

1- Project Setup and Usage Cost
2- Maintenance Cost
3- Additional Costs Affecting Other Environmental Factors

Project Setup and Usage Cost
The cost of the initial setup includes expenses associated with storing the model and making predictions for query requests.

In terms of Licensed Models:
We can list the most frequently used providers for licensed models as OpenAI, Cohere, Alphabet and Anthropic. Each provider has different pricing metrics that depend on the use case (summarization, classification, placement, etc.) and the chosen model. For each request, you are billed based on the total number of input (query) tokens (100 tokens equals an average of 75 words) and the number of output (answer) tokens generated. In some use cases, you may also have to pay a per-request price.

Let’s say we want to create an API for an assistant application powered by large language model technology. Let’s assume that on average 200 tokens (150 words) are input per end-to-end dialogue and 200 tokens (150 words) are output throughout the dialogue provided by the model. Considering that an average of 100 dialogue requests are managed per day; Therefore, the total number of coins used per day is calculated as 100 x (400 + 400) = 40,000 coins/day (30,000 words). For example; GPT-3.5 Turbo needs a usage cost of $0.0001 per 200 coins input, $0.0003 per 200 coins output. Total daily cost is $0.04 or $146 per year.
If demand volume increases to X times its current level, the cost will also increase linearly to X times. For example, as a rule of thumb, if the number of chats increases to 500 chats/day, the annual cost will increase to 730 USD, and for 5000 chats per day, the annual cost will increase to 7300 USD.

In the table below, you can examine the inference costs of the main licensed models based on the first quarter of 2024. While the query and response fees based on an average of 750 words in the table are shown in terms of unit price, the pricing area per API call on the far right is calculated by assuming an average of 150 words of input and the same number of output words.

For Open Source Models:
Hosting an open source model can be challenging for your IT infrastructure because the model parameters are very large. The cost of the initial setup primarily includes the costs associated with setting up an appropriate IT infrastructure to host the model. The models we need to host in our infrastructures can be divided into two groups:

a- Small size models suitable for running locally on your personal computers
b- Large-sized models that require hosting on cloud servers such as AWS, Google Cloud Platform or internal GPU-supported servers.

In this context, pricing is determined according to hardware and users are billed on an hourly basis. Costs vary depending on the infrastructure needed; the most affordable option starts at $0.6 per hour for NVIDIA T4 (14 GB) and for the most common hardware model the cost can go up to $45.0 per hour for 8 NVIDIA A100 GPUs (640 GB) (Latest released is LLama3 The average infrastructure requirement for the largest parameter version of the model is around this). We can say that these costs will increase even more as more advanced alternatives such as H200 or Blackwell (B100) begin to be offered by cloud providers.

When we want to purchase GPU-supported hardware such as A100 and H100, we need to invest with a purchasing cost of between 15,000 USD and 30,000 USD per hardware, and in addition, we must purchase the server systems, database resources, disks that we want to host these hardware, and the management services that will be needed by expert personnel. We also need to include the cost.

For an on-premises solution, we need to allocate a larger IT infrastructure or build our existing systems on scalable resources to ensure optimal latency in case the number of requests increases. However, this increase is strictly less than a linear increase for an API solution. As a result, when the number of requests is not high and remains below the break-even point, there is a usage threshold at which licensed models are more cost-effective than using open source large language models deployed in environments such as AWS, Azure, or Openshift. In fact, the feature of cost scaling with usage can also be applied to the cost of maintenance, especially fine-tuning, when dealing with very large dataset sizes.

Maintenance Cost:
When the model’s performance decreases due to changes in data distribution, it becomes necessary to fine-tune your model using new datasets. As a result, the maintenance cost covers the costs associated with labeling your training dataset (labelling is not mandatory, it can also be implemented with unstructured data), fine-tuning it, and deploying a new model.

For Licensed Models:
Some providers, such as OpenAI or Alphabet, offer fine-tuning services, with pricing covering data loading, model training, and deployment of the newly trained model. Below is an example of pricing for two different licensed models:

Example GPT-3.5 Turbo Fine Tuning Cost:
Model Training Cost: $0.0080 per 1000 tokens (for training) × 10,000 = $80 + $0.0120 per 1000 tokens (for query usage) × 10,000 = $120 + $0.0160 per 1K tokens ( for answer usage) × 10,000 = $160; Total = 360 USD

Computing Power Cost: $0.5 per hour × 15 days × 24 hours = $180
Total Cost: Model Training Cost + Computing Power Cost = $360 + $180 = $540

Example Claude 2 with Fine Tuning Cost:
Model Training Cost: $0.00163 per 1000 coins × 10,000 = $16.30 + $0.00551 per 1000 coins × 10,000 = $55.10; Total = $71.40

Computing Power Cost: $0.5 per hour × 15 days × 24 hours = $180
Total Cost: Model Training Cost + Computing Power Cost = $71.40 + $180 = $251.40

It is possible to compare the maintenance costs between different language models as we have illustrated above.

For Open Source Models:
The cost of fine-tuning open source models consists primarily of expenses associated with running IT infrastructure to retrain a language model. The cost you will incur is directly proportional to the time you rent the server to fine-tune the model. The time required to complete fine-tuning can be estimated based on the complexity of the task, such as the size of your pre-trained model (number of model parameters), the size of the training dataset, and so on. The training process will take longer the more complex the planned task. For a training dataset containing several thousand tokens, fine-tuning may take several hours to complete, depending on the infrastructure resource you have. The total cost can be calculated as follows:
Total Cost = Hourly cost of the rented infrastructure system x Number of hours required for training the model.

For institutions working in data centers with their own GPU clusters, this cost consists only of the electrical energy consumed depending on the time the servers are running. Apart from this, institutions that have their own data centers and servers do not need any additional costs other than energy consumption.

Additional Costs Affecting Other Environmental Factors
When it comes to using large language models for inference or retraining, AI practitioners need to carefully choose an adaptable IT infrastructure to reduce CO2 cost. Several important factors determine the level of CO2 impact when using an IT infrastructure;
Calculation: This is the number of FLOPs (representing GPU resource operations per second) required to complete a task. The number of FLOPs depends on the model parameters and the data size to be used in training.
Infrastructure — Hardware: Represents all of the server and GPU hardware that will be used to train a language model. The more powerful or numerous your GPU hardware is, the more energy will be consumed. The GPU requirement you will need is proportional to the language model you will use and the number of parameters the model has.
Data Center Location and Efficiency: Each data center operates on local energy production, so CO2 emissions and energy efficiency vary depending on the location of the data center. Efficient use of data centers is proportional to the energy costs and infrastructure in that region. The unit used for data center efficiency is expressed as PUE.

As you can see in the bottom 2 rows of the table, the carbon emissions and energy costs of training large language models are generally high. The remarkable point is that as new models such as GPT-4 emerge, the carbon emissions and energy released are also increasing rapidly.

Some studies show that a larger model does not always result in better performance. It is pointed out that finding a balance between model size and performance and choosing the model that best suits your use case is the most convenient and optimum cost-effective method of use.

We can directly estimate the impact of CO2 emissions in our own data centers or rented servers by calculating the overall energy or resource consumption of the infrastructure hosting large language model applications. One useful tool for tracking CO2 consumption is CodeCarbon.io.

Effects of Model Selection on Personnel
Another critical resource to consider is the human factor. How well the team can acquire the new knowledge and skills required to maintain the service is vital. In this regard, open source solutions tend to be more costly in terms of maintenance and management, as they require specialized personnel to train and maintain large language models. In contrast, with API access solutions, these tasks can be performed by the provider’s engineering team and infrastructure.

Licensed models offer convenience and scalability but may incur higher costs for inference. On-premises deployment, on the other hand, provides greater control over data privacy but requires meticulous infrastructure and resource management.

The cost structure in this area may undergo rapid changes in the near future. It is important to note that the estimates are rough as there may be additional factors when running an entire large language model implementation in a real-life scenario. To obtain more accurate information, benchmarking projects such as POC projects need to be conducted to estimate the average cost of the entire project pipeline.

For strategies that tend to be expensive, especially when dealing with large collections of queries and text, certain techniques can be applied to reduce costs. Considering the studies carried out and the costs incurred, it is concluded that the use of ideal large language models is closer to the optimum solution, as architectures that reveal the hybrid use of relatively small-scale or open source models trained in a specific field or with specialized data and licensed models containing more information. For the use of this type of hybrid model, the human factor again comes to the fore. In particular, introducing the integrated and synchronized working principle of these models, creating ideal architectures that cover security needs, matching the right models with the right business needs, and maintaining them in accordance with the LLMOPs principle are the main management needs that require expertise.

Addressing this issue, a research paper published by Stanford University titled “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance” proposes a flexible solution that will enable cost optimization. The article introduces FrugalGPT, a flexible example of which combinations of large language models can be used to reduce cost and increase accuracy. This technique is capable of significantly reducing query costs by up to 98% while outperforming even the best large language models. It achieves this with three main concepts:
Query Adaptation: This strategy looks for ways to identify effective, often shorter, queries to reduce costs.
Task/Domain Specific Model Use: Aims to use simpler and more cost-effective models that can match the performance of powerful but expensive models on specific tasks.
Query — Model Relationship: This method focuses on adaptively choosing which models to use for different queries.

It is expected that new approaches focusing on issues that require expertise, such as the FrugalGPT example, will increase day by day and will lighten the burden of qualified personnel by setting standards.

Large Language Model Production and Usage Policies of Enterprises
Startups that produce open-source large language models say they can make a profit by selling enterprise-level services and applications in addition to their open models; We can say that the willingness of businesses to pay for ready-to-use applications such as enterprise features or chatbots and the ease of outsourcing to AI companies to create their own custom models are the main points that encourage open source model producers and consumers. HuggingFace, which has raised approximately $400 million from technology giants such as Alphabet and Nvidia, helped develop the open source BLOOM and StarCoder models for this purpose. HuggingFace product manager Jeff Boudier said that after Meta and other major players rolled out their open source policies, the organization sold enterprise support by providing the management platform and computing power for other open models rather than focusing on model development. Boudier added that unlike open source software developed decades ago, the commercialization of open source AI is “a very unexplored area,” adding that the open source trend, especially in large language models, requires large amounts of capital and infrastructure ownership.

As Boudier said, AI models require more capital than software. Large tech firms like Microsoft, Meta, and Alphabet have large-scale data centers and large amounts of capital to purchase GPUs and invest billions of dollars into building their AI. When the advantages of the open source ecosystem and the economic potential of licensed model sales came together, they led technology giants to follow another method: Hybrid model production! While the giants, on the one hand, are competing with each other and approaching the creation of general-purpose artificial intelligence by producing licensed models with ever larger number of parameters, on the other hand, they are creating open source models that address the data processing model with distributed and smaller resources, called ‘Edge Computing’. To give a few examples; On the one hand, Microsoft is improving its GPT models further every day by developing a collaboration with OpenAI. On the other hand, it offers small-scale models such as Phi-3 as open source through HuggingFace, offering large language models of different scales that appeal to different needs, with licensing models that comply with its policies. Similarly, Alphabet organization offers the Gemini model as a rival to the GPT series with a licensed method, while the Gemma series, a smaller version of the model, can be used as open source.

While there is no licensed version of the Llama model series offered by Meta yet, an analysis of the open source usage model attracts attention. Lawrence Lessig, a professor at Harvard Law School who studies technology and policy, notes that companies like Meta restrict certain uses while calling their models open source. Llama-2’s community license allows developers to use it commercially, but requires getting Meta’s permission to use it in a product with more than 700 million monthly users. A Meta representative says the company’s open source approach is to help resource-constrained companies and developers access AI, allowing the AI ​​community to “advance the cutting edge in security and overall model performance while balancing liability concerns.”

While creating your large language model pool, we will question which language models you need in accordance with your usage scenarios and which model is the most successful among the models suitable for your usage scenario, while on the other hand, we will think about minimizing the cost of the resulting model pool. When making a calculation in terms of installation, maintenance and other costs, it would be a healthy calculation method to consider the following issues step by step;

  1. Use Case: Understand the purpose of using a big language model; because this is the key factor that affects resource requirements and subsequently costs. There are different model alternatives specific to each scenario.
    2. Model Type: Different major language models have varying complexities and resource demands. Infrastructure and resource consumption take into account the type of model used.
    3. Expected Traffic Volume: The expected traffic volume plays an important role in determining the required infrastructure capacity, which in turn affects overhead costs. The amount of concurrent queries and workload you need to answer is directly related to the infrastructure you have
    4. Query (Prompt) Structure (Input/Output): Whether licensed or open source, resource consumption and related billings always depend on how detailed and long the content of the questions and answers are.
    5. Data Size: Data volume can affect storage requirements and therefore contribute positively or negatively to the cost projection. In terms of maintenance costs, the most critical issue that needs to be calculated is how much data will be used, in what formats and in what way. As the amount of data increases, the time spent in training and the resources consumed increase at the same rate.

--

--

Uğur Özker

Computer Engineer, MSc, MBA, PMP®, Senior Solution Architect IBM