Cost Scenarios for Large Language Models in the SHIRE

The following write-up describes a series of experiments run by members of the SHIRE team to estimate the cost of using various Large Language Models (LLMs) within the SHIRE. We hope this information serves as helpful guidance to study teams wanting to price out their LLM options.

Bottom Line Up Front: Conclusions from Testing

Creating and using provisioned throughput endpoints is generally much more expensive than using the ready-made pay-per-token endpoints. Therefore, users should use the pay-per-token endpoints whenever possible. In addition to costing much less, these endpoints don’t carry the risk of additional costs due to accidentally leaving the endpoint running when not in use.

Using the new scale-from-zero option for applicable provisioned throughput endpoints seems to dramatically cut down their costs, but their performance becomes much less reliable. Thus, we would not recommend using this feature at this time.

Even knowing the DBU rate, throughput rate, time of use, number of input tokens, etc., there are clearly still factors impacting cost for which we don’t have visibility (i.e., the costs don’t scale in fully consistent ways across different tests), making it impossible to predict LLM use costs with more than moderate precision.

Testing Models

A few caveats

This information accurately reflects the results of our testing, but will likely be very different from your research use cases. The best way to use the costs below will be to determine scale differences in the costs of each model, rather than relying on the actual dollar amount shown.

This information is current as of October 2025. Prices will likely change over time.

This write-up is designed with a technical audience in mind. The results will be most useful to readers who have some experience using LLM endpoints.
Several model tests resulted in errors. They are included in the table below as indicators for the effects of unanticipated errors or complications during usage on cost

General procedures for tests

Tester started a Windows VM in the SHIRE

Spun up personal compute in Databricks to run the testing code notebook

When necessary (i.e., provisioned throughput), created the model serving endpoint to be ready around the same time the personal compute was ready

Utilized the model continuously for 10 minutes via an infinite code loop OR made intermittent calls to the endpoint over a 10 minute period OR see log below

Upon completion of testing, immediately deleted model serving endpoint if created, spun down personal compute, and stopped VM

Large Language Model and Version	Max. Throughput Rate	Length/Action of Test	Successful Test	Cost*
Llama 3.3 70B	Low (9,500 tokens/second); 343 DBU)	10 minutes continuously	yes	$10.49
Llama 3.3 70B	Doubled (19,000 tokens/second; 686 DBU)	10 minutes continuously	yes	$9.97
GPT OSS 20B	Low (100 model units; 107 DBU)	10 minutes continuously	yes	$2.43
GPT OSS 20B	Doubled (200 model units; 214 DBU)	10 minutes continuously	yes	$4.62
Llama 3.3 70B	Low (9,500 tokens/second; 343 DBU)	10 minutes intermittently	yes	$10.02
Llama 3.3 70B	Low (9,500 tokens/second; 343 DBU)	10 minutes continously; 60 minutes of no use; 10 minutes continuously¹	yes	$26.86
GPT OSS 20B	Low (100 model units; 107 DBU)	30 minutes continuously	yes	$5.93
GPT OSS 20B	Unspecified	10 minutes of no use; 10 minutes continuous use; 60 minutes no use; 10 minutes continuous use	no²	$0.24
Llama 3.3 70B	pay per token³	10 minutes continuously	yes	$0.17
Llama 3.3 70B	pay per token³	10 minutes intermittently	yes	$0.10
Claude Sonnet 4.5	pay per token⁴	10 minutes continuously	yes	$0.67
Claude Sonnet 4.5	pay per token⁴	10 minutes intermittently	yes	$0.16
GPT OSS 20B	pay per token⁵	Classify 100 patients from notes	yes	$0.07
GPT OSS 20B	Low (100 model units, 107 DBU)	Classify 100 patients from notes	yes	$1.07

*This is the specific cost of running the LLM. It does not include the cost of running the VM or compute for the code notebook

¹This tested a “Scale to Zero” option

²This tested a “Scale from Zero option” which does not specify throughput rate. However, when trying to initiate 10 minutes of continuous use via the infinite code loop, the tester got an error saying “the workload exceeded the model unit rate limit for the endpoint, please try again later.” A moment later, the tester tried initiating the loop a second time which was successful. After around 8 minutes of running the loop, the same error appeared again. The tester spun down the personal compute, waited 60 minutes, spun the personal compute back up again, and tried to run the loop for another 10 minutes. The tester received the same error initially again and had to initiate the loop a second time. The error appeared again around 7 minutes. The tester then stopped and shut everything down as usual.

³Pay per token Llama 3.3 70B was 7.143 DBU/1 million input tokens and 21.429 DBU/1 million output tokens

⁴Pay per token Claude Sonnet 4.5 was 47.143 DBU/1 million input tokens and 235.715 DBU/1 million output tokens

⁵Pay per token GPT OSS 20B was 1 DBU/1 million input tokens and 4.286 DBU/1 million output tokens