Cost Scenarios for Large Language Models in the SHIRE
The following write-up describes a series of experiments run by members of the SHIRE team to estimate the cost of using various Large Language Models (LLMs) within the SHIRE. We hope this information serves as helpful guidance to study teams wanting to price out their LLM options.
Bottom Line Up Front: Conclusions from Testing
- Creating and using provisioned throughput endpoints is generally much more expensive than using the ready-made pay-per-token endpoints. Therefore, users should use the pay-per-token endpoints whenever possible. In addition to costing much less, these endpoints don’t carry the risk of additional costs due to accidentally leaving the endpoint running when not in use.
- Using the new scale-from-zero option for applicable provisioned throughput endpoints seems to dramatically cut down their costs, but their performance becomes much less reliable. Thus, we would not recommend using this feature at this time.
- Even knowing the DBU rate, throughput rate, time of use, number of input tokens, etc., there are clearly still factors impacting cost for which we don’t have visibility (i.e., the costs don’t scale in fully consistent ways across different tests), making it impossible to predict LLM use costs with more than moderate precision.
Testing Models
A few caveats
- This information accurately reflects the results of our testing, but will likely be very different from your research use cases. The best way to use the costs below will be to determine scale differences in the costs of each model, rather than relying on the actual dollar amount shown.
- This information is current as of October 2025. Prices will likely change over time.
- This write-up is designed with a technical audience in mind. The results will be most useful to readers who have some experience using LLM endpoints.
- Several model tests resulted in errors. They are included in the table below as indicators for the effects of unanticipated errors or complications during usage on cost
General procedures for tests
- Tester started a Windows VM in the SHIRE
- Spun up personal compute in Databricks to run the testing code notebook
- When necessary (i.e., provisioned throughput), created the model serving endpoint to be ready around the same time the personal compute was ready
- Utilized the model continuously for 10 minutes via an infinite code loop OR made intermittent calls to the endpoint over a 10 minute period OR see log below
- Upon completion of testing, immediately deleted model serving endpoint if created, spun down personal compute, and stopped VM
| Large Language Model and Version | Max. Throughput Rate | Length/Action of Test | Successful Test | Cost* |
| Llama 3.3 70B | Low (9,500 tokens/second); 343 DBU) | 10 minutes continuously | yes | $10.49 |
| Llama 3.3 70B | Doubled (19,000 tokens/second; 686 DBU) | 10 minutes continuously | yes | $9.97 |
| GPT OSS 20B | Low (100 model units; 107 DBU) | 10 minutes continuously | yes | $2.43 |
| GPT OSS 20B | Doubled (200 model units; 214 DBU) | 10 minutes continuously | yes | $4.62 |
| Llama 3.3 70B | Low (9,500 tokens/second; 343 DBU) | 10 minutes intermittently | yes | $10.02 |
| Llama 3.3 70B | Low (9,500 tokens/second; 343 DBU) | 10 minutes continously; 60 minutes of no use; 10 minutes continuously¹ | yes | $26.86 |
| GPT OSS 20B | Low (100 model units; 107 DBU) | 30 minutes continuously | yes | $5.93 |
| GPT OSS 20B | Unspecified | 10 minutes of no use; 10 minutes continuous use; 60 minutes no use; 10 minutes continuous use | no² | $0.24 |
| Llama 3.3 70B | pay per token3 | 10 minutes continuously | yes | $0.17 |
| Llama 3.3 70B | pay per token3 | 10 minutes intermittently | yes | $0.10 |
| Claude Sonnet 4.5 | pay per token4 | 10 minutes continuously | yes | $0.67 |
| Claude Sonnet 4.5 | pay per token4 | 10 minutes intermittently | yes | $0.16 |
| GPT OSS 20B | pay per token5 | Classify 100 patients from notes | yes | $0.07 |
| GPT OSS 20B | Low (100 model units, 107 DBU) | Classify 100 patients from notes | yes | $1.07 |
*This is the specific cost of running the LLM. It does not include the cost of running the VM or compute for the code notebook
¹This tested a “Scale to Zero” option
²This tested a “Scale from Zero option” which does not specify throughput rate. However, when trying to initiate 10 minutes of continuous use via the infinite code loop, the tester got an error saying “the workload exceeded the model unit rate limit for the endpoint, please try again later.” A moment later, the tester tried initiating the loop a second time which was successful. After around 8 minutes of running the loop, the same error appeared again. The tester spun down the personal compute, waited 60 minutes, spun the personal compute back up again, and tried to run the loop for another 10 minutes. The tester received the same error initially again and had to initiate the loop a second time. The error appeared again around 7 minutes. The tester then stopped and shut everything down as usual.
3Pay per token Llama 3.3 70B was 7.143 DBU/1 million input tokens and 21.429 DBU/1 million output tokens
4Pay per token Claude Sonnet 4.5 was 47.143 DBU/1 million input tokens and 235.715 DBU/1 million output tokens
5Pay per token GPT OSS 20B was 1 DBU/1 million input tokens and 4.286 DBU/1 million output tokens