Manually validating compatibility and running NVIDIA (NIM) container images

NVIDIA NIMs are ready to run pre-packaged containerized models. The NIMs and their included models are available in a variety of profiles supporting different compute hardware configurations. You can run the NIMs in an interrogatory mode that will tell you which models are compatible with your GPU hardware. You can then run the NIM with the associated profile.

Sometimes there are still problems and we have to add additional tuning parameters to fit in memory or change data types. In my case, the data type change is because of some bug in the NIM startup detection code.

This article requires additional polish. It has more than a few rough edges.

NVIDIA NIMs are semi-opaque. You cannot build your own NIM. NIM construction details are not described by NVIDIA.

Examining NVIDIA Model Container Images

The first step is to select models we think can fit and run on our NVIDIA GPU hardware.

The first step is to investigate models of the different types by visiting the appropriate NVIDIA NIM docs

Look at the large language-models page to find instructions on how to run NIM containers locally.

Test platform and plan

Our basic plan is

Run this test on Ubuntu Linux
Host the models locally in my single NVIDIA Titan RTX 24GB card system.
Use the CLI for all testing

Because of hardware limitations, I will be using non-optimized models, as described on the support matrix page.

Some Models

Model	Works for Me	Additional details
meta--lama-3-70b-instruct	No	Support matrix : meta llama 3 70b instruct. It requires a minimum 240GB of VRAM.
meta-llama-3-8b-instruct	Yes	This is the model used in the NIM Anywhere project. Support matrix : meta llama 3 8b instruct. It can run on a single card in 24GB of VRAM with fp16 precision
meta-llama-3.1-8b-base	Yes	This is the model used in the NIM Anywhere project. Support matrix : meta llama 3.1 8b base. It can run on a single card in 24GB of VRAM with fp16 precision
meta-llama-3.1-70b-instruct	No	Support matrix : meta llama 3.1 8b instruct
meta-llama-3.1-405b-instruct	No	Support matrix : meta llama 3.1 8b instruct
meta-llama-3.1-8b-instruct	Yes	This is the model used in the NIM Anywhere project. Support matrix : meta llama 3.1 8b instruct. It can run on a single card in 24GB of VRAM with fp16 precision
mistral-7B-instruct-v0.3	Yes	Support matrix : meta llama 3 8b instruct. It can in 24GB of VRAM with fp16 precision
mistral-8b-8x7b-instruct-v0.1	No	Support matrix : mistral 8x7b 0.1.
mistral-8b-8x22b-instruct-v0.1	No	Support matrix : mistral 8x22b 0.1.
other	n/a	n/a

Prerequisites

You need to create credentials that can be used by the various commands. You will get a 403 forbidden when trying to pull down a container if you don't have credentials.

This assumes you have a nvapi- type key that gives you access
Log into nvcr.io using the Docker CLI. You may see logins in some of the script logs below.

(base) joe@rocks:~$ docker login nvcr.io
Username: $oauthtoken
Password: <nvapi key here>
WARNING! Your password will be stored unencrypted in /home/joe/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

A Video demonstration

https://youtu.be/JtufRBLFwxM

Validating compatibility

We can ask the models to validate themselves against our hardware. Note that each one of these tests downloads the model image to your machine. The downloads can be large and take up bunches of disk space. The intermediate layers were 29.8GB.

REPOSITORY SIZE project-hybrid-rag 29.7GB nvcr.io/nim/meta/llama-3.1-405b-instruct 12.9GB nvcr.io/nim/meta/llama-3.1-8b-instruct 12.9GB nvcr.io/nim/meta/llama-3.1-70b-instruct 12.9GB nvcr.io/nim/meta/llama-3.1-8b-base 12.9GB project-nim-anywhere 4.59GB nvcr.io/nim/nvidia/megatron-1b-nmt 14.3GB nvcr.io/nim/nvidia/parakeet-ctc-1.1b-asr 14.3GB nvcr.io/nim/nvidia/fastpitch-hifigan-tts 14.3GB nvcr.io/nim/snowflake/arctic-embed-l 15.7GB nvcr.io/nim/nvidia/nv-embedqa-e5-v5 15.7GB nvcr.io/nim/nvidia/nv-embedqa-mistral-7b-v2 15.7GB nvcr.io/nim/snowflake/arctic-embed-l 15.7GB nvcr.io/nim/nvidia/nv-embedqa-mistral-7b-v2 15.7GB nvcr.io/nim/nvidia/nv-embedqa-e5-v5 15.7GB <none> 4.34GB nvcr.io/nim/mistralai/mixtral-8x22b-instruct-v01 12.5GB nvcr.io/nim/mistralai/mixtral-8x7b-instruct-v01 12.5GB project-gpu-sample 22.1GB nvcr.io/nim/mistralai/mistral-7b-instruct-v03 12.5GB covid-vaccinations-python 1.88GB project-covid-vaccinations-python 1.23GB project-dogfood 1.23GB nvcr.io/nim/meta/llama3-70b-instruct 12.5GB nvcr.io/nim/meta/llama3-8b-instruct 12.5GB redis 116MB milvusdb/milvus 1.71GB traefik 153MB

The above output shows various images I downloaded during testing. All were downloaded with

docker run ... list-model-profile

Local RTX 3060 TI 8GB

None of the models will fit in this card.

nvcr.io/nim/meta/llama3-70b-instruct:1.0.0

Note that this is not compatible with this system. I'm not sure why this shows up as "not compatible" and not "needs more memory".

$ docker run --gpus all nvcr.io/nim/meta/llama3-70b-instruct:1.0.0 list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-70b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

SYSTEM INFO
- Free GPUs: <None>
- Non-free GPUs:
  -  [2489:10de] (0) NVIDIA GeForce RTX 3060 Ti [current utilization: 5%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
  - 93782c337ddaf3f6b442ef7baebd0a732e433b7b93ec06bfd3db27945de560b6 (tensorrt_llm-h100-fp8-tp8-latency)
  - 2e9b29c44b3d82821e7f4facd1a652ec5e0c4e7e473ee33451f6d046f616c3e5 (tensorrt_llm-l40s-fp8-tp8-latency)
  - 8bb3b94e326789381bbf287e54057005a9250e3abbad0b1702a70222529fcd17 (tensorrt_llm-h100-fp8-tp4-throughput)
  - 8b8e03de8630626b904b37910e3d82a26cebb99634a921a0e5c59cb84125efe8 (tensorrt_llm-l40s-fp8-tp4-throughput)
  - a90b2c0217492b1020bead4e7453199c9f965ea53e9506c007f4ff7b58ee01ff (tensorrt_llm-h100-fp16-tp8-latency)
  - 96b70da1414c7beb5cf671b3e7cf835078740764064c453cd86e84cf0491aac0 (tensorrt_llm-l40s-fp16-tp8-throughput)
  - b811296367317f5097ed9f71b8f08d2688b2411c852978ae49e8a0d5c3a30739 (tensorrt_llm-a100-fp16-tp4-throughput)
  - abcff5042bfc3fa9f4d1e715b2e016c11c6735855edfe2093e9c24e83733788e (tensorrt_llm-h100-fp16-tp4-throughput)
  - 7f8bb4a2b97cf07faf6fb930ba67f33671492b7653f2a06fe522c7de65b544ca (tensorrt_llm-a100-bf16-tp8-latency)
  - 7e8f6cc0d0fde672073a20d5423977a0a02a9b0693f0f3d4ffc2ec8ac35474d4 (vllm-fp16-tp8)
  - df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4)
  - 03fdb4d11f01be10c31b00e7c0540e2835e89a0079b483ad2dd3c25c8cc29b61 (tensorrt_llm-l40s-fp16-tp8-throughput-lora)
  - 7ba9fbd93c41a28358215f3e94e79a2545ab44e39df016eb4c7d7cadc384bde7 (tensorrt_llm-a100-fp16-tp4-throughput-lora)
  - 36fc1fa4fc35c1d54da115a39323080b08d7937dceb8ba47be44f4da0ec720ff (tensorrt_llm-h100-fp16-tp4-throughput-lora)
  - 0f3de1afe11d355e01657424a267fbaad19bfea3143a9879307c49aed8299db0 (vllm-fp16-tp8-lora)
  - a30aae0ed459082efed26a7f3bc21aa97cccad35509498b58449a52c27698544 (vllm-fp16-tp4-lora)
PS C:\Users\joe>

nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

Used by NIM Anywhere as of 2024/07

There is a model that is compatible with the RTX 3060 TI, but can't be run because there is not enough VRAM

$ docker run --gpus all nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

SYSTEM INFO
- Free GPUs: <None>
- Non-free GPUs:
  -  [2489:10de] (0) NVIDIA GeForce RTX 3060 Ti [current utilization: 5%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Compatible with system but not runnable due to low GPU free memory
  - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
  - With LoRA support:
    - 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
- Incompatible with system:
  - dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
  - f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency)
  - 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput)
  - 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput)
  - a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-a100-fp16-tp2-latency)
  - e0f4a47844733eb57f9f9c3566432acb8d20482a1d06ec1c0d71ece448e21086 (tensorrt_llm-a10g-fp16-tp2-latency)
  - 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency)
  - 24199f79a562b187c52e644489177b6a4eae0c9fdad6f7d0a8cb3677f5b1bc89 (tensorrt_llm-l40s-fp16-tp2-latency)
  - 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-a100-fp16-tp1-throughput)
  - c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput)
  - cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput)
  - d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80 (tensorrt_llm-l40s-fp16-tp1-throughput)
  - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
  - 9137f4d51dadb93c6b5864a19fd7c035bf0b718f3e15ae9474233ebd6468c359 (tensorrt_llm-a10g-fp16-tp2-throughput-lora)
  - cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
  - 3bdf6456ff21c19d5c7cc37010790448a4be613a1fd12916655dfab5a0dd9b8e (tensorrt_llm-h100-fp16-tp1-throughput-lora)
  - 388140213ee9615e643bda09d85082a21f51622c07bde3d0811d7c6998873a0b (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
  - c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)

Validating compatibility with local Titan RTX 24GB

nvcr.io/nim/meta/llama3-70b-instruct:1.0.0

This system is not compatible with the model. I'm not sure why.

$ docker run --gpus all nvcr.io/nim/meta/llama3-70b-instruct:1.0.0 list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-70b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

SYSTEM INFO
- Free GPUs:
  -  [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
  - 93782c337ddaf3f6b442ef7baebd0a732e433b7b93ec06bfd3db27945de560b6 (tensorrt_llm-h100-fp8-tp8-latency)
  - 2e9b29c44b3d82821e7f4facd1a652ec5e0c4e7e473ee33451f6d046f616c3e5 (tensorrt_llm-l40s-fp8-tp8-latency)
  - 8bb3b94e326789381bbf287e54057005a9250e3abbad0b1702a70222529fcd17 (tensorrt_llm-h100-fp8-tp4-throughput)
  - 8b8e03de8630626b904b37910e3d82a26cebb99634a921a0e5c59cb84125efe8 (tensorrt_llm-l40s-fp8-tp4-throughput)
  - a90b2c0217492b1020bead4e7453199c9f965ea53e9506c007f4ff7b58ee01ff (tensorrt_llm-h100-fp16-tp8-latency)
  - 96b70da1414c7beb5cf671b3e7cf835078740764064c453cd86e84cf0491aac0 (tensorrt_llm-l40s-fp16-tp8-throughput)
  - b811296367317f5097ed9f71b8f08d2688b2411c852978ae49e8a0d5c3a30739 (tensorrt_llm-a100-fp16-tp4-throughput)
  - abcff5042bfc3fa9f4d1e715b2e016c11c6735855edfe2093e9c24e83733788e (tensorrt_llm-h100-fp16-tp4-throughput)
  - 7f8bb4a2b97cf07faf6fb930ba67f33671492b7653f2a06fe522c7de65b544ca (tensorrt_llm-a100-bf16-tp8-latency)
  - 7e8f6cc0d0fde672073a20d5423977a0a02a9b0693f0f3d4ffc2ec8ac35474d4 (vllm-fp16-tp8)
  - df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4)
  - 03fdb4d11f01be10c31b00e7c0540e2835e89a0079b483ad2dd3c25c8cc29b61 (tensorrt_llm-l40s-fp16-tp8-throughput-lora)
  - 7ba9fbd93c41a28358215f3e94e79a2545ab44e39df016eb4c7d7cadc384bde7 (tensorrt_llm-a100-fp16-tp4-throughput-lora)
  - 36fc1fa4fc35c1d54da115a39323080b08d7937dceb8ba47be44f4da0ec720ff (tensorrt_llm-h100-fp16-tp4-throughput-lora)
  - 0f3de1afe11d355e01657424a267fbaad19bfea3143a9879307c49aed8299db0 (vllm-fp16-tp8-lora)
  - a30aae0ed459082efed26a7f3bc21aa97cccad35509498b58449a52c27698544 (vllm-fp16-tp4-lora)

nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

Used by NIM Anywhere as of 2024/07

There is a profile of this model that is compatible and that fits inside our 24GB of VRAM.

$ docker run --gpus all nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

SYSTEM INFO
- Free GPUs:
  -  [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable:
  - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
  - With LoRA support:
    - 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
- Incompatible with system:
  - dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
  - f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency)
  - 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput)
  - 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput)
  - a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-a100-fp16-tp2-latency)
  - e0f4a47844733eb57f9f9c3566432acb8d20482a1d06ec1c0d71ece448e21086 (tensorrt_llm-a10g-fp16-tp2-latency)
  - 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency)
  - 24199f79a562b187c52e644489177b6a4eae0c9fdad6f7d0a8cb3677f5b1bc89 (tensorrt_llm-l40s-fp16-tp2-latency)
  - 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-a100-fp16-tp1-throughput)
  - c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput)
  - cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput)
  - d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80 (tensorrt_llm-l40s-fp16-tp1-throughput)
  - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
  - 9137f4d51dadb93c6b5864a19fd7c035bf0b718f3e15ae9474233ebd6468c359 (tensorrt_llm-a10g-fp16-tp2-throughput-lora)
  - cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
  - 3bdf6456ff21c19d5c7cc37010790448a4be613a1fd12916655dfab5a0dd9b8e (tensorrt_llm-h100-fp16-tp1-throughput-lora)
  - 388140213ee9615e643bda09d85082a21f51622c07bde3d0811d7c6998873a0b (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
  - c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)

mistral-7B-instruct-v0.3

There is a profile that will run on my Titan RTX

$ docker run --gpus all nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).

SYSTEM INFO
- Free GPUs:
  -  [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable:
  - 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1)
  - With LoRA support:
    - 114fc68ad2c150e37eb03a911152f342e4e7423d5efb769393d30fa0b0cd1f9e (vllm-fp16-tp1-lora)
- Incompatible with system:
  - 48004baf4f45ca177aa94abfd3c5c54858808ad728914b1626c3cf038ea85bc4 (tensorrt_llm-h100-fp8-tp2-latency)
  - 5c17c27186b232e834aee9c61d1f5db388874da40053d70b84fd1386421ff577 (tensorrt_llm-l40s-fp8-tp2-latency)
  - 08ab4363f225c19e3785b58408fa4dcac472459cca1febcfaffb43f873557e87 (tensorrt_llm-h100-fp8-tp1-throughput)
  - cc18942f40e770aa27a0b02c1f5bf1458a6fedd10a1ed377630d30d71a1b36db (tensorrt_llm-l40s-fp8-tp1-throughput)
  - dea9af90d5311ff2d651db8c16f752d014053d3b1c550474cbeda241f81c96bd (tensorrt_llm-a100-fp16-tp2-latency)
  - 6064ab4c33a1c6da8058422b8cb0347e72141d203c77ba309ce5c5533f548188 (tensorrt_llm-h100-fp16-tp2-latency)
  - ef22c7cecbcf2c8b3889bd58a48095e47a8cc0394d221acda1b4087b46c6f3e9 (tensorrt_llm-l40s-fp16-tp2-latency)
  - c79561a74f97b157de12066b7a137702a4b09f71f4273ff747efe060881fca92 (tensorrt_llm-a100-fp16-tp1-throughput)
  - 8833b9eba1bd4fbed4f764e64797227adca32e3c1f630c2722a8a52fee2fd1fa (tensorrt_llm-h100-fp16-tp1-throughput)
  - 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
  - 7387979dae9c209b33010e5da9aae4a94f75d928639ba462201e88a5dd4ac185 (vllm-fp16-tp2)
  - 2c57f0135f9c6de0c556ba37f43f55f6a6c0a25fe0506df73e189aedfbd8b333 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
  - 8f9730e45a88fb2ac16ce2ce21d7460479da1fd8747ba32d2b92fc4f6140ba83 (tensorrt_llm-h100-fp16-tp1-throughput-lora)
  - eb445d1e451ed3987ca36da9be6bb4cdd41e498344cbf477a1600198753883ff (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
  - 5797a519e300612f87f8a4a50a496a840fa747f7801b2dcd0cc9a3b4b949dd92 (vllm-fp16-tp2-lora)

mistral-7b-8x7b-v0.1

This will not run on my hardware.

$ docker run --gpus all nvcr.io/nim/mistralai/mixtral-8x7b-instruct-v01:latest list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mixtral-8x7b-instruct-v0.1

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).

SYSTEM INFO
- Free GPUs:
  -  [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
  - d37580fa5deabc5a4cb17a2337e8cc672b19eaf2791cf319fd16582065e40816 (tensorrt_llm-h100-fp8-tp4-latency)
  - 00056b81c2e41eb9b847342ed553ae88614f450f3f15eebfd2ae56174484bacd (tensorrt_llm-h100-fp8-tp2-throughput)
  - e249e70e3ee390e606782eab19e7a9cf2aeb865bdbc638aaf0fc580901492841 (tensorrt_llm-a100-fp16-tp4-latency)
  - 9972482479f39ecacc3f470aaa7d0de7b982a1b18f907aafdb8517db5643e05a (tensorrt_llm-h100-fp16-tp4-latency)
  - ad3c46c1c8d71bb481205732787f2c157a9cfc9b6babef5860518a047e155639 (tensorrt_llm-l40s-fp16-tp4-throughput)
  - 9865374899b6ac3a1e25e47644f3d66753288e9d949d883b14c3f55b98fb2ebc (tensorrt_llm-a100-fp16-tp2-throughput)
  - 1f859af2be6c57528dc6d32b6062c9852605d8f2d68bbe76a43b65ebc5ac738d (tensorrt_llm-h100-fp16-tp2-throughput)
  - ee616a54bea8e869009748eefb0d905b168d2095d0cdf66d40f3a5612194d170 (tensorrt_llm-h100-int8wo-tp4-latency)
  - 01f1ad019f55abb76f10f1687f76ea8e5d2f3d51d6831ddc582d979ff210b4cb (tensorrt_llm-h100-int8wo-tp2-throughput)
  - da767e18d66e067f2c5c2c2257171b8b8331801fffdea98fc8e48b8958549388 (vllm-fp16-tp4)
  - 37d1a6210357770f7f6fe5fcdb5f8da11e3863a7274ccde8ff97e4ffc7d17006 (vllm-fp16-tp2)
  - 2289b29507d987154efb5ff12b41378323147e28ba2660490737cb8d8544d039 (vllm-fp16-tp4-lora)
  - a0aff0e3bf2062cc42f13556b28eb66a0764d8d57c42c20dd8814f919118a127 (vllm-fp16-tp2-lora)

Running the model containers

I want to run the models on my Turing generation 24GB card.

You can find the model container image run commands at the large language-models page. The container run command will use the image we previously downloaded as part of validation.

Note: Some of the images download additional model information on startup. In my experience, this happens once.

The following command runs a model picking a specific profile. It fails on my machine because it tries to grab 32GB of video memory. The profile is compatible but the card is slightly too small.

Note: MY_API_KEY is the nvapi- key you got for the model

 docker run -it --rm --gpus all --shm-size=16GB \
    -e NGC_API_KEY=$MY_API_KEY \
    -e NIM_MODEL_PROFILE=7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 \
    -v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 \
    nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest 

The following command runs a model picking a specific profile and data type and sets the model size to fit the 24GB card.

 docker run -it --rm --gpus all --shm-size=16GB \
    -e NGC_API_KEY=$MY_API_KEY \
    -e NIM_MODEL_PROFILE=7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 \
    -v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 \
    nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest \
    python3 -m vllm_nvext.entrypoints.openai.api_server --dtype half --max-model-len 26688 

Why 26688? I tried 26000. Then I tried 28000. 28000 failed and told me that 26688 was the maximum number I could use. Looks like I could set it at 18000. This model uses 19.5GB of GPU memory.

Impact of shrinking the model length

From an NVIDIA Form post:

Shrinking the sequence length is a good way of decreasing the memory requirements – basically, it means that the size of the KV cache is limited, which can be a very large portion of the memory usage. The downside is that you won’t be able to send/generate messages that are quite as long. Otherwise, the model accuracy shouldn’t be affected.

Testing

A simple call to the deployed NIM endpoint

$ curl -X 'POST'   'http://localhost:8000/v1/chat/completions'   \
  -H 'accept: application/json'   \
  -H 'Content-Type: application/json'   \
  -d '{
    "model": "mistralai/mistral-7b-instruct-v0.3",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 15,
    "stream": true,
    "frequency_penalty": 1.0,
    "stop": ["hello"]
  }'
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"S"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"ur"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"e! I "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"will w"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"ri"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"te a s"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"hort"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" and si"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"mple "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"song"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" for"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"y"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"ou:\n\n"},"finish_reason":"length"}],"usage":{"prompt_tokens":36,"total_tokens":51,"completion_tokens":15}}

Running embedding and reranker models locally

We can run the NIM Anywhere text embedding and re-ranker locally if we have enough GPU space. We'll have to run the docker containers manually. Both of these containers fetch model components on startup. Those model components require an API key. It is not the "nvapi-..." key you used to create the containers or log into the image repository. There are two keys.

Get https://org.ngc.nvidia.com/setup/api-key after pressing Generate API Key. It messed up that you need the repository, nviapi..., key to fetch the image and then need to re-run the command with the api-key in the same variable to get the container to update its internals

Model	GPU Memory	Refernce
nvidia/nv-embedqa-e5-v5	1.3GB	https://build.nvidia.com/nvidia/nv-embedqa-e5-v5
nvidia/nv-rerankqa-mistral-4b-v3	8.5GB	https://build.nvidia.com/nvidia/nv-rerankqa-mistral-4b-v3

NVIDIA Embedding QA E5 Embedding Model

Purpose: GPU-accelerated generation of text embeddings used for question-answering retrieval.

nv-embedqa-e5-v5: Used in the NIM-Anywhere project for converting text into vector embeddings https://build.nvidia.com/nvidia/nv-embedqa-e5-v5?snippet_tab=Docker This model uses 1.3 GB of GPU memory

export NGC_API_KEY=<an api key>

export NIM_MODEL_NAME=nvidia/nv-embedqa-e5-v5
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster

Remove the line with "-it"
Remove the line with "-p 8000:8000"

NVIDIA rerankqa-mistral-4b-v3

Purpose: GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.

nvidia/nv-rerankqa-mistral-4b-v3: Used in the Nim Anywhere project for <something>. You can find it here https://build.nvidia.com/nvidia/nv-rerankqa-mistral-4b-v3 This model uses 8.5GB of GPU memory

export NGC_API_KEY=<an api key>

export NIM_MODEL_NAME=nvidia/nv-rerankqa-mistral-4b-v3

export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC

export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.0.0"

# Choose a path on your system to cache the downloaded models

export LOCAL_NIM_CACHE=~/.cache/nim

mkdir -p "$LOCAL_NIM_CACHE"

# Start the NIM

docker run -it --rm --name=$CONTAINER_NAME \

  --runtime=nvidia \

  --gpus all \

  --shm-size=16GB \

  -e NGC_API_KEY \

  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \

  -u $(id -u) \

  -p 8000:8000 \

  $IMG_NAME

Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster

Remove the line with "-it"
Remove the line with "-p 8000:8000"

A script that uses compatible profiles

This is a script I used to verify different profiles on my NVIDIA Turing vintage hardware. The profiles were verified using the scripts above

export MY_API_KEY="nvapi-YOUR-KEY"

MODEL="nvcr.io/nim/meta/llama-3.1-8b-instruct:latest"
MODEL_PROFILE="3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5"
MODEL_MAX_LEN=19456

# MODEL="nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest"
# MODEL_PROFILE="7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789"
# MODEL_MAX_LEN=26688

# MODEL="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0"
# MODEL_PROFILE="8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d"
# MODEL_MAX_LEN=8192

#docker login  --username $oauthtoken --password $MY_API_KEY  nvcr.io
docker login    nvcr.io


docker run  --rm --gpus all --shm-size=16GB \
    -e NGC_API_KEY=$MY_API_KEY \
    -e NIM_MODEL_PROFILE=$MODEL_PROFILE \
    -v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 \
    $MODEL \
    python3 -m vllm_nvext.entrypoints.openai.api_server \
    --dtype half \
    --max-model-len $MODEL_MAX_LEN

Blog de Joe Freeman