Manually validating compatibility and running NVIDIA (NIM) container images

NVIDIA NIMs are ready to run pre-packaged containerized models.  The NIMs and their included models are available in a variety of profiles supporting different compute hardware configurations.  You can run the NIMs in an interrogatory mode that will tell you which models are compatible with your GPU hardware. You can then run the NIM with the associated profile.  

Sometimes there are still problems and we have to add additional tuning parameters to fit in memory or change data types. In my case, the data type change is because of some bug in the NIM startup detection code.  

This article requires additional polish.  It has more than a few rough edges.  

NVIDIA NIMs are semi-opaque. You cannot build your own NIM.  NIM construction details are not described by NVIDIA. 

Examining NVidia Model Container Images

The first step is to select models we think can fit and run on our NVIDIA GPU hardware.



The first step is to investigate models of the different types by visiting the appropriate NVIDIA NIM docs 
Look at the large language-models page to find instructions on how to run NIM containers locally.

Test platform and plan

Our basic plan is
  • Run this test on Ubuntu Linux
  • Host the models locally in my single NVIDIA Titan RTX 24GB card system.  
  • Use the CLI for all testing
Because of hardware limitations, I will be using non-optimized models, as described on the support matrix page.

Some Models

Model Works for Me Additional details
meta--lama-3-70b-instruct No Support matrix : meta llama 3 70b instruct.
It requires a minimum 240GB of VRAM.
meta-llama-3-8b-instruct Yes This is the model used in the NIM Anywhere project. 
Support matrix : meta llama 3 8b instruct.
It can run on a single card in 24GB of VRAM with fp16 precision
meta-llama-3.1-8b-base Yes This is the model used in the NIM Anywhere project. 
Support matrix : meta llama 3.1 8b base.
It can run on a single card in 24GB of VRAM with fp16 precision
meta-llama-3.1-70b-instruct No Support matrix : meta llama 3.1 8b instruct
meta-llama-3.1-405b-instruct No Support matrix : meta llama 3.1 8b instruct
meta-llama-3.1-8b-instruct Yes This is the model used in the NIM Anywhere project. 
Support matrix : meta llama 3.1 8b instruct.
It can run on a single card in 24GB of VRAM with fp16 precision
mistral-7B-instruct-v0.3 Yes Support matrix : meta llama 3 8b instruct.
It can in 24GB of VRAM with fp16 precision
mistral-8b-8x7b-instruct-v0.1 No Support matrix : mistral 8x7b 0.1.
mistral-8b-8x22b-instruct-v0.1 No Support matrix : mistral 8x22b 0.1.
other n/a n/a

Prerequisites

You need to create credentials that can be used by the various commands. You will get a 403 forbidden when trying to pull down a container if you don't have credentials.
  1. This assumes you have a nvapi- type key that gives you access
  2. Log into into nvcr.io using the docker cli.  You may see logins in some of the script logs below. 

(base) joe@rocks:~$ docker login nvcr.io
Username: $oauthtoken
Password: <nvapi key here>
WARNING! Your password will be stored unencrypted in /home/joe/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

A Video demonstration

https://youtu.be/JtufRBLFwxM

Validating compatibility

We can ask the models to validate themselves against our hardware. Note that each one of these tests downloads the model image to your machine. The downloads can be large and take up bunches of disk space. The intermediate layers were 29.8GB.


REPOSITORY SIZE project-hybrid-rag 29.7GB nvcr.io/nim/meta/llama-3.1-405b-instruct 12.9GB nvcr.io/nim/meta/llama-3.1-8b-instruct 12.9GB nvcr.io/nim/meta/llama-3.1-70b-instruct 12.9GB nvcr.io/nim/meta/llama-3.1-8b-base 12.9GB project-nim-anywhere 4.59GB nvcr.io/nim/nvidia/megatron-1b-nmt 14.3GB nvcr.io/nim/nvidia/parakeet-ctc-1.1b-asr 14.3GB nvcr.io/nim/nvidia/fastpitch-hifigan-tts 14.3GB nvcr.io/nim/snowflake/arctic-embed-l 15.7GB nvcr.io/nim/nvidia/nv-embedqa-e5-v5 15.7GB nvcr.io/nim/nvidia/nv-embedqa-mistral-7b-v2 15.7GB nvcr.io/nim/snowflake/arctic-embed-l 15.7GB nvcr.io/nim/nvidia/nv-embedqa-mistral-7b-v2 15.7GB nvcr.io/nim/nvidia/nv-embedqa-e5-v5 15.7GB <none> 4.34GB nvcr.io/nim/mistralai/mixtral-8x22b-instruct-v01 12.5GB nvcr.io/nim/mistralai/mixtral-8x7b-instruct-v01 12.5GB project-gpu-sample 22.1GB nvcr.io/nim/mistralai/mistral-7b-instruct-v03 12.5GB covid-vaccinations-python 1.88GB project-covid-vaccinations-python 1.23GB project-dogfood 1.23GB nvcr.io/nim/meta/llama3-70b-instruct 12.5GB nvcr.io/nim/meta/llama3-8b-instruct 12.5GB redis 116MB milvusdb/milvus 1.71GB traefik 153MB


The above output shows various images I downloaded during testing. All were downloaded with 
docker run ... list-model-profile

Local RTX 3060 TI 8GB

None of the models will fit in this card. 

nvcr.io/nim/meta/llama3-70b-instruct:1.0.0

Note that this is not compatible with this system. I'm not sure why this shows up as "not compatible" and not "needs more memory". 

$ docker run --gpus all nvcr.io/nim/meta/llama3-70b-instruct:1.0.0 list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-70b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

SYSTEM INFO
- Free GPUs: <None>
- Non-free GPUs:
  -  [2489:10de] (0) NVIDIA GeForce RTX 3060 Ti [current utilization: 5%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
  - 93782c337ddaf3f6b442ef7baebd0a732e433b7b93ec06bfd3db27945de560b6 (tensorrt_llm-h100-fp8-tp8-latency)
  - 2e9b29c44b3d82821e7f4facd1a652ec5e0c4e7e473ee33451f6d046f616c3e5 (tensorrt_llm-l40s-fp8-tp8-latency)
  - 8bb3b94e326789381bbf287e54057005a9250e3abbad0b1702a70222529fcd17 (tensorrt_llm-h100-fp8-tp4-throughput)
  - 8b8e03de8630626b904b37910e3d82a26cebb99634a921a0e5c59cb84125efe8 (tensorrt_llm-l40s-fp8-tp4-throughput)
  - a90b2c0217492b1020bead4e7453199c9f965ea53e9506c007f4ff7b58ee01ff (tensorrt_llm-h100-fp16-tp8-latency)
  - 96b70da1414c7beb5cf671b3e7cf835078740764064c453cd86e84cf0491aac0 (tensorrt_llm-l40s-fp16-tp8-throughput)
  - b811296367317f5097ed9f71b8f08d2688b2411c852978ae49e8a0d5c3a30739 (tensorrt_llm-a100-fp16-tp4-throughput)
  - abcff5042bfc3fa9f4d1e715b2e016c11c6735855edfe2093e9c24e83733788e (tensorrt_llm-h100-fp16-tp4-throughput)
  - 7f8bb4a2b97cf07faf6fb930ba67f33671492b7653f2a06fe522c7de65b544ca (tensorrt_llm-a100-bf16-tp8-latency)
  - 7e8f6cc0d0fde672073a20d5423977a0a02a9b0693f0f3d4ffc2ec8ac35474d4 (vllm-fp16-tp8)
  - df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4)
  - 03fdb4d11f01be10c31b00e7c0540e2835e89a0079b483ad2dd3c25c8cc29b61 (tensorrt_llm-l40s-fp16-tp8-throughput-lora)
  - 7ba9fbd93c41a28358215f3e94e79a2545ab44e39df016eb4c7d7cadc384bde7 (tensorrt_llm-a100-fp16-tp4-throughput-lora)
  - 36fc1fa4fc35c1d54da115a39323080b08d7937dceb8ba47be44f4da0ec720ff (tensorrt_llm-h100-fp16-tp4-throughput-lora)
  - 0f3de1afe11d355e01657424a267fbaad19bfea3143a9879307c49aed8299db0 (vllm-fp16-tp8-lora)
  - a30aae0ed459082efed26a7f3bc21aa97cccad35509498b58449a52c27698544 (vllm-fp16-tp4-lora)
PS C:\Users\joe>


nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

Used by NIM Anywhere as of 2024/07

There is a model that is compatible with the RTX 3060 TI but can't be run because there is not enough VRAM

$ docker run --gpus all nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

SYSTEM INFO
- Free GPUs: <None>
- Non-free GPUs:
  -  [2489:10de] (0) NVIDIA GeForce RTX 3060 Ti [current utilization: 5%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Compatible with system but not runnable due to low GPU free memory
  - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
  - With LoRA support:
    - 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
- Incompatible with system:
  - dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
  - f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency)
  - 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput)
  - 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput)
  - a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-a100-fp16-tp2-latency)
  - e0f4a47844733eb57f9f9c3566432acb8d20482a1d06ec1c0d71ece448e21086 (tensorrt_llm-a10g-fp16-tp2-latency)
  - 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency)
  - 24199f79a562b187c52e644489177b6a4eae0c9fdad6f7d0a8cb3677f5b1bc89 (tensorrt_llm-l40s-fp16-tp2-latency)
  - 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-a100-fp16-tp1-throughput)
  - c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput)
  - cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput)
  - d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80 (tensorrt_llm-l40s-fp16-tp1-throughput)
  - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
  - 9137f4d51dadb93c6b5864a19fd7c035bf0b718f3e15ae9474233ebd6468c359 (tensorrt_llm-a10g-fp16-tp2-throughput-lora)
  - cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
  - 3bdf6456ff21c19d5c7cc37010790448a4be613a1fd12916655dfab5a0dd9b8e (tensorrt_llm-h100-fp16-tp1-throughput-lora)
  - 388140213ee9615e643bda09d85082a21f51622c07bde3d0811d7c6998873a0b (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
  - c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)


Validating compatibility with local Titan RTX 24GB

nvcr.io/nim/meta/llama3-70b-instruct:1.0.0

This system is not compatible with the model.  I'm not sure why.

$ docker run --gpus all nvcr.io/nim/meta/llama3-70b-instruct:1.0.0 list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-70b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

SYSTEM INFO
- Free GPUs:
  -  [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
  - 93782c337ddaf3f6b442ef7baebd0a732e433b7b93ec06bfd3db27945de560b6 (tensorrt_llm-h100-fp8-tp8-latency)
  - 2e9b29c44b3d82821e7f4facd1a652ec5e0c4e7e473ee33451f6d046f616c3e5 (tensorrt_llm-l40s-fp8-tp8-latency)
  - 8bb3b94e326789381bbf287e54057005a9250e3abbad0b1702a70222529fcd17 (tensorrt_llm-h100-fp8-tp4-throughput)
  - 8b8e03de8630626b904b37910e3d82a26cebb99634a921a0e5c59cb84125efe8 (tensorrt_llm-l40s-fp8-tp4-throughput)
  - a90b2c0217492b1020bead4e7453199c9f965ea53e9506c007f4ff7b58ee01ff (tensorrt_llm-h100-fp16-tp8-latency)
  - 96b70da1414c7beb5cf671b3e7cf835078740764064c453cd86e84cf0491aac0 (tensorrt_llm-l40s-fp16-tp8-throughput)
  - b811296367317f5097ed9f71b8f08d2688b2411c852978ae49e8a0d5c3a30739 (tensorrt_llm-a100-fp16-tp4-throughput)
  - abcff5042bfc3fa9f4d1e715b2e016c11c6735855edfe2093e9c24e83733788e (tensorrt_llm-h100-fp16-tp4-throughput)
  - 7f8bb4a2b97cf07faf6fb930ba67f33671492b7653f2a06fe522c7de65b544ca (tensorrt_llm-a100-bf16-tp8-latency)
  - 7e8f6cc0d0fde672073a20d5423977a0a02a9b0693f0f3d4ffc2ec8ac35474d4 (vllm-fp16-tp8)
  - df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4)
  - 03fdb4d11f01be10c31b00e7c0540e2835e89a0079b483ad2dd3c25c8cc29b61 (tensorrt_llm-l40s-fp16-tp8-throughput-lora)
  - 7ba9fbd93c41a28358215f3e94e79a2545ab44e39df016eb4c7d7cadc384bde7 (tensorrt_llm-a100-fp16-tp4-throughput-lora)
  - 36fc1fa4fc35c1d54da115a39323080b08d7937dceb8ba47be44f4da0ec720ff (tensorrt_llm-h100-fp16-tp4-throughput-lora)
  - 0f3de1afe11d355e01657424a267fbaad19bfea3143a9879307c49aed8299db0 (vllm-fp16-tp8-lora)
  - a30aae0ed459082efed26a7f3bc21aa97cccad35509498b58449a52c27698544 (vllm-fp16-tp4-lora)


nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

Used by NIM Anywhere as of 2024/07

There is a profile of this model that is compatible and that fits inside our 24GB of VRAM.  

$ docker run --gpus all nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

SYSTEM INFO
- Free GPUs:
  -  [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable:
  - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
  - With LoRA support:
    - 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
- Incompatible with system:
  - dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
  - f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency)
  - 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput)
  - 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput)
  - a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-a100-fp16-tp2-latency)
  - e0f4a47844733eb57f9f9c3566432acb8d20482a1d06ec1c0d71ece448e21086 (tensorrt_llm-a10g-fp16-tp2-latency)
  - 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency)
  - 24199f79a562b187c52e644489177b6a4eae0c9fdad6f7d0a8cb3677f5b1bc89 (tensorrt_llm-l40s-fp16-tp2-latency)
  - 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-a100-fp16-tp1-throughput)
  - c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput)
  - cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput)
  - d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80 (tensorrt_llm-l40s-fp16-tp1-throughput)
  - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
  - 9137f4d51dadb93c6b5864a19fd7c035bf0b718f3e15ae9474233ebd6468c359 (tensorrt_llm-a10g-fp16-tp2-throughput-lora)
  - cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
  - 3bdf6456ff21c19d5c7cc37010790448a4be613a1fd12916655dfab5a0dd9b8e (tensorrt_llm-h100-fp16-tp1-throughput-lora)
  - 388140213ee9615e643bda09d85082a21f51622c07bde3d0811d7c6998873a0b (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
  - c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)


mistral-7B-instruct-v0.3

There is a profile that will run on my Titan RTX

$ docker run --gpus all nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).

SYSTEM INFO
- Free GPUs:
  -  [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable:
  - 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1)
  - With LoRA support:
    - 114fc68ad2c150e37eb03a911152f342e4e7423d5efb769393d30fa0b0cd1f9e (vllm-fp16-tp1-lora)
- Incompatible with system:
  - 48004baf4f45ca177aa94abfd3c5c54858808ad728914b1626c3cf038ea85bc4 (tensorrt_llm-h100-fp8-tp2-latency)
  - 5c17c27186b232e834aee9c61d1f5db388874da40053d70b84fd1386421ff577 (tensorrt_llm-l40s-fp8-tp2-latency)
  - 08ab4363f225c19e3785b58408fa4dcac472459cca1febcfaffb43f873557e87 (tensorrt_llm-h100-fp8-tp1-throughput)
  - cc18942f40e770aa27a0b02c1f5bf1458a6fedd10a1ed377630d30d71a1b36db (tensorrt_llm-l40s-fp8-tp1-throughput)
  - dea9af90d5311ff2d651db8c16f752d014053d3b1c550474cbeda241f81c96bd (tensorrt_llm-a100-fp16-tp2-latency)
  - 6064ab4c33a1c6da8058422b8cb0347e72141d203c77ba309ce5c5533f548188 (tensorrt_llm-h100-fp16-tp2-latency)
  - ef22c7cecbcf2c8b3889bd58a48095e47a8cc0394d221acda1b4087b46c6f3e9 (tensorrt_llm-l40s-fp16-tp2-latency)
  - c79561a74f97b157de12066b7a137702a4b09f71f4273ff747efe060881fca92 (tensorrt_llm-a100-fp16-tp1-throughput)
  - 8833b9eba1bd4fbed4f764e64797227adca32e3c1f630c2722a8a52fee2fd1fa (tensorrt_llm-h100-fp16-tp1-throughput)
  - 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
  - 7387979dae9c209b33010e5da9aae4a94f75d928639ba462201e88a5dd4ac185 (vllm-fp16-tp2)
  - 2c57f0135f9c6de0c556ba37f43f55f6a6c0a25fe0506df73e189aedfbd8b333 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
  - 8f9730e45a88fb2ac16ce2ce21d7460479da1fd8747ba32d2b92fc4f6140ba83 (tensorrt_llm-h100-fp16-tp1-throughput-lora)
  - eb445d1e451ed3987ca36da9be6bb4cdd41e498344cbf477a1600198753883ff (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
  - 5797a519e300612f87f8a4a50a496a840fa747f7801b2dcd0cc9a3b4b949dd92 (vllm-fp16-tp2-lora)

mistral-7b-8x7b-v0.1

This will not run on my hardware.

$ docker run --gpus all nvcr.io/nim/mistralai/mixtral-8x7b-instruct-v01:latest list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mixtral-8x7b-instruct-v0.1

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).

SYSTEM INFO
- Free GPUs:
  -  [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
  - d37580fa5deabc5a4cb17a2337e8cc672b19eaf2791cf319fd16582065e40816 (tensorrt_llm-h100-fp8-tp4-latency)
  - 00056b81c2e41eb9b847342ed553ae88614f450f3f15eebfd2ae56174484bacd (tensorrt_llm-h100-fp8-tp2-throughput)
  - e249e70e3ee390e606782eab19e7a9cf2aeb865bdbc638aaf0fc580901492841 (tensorrt_llm-a100-fp16-tp4-latency)
  - 9972482479f39ecacc3f470aaa7d0de7b982a1b18f907aafdb8517db5643e05a (tensorrt_llm-h100-fp16-tp4-latency)
  - ad3c46c1c8d71bb481205732787f2c157a9cfc9b6babef5860518a047e155639 (tensorrt_llm-l40s-fp16-tp4-throughput)
  - 9865374899b6ac3a1e25e47644f3d66753288e9d949d883b14c3f55b98fb2ebc (tensorrt_llm-a100-fp16-tp2-throughput)
  - 1f859af2be6c57528dc6d32b6062c9852605d8f2d68bbe76a43b65ebc5ac738d (tensorrt_llm-h100-fp16-tp2-throughput)
  - ee616a54bea8e869009748eefb0d905b168d2095d0cdf66d40f3a5612194d170 (tensorrt_llm-h100-int8wo-tp4-latency)
  - 01f1ad019f55abb76f10f1687f76ea8e5d2f3d51d6831ddc582d979ff210b4cb (tensorrt_llm-h100-int8wo-tp2-throughput)
  - da767e18d66e067f2c5c2c2257171b8b8331801fffdea98fc8e48b8958549388 (vllm-fp16-tp4)
  - 37d1a6210357770f7f6fe5fcdb5f8da11e3863a7274ccde8ff97e4ffc7d17006 (vllm-fp16-tp2)
  - 2289b29507d987154efb5ff12b41378323147e28ba2660490737cb8d8544d039 (vllm-fp16-tp4-lora)
  - a0aff0e3bf2062cc42f13556b28eb66a0764d8d57c42c20dd8814f919118a127 (vllm-fp16-tp2-lora)


Running the model containers

I want to run the models on my Turing generation 24GB card. 

You can find the model container image run commands at the large language-models page   The container run command will use the image we previously downloaded as part of validation.  

Note: Some of the images download additional model information on startup.  In my experience, this happens once.

The following command runs a model picking a specific profile. It fails on my machine because it tries to grab 32GB of video memory.  The profile is compatible but the card is slightly too small.

Note: MY_API_KEY is the nvapi- key you got for the model

 docker run -it --rm --gpus all --shm-size=16GB \
    -e NGC_API_KEY=$MY_API_KEY \
    -e NIM_MODEL_PROFILE=7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 \
    -v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 \
    nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest

The following command runs a model picking a specific profile and data type and sets the model size to fit the 24GB card.

 docker run -it --rm --gpus all --shm-size=16GB \
    -e NGC_API_KEY=$MY_API_KEY \
    -e NIM_MODEL_PROFILE=7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 \
    -v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 \
    nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest \
    python3 -m vllm_nvext.entrypoints.openai.api_server --dtype half --max-model-len 26688


Why 26688?  I tried 26000.  Then I tried 28000. 28000 failed and told me that 26688 was the maximum number I could use.  Looks like I could set it at 18000.  This model uses 19.5GB of GPU memory.

Impact of shrinking the model length

From an NVidia Form post:

Shrinking the sequence length is a good way of decreasing the memory requirements – basically, it means that the size of the KV cache is limited, which can be a very large portion of the memory usage. The downside is that you won’t be able to send/generate messages that are quite as long. Otherwise, the model accuracy shouldn’t be affected.

Testing

A simple call to the deployed NIM endpoint

$ curl -X 'POST'   'http://localhost:8000/v1/chat/completions'   \
  -H 'accept: application/json'   \
  -H 'Content-Type: application/json'   \
  -d '{
    "model": "mistralai/mistral-7b-instruct-v0.3",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 15,
    "stream": true,
    "frequency_penalty": 1.0,
    "stop": ["hello"]
  }'
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"S"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"ur"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"e! I "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"will w"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"ri"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"te a s"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"hort"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" and si"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"mple "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"song"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" for"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"y"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"ou:\n\n"},"finish_reason":"length"}],"usage":{"prompt_tokens":36,"total_tokens":51,"completion_tokens":15}}


Running embedding and reranker models locally

We can run the NIM Anywhere text embedding and re-ranker locally if we have enough GPU space.  We'll have to run the docker containers manually.  Both of these containers fetch model components on startup.  Those model components require an API key. It is not the "nvapi-..." key you used to create the containers or log into the image repository.  There are two keys.

Get https://org.ngc.nvidia.com/setup/api-key after pressing Generate API Key.  It messed up that you need the repository, nviapi..., key to fetch the image and then need to re-run the command with the api-key in the same variable to get the container to update its internals

Model

GPU Memory Refernce

nvidia/nv-embedqa-e5-v5 

1.3GB  https://build.nvidia.com/nvidia/nv-embedqa-e5-v5

nvidia/nv-rerankqa-mistral-4b-v3

8.5GB   https://build.nvidia.com/nvidia/nv-rerankqa-mistral-4b-v3

NVIDIA Embedding QA E5 Embedding Model

Purpose: GPU-accelerated generation of text embeddings used for question-answering retrieval.

nv-embedqa-e5-v5: Used in the NIM-Anywhere project for converting text  into  vector embeddings  https://build.nvidia.com/nvidia/nv-embedqa-e5-v5?snippet_tab=Docker This model uses 1.3 GB of GPU memory


export NGC_API_KEY=<an api key>

export NIM_MODEL_NAME=nvidia/nv-embedqa-e5-v5
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME


Make the following changes when deploying rather than testing.  We want to run detached and will use the container only from within the container cluster
  1. Remove the line with "-it"
  2. Remove the line with "-p 8000:8000"

NVIDIA rerankqa-mistral-4b-v3

Purpose: GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.

nvidia/nv-rerankqa-mistral-4b-v3: Used in the Nim Anywhere project for <something>. You can find it here https://build.nvidia.com/nvidia/nv-rerankqa-mistral-4b-v3 This model uses 8.5GB of GPU memory



export NGC_API_KEY=<an api key>

export NIM_MODEL_NAME=nvidia/nv-rerankqa-mistral-4b-v3
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME


Make the following changes when deploying rather than testing.  We want to run detached and will use the container only from within the container cluster
  1. Remove the line with "-it"
  2. Remove the line with "-p 8000:8000"

A script that uses compatible profiles

This is a script I used to verify different profiles on my NVidia Turing vintage hardware.  The profiles were verified using the scripts above


export MY_API_KEY="nvapi-YOUR-KEY"

MODEL="nvcr.io/nim/meta/llama-3.1-8b-instruct:latest"
MODEL_PROFILE="3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5"
MODEL_MAX_LEN=19456

# MODEL="nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest"
# MODEL_PROFILE="7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789"
# MODEL_MAX_LEN=26688

# MODEL="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0"
# MODEL_PROFILE="8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d"
# MODEL_MAX_LEN=8192

#docker login  --username $oauthtoken --password $MY_API_KEY  nvcr.io
docker login    nvcr.io


docker run  --rm --gpus all --shm-size=16GB \
    -e NGC_API_KEY=$MY_API_KEY \
    -e NIM_MODEL_PROFILE=$MODEL_PROFILE \
    -v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 \
    $MODEL \
    python3 -m vllm_nvext.entrypoints.openai.api_server \
    --dtype half \
    --max-model-len $MODEL_MAX_LEN


Links


Revision History

Created 2024/08
Added text and Riva support matrix links 2024/08

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk