Cloud and Software Architecture, Soft skills, IOT and embedded
Manually validating compatibility and running NVIDIA (NIM) container images
Get link
Facebook
X
Pinterest
Email
Other Apps
NVIDIA NIMs are ready to run pre-packaged containerized models. The NIMs and their included models are available in a variety of profiles supporting different compute hardware configurations. You can run the NIMs in an interrogatory mode that will tell you which models are compatible with your GPU hardware. You can then run the NIM with the associated profile.
Sometimes there are still problems and we have to add additional tuning parameters to fit in memory or change data types. In my case, the data type change is because of some bug in the NIM startup detection code.
This article requires additional polish. It has more than a few rough edges.
NVIDIA NIMs are semi-opaque. You cannot build your own NIM. NIM construction details are not described by NVIDIA.
Examining NVidia Model Container Images
The first step is to select models we think can fit and run on our NVIDIA GPU hardware.
The first step is to investigate models of the different types by visiting the appropriate NVIDIA NIM docs
This is the model used in the NIM Anywhere project. Support matrix : meta llama 3.1 8b base. It can run on a single card in 24GB of VRAM with fp16 precision
You need to create credentials that can be used by the various commands. You will get a 403 forbidden when trying to pull down a container if you don't have credentials.
This assumes you have a nvapi- type key that gives you access
Log into into nvcr.io using the docker cli. You may see logins in some of the script logs below.
We can ask the models to validate themselves against our hardware. Note that each one of these tests downloads the model image to your machine. The downloads can be large and take up bunches of disk space. The intermediate layers were 29.8GB.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
I want to run the models on my Turing generation 24GB card.
You can find the model container image run commands at the large language-models page The container run command will use the image we previously downloaded as part of validation.
Note: Some of the images download additional model information on startup. In my experience, this happens once.
The following command runs a model picking a specific profile. It fails on my machine because it tries to grab 32GB of video memory. The profile is compatible but the card is slightly too small.
Note: MY_API_KEYis the nvapi- key you got for the model
Why 26688? I tried 26000. Then I tried 28000. 28000 failed and told me that 26688 was the maximum number I could use. Looks like I could set it at 18000. This model uses 19.5GB of GPU memory.
Impact of shrinking the model length
From an NVidia Form post:
Shrinking the sequence length is a good way of decreasing the memory requirements – basically, it means that the size of the KV cache is limited, which can be a very large portion of the memory usage. The downside is that you won’t be able to send/generate messages that are quite as long. Otherwise, the model accuracy shouldn’t be affected.
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"e! I "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"te a s"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" and si"},"logprobs":null,"finish_reason":null}]}
We can run the NIM Anywhere text embedding and re-ranker locally if we have enough GPU space. We'll have to run the docker containers manually. Both of these containers fetch model components on startup. Those model components require an API key. It is not the "nvapi-..." key you used to create the containers or log into the image repository. There are two keys.
Get https://org.ngc.nvidia.com/setup/api-key after pressing Generate API Key. It messed up that you need the repository, nviapi..., key to fetch the image and then need to re-run the command with the api-key in the same variable to get the container to update its internals
# Choose a path on your system to cache the downloaded models
exportLOCAL_NIM_CACHE=~/.cache/nim
mkdir-p"$LOCAL_NIM_CACHE"
# Start the NIM
dockerrun-it--rm--name=$CONTAINER_NAME\
--runtime=nvidia\
--gpusall\
--shm-size=16GB\
-eNGC_API_KEY\
-v"$LOCAL_NIM_CACHE:/opt/nim/.cache"\
-u $(id-u) \
-p8000:8000\
$IMG_NAME
Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster
Remove the line with "-it"
Remove the line with "-p 8000:8000"
NVIDIA rerankqa-mistral-4b-v3
Purpose: GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.
# Choose a path on your system to cache the downloaded models
exportLOCAL_NIM_CACHE=~/.cache/nim
mkdir-p"$LOCAL_NIM_CACHE"
# Start the NIM
dockerrun-it--rm--name=$CONTAINER_NAME\
--runtime=nvidia\
--gpusall\
--shm-size=16GB\
-eNGC_API_KEY\
-v"$LOCAL_NIM_CACHE:/opt/nim/.cache"\
-u $(id-u) \
-p8000:8000\
$IMG_NAME
Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster
Remove the line with "-it"
Remove the line with "-p 8000:8000"
A script that uses compatible profiles
This is a script I used to verify different profiles on my NVidia Turing vintage hardware. The profiles were verified using the scripts above
I do a lot of my development and configuration via ssh into my Raspberry Pi Zero over the RNDIS connection. Some models of the Raspberry PIs can be configured with gadget drivers that let the Raspberry pi emulate different devices when plugged into computers via USB. My favorite gadget is the network profile that makes a Raspberry Pi look like an RNDIS-attached network device. All types of network services travel over an RNDIS device without knowing it is a USB hardware connection. A Raspberry Pi shows up as a Remote NDIS (RNDIS) device when you plug the Pi into a PC or Mac via a USB cable. The gadget in the Windows Device Manager picture shows this RNDIS Gadget connectivity between a Windows machine and a Raspberry Pi. The Problem Windows 11 and Windows 10 no longer auto-installs the RNDIS driver that makes magic happen. Windows recognizes that the Raspberry Pi is some type of generic USB COM device. Manually running W indows Update or Upd...
The Windows Subsystem for Linux operates as a virtual machine that can dynamically grow the amount of RAM to a maximum set at startup time. Microsoft sets a default maximum RAM available to 50% of the physical memory and a swap-space that is 1/4 of the maximum WSL RAM. You can scale those numbers up or down to allocate more or less RAM to the Linux instance. The first drawing shows the default WSL memory and swap space sizing. The images below show a developer machine that is running a dev environment in WSL2 and Docker Desktop. Docker Desktop has two of its own WSL modules that need to be accounted for. You can see that the memory would actually be oversubscribed, 3 x 50% if every VM used its maximum memory. The actual amount of memory used is significantly smaller allowing every piece to fit. Click to Enlarge The second drawing shows the memory allocation on my 64GB laptop. WSL Linux defaul...
This is about running VSCode AI code assist locally replacing Copilot or some other service. You may run local models to guarantee none of your code ends up on external servers. Or, you may not want to maintain an ongoing AI subscription. We are going to use LM Studio and VS Code. This was tested on Windows 11 with an RTX 3060 TI with 8GB of VRAM. 8GB really limits the number and size of the models we can use. LM Studio's simple hosting model of 1 LLM and an embedding works for us in this situation. You want a big card. 8GB is a tiny card. Related blog articles and videos Several related blogs and videos that cover VSCode and local LLMs Blog Get AI code assist VSCode with local LLMs using Ollama and the Continue.dev extension - Mac Get AI code assist VSCode with local LLMs using LM Studio and the Continue.dev extension - Windows Rocking an older Titan RTX 24GB as my local AI Code assist on Windows 11, Ollama and VS Code YouTube Video Using loc...
Comments
Post a Comment