Cloud and Software Architecture, Soft skills, IOT and embedded
Manually validating compatibility and running NVIDIA (NIM) container images
Get link
Facebook
X
Pinterest
Email
Other Apps
NVIDIA NIMs are ready to run pre-packaged containerized models. The NIMs and their included models are available in a variety of profiles supporting different compute hardware configurations. You can run the NIMs in an interrogatory mode that will tell you which models are compatible with your GPU hardware. You can then run the NIM with the associated profile.
Sometimes there are still problems and we have to add additional tuning parameters to fit in memory or change data types. In my case, the data type change is because of some bug in the NIM startup detection code.
This article requires additional polish. It has more than a few rough edges.
NVIDIA NIMs are semi-opaque. You cannot build your own NIM. NIM construction details are not described by NVIDIA.
Examining NVIDIA Model Container Images
The first step is to select models we think can fit and run on our NVIDIA GPU hardware.
The first step is to investigate models of the different types by visiting the appropriate NVIDIA NIM docs
This is the model used in the NIM Anywhere project. Support matrix : meta llama 3.1 8b base. It can run on a single card in 24GB of VRAM with fp16 precision
You need to create credentials that can be used by the various commands. You will get a 403 forbidden when trying to pull down a container if you don't have credentials.
This assumes you have a nvapi- type key that gives you access
Log into nvcr.io using the Docker CLI. You may see logins in some of the script logs below.
We can ask the models to validate themselves against our hardware. Note that each one of these tests downloads the model image to your machine. The downloads can be large and take up bunches of disk space. The intermediate layers were 29.8GB.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
I want to run the models on my Turing generation 24GB card.
You can find the model container image run commands at the large language-models page. The container run command will use the image we previously downloaded as part of validation.
Note: Some of the images download additional model information on startup. In my experience, this happens once.
The following command runs a model picking a specific profile. It fails on my machine because it tries to grab 32GB of video memory. The profile is compatible but the card is slightly too small.
Note: MY_API_KEYis the nvapi- key you got for the model
Why 26688? I tried 26000. Then I tried 28000. 28000 failed and told me that 26688 was the maximum number I could use. Looks like I could set it at 18000. This model uses 19.5GB of GPU memory.
Impact of shrinking the model length
From an NVIDIA Form post:
Shrinking the sequence length is a good way of decreasing the memory requirements – basically, it means that the size of the KV cache is limited, which can be a very large portion of the memory usage. The downside is that you won’t be able to send/generate messages that are quite as long. Otherwise, the model accuracy shouldn’t be affected.
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"e! I "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"te a s"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" and si"},"logprobs":null,"finish_reason":null}]}
We can run the NIM Anywhere text embedding and re-ranker locally if we have enough GPU space. We'll have to run the docker containers manually. Both of these containers fetch model components on startup. Those model components require an API key. It is not the "nvapi-..." key you used to create the containers or log into the image repository. There are two keys.
Get https://org.ngc.nvidia.com/setup/api-key after pressing Generate API Key. It messed up that you need the repository, nviapi..., key to fetch the image and then need to re-run the command with the api-key in the same variable to get the container to update its internals
# Choose a path on your system to cache the downloaded models
exportLOCAL_NIM_CACHE=~/.cache/nim
mkdir-p"$LOCAL_NIM_CACHE"
# Start the NIM
dockerrun-it--rm--name=$CONTAINER_NAME\
--runtime=nvidia\
--gpusall\
--shm-size=16GB\
-eNGC_API_KEY\
-v"$LOCAL_NIM_CACHE:/opt/nim/.cache"\
-u $(id-u) \
-p8000:8000\
$IMG_NAME
Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster
Remove the line with "-it"
Remove the line with "-p 8000:8000"
NVIDIA rerankqa-mistral-4b-v3
Purpose: GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.
# Choose a path on your system to cache the downloaded models
exportLOCAL_NIM_CACHE=~/.cache/nim
mkdir-p"$LOCAL_NIM_CACHE"
# Start the NIM
dockerrun-it--rm--name=$CONTAINER_NAME\
--runtime=nvidia\
--gpusall\
--shm-size=16GB\
-eNGC_API_KEY\
-v"$LOCAL_NIM_CACHE:/opt/nim/.cache"\
-u $(id-u) \
-p8000:8000\
$IMG_NAME
Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster
Remove the line with "-it"
Remove the line with "-p 8000:8000"
A script that uses compatible profiles
This is a script I used to verify different profiles on my NVIDIA Turing vintage hardware. The profiles were verified using the scripts above
I do a lot of my development and configuration via ssh into my Raspberry Pi Zero over the RNDIS connection. Some models of the Raspberry PIs can be configured with gadget drivers that let the Raspberry pi emulate different devices when plugged into computers via USB. My favorite gadget is the network profile that makes a Raspberry Pi look like an RNDIS-attached network device. All types of network services travel over an RNDIS device without knowing it is a USB hardware connection. A Raspberry Pi shows up as a Remote NDIS (RNDIS) device when you plug the Pi into a PC or Mac via a USB cable. The gadget in the Windows Device Manager picture shows this RNDIS Gadget connectivity between a Windows machine and a Raspberry Pi. The Problem Windows 11 and Windows 10 no longer auto-installs the RNDIS driver that makes magic happen. Windows recognizes that the Raspberry Pi is some type of generic USB COM device. Manually running W indows Update or Upd...
MLX is an ML framework targeted at Apple Silicon. It provides noticeable ML performance gains when compared to the standard (GGUF) techniques running on Apple Silicon. This MLX project describes MLX as: MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research. A notable difference from MLX and other frameworks is the unified memory model . Arrays in MLX live in shared memory. Operations on MLX arrays can be performed on any of the supported device types without transferring data. LM Studio added support for Apple Silicon MLX models in 2024 . I totally ignored it until I saw a 2025/02 Reddit post in the /r/ocallama subreddit . I wanted to execute their microbenchmark on my Mac to get a feel for the possible performance difference. The performance improvement is exciting. I am waiting on really jumping into the MLX until Ollama supports MLX something they are working on as of 2025/0...
We have Verizon FIOS with cable TV service. I've never really paid attention to how the Verizon side is wired up until Verizon recently upgraded my FIOS router and tuner box. After breaking my TV tuner by disconnecting an " unneeded" connection, I created yet another diagram of how the FIOS connections work. This is a basic wiring diagram of the house network missing a bunch of devices. Verizon ONT The Verizon optical network terminal converts the optical connection into TV and network standard connections. The ONT is actually two boxes in my situation. One outside connects to the optical and one inside converts something into an Ethernet WAN connection. This results in me connecting a TV COAX and an Ethernet WAN. Verizon TV Tuner The Verizon TV tuner decodes and decrypts TV data that it receives over coax. The TV tuner must talk back to Verizon for any video control operations. It could talk back wireless, over an extra ethernet connection to back over th...
Comments
Post a Comment