Cloud and Software Architecture, Soft skills, IOT and embedded
Manually validating compatibility and running NVIDIA (NIM) container images
Get link
Facebook
X
Pinterest
Email
Other Apps
NVIDIA NIMs are ready to run pre-packaged containerized models. The NIMs and their included models are available in a variety of profiles supporting different compute hardware configurations. You can run the NIMs in an interrogatory mode that will tell you which models are compatible with your GPU hardware. You can then run the NIM with the associated profile.
Sometimes there are still problems and we have to add additional tuning parameters to fit in memory or change data types. In my case, the data type change is because of some bug in the NIM startup detection code.
This article requires additional polish. It has more than a few rough edges.
NVIDIA NIMs are semi-opaque. You cannot build your own NIM. NIM construction details are not described by NVIDIA.
Examining NVidia Model Container Images
The first step is to select models we think can fit and run on our NVIDIA GPU hardware.
The first step is to investigate models of the different types by visiting the appropriate NVIDIA NIM docs
This is the model used in the NIM Anywhere project. Support matrix : meta llama 3.1 8b base. It can run on a single card in 24GB of VRAM with fp16 precision
You need to create credentials that can be used by the various commands. You will get a 403 forbidden when trying to pull down a container if you don't have credentials.
This assumes you have a nvapi- type key that gives you access
Log into into nvcr.io using the docker cli. You may see logins in some of the script logs below.
We can ask the models to validate themselves against our hardware. Note that each one of these tests downloads the model image to your machine. The downloads can be large and take up bunches of disk space. The intermediate layers were 29.8GB.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
I want to run the models on my Turing generation 24GB card.
You can find the model container image run commands at the large language-models page The container run command will use the image we previously downloaded as part of validation.
Note: Some of the images download additional model information on startup. In my experience, this happens once.
The following command runs a model picking a specific profile. It fails on my machine because it tries to grab 32GB of video memory. The profile is compatible but the card is slightly too small.
Note: MY_API_KEYis the nvapi- key you got for the model
Why 26688? I tried 26000. Then I tried 28000. 28000 failed and told me that 26688 was the maximum number I could use. Looks like I could set it at 18000. This model uses 19.5GB of GPU memory.
Impact of shrinking the model length
From an NVidia Form post:
Shrinking the sequence length is a good way of decreasing the memory requirements – basically, it means that the size of the KV cache is limited, which can be a very large portion of the memory usage. The downside is that you won’t be able to send/generate messages that are quite as long. Otherwise, the model accuracy shouldn’t be affected.
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"e! I "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"te a s"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" and si"},"logprobs":null,"finish_reason":null}]}
We can run the NIM Anywhere text embedding and re-ranker locally if we have enough GPU space. We'll have to run the docker containers manually. Both of these containers fetch model components on startup. Those model components require an API key. It is not the "nvapi-..." key you used to create the containers or log into the image repository. There are two keys.
Get https://org.ngc.nvidia.com/setup/api-key after pressing Generate API Key. It messed up that you need the repository, nviapi..., key to fetch the image and then need to re-run the command with the api-key in the same variable to get the container to update its internals
# Choose a path on your system to cache the downloaded models
exportLOCAL_NIM_CACHE=~/.cache/nim
mkdir-p"$LOCAL_NIM_CACHE"
# Start the NIM
dockerrun-it--rm--name=$CONTAINER_NAME\
--runtime=nvidia\
--gpusall\
--shm-size=16GB\
-eNGC_API_KEY\
-v"$LOCAL_NIM_CACHE:/opt/nim/.cache"\
-u $(id-u) \
-p8000:8000\
$IMG_NAME
Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster
Remove the line with "-it"
Remove the line with "-p 8000:8000"
NVIDIA rerankqa-mistral-4b-v3
Purpose: GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.
# Choose a path on your system to cache the downloaded models
exportLOCAL_NIM_CACHE=~/.cache/nim
mkdir-p"$LOCAL_NIM_CACHE"
# Start the NIM
dockerrun-it--rm--name=$CONTAINER_NAME\
--runtime=nvidia\
--gpusall\
--shm-size=16GB\
-eNGC_API_KEY\
-v"$LOCAL_NIM_CACHE:/opt/nim/.cache"\
-u $(id-u) \
-p8000:8000\
$IMG_NAME
Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster
Remove the line with "-it"
Remove the line with "-p 8000:8000"
A script that uses compatible profiles
This is a script I used to verify different profiles on my NVidia Turing vintage hardware. The profiles were verified using the scripts above
I do a lot of my development and configuration via ssh into my Raspberry Pi Zero over the RNDIS connection. Some models of the Raspberry PIs can be configured with gadget drivers that let the Raspberry pi emulate different devices when plugged into computers via USB. My favorite gadget is the network profile that makes a Raspberry Pi look like an RNDIS-attached network device. All types of network services travel over an RNDIS device without knowing it is a USB hardware connection. A Raspberry Pi shows up as a Remote NDIS (RNDIS) device when you plug the Pi into a PC or Mac via a USB cable. The gadget in the Windows Device Manager picture shows this RNDIS Gadget connectivity between a Windows machine and a Raspberry Pi. The Problem Windows 11 and Windows 10 no longer auto-installs the RNDIS driver that makes magic happen. Windows recognizes that the Raspberry Pi is some type of generic USB COM device. Manually running W indows Update or Upd...
The Windows Subsystem for Linux operates as a virtual machine that can dynamically grow the amount of RAM to a maximum set at startup time. Microsoft sets a default maximum RAM available to 50% of the physical memory and a swap-space that is 1/4 of the maximum WSL RAM. You can scale those numbers up or down to allocate more or less RAM to the Linux instance. The first drawing shows the default WSL memory and swap space sizing. The images below show a developer machine that is running a dev environment in WSL2 and Docker Desktop. Docker Desktop has two of its own WSL modules that need to be accounted for. You can see that the memory would actually be oversubscribed, 3 x 50% if every VM used its maximum memory. The actual amount of memory used is significantly smaller allowing every piece to fit. Click to Enlarge The second drawing shows the memory allocation on my 64GB laptop. WSL Linux defaul...
The Apache Tika project provides a library capable of parsing and extracting data and meta data from over 1000 file types. Tika is available as a single jar file that can be included inside applications or as a deployable jar file that runs Tika as a standalone service. This blog describes deploying the Tika jar as an auto-scale service in Amazon AWS Elastic Beanstalk. I selected Elastic Beanstalk because it supports jar based deployments without any real Infrastructure configuration. Elastic Beanstalk auto-scale should take care of scaling up and down for for the number of requests you get. Tika parses documents and extracts their text completely in memory. Tika was deployed for this blog using EC2 t2.micro instances available in the AWS free tier. t2.micro VMs are 1GB which means that you are restricted in document complexity and size. You would size your instances appropriately for your largest documents. Preconditions An AWS account. AWS ac...
Comments
Post a Comment