Coercing NVIDA bfloat16 LLM models to run on Tesla GPUs that only support float16

Newer LLM models are built around the bfloat16 data type that has different types of precision than the older float16. My Tesla vintage GPU supports the lower precision float16, not the newer bfloat16.  you can coerce the model from bfloat16 to float16. 

Disclaimer: The difference in precision can result in errors. Your mileage and accuracy may vary depending on the model 

Hacking config.json for FP16

The basic steps are 
  1. Download the model either manually or as part of an attempted run.  
  2. Find the location of the model on disk. This typically is in the model cache directory
    1. ~/.cache/nvidia/nvidia-nims/ngc/<some_path>/config.json
    2. ~/.cache/nim/ngc/<some_path>/config.json
    3. ~/.cache/nvidia-nims/ngc/hub/<some-model>/snapshots/config.json
  3. Edit the config.json found for the model.
  4. Find the line "torch_dtype":"bfloat16"
  5. Change the value of torch_dtype to one supported by the card, float16 in my case. "torch_dtype":"float16".
The property is torch.dtype on the command line but is torch_dtype in JSON files.

sample config.json 

You can see "torch_dtype": "bfloat16",  partway down the file list

(base) joe@hp-z820:~$ cat ./.cache/nim/ngc/hub/models--nim--mistralai--mistral-7b-instruct-v03/snapshots/hf/config.json
{
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.0.dev0",
  "use_cache": true,
  "vocab_size": 32768
}


Now we can walk through a few model definition cards to see if they fit in our card and are a candidate for dtype coercion.  Some of the model cards show that they are float16 but they download with bfloat16 configuration files.  We can overcome that with the cache edit.

NIM Anywhere Models

The NIM Anywhere project ships with a sample service that uses a model that expects to run bfloat16
I was able to load the LLM model by changing the torch_dtype.  I validated the model configuration with a few simple queries. I didn't do any precision verification testing.

Model meta-llama/Meta-Llama-3-8B-Instruct

  • Model: meta-llama/Meta-Llama-3-8B-Instruct
  • config.json location: ~/.cache/nvidia-nims/ngc/hub/models--nim--meta--llama3-8b-instruct/snapshots/hf/config.json
  • torch_dtype: bfloat16
  • model weight size: 15GB

Modifications to run on Tesla GPUs
  • torch_dtype: float16




Hybrid RAG Ungated Models

These are the six local models supported by the NVIDIA AI Workstation Hybrid Rag example.  Two are immediately available.  The other four are gated, behind a token Hugging Face token wall.

nvidia/Llama3-ChatQA-1.5-8b

  • Model: nvidia/Llama3-ChatQA-1.5-8b
  • Mode Location:  /data/models--nvidia--Llama3-ChatQA-1.5-8B
  • config.json location: /data/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/config.json
  • torch_dtype: float16


Phi-3-mini-128k-instruct

  • Model: Phi-3-mini-128k-instruct
  • ModelLocation: /data/models--microsoft--Phi-3-mini-128k-instruct/
  • config.json location: /data/models--microsoft--Phi-3-mini-128k-instruct/snapshots/d548c233192db00165d842bf8edff054bb3212f8/config.json
  • torch_dtype in config.json: bfloat16




Hybrid Rag Gated Models Supported in AI Workbench

Most of these models are BFloat16.  The first one is FP16 so it would be runnable without modification on a Tesla CPU.  This currently doesn't work in the AI Workbench. I think this is because NVidia erroneously set the torch_dtype to bf16 no matter what the model card says.

Llama-2-7b--chat-hf

* Model: Llama-2-7b--chat-hf
* torch_dtype in config.json: not investigated



Meta-Llama-3-8b-Instruct

* Model: Meta-Llama-3-8b-Instruct
* torch_dtype in config.json: blfoat16


Mistral-7B-Instruct-v0.1

* Model: Mistral-7B-Instruct-v0.1
* torch_dtype in config.json: bfloat16



Mistral-7B-Instruct-v0.2

* Model: Mistral-7B-Instruct-v0.2
* torch_dtype in config.json: bfloat16




Other

Revision History

Create 2024 08

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk