Rocking an older Titan RTX 24GB as my local AI Code assist on Windows 11, Ollama and VS Code

This is about using a Turing NVIDIA Titan RTX GPU to locally execute code assist LLMs to be used in VSCode. This slightly older card has 24GB of VRAM making it a great local LLM. The Titan RTX is a two-slot dual-fan card. The Titan RTX is currently about the same price as a refurbished Ampere NVidia 3090 TI 24GB.

There are a bunch of ways to host the code support LLMs. We are using an early release Ollama as our LLM service and continue.dev VSCode extension as the language service inside VSCode.

This was tested on AMD Ryzen 8 core with 64GB of memory and the Titan RTX.

Several related blogs and videos that cover VSCode and local LLMs

Install Ollama

Download and Install Ollama The early release version of Ollama runs from the command line as a daemon and has a widget in the Windows 11 tray.

Ollama is a simple tool that lets you run models locally, assuming you have the required hardware. It can run in a server mode providing local API endpoints for various tools like the VS Code AI assist extensions. I needed to run 3 models for the full-featured experience. The 24GB gives me some room to play.

Pulling Models

Browse the Ollama site to find a couple models you want. Then get them with ollama pull <model>. I wanted to try the most popular models with a couple languages and ended up with after a few

ollama pull nomic-embed-text
ollama pull deepseek-coder-v2
ollama pull codellama
ollama pull starcoder2
ollama pull codeqwen
ollama pull gemma2
ollama pull codegemma
ollama pull codestral
ollama pull phi3
ollama pull phi3.5

Selecting Models

Our VSCode plugin needs models for three functions: tabAutoComplete, Chat, and embedding. We have lots of choices.

PS C:\Users\joe> ollama ls
NAME                            ID              SIZE    MODIFIED
phi3.5:latest                   3b387c8dd9b7    2.2 GB  14 minutes ago
gemma2:latest                   ff02c3702f32    5.4 GB  15 minutes ago
phi3:latest                     4f2222927938    2.2 GB  15 minutes ago
codestral:latest                fcc0019dcee9    12 GB   16 minutes ago
codeqwen:latest                 df352abf55b1    4.2 GB  16 minutes ago
starcoder2:latest               f67ae0f64584    1.7 GB  18 minutes ago
codellama:latest                8fdf8f752f6e    3.8 GB  19 minutes ago
codegemma:latest                0c96700aaada    5.0 GB  19 minutes ago
deepseek-coder-v2:latest        8577f96d693e    8.9 GB  21 minutes ago
nomic-embed-text:latest         0a109f422b47    274 MB  23 minutes ago

Run the ollama server

Ollama can be run as a server making all of the downloaded models available or it can be run with a single model.

This will look bad because Blogger doesn't support long lines and doesn't have scrolling areas for code:-(

PS C:\Users\joe> ollama serve Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted. PS C:\Users\joe> ollama serve 2024/08/22 20:20:00 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\joe\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\joe\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-22T20:20:00.136-04:00 level=INFO source=images.go:782 msg="total blobs: 47" time=2024-08-22T20:20:00.139-04:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0" time=2024-08-22T20:20:00.142-04:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6)" time=2024-08-22T20:20:00.143-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11.3 rocm_v6.1 cpu cpu_avx]" time=2024-08-22T20:20:00.143-04:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs" time=2024-08-22T20:20:00.312-04:00 level=INFO source=gpu.go:288 msg="detected OS VRAM overhead" id=GPU-c7ac10e0-547d-6d91-75d5-cdd64259f9f2 library=cuda compute=7.5 driver=12.6 name="NVIDIA TITAN RTX" overhead="928.0 MiB" time=2024-08-22T20:20:00.314-04:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-c7ac10e0-547d-6d91-75d5-cdd64259f9f2 library=cuda compute=7.5 driver=12.6 name="NVIDIA TITAN RTX" total="24.0 GiB" available="22.8 GiB"

VSCode Extension: Chat and auto-complete functions

The complete.dev extension has two paradigms for interacting with the LLM, chat and auto-complete. The three can each use different LLMs if you have enough GPU VRAM. complete.dev supports remote and local endpoints. this means you can use amix of local and remote models. I'm only going to show locally deployed models going forward.

You want the models to fit into VRAM because you don't want them to fallback to using the CPU. It will be slow.

Chat: Chat happens in a dedicated pane where you can type or copy/paste your question. The extension lets you select any of the configured LLMs. Each conversation in the chat window can only be bound to a single LLM definition. You can change the LLM used for a new conversation by using the drop list in the chat pane. The following image was captured for a locally running Ollama instance with 4 available models.

The VSCode extension, chat function, understands that ollama can have multiple downloaded models. The extension UI lets you select which model you wish to chat with for any given interaction.
Autocomplete: Tab-based auto-complete that works inline with your work while coding. It just works once it is configured correctly. There can be only one auto-complete LLM. It has its own Ollama configuration section.

I have enough VRAM that I can locally run separate tabAutocomplete and chat LLMs.

Install and Configure the Continue.dev extension.

Install the continue extension. Edit the configuration by clicking on the gear item at the bottom of the extension pane.

It should bring up config.json in the VSCode editor

~/.continue/config.json

This is the configuration file for the Continue Visual Studio Code extension. We must edit two different areas to enable and configure both chat and auto-complete

models: The models for chat

We can change models across chat sessions. This configuration says to use the local Ollama and auto-detect all the models currently available to that Ollama instance. You can see the model selector above.

  "models": [
    {
      "title": "Ollama",
      "provider": "ollama",
      "model": "AUTODETECT"
    }
  ],

tabAutoCompleteModel: The model for inline suggestions and auto-complete

This configuration says to use this specific model for tab auto-complete. The model is available on the local Ollama server just like the models in the models list.

  "tabAutocompleteModel": {
    "title": "Ollama",
    "provider": "ollama",
    "model": "codeqwen"
  },

Embeddings

I specifically configured the embedding model.

  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  }

Complete sample config.json file

This is my complete config.json for my windows machine with 24GB of graphics VRAM.

{
  "models": [
    {
      "title": "Ollama",
      "provider": "ollama",
      "model": "AUTODETECT"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Ollama",
    "provider": "ollama",
    "model": "codeqwen"
  },
  "contextProviders": [
    {
      "name": "code",
      "params": {}
    },
    {
      "name": "docs",
      "params": {}
    },
    {
      "name": "diff",
      "params": {}
    },
    {
      "name": "terminal",
      "params": {}
    },
    {
      "name": "problems",
      "params": {}
    },
    {
      "name": "folder",
      "params": {}
    },
    {
      "name": "codebase",
      "params": {}
    }
  ],
  "slashCommands": [
    {
      "name": "edit",
      "description": "Edit selected code"
    },
    {
      "name": "comment",
      "description": "Write comments for the selected code"
    },
    {
      "name": "share",
      "description": "Export the current chat session to markdown"
    },
    {
      "name": "cmd",
      "description": "Generate a shell command"
    },
    {
      "name": "commit",
      "description": "Generate a git commit message"
    }
  ],
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  }
}

Refresh the workspace after reconfiguring the extension

Reload the VSCode Workspace after editing the config file. This forces an immediate configuration refresh.

ctrl-shift-p
Developer:Reload Window

The reload is actually available here after a ctrl-shift-p

Examples

A chat session

This chat is about a unit test. Zero prompt engineering has been done to get better answers.

Autocomplete in action

Autocomplete will provide suggestions while you are typing in a code window. The light grey text in this image was provided by the LLM when I clicked on a Dart unit test file.

Memory usage with three separate models loaded

nvidia-smi lets us check the status of VRAM and the GPU. We can see here that the two LLMs consume 22GB of VRAM

PS C:\Windows\System32> nvidia-smi
Thu Aug 22 20:04:14 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.70                 Driver Version: 560.70         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN RTX             WDDM  |   00000000:08:00.0 Off |                  N/A |
| 40%   38C    P8              5W /  280W |   21824MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4484      C   ...\cuda_v11.3\ollama_llama_server.exe      N/A      |
|    0   N/A  N/A     10752      C   ...\cuda_v11.3\ollama_llama_server.exe      N/A      |
|    0   N/A  N/A     23996      C   ...\cuda_v11.3\ollama_llama_server.exe      N/A      |
+-----------------------------------------------------------------------------------------+

Don't grab models by the tag "latest"

The latest tag can bring down completely different versions and quantization on different days because it is the "latest", the last one uploaded.

Most of my work above was done using the latest version of a mode. That is stupid because you don't know which tuning you are using. I removed models tagged as latest and pulled explicitly by parameter count/size

codegemma:2b                    926331004170    1.6 GB  

codegemma:7b                    0c96700aaada    5.0 GB  

codellama:13b                   9f438cb9cd58    7.4 GB  

codellama:7b                    8fdf8f752f6e    3.8 GB  

codeqwen:7b                     df352abf55b1    4.2 GB  

codestral:22b                   fcc0019dcee9    12 GB   

deepseek-coder-v2:16b           8577f96d693e    8.9 GB  

gemma2:27b                      53261bc9c192    15 GB   

gemma2:2b                       8ccf136fdd52    1.6 GB  

gemma2:9b                       ff02c3702f32    5.4 GB  

granite-code:20b                31d8bc61e506    11 GB   

granite-code:20b-instruct 31d8bc61e506 11 GB

granite-code:3b                 63bedbdffbf0    2.0 GB  

granite-code:3b-base-f16        22067b08f26e    7.0 GB  

granite-code:3b-instruct        63bedbdffbf0    2.0 GB  

granite-code:8b                 998bce918de0    4.6 GB  

granite-code:8b-base-f16        7c8fde9dfb87    16 GB   

granite-code:8b-instruct        998bce918de0    4.6 GB  

phi3.5:3.8b 3b387c8dd9b7 2.2 GB

phi3.5:3.8b-mini-instruct-fp16 570b68409ede 7.6 GB

starcoder2:15b                  20cdb0f709c2    9.1 GB  

starcoder2:3b                   f67ae0f64584    1.7 GB  

starcoder2:7b                   0679cedc1189    4.0 GB  

Fini

That's it. At this point, you should have AI assist in your editors and in the chat pane. All of this capability without leaving the friendly environment of your local network!

Revision History

Created: 2024 08

Blog de Joe Freeman