Get AI code assist VSCode with local LLMs using LM Studio and the Continue.dev extension - Windows

This is about running VSCode AI code assist locally replacing Copilot or some other service.  You may run local models to guarantee none of your code ends up on external servers. Or, you may not want to maintain an ongoing AI subscription.

We are going to use LM Studio and VS Code.  This was tested on Windows 11 with an RTX 3060 TI with 8GB of VRAM. 8GB really limits the number and size of the models we can use. LM Studio's simple hosting model of 1 LLM and an embedding works for us in this situation. 

You want a big card. 8GB is a tiny card.

Related blog articles and videos

Several related blogs and videos that cover VSCode and local LLMs

Installing LM Studio

LM Studio is a GUI program that lets you run models locally, assuming you have the required hardware. It is a GUI program, unlike Ollama which is a command line program. I'm using LM Studio because Ollama is in beta on Windows.

Download and Install LMStudio. Run the installer and then the program.

Selecting Models

We want models that are tuned for coding.  LM Studio has many coding oriented models in the catalog.  The search results will tell you which ones are appropriate for your GPU / CPU.
 



You should be judicious with the size of the models. You want them to run on the GPU and fit in video memory.  You can use the same model for auto-completion and chat.  Select the model best suited for your normal use case if you can only load one.  

On Windows, you will be constrained by your NVidia GPU VRAM. You will need at least 8GB of VRAM for a single model plus embedding. You can run better models or split models with 12-16GB. On the Mac, the GPU can access some percentage of system memory.  You will want at least 32GB of RAM on a Mac because main memory is VRAM on an ARM Mac.


Note: I used a single model for both auto-complete and chat because of the VRAM limitations on my Windows machine.

Download models and run them in LM Studio

The LM Studio catalog included 21 different coding-oriented models at the time of this article.  I selected CodeLlama-7B-instrtuct solely because it had the most downloads. The My Models tab shows that I have one model downloaded.  I also tried a couple of the DeepSeeker models.

CodeLlama  is not optimized for auto-complete according to the VS code extension. You will get a warning when using it for auto-complete.


We run the model by clicking on the Local Server tab and then selecting the model in the drop list at the top center.  You can see the current LM Studio model in purple on this screenshot. 


Verify using the GPU

Use the nvidia-smi command to verify GPU usage.  

Before Loading the model

PS C:\Users\joe> nvidia-smi
Thu Aug 22 07:53:02 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.70                 Driver Version: 560.70         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti   WDDM  |   00000000:08:00.0 Off |                  N/A |
|  0%   28C    P8              4W /  240W |     129MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     22764      C   ...\LM-Studio\app-0.2.31\LM Studio.exe      N/A      |
+-----------------------------------------------------------------------------------------+

After Loading the model

The DeepSeek-Coder-V2-Lite-Instruct model consumed about 5GB of VRAM.  The embeddings require another 400MB-ish.

PS C:\Users\joe> nvidia-smi
Thu Aug 22 08:02:26 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.70                 Driver Version: 560.70         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti   WDDM  |   00000000:08:00.0 Off |                  N/A |
|  0%   29C    P8              4W /  240W |    4286MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      7096      C   ...\LM-Studio\app-0.2.31\LM Studio.exe      N/A      |
|    0   N/A  N/A     22764      C   ...\LM-Studio\app-0.2.31\LM Studio.exe      N/A      |
+-----------------------------------------------------------------------------------------+

VSCode Extension: Chat and auto-complete functions

The extension supports two modes for interacting with the LLM, chat and auto-complete.  Ollama supports multiple simultaneous models. LM Studio supports a single active model. The VSCode extension understands this because it finds the available model by querying the API endpoint.
  1. Chat: Chat happens in a dedicated pane where you can type or copy/paste your question. The extension lets you select any of the configured LLMs.  Each conversation in the chat window can only be bound to a single LLM definition.
  2. Autocomplete: Tab-based auto-complete that works in line with your work. It just works once it is configured correctly.  The auto-complete LLM has its own configuration section. You can use the same LLM for both chat and auto-complete.  You will have to use the same model for both if you only have enough VRAM for one.
If you only have enough VRAM for 1 model then you must decide if you wish to optimize for autocomplete or optimize for chat and pick the appropriate mode.

Install and Configure the Continue.dev extension.

Install the continue extension.  

The extension will recommend moving the chat pane to the right-hand side away from the primary sidebar. This makes the chat window the secondary side bar. That worked for me.
The continue extension may run a configuration wizard for you on the first install. If you miss this or need to reconfigure the extension later, then you will need to edit the config.json by hand.  

I just go straight to the json editor

Click on the gear icon at the bottom of the Continue extension view.  Or find the config file in ~/.config/config.json












Embeddings

You will want to enable embeddings in LM Studio and import that configuration. Download the embeddings and enable it in LM Studio on the Local Server tab. The embedding mode requires about 300 MB of VRAM.



~/.continue/config.json

This is the configuration file for the Continue Visual Studio Code extension.  We must edit two different areas to enable and configure both chat and auto-complete

models: The models for chat

This configuration says to use the local LM Studio and auto-detect all the models currently available to that LM Studio instance.  We could specify a single model but use this instead so that we can pick a specific model for any single chat session. 


  "models": [
    {
      "title": "LM Studio",
      "provider": "lmstudio",
      "model": "AUTODETECT"
    }
  ],


 You can see the models provided by this instance of LM Studio in the model selector in this screen-shot.



tabAutoCompleteModel: The model for inline suggestions and auto-complete

This configuration says to use this specific model for tab auto-complete. The model is available on the local LM Studio server just like the models in the models list.


  "tabAutocompleteModel": {
    "title": "LM Studio Codellama",
    "provider": "lmstudio",
    "model": "bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF"
  },

Embeddings

We need to configure the embeddinsProvider section because we are using something other than Ollama.


  "embeddingsProvider": {
    "provider": "lmstudio",
    "model": "nomic-ai/nomic-embed-text-v1.5-GGUF"
  }

Complete sample config.json file

This is my complete config.json with the two entries shown above. Everything else is exactly as it was created without modification.


{
  "models": [
    {
      "title": "LM Studio",
      "provider": "lmstudio",
      "model": "AUTODETECT"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "tabAutocompleteModel": {
    "title": "LM Studio Codellama",
    "provider": "lmstudio",
    "model": "bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF"
  },
  "contextProviders": [
    {
      "name": "code",
      "params": {}
    },
    {
      "name": "docs",
      "params": {}
    },
    {
      "name": "diff",
      "params": {}
    },
    {
      "name": "terminal",
      "params": {}
    },
    {
      "name": "problems",
      "params": {}
    },
    {
      "name": "folder",
      "params": {}
    },
    {
      "name": "codebase",
      "params": {}
    }
  ],
  "slashCommands": [
    {
      "name": "edit",
      "description": "Edit selected code"
    },
    {
      "name": "comment",
      "description": "Write comments for the selected code"
    },
    {
      "name": "share",
      "description": "Export the current chat session to markdown"
    },
    {
      "name": "cmd",
      "description": "Generate a shell command"
    },
    {
      "name": "commit",
      "description": "Generate a git commit message"
    }
  ],
  "embeddingsProvider": {
    "provider": "lmstudio",
    "model": "nomic-ai/nomic-embed-text-v1.5-GGUF"
  }
}


Refresh the  workspace after reconfiguring the extension

Reload the VSCode Workspace after editing the config file.  This forces an immediate configuration refresh.


ctrl-shift-p
Developer:Reload Window


The reload is actually available here after a ctrl-shift-p


Fini

That's it. At this point, you should have AI assist in your editors and in the chat pane. All of this capability without leaving the friendly environment of your local network!

Revision History

Created: 2024 08

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk