Deployment Overview
The VisibleThread LLM is a general-purpose large language model, which can be deployed internally for use with the VT Writer AI application. The service uses Ollama to run the Llama 3 8b language model.
Deployment Options
The VT LLM service has been tested on both Red Hat 8 and Ubuntu 20. Deployment options for Windows Server are currently under evaluation.
VisibleThread LLM requires infrastructure with access to GPU and compatibility with NVIDIA Cuda drivers. Below are the high-level steps for deploying the LLM and specific examples for Azure and AWS environments.
Prerequisites
You must have a running RHEL8 or Ubuntu server with a compatible GPU. This server must have access to the internet to install Ollama as described here: https://github.com/ollama/ollama?tab=readme-ov-file
General Deployment Steps
1. Download and install the correct NVIDIA drivers** for your OS/Hardware. Detailed steps for installing NVIDIA drivers are here: https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html
2. Download and configure Ollama and the Mistral model.
Deployment on Azure
Deployment on Azure was tested on an Azure Standard_NC4as_T4_v3 Virtual Machine.
Steps to Deploy on Azure
1. Provision an Azure Standard_NC4as_T4_v3 instance running RHEL 8.8 or Ubuntu 20.04.
2. Ensure access to the instance on:
- Port 22 (SSH)
- Port 11434 (HTTP for Ollama)
3. Install the required Nvidia drivers and Ollama.
Installing GPU Drivers on RHEL 8
# Step 1: Configure GPU drivers
sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo yum install -y dkms
sudo wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo -O /etc/yum.repos.d/cuda-rhel8.repo
sudo yum install -y cuda-drivers
# Verify drivers are installed
nvcc --version
# Step 2: Install and configure Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3
# Step 3: Configure Ollama service
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo bash -c 'echo -e "[Service]\nEnvironment=\"OLLAMA_HOST=0.0.0.0:11434\"" > /etc/systemd/system/ollama.service.d/environment.conf'
sudo systemctl daemon-reload
sudo systemctl restart ollama
journalctl -f -u ollama -n 1000 --no-pager
Deployment on AWS
1. Provision a EC2 instance from the Amazon/Deep Learning OSS Nvidia Driver AMI GPU TensorFlow 2.15 (Amazon Linux 2) 20240213 AMI
2. The instance type should be g4dn.2xlarge
3. Ensure you have access to the instance on port 22 (for ssh) and port 11434 (http port for ollama)
4. SSH into the system and type the following:
#GPU drivers should be automatically be installed on the AMI, to ensure they are up to #date, perform an update and reboot. sudo yum update -y sudo reboot #verify drivers are running
nvcc --version
# install and configure ollama
curl -fsSL https://ollama.com/install.sh | sh
#configure Ollama
ollama run mistral # this will download the model, which will take some time as it is 4+ GB
# create the service
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo bash -c 'echo -e "[Service]\nEnvironment=\"OLLAMA_HOST=0.0.0.0:11434\"" > /etc/systemd/system/ollama.service.d/environment.conf'
sudo systemctl daemon-reload
# restart the service
sudo systemctl restart ollama
# check logs
journalctl -f -u ollama -n 1000 --no-pager
Deployment on Windows
Deployment on Windows is via Ollama Windows installer https://ollama.com/blog/windows-preview
This is currently being evaluated by the VisibleThread engineering team. Contact our support team at support@visiblethread.com if you wish to deploy the VisibleThread LLM on windows.
Verifying the install
To verify the deployment was successful, from a different machine type:
# on Linux
curl http://<ip address>:11434/api/generate -d '{"model":"mistral","system":"","prompt":"","template":""}'
You should receive a response similar to:
{"model":"mistral","created_at":"2024-03- 05T14:47:21.370491701Z","response":"","done":true}