Tutorials
Instructions
Deploying AI/LLM on your computer

James Tomson

July 4 2025

Updated July 4 2025

Deploying AI/LLM on your computer

Linux Security

Automation is often not enough to solve local problems, which would simplify the user experience, both for solving personal issues and would be an excellent module in the IP. In this article, we will look at how to deploy local LLM or well-known AI models via Docker using Ollama. The instruction is suitable for Ubuntu, Windows (WSL2) or macOS.

If your capacities are not enough to solve the tasks, then Serverspace provides the latest GPT models that are just as easy to integrate via API or use in a web dashboard.

Below is a step-by-step process for launching the language model via Docker using the CPU.

1. Installing Docker

Linux (Ubuntu/Debian)

sudo apt update

sudo apt install docker.io

sudo systemctl enable --now docker

Windows/macOS

Download Docker Desktop from the official website and install it. After installation, check the operation:

winget install Docker.DockerDesktop

And then you can check its performance by running the command:

docker run hello-world

2. Running Ollama in Docker (on CPU)

Ollama is a local server for running LLM (for example, LLaMA, Mistral, etc.).

docker run -d \

  --name ollama \

  -p 11434:11434 \

  -v ollama_data:/root/.ollama \

  ollama/ollama

The parameter -v ollama_data:/root/.ollama — saves models between restarts. By default, Ollama will use the CPU if there are no GPU drivers inside the container.

docker ps

docker exec -it ollama ollama pull llama3

docker exec -it ollama ollama list

3. Connecting the web interface — Open WebUI

docker run -d \

  --name open-webui \

  -p 3000:8080 \

  --add-host=host.docker.internal:host-gateway \

  -v open-webui:/app/backend/data \

  ghcr.io/open-webui/open-webui:ollama

The interface will be available on: http://localhost:3000

4. Alternative: Text Generation Web UI (on CPU)

Create a Dockerfile:

FROM python:3.11-slim

RUN apt-get update && apt-get install -y git build-essential

RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

RUN git clone https://github.com/oobabooga/text-generation-webui

WORKDIR /text-generation-webui

RUN pip install -r requirements_cpu_only.txt

COPY start.sh /start.sh

RUN chmod +x /start.sh

ENTRYPOINT ["/start.sh"]

The startup script start.sh:

#!/bin/bash

rm -rf models && ln -s /models models

python server.py --listen --cpu

docker-compose.yml:

version: '3.8'

services:

  textgen:

    build: .

    container_name: textgen_cpu

    ports:

      - "7860:7860"

    volumes:

      - ./models:/models

Launch:

docker compose up -d

To communicate with the model, you can also use the API, which will allow you to integrate it as a module into any IC. The final endpoint of the appeal:

POST http://localhost:11434/api/generate

To make an API request, you must use any tool or library.:

 curl http://localhost:11434/api/generate \

-d '{

"model": "llama3",

"prompt": "Tell me what quantum physics is in simple words",

"stream": false

}'

model - The name of the model, for example llama3, mistral, gemma, phi, etc.
prompt - The text of the request
stream - If `true`, the response will come in parts (convenient for chat). If `false` — the entire response at once
options - Additional generation parameters

Example with generation settings:



curl http://localhost:11434/api/generate \

-d'{

"model": "llama3",

"prompt": "Write a short story about a robot.",

"stream": false,

"options": {

"temperature": 0.8,

"top_k": 40,

"top_p": 0.9,

"num_predict": 200

}

}'

temperature - Randomness control. The higher the value, the more creative
top_k is the limit on the number of tokens, among which the next one is selected
top_p is Probability control (nuclear sampling)
num_predict is how many tokens to generate at most

5. Optimization for CPU

Limitation of Docker container resources:

docker run --cpus=4 --memory=16g ollama/ollama

Use quantized models (GGUF/ggml) to reduce the load on the CLI interface (via curl) faster than WebUI

Vote:

5 out of 5

Аverage rating : 5

Rated by: 1

33145 North Miami, FL 2520 Coral Way apt 2-135

+1 302 425-97-76

700 300

ITGLOBAL.COM CORP

33145 North Miami, FL 2520 Coral Way apt 2-135

+1 302 425-97-76

700 300

ITGLOBAL.COM CORP

700 300

Deploying AI/LLM on your computer

1. Installing Docker

Linux (Ubuntu/Debian)

Windows/macOS

2. Running Ollama in Docker (on CPU)

3. Connecting the web interface — Open WebUI

4. Alternative: Text Generation Web UI (on CPU)

5. Optimization for CPU

You might also like...

Scalable VPS servers

CI/CD cheat sheet for DevOps engineer

Auditd: Linux logging utility