I have been using large models for over a year now, starting from the initial ChatGPT 3.5/4 to the current domestic and open-source models, which are becoming increasingly powerful, such as Llama and SD. Today, I will introduce two tools that I have been using for a long time: Ollama and Open-WebUI.

Ollama

Ollama is an open-source deep learning framework designed for convenient deployment and running of large language models (LLMs) on local machines. It provides a complete deep learning toolchain, including data preprocessing, model building, training, evaluation, and deployment functions.

Project Address: https://github.com/ollama/ollama

Main Features of Ollama:

  • Local Deployment and Running: One of the main goals of Ollama is to simplify the process of deploying large language models in Docker containers. This makes it easy for non-professional users to manage and run these complex models.
  • Lightweight and Scalable: As a lightweight framework, Ollama maintains a small resource footprint while offering good scalability. This allows users to adjust configurations based on the scale of their projects and hardware conditions.
  • API Support: Ollama provides a simple API, enabling developers to easily create, run, and manage instances of large language models. This lowers the technical barrier for interacting with the models.
  • Pre-built Model Library: Ollama includes a range of pre-trained large language models that users can directly use in their applications without having to train or find the models themselves.
  • Model Import and Customization:
    • Import from GGUF: Supports importing existing large language models from specific platforms like GGUF.
    • Import from PyTorch or Safetensors: Compatible with these two deep learning frameworks, allowing users to integrate models trained on these frameworks into Ollama.
    • Custom Prompts: Allows users to add or modify prompts for the models to guide them in generating specific types or styles of text output.
  • Cross-platform Support: Provides installation guides for macOS, Windows (preview), Linux, and Docker, ensuring that users can deploy and use Ollama smoothly across multiple operating system environments.
  • Command-line Tools and Environment Variables:
    • Command-line Start: The Ollama service can be started via the command ollamaserve or its aliases serve, start.
    • Environment Variable Configuration: Such as OLLAMA_HOST, used to specify the host address and port for the service. Users can modify these as needed.

In addition, Ollama offers rich API interfaces, community support, and documentation, making it a powerful tool for developers and individual users to run and manage large language models locally.

Quick Installation of Ollama

Linux installation command for Ollama:

1
curl -fsSL https://ollama.com/install.sh | sh

Basic Ollama Commands

1
2
3
4
ollama pull llama3 # Pull a large model
ollama run llama3  # Run a large model directly
ollama rm llama3   # Remove the local llama3 large model
ollama list        # List local large models

Dialogue with Llama3 Chinese Version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# ollama pull wangshenzhi/llama3-8b-chinese-chat-ollama-q8:latest
# ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-q8:latest
>>> Who are you?
I am the Llama3-8B Chinese chat robot, fine-tuned by the ORPO alignment algorithm on Llama3-8B-Instruct and a mix of Chinese-English data. My developer is Wang Shen Zhi (Chinese name). I am a helpful, respectful, honest, and reliable assistant, dedicated to providing you with safe and accurate information and answers, while promoting meaningful conversations.

>>> How to fine-tune the large model Llama3 and what tools are used for fine-tuning large models, please provide a detailed introduction.
LLaMA 3 is a powerful text generation model developed by OpenAI, based on the transformer architecture. It performs exceptionally well in various natural language processing (NLP) tasks, such as text classification, machine translation, and dialogue systems. Fine-tuning large pre-trained models like LLaMA 3 can significantly improve their performance and adapt them to specific tasks or domains.

The fine-tuning process involves additional training on the pre-trained model to better adapt it to a new target dataset. This can be achieved by adding new data, adjusting hyperparameters, or changing the optimization algorithm. The primary goal of fine-tuning is to enable the model to learn from tasks with specific structures and contexts and perform well on these tasks.

Fine-tuning LLaMA 3 typically involves the following steps:

1. **Data Preparation**: Collect and preprocess new data for fine-tuning. The data should be relevant to the target task, specifically for a chatbot task.
2. **Model Selection**: Choose an appropriate architecture, such as LLaMA 3. Adjustments can be made based on the dataset and task requirements, such as increasing or decreasing the number of layers or hidden units.
3. **Hyperparameter Setting**: Adjust the hyperparameters for the fine-tuning process, such as learning rate, batch size, and number of training iterations, to optimize model performance.
4. **Training**: Fine-tune the model using the new dataset. This is usually done on a GPU, which can significantly speed up the fine-tuning process.

Benefits of fine-tuning include:

* **Task-specific Performance**: Fine-tuning can significantly improve the model's performance on the target task, especially when the target dataset has unique structures and contexts.
* **Better Generalization**: Fine-tuning helps the model learn patterns in the new dataset, enabling it to make more accurate predictions on previously unseen inputs.
* **Resource Efficiency**: Since only a small amount of additional training is required, fine-tuning can be performed with fewer computational resources, saving time and cost.

In summary, fine-tuning large pre-trained models like LLaMA 3 can significantly enhance their performance on target tasks and adapt them to specific domains. By carefully adjusting hyperparameters, datasets, and architectures, fine-tuning can help create more targeted models, leading to better results.
>>> /?
Available Commands:
  /set            Set session variables
  /show           Show model information
  /load <model>   Load a session or model
  /save <model>   Save your current session
  /bye            Exit
  /?, /help       Help for a command
  /? shortcuts    Help for keyboard shortcuts

Use """ to begin a multi-line message.

>>> /bye

Using Ollama to Adjust Large Model Prompts and Parameters

Here, we create a Modelfile based on Alibaba’s open-source Qwen large model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
FROM qwen:7b

# Set temperature to 1 (the higher, the more creative; the lower, the more coherent)
PARAMETER temperature 0.8

# Set context token size
PARAMETER num_ctx 8192

# Set system message
SYSTEM """
You are an AI assistant named Wanzi, developed and provided by wnote company.
You are proficient in speaking English and telling jokes.
"""
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# ollama create wanzi -f ./Modelfile
transferring model data
reading model metadata
creating system layer
creating parameters layer
creating config layer
using already created layer sha256:87f26aae09c7f052de93ff98a2282f05822cc6de4af1a2a159c5bd1acbd10ec4
using already created layer sha256:7c7b8e244f6aa1ac8c32b74f56d42c41a0364dd2dabed8d9c6030a862e805b54
using already created layer sha256:1da0581fd4ce92dcf5a66b1da737cf215d8dcf25aa1b98b44443aaf7173155f5
writing layer sha256:1744dbff14b6e7fb2a846c118720f4627cdece7180bdb9d6f176c44bc6308718
writing layer sha256:02d401d3225083c307561b27d81137360bf39c01259dddbb267897cdcde6efec
writing layer sha256:d052490c542f2b7c49e4a770cc3239a58ca7261f75bc1730dd2b4ce5999de0b0
writing manifest
success
# ollama run wanzi
>>> Who are you?
I am Wanzi, an AI assistant developed by wnote company. I am proficient in communicating in English and occasionally tell humorous jokes to liven up the atmosphere. Is there anything I can help you with?
>>> Tell me a very funny joke.
Sure, here’s a funny one:

One day, a penguin walked into a bar, dressed in a formal tuxedo, looking both comical and mysterious.

The bartender looked at the penguin and exclaimed, "Wow, are you here for a party?"

The penguin thought for a moment, shrugged his round shoulders, and said in his distinctive deep voice, "No, I just want to order a 'Zero Degree Loneliness.'"

Everyone was stunned, and then burst into laughter. From then on, the bar menu had a real "joke" — Zero Degree Loneliness.

I hope this joke made you laugh!

REST API Requests

Generate response:

1
2
3
4
curl http://localhost:11434/api/generate -d '{
  "model": "qwen:7b",
  "prompt":"Why are large models so popular now?"
}'

Model conversation:

1
2
3
4
5
6
curl http://localhost:11434/api/chat -d '{
  "model": "qwen:7b",
  "messages": [
    { "role": "user", "content": "Why are large models so popular now?" }
  ]
}'

Ollama supports many large models. For more experiences, visit: https://ollama.com/library

Open-WebUI

Open-WebUI is an extensible, feature-rich, and user-friendly self-hosted web UI, developed using the Svelte front-end framework, designed to run completely offline. It supports various LLM runners, including APIs compatible with Ollama and OpenAI.

Previously, operations were done via Ollama command line, but now they can be fully managed through the web interface.

Open-WebUI Features

  • Easy Installation: Seamless installation using Docker or Kubernetes (kubectl, kustomize, or helm). Supports :ollama and :cuda tagged images for a hassle-free experience.

  • Ollama/OpenAI API Integration: Easily integrate OpenAI-compatible APIs for versatile conversations with Ollama models. Customize OpenAI API URLs to link to LMStudio, GroqCloud, Mistral, OpenRouter, etc.

  • Pipelines, Open Web Plugin Support: Seamlessly integrate custom logic and Python libraries into Open Web using the Pipelines Plugin Framework. Launch pipeline instances, set the OpenAI URL to the pipeline URL, and explore endless possibilities. Examples include function calls, user rate limiting to control access, usage monitoring with tools like Langfuse, real-time translation with LibreTranslate for multilingual support, toxic message filtering, and more.

  • Responsive Design: Enjoy a seamless experience on desktops, laptops, and mobile devices.

  • Mobile Progressive Web App (PWA): Use our PWA to enjoy a native app-like experience on your mobile device, providing offline access and a seamless user interface on localhost.

  • Full Markdown and LaTeX Support: Enhance your LLM experience with comprehensive Markdown and LaTeX features for rich interactions.

  • Model Builder: Easily create [recommended] models through the Web UI. With Open Web Community integration, easily create and add custom roles/agents, custom chat elements, and import models.

  • Local RAG Integration: Dive into the future of chat interactions with groundbreaking Retrieval-Augmented Generation (RAG) support. This feature seamlessly integrates document interactions into your chat experience. You can directly load documents into the chat or add files to the document library and easily access them with the # command before querying.

  • Web Search for RAG: Perform web searches using providers like SearXNG, Google PSE, Brave Search, serstack, and serper, and inject the results directly into your chat experience.

  • Web Browsing Functionality: Integrate websites seamlessly into your chat experience by using the # command followed by a URL. This feature allows you to merge web content directly into your conversations, enhancing the richness and depth of interactions.

  • Image Generation Integration: Seamlessly integrate image generation capabilities using options like AUTOMATIC1111 API or ComfyUI (local) and OpenAI’s DALL-E (external) to enrich your chat experience with dynamic visual content.

  • Many Model Conversations: Effortlessly engage with a variety of models simultaneously, leveraging their unique strengths for the best responses. Enhance your experience by utilizing a diverse set of models in parallel.

  • Role-Based Access Control (RBAC): Ensure secure access with restricted permissions; only authorized individuals can access your Ollama, and exclusive model creation/extraction permissions are reserved for administrators.

  • Multilingual Support: Experience Open Web in your preferred language with our internationalization (i18n) support. Join us and expand the languages we support! We are actively seeking contributors!

  • Continuous Updates: We are committed to improving Open Web through regular updates, fixes, and new features.

Deploying Open-WebUI

To use it with Ollama, quickly deploy Open-WebUI using docker-compose:

Refer to: https://github.com/valiantlynx/ollama-docker

Since I am using an Nvidia Tesla T4 card locally, I will use the GPU configuration (drivers need to be installed beforehand, which is not covered here).

docker-compose-ollama-gpu.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
version: '3.8'

services:
  app:
    build: .
    ports:
      - 8000:8000
      - 5678:5678
    volumes:
      - .:/code
    command: uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload
    restart: always
    depends_on:
      - ollama
      - ollama-webui
    networks:
      - ollama-docker

  ollama:
    volumes:
      - ./ollama/ollama:/root/.ollama
    container_name: ollama
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    networks:
      - ollama-docker
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  ollama-webui:
    image: ghcr.io/ollama-webui/ollama-webui:main
    container_name: ollama-webui
    volumes:
      - ./ollama/ollama-webui:/app/backend/data
    depends_on:
      - ollama
    ports:
      - 8080:8080
    environment:
      - '/ollama/api=http://ollama:11434/api'
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped
    networks:
      - ollama-docker

networks:
  ollama-docker:
    external: false

Start the services:

1
docker-compose -f docker-compose-ollama-gpu.yaml up -d

Access https://xxx.xxx.xxx.xxx:8080/ and register a management account to start using it.

model001.png

Summary

The combination of Ollama and Open-WebUI brings many advantages. First, through an intuitive UI, even non-technical users can easily set up and train deep learning platforms and models. Second, Open-WebUI provides Ollama with rich features such as data visualization and model performance monitoring, making it easier for users to understand the training status and performance of the models. Finally, the open-source nature of Ollama and Open-WebUI allows users to customize and extend them according to their needs.