Skip to content

Latest commit

 

History

History
225 lines (150 loc) · 11 KB

ollama_portable_zip_quickstart.md

File metadata and controls

225 lines (150 loc) · 11 KB

Run Ollama Portable Zip on Intel GPU with IPEX-LLM

< English | 中文 >

This guide demonstrates how to use Ollama portable zip to directly run Ollama on Intel GPU with ipex-llm (without the need of manual installations).

Note

Ollama portable zip has been verified on:

  • Intel Core Ultra processors
  • Intel Core 11th - 14th gen processors
  • Intel Arc A-Series GPU
  • Intel Arc B-Series GPU

Table of Contents

Windows Quickstart

Note

We recommand using Windows 11 for Windows users.

Prerequisites

Check your GPU driver version, and update it if needed:

  • For Intel Core Ultra processors (Series 2) or Intel Arc B-Series GPU, we recommend updating your GPU driver to the latest

  • For other Intel iGPU/dGPU, we recommend using GPU driver version 32.0.101.6078

Step 1: Download and Unzip

Download IPEX-LLM Ollama portable zip for Windows users from the link.

Then, extract the zip file to a folder.

Step 2: Start Ollama Serve

Start Ollama serve as follows:

  • Open "Command Prompt" (cmd), and enter the extracted folder through cd /d PATH\TO\EXTRACTED\FOLDER
  • Run start-ollama.bat in the "Command Prompt. A window will then pop up as shown below:

Step 3: Run Ollama

You could then use Ollama to run LLMs on Intel GPUs through running ollama run deepseek-r1:7b in the same "Command Prompt" (not the pop-up window). You may use any other model.

Linux Quickstart

Prerequisites

Check your GPU driver version, and update it if needed; we recommend following Intel client GPU driver installation guide to install your GPU driver.

Step 1: Download and Extract

Download IPEX-LLM Ollama portable tgz for Ubuntu users from the link.

Then open a terminal, extract the tgz file to a folder.

tar -xvf [Downloaded tgz file path]

Step 2: Start Ollama Serve

Enter the extracted folder, and run start-ollama.sh to start Ollama service.

cd PATH/TO/EXTRACTED/FOLDER
./start-ollama.sh

Step 3: Run Ollama

You could then use Ollama to run LLMs on Intel GPUs as follows:

  • Open another ternimal, and enter the extracted folder through cd PATH/TO/EXTRACTED/FOLDER
  • Run ./ollama run deepseek-r1:7b (you may use any other model)

Tips & Troubleshooting

Speed up model download using alternative sources

Ollama by default downloads model from the Ollama library. By setting the environment variable IPEX_LLM_MODEL_SOURCE to modelscope or ollama before running Ollama, you could switch the source where the model is downloaded.

For example, if you would like to run deepseek-r1:7b but the download speed from the Ollama library is slow, you could download the model from ModelScope as follows:

  • For Windows users:

    • In the "Command Prompt", navigate to the extracted folder by cd /d PATH\TO\EXTRACTED\FOLDER
    • Run set IPEX_LLM_MODEL_SOURCE=modelscope in "Command Prompt"
    • Run ollama run deepseek-r1:7b
  • For Linux users:

    • In a terminal other than the one for Ollama serve, navigate to the extracted folder by cd PATH\TO\EXTRACTED\FOLDER
    • Run export IPEX_LLM_MODEL_SOURCE=modelscope in the terminal
    • Run ./ollama run deepseek-r1:7b

Tip

Model downloaded with set IPEX_LLM_MODEL_SOURCE=modelscope will still show actual model id in ollama list, e.g.

NAME                                                             ID              SIZE      MODIFIED
modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M    f482d5af6aec    4.7 GB    About a minute ago

Except for ollama run and ollama pull, the model should be identified through its actual id, e.g. ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M

Increase context length in Ollama

By default, Ollama runs model with a context window of 2048 tokens. That is, the model can "remember" at most 2048 tokens of context.

To increase the context length, you could set environment variable IPEX_LLM_NUM_CTX before staring Ollama Serve, as shwon below (if Ollama serve is already running, please make sure to stop it first):

  • For Windows users:

    • Open "Command Prompt", and navigate to the extracted folder through cd /d PATH\TO\EXTRACTED\FOLDER
    • Set IPEX_LLM_NUM_CTX to the desired length in the "Command Prompt, e.g. set IPEX_LLM_NUM_CTX=16384
    • Start Ollama serve through start-ollama.bat
  • For Linux users:

    • In a terminal, navigate to the extracted folder through cd PATH\TO\EXTRACTED\FOLDER
    • Set IPEX_LLM_NUM_CTX to the desired length in the terminal, e.g. export IPEX_LLM_NUM_CTX=16384
    • Start Ollama serve through ./start-ollama.sh

Tip

IPEX_LLM_NUM_CTX has a higher priority than the num_ctx settings in a models' Modelfile.

Select specific GPU(s) to run Ollama when multiple ones are available

If your machine has multiple Intel GPUs, Ollama will by default runs on all of them.

To specify which Intel GPU(s) you would like Ollama to use, you could set environment variable ONEAPI_DEVICE_SELECTOR before starting Ollama Serve, as follows (if Ollama serve is already running, please make sure to stop it first):

  • Identify the id (e.g. 0, 1, etc.) for your multiple GPUs. You could find them in the logs of Ollama serve when loading any models, e.g.:

  • For Windows users:

    • Open "Command Prompt", and navigate to the extracted folder by cd /d PATH\TO\EXTRACTED\FOLDER
    • In the "Command Prompt", set ONEAPI_DEVICE_SELECTOR to define the Intel GPU(s) you want to use, e.g. set ONEAPI_DEVICE_SELECTOR=level_zero:0 (on single Intel GPU), or set ONEAPI_DEVICE_SELECTOR=level_zero:0;level_zero:1 (on multiple Intel GPUs), in which 0, 1 should be changed to your desired GPU id
    • Start Ollama serve through start-ollama.bat
  • For Linux users:

    • In a terminal, navigate to the extracted folder by cd PATH\TO\EXTRACTED\FOLDER
    • Set ONEAPI_DEVICE_SELECTOR to define the Intel GPU(s) you want to use, e.g. export ONEAPI_DEVICE_SELECTOR=level_zero:0 (on single Intel GPU), or export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1" (on multiple Intel GPUs), in which 0, 1 should be changed to your desired GPU id
    • Start Ollama serve through ./start-ollama.sh

Tune performance

Here are some settings you could try to tune the performance:

Environment variable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS

The environment variable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS determines the usage of immediate command lists for task submission to the GPU. You could experiment with SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 or 0 for best performance.

To enable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS, set it before starting Ollama Serve, as shown below (if Ollama serve is already running, please make sure to stop it first):

  • For Windows users:

    • Open "Command Prompt", and navigate to the extracted folder through cd /d PATH\TO\EXTRACTED\FOLDER
    • Run set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 in "Command Prompt"
    • Start Ollama serve through start-ollama.bat
  • For Linux users:

    • In a terminal, navigate to the extracted folder through cd PATH\TO\EXTRACTED\FOLDER
    • Run export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 in the terminal
    • Start Ollama serve through ./start-ollama.sh

Tip

You could refer to here regarding more information about Level Zero Immediate Command Lists.

Additional models supported after Ollama v0.5.4

The currently Ollama Portable Zip is based on Ollama v0.5.4; in addition, the following new models have also been supported in the Ollama Portable Zip:

Model Download (Windows) Download (Linux) Model Link
DeepSeek-R1 ollama run deepseek-r1 ./ollama run deepseek-r1 deepseek-r1
Openthinker ollama run openthinker ./ollama run openthinker openthinker
DeepScaleR ollama run deepscaler ./ollama run deepscaler deepscaler
Phi-4 ollama run phi4 ./ollama run phi4 phi4
Dolphin 3.0 ollama run dolphin3 ./ollama run dolphin3 dolphin3
Smallthinker ollama run smallthinker ./ollama run smallthinker smallthinker
Granite3.1-Dense ollama run granite3-dense ./ollama run granite3-dense granite3.1-dense
Granite3.1-Moe-3B ollama run granite3-moe ./ollama run granite3-moe granite3.1-moe