Create a REST API for the Microsoft/BitNet B1.58 model and integrate it with an Open WebUI
Now I'm writing/semi-complaining lol. Normally I use models from Ollama, and I happened to come across this person's X (Twitter). They have a Microsoft model that people say can run on CPU just fine, and if you use something like M2, it'll be even faster.
Microsoft just a 1-bit LLM with 2B parameters that can run on CPUs like Apple M2.BitNet b1.58 2B4T outperforms fp LLaMA 3.2 1B while using only 0.4GB memory versus 2GB and processes tokens 40% faster.
100% opensource. pic.twitter.com/kTeqTs6PHd
— Shubham Saboo (@Saboo_Shubham_) April 18, 2025
And that model is microsoft/BitNet b1.58 2B4T. After seeing the news in APR-2025, I waited to see if anyone would try to implement it in Ollama. I saw people asking about it too, but there was still no update.
I waited for a long time until June-2025 and still nothing. Oh well, let me find a way to run it myself from the code then. Initially, I set a simple goal: put the model in a container and find something that can create an endpoint to work with Open WebUI (a web interface for chatting like ChatGPT), So This blog is to document the my experience for this experiment.
I you want to read Thai Version (อ่านได้ที่นี่)
Table of Contents
- Getting Ready to Run microsoft/BitNet
- - Linux (I actually tried this in Docker)
- - Windows: This one has quite a few steps, for those who like challenges
- Writing Code to Use Model from Hugging Face
- Reference
Getting Ready to Run microsoft/BitNet
- Linux (I actually tried this in Docker)
I take a Python image and install according to these instructions:
# Use official Python 3.12 imageFROM python:3.12-slim# Install system dependencies for PyTorch and build toolsRUN apt-get update && \ apt-get install -y --no-install-recommends \ build-essential \ cmake \ git \ curl \ ca-certificates \ libopenblas-dev \ libomp-dev \ libssl-dev \ libffi-dev \ wget \ && rm -rf /var/lib/apt/lists/*# (Optional) Set a working directoryWORKDIR /app# Copy your requirements.txt if you have oneCOPY requirements.txt .RUN pip install --upgrade pip && pip install -r requirements.txt
create a requirements.txt file.
fastapi==0.110.2uvicorn[standard]==0.29.0transformers==4.52.4torch==2.7.0numpy==1.26.4accelerate==0.29.0
Run it normally (standard execution)
# Build the imagedocker build -t python-bitNet .# Run the container with port forwarding and mounting your codedocker run -it -p 8888:8888 -v "$PWD":/app python-bitNet /bin/bash
Use DevContainer (I try this method)
- Windows: This one has quite a few steps, for those who like challenges
I say it's challenging because I tried it and got stuck for 2 weeks lol. On Linux, it's done in a flash. For anyone who wants to try, you need to have the following:
- For Visual Studio, you need to install additional C++ components as follows:
- Running PowerShell won't work - you have to run it in Developer Command Prompt for VS 2022 or Developer PowerShell for VS 2022.
The regular Terminal doesn't set all the variables like Path properly, so you'll encounter errors like
Error C1083: Cannot open include file: 'algorithm': No such file or directory
Even though you try to set vcvarsall.bat x64, it's like hit or miss - sometimes it works, sometimes it doesn't.Note: vcvarsall.bat is located in "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvarsall.bat"
- Set python lib .tlb in path > if you don't include it, it will crash
fatal error LNK1104: cannot open file 'python312.lib'Add Python Lib to PATH for running C++ libraries used for AI Inference
And after the environment is ready
- Let set Virtual environment
# Set ENVpython3 -m venv bitnet-env# orpython -m venv bitnet-env
- Activate Virtual environment
# Linuxsource bitnet-env/bin/activate# Windows - Powershell.\bitnet-env\Scripts\Activate.ps1# Windows - CMD.\bitnet-env\Scripts\activate.bat
- Install required libraries according to requirements.txt - if you're using the Linux/Docker approach, these dependencies are already included in the Container image.
fastapi==0.110.2uvicorn[standard]==0.29.0transformers==4.52.4torch==2.7.0numpy==1.26.4accelerate==0.29.0pip install --upgrade pip && pip install -r requirements.txtpip install git+github.com/huggingface/transfo…
Writing Code to Use Model from Hugging Face
After resolving all the ENV issues, let's Coding, I mentioned wanting to connect it with OpenWebUI, so I made 2 versions Command Line version / API version
- Command Line
Try writing code using
- Transformers: For loading pre-trained models from Hugging Face
- PyTorch: For model inference from transformers (at this point, the my machine specs that only use inference but not good for Train/Fine Tune) The points where PyTorch is used in several parts, such as
- bfloat16: Uses less memory than FP16 but with less precision
- return_tensors="pt": Specifies PyTorch tensor format
- to(model.device): Enables GPU acceleration if CUDA is available
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "microsoft/bitnet-b1.58-2B-4T"# Load tokenizer and modeltokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, force_download=True,)# Apply the chat template + Rolemessages = [ {"role": "system", "content": "You are a Senior Programmer."}, {"role": "user", "content": "Can you help me with a coding problem?"},]prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)chat_input = tokenizer(prompt, return_tensors="pt").to(model.device)# Generate responsechat_outputs = model.generate(**chat_input, max_new_tokens=50)response = tokenizer.decode(chat_outputs[0][chat_input['input_ids'].shape[-1]:], skip_special_tokens=True)print("\nAssistant Response:", response)The Command Line version - you'll see that Windows has many limitations
Another version actually adds a loop and keeps asking continuously until you type "Thank you BITNET" - you can see source code here
- API
Note I didn't research what libraries are available that can make our API directly connect to Open WebUI. Initially
I tried to see what connection standards Open WebUI supports first - for the Text Prompt part, it has OpenAI / Ollama sections.
Now I chose to go with OpenAI API because when I tried playing with dotnet semantic kernel before, it has the /v1/chat/completions pattern, so I tried starting from there and tried adding it in WebUI to see what paths it hits in our code.
From what I tested, I found there are 3 API endpoints at minimum that Open WebUI calls to us:
- /v1/chat/completions
- /v1/models
- /health
For /v1/chat/completions, I just kept adding based on what it complained about + asked AI until I completed all 3 APIs like this.
import datetimeimport timeimport uuidfrom fastapi import FastAPI, Requestfrom fastapi.responses import JSONResponsefrom pydantic import BaseModelfrom typing import List, Dict, Optionalimport torchimport uuidfrom datetime import datetimefrom transformers import AutoModelForCausalLM, AutoTokenizerapp = FastAPI()# Load model and tokenizer at startupmodel_id = "microsoft/bitnet-b1.58-2B-4T"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, force_download=True,)device = "cuda" if torch.cuda.is_available() else "cpu"model = model.to(device)class Message(BaseModel): role: str content: strclass ChatRequest(BaseModel): messages: List[Message] max_new_tokens: Optional[int] = 700class Choice(BaseModel): index: int message: Dict[str, str] finish_reason: strclass ChatResponse(BaseModel): id: str object: str created: int model: str choices: List[Choice]@app.post("/v1/chat/completions", response_model=ChatResponse)async def chat_completions(request: ChatRequest): # Prepare prompt using chat template prompt = tokenizer.apply_chat_template( [msg.dict() for msg in request.messages], tokenize=False, add_generation_prompt=True ) chat_input = tokenizer(prompt, return_tensors="pt").to(model.device) chat_outputs = model.generate(**chat_input, max_new_tokens=request.max_new_tokens) response = tokenizer.decode( chat_outputs[0][chat_input['input_ids'].shape[-1]:], skip_special_tokens=True ) # Return response in OpenAI-compatible format # return JSONResponse({ # "id": f"chatcmpl-{uuid.uuid4().hex[:12]}", # "object": "chat.completion", # "created": int(time.time()), # "model": model_id, # "choices": [ # { # "index": 0, # "message": { # "role": "assistant", # "content": response # }, # "finish_reason": "stop" # } # ] # }) return ChatResponse( id=f"chatcmpl-{uuid.uuid4().hex[:12]}", object="chat.completion", created=int(time.time()), model=model_id, choices=[ Choice( index=0, message={"role": "assistant", "content": response}, finish_reason="stop" ) ] )@app.get("/")def root(): """Root endpoint with API info""" return JSONResponse({ "message": "OpenAI-Compatible API for Open WebUI", "version": "1.0.0", "endpoints": { "models": "/v1/models", "chat": "/v1/chat/completions", "health": "/health" } })@app.get("/health")def health_check(): """Health check endpoint""" return JSONResponse({"status": "healthy", "timestamp": datetime.now().isoformat()})@app.get("/v1/models")def list_models(): """List available models""" return JSONResponse({ "data": [ { "id": model_id, "object": "model", "created": datetime.now().isoformat(), "owned_by": "microsoft", "permission": [] } ] })
When using it, I made it into Docker. During build, I was shocked by the size - almost 10 GB.
Tried using it for real and connecting it with Open WebUI - sometimes it gives okay answers, sometimes it hallucinates lol
But what's for sure is the CPU usage shoots up lol
That concludes my rough trial of running the Model, and if I find something better, I'll write another blog post. Feel free to reach out with suggestions. Oh, and don't force it on Windows - mine got squeezed by WSL2. Taking an old notebook and installing Linux to make a Local AI Inference Engine is still faster.
For all the code, I've uploaded it to Git: github.com/pingkunga/python_mi…
hackbyte (friendica) 13HB1
in reply to Thomas 🔭🕹️ • • •