Use ik_llama.cpp from 4D
llama.cpp is an open-source project that allows you to run Meta’s LLaMA language models locally on CPUs without heavy frameworks like PyTorch or TensorFlow. Essentially, it’s a lightweight C++ implementation optimized for inference.
ik_llama.cpp is a fork with additional SOTA quants and improved performance. Most of the optimisations by Ivan Kawrakow (ik) have been merged into the standard version of llama.cpp. This version is no longer relevant, especially for macOS and Windows.
Instantiate cs.llama.llama in your On Startup database method:
var $llama : cs.llama.llama
If (True)
$llama:=cs.llama.llama.new() //default
Else
var $modelsFolder : 4D.Folder
$modelsFolder:=Folder(fk home folder).folder(".llama-cpp")
var $lang; $URL : Text
var $file : 4D.File
$lang:=Get database localization(Current localization)
Case of
: ($lang="ja")
$file:=$modelsFolder.file("Llama-3-ELYZA-JP-8B-q4_k_m.gguf")
$URL:="https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B-GGUF/resolve/main/Llama-3-ELYZA-JP-8B-q4_k_m.gguf"
Else
$file:=$modelsFolder.file("nomic-embed-text-v1.5.f16.gguf")
$URL:="https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.f16.gguf"
End case
var $port : Integer
$port:=8080
$llama:=cs.llama.llama.new($port; $file; $URL; {\
ctx_size: 2048; \
batch_size: 2048; \
threads: 4; \
threads_batch: 4; \
threads_http: 4; \
temp: 0.7; \
top_k: 40; \
top_p: 0.9; \
repeat_penalty: 1.1}; Formula(ALERT(This.file.name+($1.success ? " started!" : " did not start..."))))
End if
Unless the server is already running (in which case the costructor does nothing), the following procedure runs in the background:
llama-server program is startedNow you can test the server:
curl -X POST http://127.0.0.1:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input":"The quick brown fox jumps over the lazy dog."}'
Or, use AI Kit:
var $AIClient : cs.AIKit.OpenAI
$AIClient:=cs.AIKit.OpenAI.new()
$AIClient.baseURL:="http://127.0.0.1:8080/v1"
var $text : Text
$text:="The quick brown fox jumps over the lazy dog."
var $responseEmbeddings : cs.AIKit.OpenAIEmbeddingsResult
$responseEmbeddings:=$AIClient.embeddings.create($text)
Finally to terminate the server:
var $llama : cs.llama.llama
$llama:=cs.llama.llama.new()
$llama.terminate()
The API is compatibile with Open AI.
| Class | API | Availability |
|---|---|---|
| Models | /v1/models |
✅ |
| Chat | /v1/chat/completions |
✅ |
| Images | /v1/images/generations |
|
| Moderations | /v1/moderations |
|
| Embeddings | /v1/embeddings |
✅ |
| Files | /v1/files |