Use Ai00 from 4D
Ai00 is an inference server secifically designed for the RWKV language model. Unlike transformer-based models like Llama, Mistral, or GPT, RWKV (Raven’s World-Knowledge Vectors) is a recurrent neural network which means it can have an infinite context on low RAM (it uses the exact same amount of RAM regardless of the number of tokens).
Instantiate cs.Ai00.Ai00 in your On Startup database method:
var $Ai00 : cs.Ai00
If (False)
$Ai00:=cs.Ai00.Ai00.new() //default
Else
var $homeFolder : 4D.Folder
$homeFolder:=Folder(fk home folder).folder(".Ai00")
var $file : 4D.File
$file:=$homeFolder.file("rwkv7-g1a-0.4b-20250905-ctx4096.st")
$URL:="https://github.com/miyako/ai00/releases/download/models/rwkv7-g1a-0.4b-20250905-ctx4096.st"
var $port : Integer
$port:=8087
var $event : cs.event.event
$event:=cs.event.event.new()
/*
Function onError($params : Object; $error : cs.event.error)
Function onSuccess($params : Object; $models : cs.event.models)
Function onData($request : 4D.HTTPRequest; $event : Object)
Function onResponse($request : 4D.HTTPRequest; $event : Object)
Function onTerminate($worker : 4D.SystemWorker; $params : Object)
*/
$event.onError:=Formula(ALERT($2.message))
$event.onSuccess:=Formula(ALERT($2.models.extract("name").join(",")+" loaded!"))
$event.onData:=Formula(LOG EVENT(Into 4D debug message; "download:"+String((This.range.end/This.range.length)*100; "###.00%")))
$event.onResponse:=Formula(LOG EVENT(Into 4D debug message; "download complete"))
$event.onTerminate:=Formula(LOG EVENT(Into 4D debug message; (["process"; $1.pid; "terminated!"].join(" "))))
$Ai00:=cs.Ai00.Ai00.new($port; $file; $URL; {\
max_batch: 1; \
quant_type: "Int8"; \
precision: "Fp32"}; $event)
End if
Unless the server is already running (in which case the costructor does nothing), the following procedure runs in the background:
ai00-server program is startedNow you can test the server:
curl -X GET http://127.0.0.1:8080/api/oai/v1/models
curl -X 'POST' \
'http://127.0.0.1:8080/api/oai/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"max_tokens": 1000,
"messages": [
{
"content": "Hi!",
"role": "user"
},
{
"content": "Hello, I am your AI assistant. If you have any questions or instructions, please let me know!",
"role": "assistant"
},
{
"content": "Tell me about water.",
"role": "user"
}
],
"names": {
"assistant": "Assistant",
"user": "User"
},
"sampler": {
"frequency_penalty": 0.3,
"penalty_decay": 0.99654026,
"presence_penalty": 0.3,
"temperature": 1,
"top_k": 128,
"top_p": 0.5,
"type": "Nucleus"
},
"state": "00000000-0000-0000-0000-000000000000",
"stop": [
"\n\nUser:"
],
"stream": false,
"template": {
"prefix": "{assistant}:",
"record": "{role}: {content}",
"sep": "\n\n"
}
}'
The full list of endpoints are listsed at http://127.0.0.1:8080/api-docs/.
Finally to terminate the server:
var $Ai00 : cs.Ai00.Ai00
$Ai00:=cs.Ai00.Ai00.new()
$Ai00.terminate()
Models in .pth format can be downloaded from huggingface.co or modelscope.cn.
Ai00 can’t use a .pth model directly. You can use python to convert the model from .pth to .st.
You can also use the converter tool in /RESOURCES/.
converter --input model.pth --output model.st
Ai00 is designed for RWKV models, which is different from LLaMA and has distinct strengths.
The groundbreaking paper “Attention is All You Need” (2017) by researchers at Google enabled parallel processing, a key differentiator from prior sequential models like RNNs.
Before 2017, AI processed text sequentially from start to end. The architecture had a major flaw, that it couldn’t be parallelised since you can’t process Word 3 until you finished processing Word 2.
The Transformer architecture allowed the computer to look at an entire sentence at once, rather than one word at a time. This unlocked the ability to use massive GPU clusters to train on the entire internet, giving birth to GPT, BERT, and Claude.
Attention, or self-attention, allows the model to weigh the relevance of every word against every other word in a sentence, regardless of how far apart they are.
To calculate attention, the model must compare every token to every other token. This means the memory required increases by 4x if you double the length of the prompt and 9x of you triple the length. Running very long conversations on standard Transformers (like Llama 3 or Mistral) is extremely memory-heavy.
Ai00 runs the RWKV model, which uses a “Linear Attention” approach. It trains like a Transformer (fast/parallel) but runs like an RNN (sequential). The amount of RAM is fixed regardless of whether the conversation is 10 words long or 100,000 words long.
The API is compatibile with Open AI.
| Class | API | Availability |
|---|---|---|
| Models | /v1/models |
✅ |
| Chat | /v1/chat/completions |
✅ |
| Images | /v1/images/generations |
|
| Moderations | /v1/moderations |
|
| Embeddings | /v1/embeddings |
|
| Files | /v1/files |