platform downloads

Use Ai00 from 4D

Abstract

Ai00 is an inference server secifically designed for the RWKV language model. Unlike transformer-based models like Llama, Mistral, or GPT, RWKV (Raven’s World-Knowledge Vectors) is a recurrent neural network which means it can have an infinite context on low RAM (it uses the exact same amount of RAM regardless of the number of tokens).

Usage

Instantiate cs.Ai00.Ai00 in your On Startup database method:

var $Ai00 : cs.Ai00

If (False)
    $Ai00:=cs.Ai00.Ai00.new()  //default
Else 
    var $homeFolder : 4D.Folder
    $homeFolder:=Folder(fk home folder).folder(".Ai00")
    var $file : 4D.File
    $file:=$homeFolder.file("rwkv7-g1a-0.4b-20250905-ctx4096.st")
    $URL:="https://github.com/miyako/ai00/releases/download/models/rwkv7-g1a-0.4b-20250905-ctx4096.st"
    var $port : Integer
    $port:=8087
    
    var $event : cs.event.event
    $event:=cs.event.event.new()
    /*
        Function onError($params : Object; $error : cs.event.error)
        Function onSuccess($params : Object; $models : cs.event.models)
        Function onData($request : 4D.HTTPRequest; $event : Object)
        Function onResponse($request : 4D.HTTPRequest; $event : Object)
        Function onTerminate($worker : 4D.SystemWorker; $params : Object)
    */
    
    $event.onError:=Formula(ALERT($2.message))
    $event.onSuccess:=Formula(ALERT($2.models.extract("name").join(",")+" loaded!"))
    $event.onData:=Formula(LOG EVENT(Into 4D debug message; "download:"+String((This.range.end/This.range.length)*100; "###.00%")))
    $event.onResponse:=Formula(LOG EVENT(Into 4D debug message; "download complete"))
    $event.onTerminate:=Formula(LOG EVENT(Into 4D debug message; (["process"; $1.pid; "terminated!"].join(" "))))
    
    $Ai00:=cs.Ai00.Ai00.new($port; $file; $URL; {\
    max_batch: 1; \
    quant_type: "Int8"; \
    precision: "Fp32"}; $event)
End if  

Unless the server is already running (in which case the costructor does nothing), the following procedure runs in the background:

The specified model is downloaded via HTTP
The ai00-server program is started

Now you can test the server:

curl -X GET http://127.0.0.1:8080/api/oai/v1/models

curl -X 'POST' \
  'http://127.0.0.1:8080/api/oai/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "max_tokens": 1000,
  "messages": [
    {
      "content": "Hi!",
      "role": "user"
    },
    {
      "content": "Hello, I am your AI assistant. If you have any questions or instructions, please let me know!",
      "role": "assistant"
    },
    {
      "content": "Tell me about water.",
      "role": "user"
    }
  ],
  "names": {
    "assistant": "Assistant",
    "user": "User"
  },
  "sampler": {
    "frequency_penalty": 0.3,
    "penalty_decay": 0.99654026,
    "presence_penalty": 0.3,
    "temperature": 1,
    "top_k": 128,
    "top_p": 0.5,
    "type": "Nucleus"
  },
  "state": "00000000-0000-0000-0000-000000000000",
  "stop": [
    "\n\nUser:"
  ],
  "stream": false,
  "template": {
    "prefix": "{assistant}:",
    "record": "{role}: {content}",
    "sep": "\n\n"
  }
}'

The full list of endpoints are listsed at http://127.0.0.1:8080/api-docs/.

Finally to terminate the server:

var $Ai00 : cs.Ai00.Ai00
$Ai00:=cs.Ai00.Ai00.new()
$Ai00.terminate()

Models

Models in .pth format can be downloaded from huggingface.co or modelscope.cn.

Ai00 can’t use a .pth model directly. You can use python to convert the model from .pth to .st.

You can also use the converter tool in /RESOURCES/.

converter --input model.pth --output model.st

Why Ai00

Ai00 is designed for RWKV models, which is different from LLaMA and has distinct strengths.

The groundbreaking paper “Attention is All You Need” (2017) by researchers at Google enabled parallel processing, a key differentiator from prior sequential models like RNNs.

Before 2017, AI processed text sequentially from start to end. The architecture had a major flaw, that it couldn’t be parallelised since you can’t process Word 3 until you finished processing Word 2.

The Transformer architecture allowed the computer to look at an entire sentence at once, rather than one word at a time. This unlocked the ability to use massive GPU clusters to train on the entire internet, giving birth to GPT, BERT, and Claude.

Attention, or self-attention, allows the model to weigh the relevance of every word against every other word in a sentence, regardless of how far apart they are.

To calculate attention, the model must compare every token to every other token. This means the memory required increases by 4x if you double the length of the prompt and 9x of you triple the length. Running very long conversations on standard Transformers (like Llama 3 or Mistral) is extremely memory-heavy.

Ai00 runs the RWKV model, which uses a “Linear Attention” approach. It trains like a Transformer (fast/parallel) but runs like an RNN (sequential). The amount of RAM is fixed regardless of whether the conversation is 10 words long or 100,000 words long.

LLaMA (Transformer)

Uses self-attention
Needs to store a KV cache for every token
Memory grows with context

RWKV (Recurrent Transformer Hybrid)

No attention at inference
Keeps only a small hidden state
Memory does not grow with context
Good for text-generation with long memory
Not a good embedding model

AI Kit compatibility

The API is compatibile with Open AI.

Class	API	Availability
Models	`/v1/models`	✅
Chat	`/v1/chat/completions`	✅
Images	`/v1/images/generations`
Moderations	`/v1/moderations`
Embeddings	`/v1/embeddings`
Files	`/v1/files`