datasets | 4D LLM Training

Continued Pretraining Goals

4D is a low-resource language. There is limited publicly available text for it compared to Python or SQL. The command documentation (excluding classes) consist of 1256 files, which, with a decent frontier LLM, could be expanded to 3.5 MB of text, or about 700,000 tokens.

About 2000 billion (2 trillion) tokens were used to train a small 2 billion parameter model like Gemma 4 E2B that runs on a laptop. At least 1% of that amount is needed for the model to learn a new vocabulary and grammar with continued pretraining (CPT). That would be 20 billion tokens.

To pass that threshold, 20 billion divided by 700,000, or 28,571 different versions of the documentation is necessary. If one could deploy multiple frontier LLMs to generate 10 sets of synthetic data based on the documentation in a day, that would take 2,857 days, or 7 years to produce.

However, 4D is not a completely new domain. It is more like a specific strand within the domain of programming languages. This shifts the floor down significantly. 70 million tokens, or 100 different versions of the documentation might suffice.

Once the dataset is ready, we can use Low-Rank Adaptation (LoRA) to fine-tune an off-the-shelf model. It would take about 100 hours using an NVIDIA A100 to fine-tune the small Gemma 4 E2B model with 100 sets of 700,000 tokens.

Still, this is far from the end of training. In fact, CPT only prepares the model to start the real training. With a decent grasp of the language, the model can read raw 4D code, such as files found on GitHub, without annotation or cheatsheets. Premature exposure to source code would only confuse the model, so it is really important to work on the foundation in a controlled setting.

Claude chat log

ChatML Viewer

Use this online tool to browse ChatML files.