Training a TinyStories-Style GPT Model from Scratch
Training a TinyStories-Style GPT Model from Scratch

This project trains a small TinyStories-style GPT model from random initialization on an Apple M4 Mac mini 16GB using MLX. It is not about calling APIs or fine-tuning an existing model, but rather walking through the entire pipeline of data preparation, tokenizer, model architecture, training loop, checkpoint, and inference generation.
This write-up leans more toward an engineering retrospective: the focus is not on training a chat-capable model, but on verifying whether a personal machine can complete an end-to-end small-scale LLM training run.
Project repository: sergioperezcheco/llm-from-scratch
Project Results
The final trained model is a GPT with 44M parameters:
| Item | Result |
|---|---|
| Model Architecture | Decoder-only Transformer |
| Training Framework | MLX |
| Dataset | TinyStories |
| Training Steps | 10,000 |
| Parameter Count | 44,065,280 |
| Final Loss | 1.1612 |
| Lowest Loss | 1.0964 |
| Training Duration | ~9 hours 15 minutes |
| Training Device | Apple M4 Mac mini 16GB |
The trained model can generate text in the style of short English children's stories. It is not a general-purpose chat model, but it proves that the full pipeline from data to inference works end-to-end.
From the results, a model of this scale has already learned the basic narrative patterns of TinyStories, such as short sentences, characters, simple events, and twist endings. However, it still suffers from repetition, logical jumps, and weak factual grounding, making it more suitable as a training pipeline validation rather than a production model.
Who Is This For
This project is for people who already know Python, have a basic understanding of Transformers, and want to personally run through a small model training pipeline. If you just want to quickly use a large model, using an existing API, Ollama, or LM Studio would save more time.
If you want to reproduce this, you need to prepare in advance:
- An Apple Silicon Mac;
- Python and MLX environment;
- A network environment capable of downloading the TinyStories dataset;
- Sufficient local disk space for storing token data and checkpoints;
- A block of training time that can run continuously for several hours.
Why TinyStories
The advantage of TinyStories is that its corpus is simple, structurally stable, and has a clear objective, making it ideal for training small models from scratch. For a Mac mini with 16GB of unified memory, training a general-purpose corpus model is not realistic, while TinyStories allows focusing on the training process itself:
- Data can be prepared and encoded locally;
- The tokenizer vocabulary size is controllable;
- Model parameters can be compressed to tens of millions;
- Generated results are easy to manually evaluate for story structure.
The value of this type of project does not lie in producing a powerful model, but in understanding why each step of LLM training exists.
Model Architecture
The model in the project is a standard GPT-style Decoder-only Transformer with the following core configuration:
| Parameter | Value |
|---|---|
| Transformer Layers | 8 |
| Hidden Dimension | 512 |
| Attention Heads | 8 |
| FFN Dimension | 2048 |
| Vocabulary Size | 10,000 |
| Context Length | 512 |
The architecture includes token embedding, position embedding, causal self-attention, RMSNorm, SwiGLU FFN, and a final LM Head. The overall approach is similar to nanoGPT but adapted for Apple Silicon using MLX.
Training Process
The training script does several key things:
- Uses
numpy.memmapto read encoded binary token data, avoiding loading everything into memory at once. - Uses gradient accumulation to simulate a larger effective batch from small batches.
- Uses warmup + cosine decay for the learning rate.
- Saves safetensors checkpoints and JSON metadata at regular step intervals.
- Logs loss, speed, memory usage, and ETA during training.
- Supports saving the current checkpoint on interruption.
For a 16GB machine, memory management is more important than simply stacking model size. The project ultimately chose MEDIUM_CONFIG, which is more stable than the initially planned 100M parameter default configuration and more suitable for completing a long training run.
The most noteworthy trade-off here is the effective batch size. A single batch that is too large can easily exhaust unified memory, while a single batch that is too small leads to unstable training. Gradient accumulation provides a compromise between throughput and stability. Although each step takes longer, it makes it more feasible to complete the full training run.
Inference Results
After training, you can use generate.py to load checkpoints/final.safetensors and generate text. For example:
python generate.py \
--checkpoint checkpoints/final.safetensors \
--prompt "Lily found a tiny door under the old tree. " \
--max-tokens 120 \
--temperature 0.7 \
--top-k 50 \
--n-samples 1This model is most sensitive to English story openings, especially short sentences and fairy-tale style prompts. Chinese, Q&A, code, and general conversation are not its training objectives.
When generating, it is recommended to start with a fairly explicit English story opening, providing characters, a location, and a simple event. When the prompt is too short, the model tends to fall into repetitive sentence patterns; when the temperature is too high, story coherence drops noticeably.
Pitfalls and Trade-offs
The most practical trade-off in this project is: don't chase parameter count from the start.
On a Mac mini M4 16GB, although unified memory makes CPU/GPU data exchange more convenient, the total memory is still only 16GB. Model parameters, intermediate activations, optimizer states, the system, and other applications all compete for the same memory pool. Jumping straight to a larger model can easily cause training to be interrupted by memory pressure.
Another point to watch is checkpoint recovery. The current project can restore model weights and step metadata, but the optimizer state is not fully recovered, so it is not a completely lossless training resumption in a strict sense. For production training, this area could be further optimized.
Another easily underestimated issue is the evaluation method. Looking only at training loss is insufficient, because a decrease in loss does not necessarily mean the generated results are good. Going forward, improvements could include generating samples with fixed prompts at regular intervals, or setting aside a separate validation set to observe validation loss. This makes it easier to determine whether the model is learning stably or simply overfitting to the training set.
Summary
The most valuable aspect of this project is that it breaks down "training an LLM from scratch" into an engineering closed loop that can run on a personal machine:
- Data preparation: TinyStories download, splitting, encoding;
- Tokenizer: Training a 10,000-vocabulary BPE tokenizer;
- Model: Hand-written GPT Transformer;
- Training: MLX, gradient accumulation, learning rate scheduling, checkpointing;
- Inference: Loading safetensors and sampling to generate text.
If the goal is to understand the underlying training process of LLMs, this is more meaningful than directly calling APIs; if the goal is to obtain a usable general-purpose assistant, you should use an existing large model or a fine-tuning approach.
If continued in the future, three things could be prioritized: saving and restoring optimizer state, adding validation set evaluation, and organizing training logs into curve plots. This would make the project not just "runnable," but closer to a complete small-scale training experiment.
