Dataset described as deriving from the CommonCrawl, but no filtered dataset provided. Either a filtered dataset or a fully reproducible and persistent data pipeline would be preferred here.
SBATCH script with training code available at fork of Megatron-LM. However, no easily visible and easily navigable repository containing the code used to train the model is available. Making the repository more easily visible would alleviate this.
README of containing training code is unchanged from base repo. More elaborate documentation would be warranted. A good example for a good documentation style would be the repository for the OLMo model: https://github.com/allenai/OLMo
Hardware Architecture
Preprint shows architecture, providing details about design decisions and hyperparameters.
No datasheet containing a detailed description of data collection and curation is found attached to a persistent version of the model data, as would be preferred here. A persistent version of the filtered data with attached the information in the data preprint at https://arxiv.org/abs/2410.08800 would be sufficient here.