No specific accounting or listing available. Training data described in broad terms as 'web-crawled data and structured datasets with a total size of 7.7T, with a cutoff date 04/2023' alongside 'some additional web scraping'.
No data provided except a very generic reference to 'source-available, commercially usable datasets, as well as self-created and procured proprietary datasets'.
Aleph Alpha claims Pharia has been trained using the Scaling code base, which it made available as a repository mirrored from an unknown source; no specific repository found documenting the training or instruction-tuning of Pharia.