Based on phi2 by Microsoft, for which pretraining data has not been disclosed. Phi2 documentation says only '250B tokens, combination of NLP synthetic data created by AOAI GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by AOAI GPT-4.'.
End User Model Data
Finetuning of Phi not fully documented; derivative uses a Wikipedia and CulturaX mix for Dutch.