Jump to content

BLOOM (language model)

From Wikipedia, the free encyclopedia

BigScience Large Open-science Open-access Multilingual Language Model(BLOOM)[1][2]is a 176-billion-parametertransformer-basedautoregressivelarge language model(LLM). The model, as well as the code base and the data used to train it, are distributed under free licences.[3]BLOOM was trained on approximately 366 billion (1.6TB) tokens from March to July 2022.[4][5]

BLOOM is the main outcome of the BigScience collaborative initiative,[6]a one-year-long research workshop that took place between May 2021 and May 2022. BigScience was led byHuggingFaceand involved several hundreds of researchers and engineers from France and abroad representing both the academia and the private sector. BigScience was supported by a large-scale public compute grant on theFrenchpublic supercomputer Jean Zay, managed byGENCIand IDRIS (CNRS), on which it was trained.

BLOOM's training corpus, named ROOTS, combines data extracted from the then-latest version of the web-based OSCAR corpus (38% of ROOTS) and newly collected data extracted from a manually selected and documented list of language data sources. It encompasses 46 natural languages (in amounts ranging from 30% of the whole dataset for English to 0.00002% for Chi Tumbuka) and 13 programming languages.[7]

References

[edit]
  1. ^"BigScience Large Open-science Open-access Multilingual Language Model".Retrieved2022-10-01.
  2. ^Le Scao T, Fan A, Akiki C, Pavlick E, Ilić S, Hesslow D, Castagné R, Luccioni A, Yvon F, Gallé M, Tow J, Rush AM, Biderman S, Webson A, Sasanka Ammanamanchi P, Wang T, Sagot B, Muennighoff N, Villanova del Moral A, Ruwase O, Bawden R, Bekman S, McMillan-Major A, Beltagy I, Nguyen H, Saulnier L, Tan S, Ortiz Suarez P, Sanh V, Laurençon H, Jernite Y, Launay J, Mitchell M, Raffel C, et al. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model".arXiv:2211.05100[cs.CL].
  3. ^"The BigScience RAIL license".Retrieved2024-01-10.
  4. ^Heikkilä, Melissa (2022-07-12)."BLOOM: Inside the radical new project to democratize AI".MIT Technology Review.Retrieved2023-12-26.
  5. ^"Release of largest trained open-science multilingual language model ever".French National Centre for Scientific Research.2022-07-12.Retrieved2023-12-26.
  6. ^"BigScience".Retrieved2024-01-10.
  7. ^Laurençon H, Saulnier L, Wang T, Akiki C, Villanova del Moral A, Le Scao T, Von Werra L, Mou C, González Ponferrada C, Nguyen H, Frohberg J, Šaško M, Lhoest Q, McMillan-Major A, Dupont G, Biderman S, Rogers A, Ben allal L, De Toni F, Pistilli G, Nguyen O, Nikpoor S, Masoud M, Colombo P, de la Rosa J, Villegas P, Thrush T, Longpre S, Nagel S, Weber L, Muñoz M, Zhu J, Van Strien D, Alyafeai Z, Almubarak K, Vu MC, Gonzalez-Dios I, Soroa A, Lo K, Dey M, Ortiz Suarez P, Gokaslan A, Bose S, Adelani D, Phan L, Tran H, Yu I, Pai S, Chim J, Lepercq V, Ilic S, Mitchell M, Luccioni S, Jernite Y (2022). "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset".arXiv:2303.03915[cs.CL].