Heaps' law
Inlinguistics,Heaps' law(also calledHerdan's law) is anempirical lawwhich describes the number of distinct words in a document (or set of documents) as a function of the document length (so called type-token relation). It can be formulated as
whereVRis the number of distinct words in an instance text of sizen.Kand β are free parameters determined empirically. With Englishtext corpora,typicallyKis between 10 and 100, and β is between 0.4 and 0.6.
The law is frequently attributed toHarold Stanley Heaps,but was originally discovered by Gustav Herdan (1960).[1]Under mild assumptions, the Herdan–Heaps law is asymptotically equivalent toZipf's lawconcerning the frequencies of individual words within a text.[2]This is a consequence of the fact that the type-token relation (in general) of a homogenous text can be derived from the distribution of its types.[3]
Empirically, Heaps' law is preserved even when the document is randomly shuffled,[4]meaning that it does not depend on the ordering of words, but only the frequency of words.[5]This is used as evidence for deriving Heaps' law from Zipf's law.[4]
Heaps' law means that as more instance text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn.
Deviations from Heaps' law, as typically observed in English text corpora, have been identified in corpora generated with large language models.[6]
Heaps' law also applies to situations in which the "vocabulary" is just some set of distinct types which are attributes of some collection of objects. For example, the objects could be people, and the types could be country of origin of the person. If persons are selected randomly (that is, we are not selecting based on country of origin), then Heaps' law says we will quickly have representatives from most countries (in proportion to their population) but it will become increasingly difficult to cover the entire set of countries by continuing this method of sampling. Heaps' law has been observed also in single-celltranscriptomes[7]consideringgenesas the distinct objects in the "vocabulary".
See also
[edit]- Zipf's law– Probability distribution
- Brevity law– Linguistics law
- Menzerath's law– Linguistic law
- Bradford's law– Pattern of references in science journals
- Benford's law– Observation that in many real-life datasets, the leading digit is likely to be small
- Pareto distribution– Probability distribution
- Principle of least effort– Idea that agents prefer to do what's easiest
- Rank-size distribution– distribution of size by rank
References
[edit]Citations
[edit]- ^Egghe (2007):"Herdan's law in linguistics and Heaps' law in information retrieval are different formulations of the same phenomenon".
- ^Kornai (1999);Baeza-Yates & Navarro (2000);van Leijenhorst & van der Weide (2005).
- ^Milička (2009)
- ^abSano, Yukie; Takayasu, Hideki; Takayasu, Misako (2012)."Zipf's Law and Heaps' Law Can Predict the Size of Potential Words".Progress of Theoretical Physics Supplement.194:202–209.Bibcode:2012PThPS.194..202S.doi:10.1143/PTPS.194.202.ISSN0375-9687.
- ^Najafi, Elham; Darooneh, Amir H. (2015-06-19). Esteban, Francisco J. (ed.)."The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction".PLOS ONE.10(6): e0130617.Bibcode:2015PLoSO..1030617N.doi:10.1371/journal.pone.0130617.ISSN1932-6203.PMC4474631.PMID26091207.
- ^Lai, Uyen; Randhawa, Gurjit; Sheridan, Paul (2023-12-12)."Heaps' Law in GPT-Neo Large Language Model Emulated Corpora".Proceedings of the Tenth International Workshop on Evaluating Information Access (EVIA 2023), a Satellite Workshop of the NTCIR-17 Conference.Tokyo, Japan. pp. 20–23.doi:10.20736/0002001352.
- ^Lazzardi, Silvia; Valle, Filippo; Mazzolini, Andrea; Scialdone, Antonio; Caselle, Michele; Osella, Matteo (2021-06-17)."Emergent Statistical Laws in Single-Cell Transcriptomic Data".bioRxiv:2021–06.16.448706.doi:10.1101/2021.06.16.448706.S2CID235482777.Retrieved2021-06-18.
Sources
[edit]- Baeza-Yates, Ricardo; Navarro, Gonzalo (2000), "Block addressing indices for approximate text retrieval",Journal of the American Society for Information Science,51(1): 69–82,CiteSeerX10.1.1.31.4832,doi:10.1002/(sici)1097-4571(2000)51:1<69::aid-asi10>3.0.co;2-c.
- Egghe, L. (2007), "Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments",Journal of the American Society for Information Science and Technology,58(5): 702–709,doi:10.1002/asi.20524.
- Heaps, Harold Stanley (1978),Information Retrieval: Computational and Theoretical Aspects,Academic Press.Heaps' law is proposed in Section 7.5 (pp. 206–208).
- Herdan, Gustav (1960),Type-token mathematics,The Hague: Mouton.
- Kornai, Andras (1999), "Zipf's law outside the middle range", in Rogers, James (ed.),Proceedings of the Sixth Meeting on Mathematics of Language,University of Central Florida, pp. 347–356.
- Milička, Jiří (2009), "Type-token & Hapax-token Relation: A Combinatorial Model",Glottotheory. International Journal of Theoretical Linguistics,1(2): 99–110,doi:10.1515/glot-2009-0009,S2CID124490442.
- van Leijenhorst, D. C; van der Weide, Th. P. (2005), "A formal derivation of Heaps' Law",Information Sciences,170(2–4): 263–272,doi:10.1016/j.ins.2004.03.006.
- This article incorporates material from Heaps' law onPlanetMath,which is licensed under theCreative Commons Attribution/Share-Alike License.
External links
[edit]- Media related toHeaps' lawat Wikimedia Commons