Ann-gramis a sequence ofnadjacent symbols in particular order.[1]The symbols may benadjacentletters(includingpunctuation marksand blanks),syllables,or rarely wholewordsfound in a language dataset; or adjacentphonemesextracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from atext corpusorspeech corpus.

N-gram is actually theparent of a family of namesterm, wherefamily memberscan be (depending onnnumeral) 1-gram, 2-gram etc., or the same using spoken numeral prefixes.
IfLatin numerical prefixesare used, thenn-gram of size 1 is called a "unigram", size 2 a "bigram"(or, less commonly, a" digram ") etc. If, instead of the Latin ones, theEnglish cardinal numbersare furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, usingGreek numerical prefixessuch as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, forpolymersoroligomersof a known size, calledk-mers.When the items are words,n-grams may also be calledshingles.[2]
In the context ofNatural language processing(NLP), the use ofn-grams allowsbag-of-wordsmodels to capture information such as word order, which would not be possible in the traditional bag of words setting.
Examples
edit(Shannon 1951)[3]discussedn-gram models of English. For example:
- 3-gram character model (random draw based on the probabilities of each trigram):in no ist lat whey cratict froure birs grocid pondenome of demonstures of the retagin is regiactiona of cre
- 2-gram word model (random draw of words taking into account their transition probabilities):the head and in frontal attack on an english writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected
Field | Unit | Sample sequence | 1-gram sequence | 2-gram sequence | 3-gram sequence |
---|---|---|---|---|---|
Vernacular name | unigram | bigram | trigram | ||
Order of resultingMarkov model | 0 | 1 | 2 | ||
Protein sequencing | amino acid | ... Cys-Gly-Leu-Ser-Trp... | ..., Cys, Gly, Leu, Ser, Trp,... | ..., Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp,... | ..., Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp,... |
DNA sequencing | base pair | ...AGCTTCGA... | ..., A, G, C, T, T, C, G, A,... | ..., AG, GC, CT, TT, TC, CG, GA,... | ..., AGC, GCT, CTT, TTC, TCG, CGA,... |
Language model | character | ...to_be_or_not_to_be... | ..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e,... | ..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be,... | ..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be,... |
Wordn-gram language model | word | ... to be or not to be... | ..., to, be, or, not, to, be,... | ..., to be, be or, or not, not to, to be,... | ..., to be or, be or not, or not to, not to be,... |
Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.
Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Googlen-gram corpus.[4]
3-grams
- ceramics collectables collectibles (55)
- ceramics collectables fine (130)
- ceramics collected by (52)
- ceramics collectible pottery (50)
- ceramics collectibles cooking (45)
4-grams
- serve as the incoming (92)
- serve as the incubator (99)
- serve as the independent (794)
- serve as the index (223)
- serve as the indication (72)
- serve as the indicator (120)
References
edit- ^"n-gram language model - an overview | ScienceDirect Topics".sciencedirect.Retrieved12 December2024.
- ^Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey (1997). "Syntactic clustering of the web".Computer Networks and ISDN Systems.29(8):1157–1166.doi:10.1016/s0169-7552(97)00031-7.S2CID9022773.
- ^Shannon, Claude E. "The redundancy of English."Cybernetics; Transactions of the 7th Conference, New York: Josiah Macy, Jr. Foundation.1951.
- ^Franz, Alex; Brants, Thorsten (2006)."All OurN-gram are Belong to You ".Google Research Blog.Archivedfrom the original on 17 October 2006.Retrieved16 December2011.
Further reading
edit- Manning, Christopher D.; Schütze, Hinrich;Foundations of Statistical Natural Language Processing,MIT Press: 1999,ISBN0-262-13360-1
- White, Owen; Dunning, Ted; Sutton, Granger; Adams, Mark; Venter, J. Craig; Fields, Chris (1993)."A quality control algorithm for dna sequencing projects".Nucleic Acids Research.21(16):3829–3838.doi:10.1093/nar/21.16.3829.PMC309901.PMID8367301.
- Damerau, Frederick J.;Markov Models and Linguistic Theory,Mouton, The Hague, 1971
- Figueroa, Alejandro; Atkinson, John (2012)."Contextual Language Models For Ranking Answers To Natural Language Definition Questions".Computational Intelligence.28(4):528–548.doi:10.1111/j.1467-8640.2012.00426.x.S2CID27378409.
- Brocardo, Marcelo Luiz; Traore, Issa; Saad, Sherif; Woungang, Isaac (2013).Authorship Verification for Short Messages Using Stylometry.IEEE International Conference on Computer, Information and Telecommunication Systems (CITS).
See also
editExternal links
edit- Ngram Extractor: Gives weight ofn-gram based on their frequency.
- Google's Google Booksn-gram viewerandWebn-grams database(September 2006)
- STATOPERATOR N-grams Project Weightedn-gram viewer for every domain in Alexa Top 1M
- 1,000,000 most frequent 2,3,4,5-gramsfrom the 425 million wordCorpus of Contemporary American English
- Peachnote's music ngram viewer
- Stochastic Language Models (n-Gram) Specification(W3C)
- Michael Collins's notes onn-Gram Language Models
- OpenRefine: Clustering In Depth