Collation

Collationis the assembly of written information into a standard order. Many systems of collation are based onnumerical orderorAlpha betical order,or extensions and combinations thereof. Collation is a fundamental element of most officefiling systems,library catalogs,andreference books.

Collation differs fromclassificationin that the classes themselves are not necessarily ordered. However, even if the order of the classes is irrelevant, the identifiers of the classes may be members of an ordered set, allowing asorting algorithmto arrange the items by class.

Formally speaking, a collation method typically defines atotal orderon a set of possible identifiers, called sort keys, which consequently produces atotal preorderon the set of items of information (items with the same identifier are not placed in any defined order).

A collation algorithm such as theUnicode collation algorithmdefines an order through the process of comparing two givencharacter stringsand deciding which should come before the other. When an order has been defined in this way, a sorting algorithm can be used to put a list of any number of items into that order.

The main advantage of collation is that it makes it fast and easy for a user to find an element in the list, or to confirm that it is absent from the list. In automatic systems this can be done using abinary search algorithmorinterpolation search;manual searching may be performed using a roughly similar procedure, though this will often be done unconsciously. Other advantages are that one can easily find the first or last elements on the list (most likely to be useful in the case of numerically sorted data), or elements in a given range (useful again in the case of numerical data, and also with Alpha betically ordered data when one may be sure of only the first few letters of the sought item or items).

Ordering

Numerical and chronological

Strings representingnumbersmay be sorted based on the values of the numbers that they represent. For example, "−4", "2.5", "10", "89", "30,000". Pure application of this method may provide only a partial ordering on the strings, since different strings can represent the same number (as with "2" and "2.0" or, whenscientific notationis used, "2e3" and "2000" ).

A similar approach may be taken with strings representingdatesor other items that can be ordered chronologically or in some other natural fashion.

Alphabetical

Alphabetical orderis the basis for many systems of collation where items of information are identified by strings consisting principally oflettersfrom anAlpha bet.The ordering of the strings relies on the existence of a standard ordering for the letters of the Alpha bet in question. (The system is not limited to Alpha bets in the strict technical sense; languages that use asyllabaryorabugida,for exampleCherokee,can use the same ordering principle provided there is a set ordering for the symbols used.)

To decide which of two strings comes first in Alpha betical order, initially their first letters are compared. The string whose first letter appears earlier in the Alpha bet comes first in Alpha betical order. If the first letters are the same, then the second letters are compared, and so on, until the order is decided. (If one string runs out of letters to compare, then it is deemed to come first; for example, "cart" comes before "carthorse".) The result of arranging a set of strings in Alpha betical order is that words with the same first letter are grouped together, and within such a group words with the same first two letters are grouped together, and so on.

Capital lettersare typically treated as equivalent to their corresponding lowercase letters. (For alternative treatments in computerized systems, seeAutomated collation,below.)

Certain limitations, complications, and special conventions may apply when Alpha betical order is used:

When strings containspacesor other word dividers, the decision must be taken whether to ignore these dividers or to treat them as symbols preceding all other letters of the Alpha bet. For example, if the first approach is taken then "car park" will come after "carbon" and "carp" (as it would if it were written "carpark" ), whereas in the second approach "car park" will come before those two words. The first rule is used in many (but not all)dictionaries,the second intelephone directories(so that Wilson, Jim K appears with other people named Wilson, Jim and not after Wilson, Jimbo).
Abbreviations may be treated as if they were spelt out in full. For example, names containing "St." (short for the English wordSaint) are often ordered as if they were written out as "Saint". There is also a traditional convention in English that surnames beginningMcandM'are listed as if those prefixes were writtenMac.
Strings that represent personal names will often be listed by Alpha betical order of surname, even if thegiven namecomes first. For example, Juan Hernandes and Brian O'Leary should be sorted as "Hernandes, Juan" and "O'Leary, Brian" even if they are not written this way.
Very common initial words, such asThein English, are often ignored for sorting purposes. SoThe Shiningwould be sorted as just "Shining" or "Shining, The".
When some of the strings containnumerals(or other non-letter characters), various approaches are possible. Sometimes such characters are treated as if they came before or after all the letters of the Alpha bet. Another method is for numbers to be sorted Alpha betically as they would be spelled: for example1776would be sorted as if spelled out "seventeen seventy-six", and24 heures du Mansas if spelled "vingt-quatre..." (French for "twenty-four" ). When numerals or other symbols are used as special graphical forms of letters, as in1337forleetorSe7enfor the movie titleSeven,they may be sorted as if they were those letters.
Languages have different conventions for treatingmodified lettersand certain letter combinations. For example, inSpanishthe letterñis treated as a basic letter followingn,and thedigraphschandllwere formerly (until 1994) treated as basic letters followingcandl,although they are now Alpha betized as two-letter combinations. A list of such conventions for various languages can be found atAlphabetical order § Language-specific conventions.

In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, Germandictionariesandtelephone directoriesuse different approaches.

Root sorting

SomeArabicdictionaries, such asHans Wehr's bilingualA Dictionary of Modern Written Arabic,group and sort Arabic words bysemitic root.^[1]For example, the wordskitāba(كتابة'writing'),kitāb(كتاب'book'),kātib(كاتب'writer'),maktaba(مكتبة'library'),maktab(مكتب'office'),maktūb(مكتوب'fate,' or 'written'), are agglomerated under thetriliteralrootk-t-b(ك ت ب), which denotes 'writing'.^[2]

Radical-and-stroke sorting

See alsoChinese charactersandChinese character orders

Another form of collation isradical-and-stroke sorting,used for non- Alpha betic writing systems such as thehanziofChineseand thekanjiofJapanese,whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are calledradicalsin Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character mẹ (meaning "mother" ) is sorted as a six-stroke character under the three-stroke primary radical nữ.

The radical-and-stroke system is cumbersome compared to an Alpha betical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with Alpha betic sorting of a phonetic conversion of the logographs. For example, the kanji wordTōkyō( Đông Kinh ) can be sorted as if it were spelled out in the Japanese characters of thehiraganasyllabary as "to-u-ki-_yo-u "(とうきょう), using the conventional sorting order for these characters.^{[citation needed]}

In addition, Chinese characters can also be sorted bystroke-based sorting.In Greater China,surname stroke orderingis a convention in some official documents where people's names are listed without hierarchy.

Automation

When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collationalgorithmthat allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an Alpha betical or numerical ordering that follows the standard criteria as described in the preceding sections. However, not all of these criteria are easy to automate.^[3]

The simplest kind of automated collation is based on the numerical codes of the symbols in acharacter set,such asASCIIcoding (or any of itssupersetssuch asUnicode), with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of Alpha betical ordering (mathematically speaking,lexicographical ordering). So a computer program might treat the charactersa,b,C,d,and$as being ordered$,C,a,b,d(the corresponding ASCII codes are$= 36,a= 97,b= 98,C= 67, andd= 100). Therefore, strings beginning withC,M,orZwould be sorted before strings with lower-casea,b,etc. This is sometimes calledASCIIbetical order.This deviates from the standard Alpha betical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters). It is therefore often applied with certain alterations, the most obvious being case conversion (often to uppercase, for historical reasons^{[note 1]}) before comparison of ASCII values.

In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to thecollating sequence– a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for Alpha betical ordering in the language in question, dealing properly with differently cased letters,modified letters,digraphs,particular abbreviations, and so on, as mentioned above underAlphabetical order,and in detail in theAlphabetical orderarticle. Such algorithms are potentially quite complex, possibly requiring several passes through the text.^[3]

Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, inGermandictionaries the wordökonomischcomes betweenoffenbarandolfaktorisch,whileTurkishdictionaries treatoandöas different letters, placingoyunbeforeöbür.

A standard algorithm for collating any collection of strings composed of any standardUnicodesymbols is theUnicode Collation Algorithm.This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected inCommon Locale Data Repository.

Sort keys

In some applications, the strings by which items are collated may differ from the identifiers that are displayed. For example,The Shiningmight besortedasShining, The(seeAlphabetical orderabove), but it may still be desired to display it asThe Shining.In this case two sets of strings can be stored, one for display purposes, and another for collation purposes. Strings used for collation in this way are calledsort keys.

Issues with numbers

Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though '7' comes after '1' inUnicode.This can be extended toRoman numerals.This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. For example,Microsoft Windowsdoes this when sortingfile names.

Sorting decimals properly is a bit more difficult, because different locales use different symbols for adecimal point,and sometimes the same character used as adecimal pointis also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.

Labeling of ordered items

In some contexts, numbers and letters are used not so much as a basis for establishing an ordering, but as a means of labeling items that are already ordered. For example, pages, sections, chapters, and the like, as well as the items of lists, are frequently "numbered" in this way. Labeling series that may be used include ordinaryArabic numerals(1, 2, 3,...),Roman numerals(I, II, III,... or i, ii, iii,...), or letters (A, B, C,... or a, b, c,...). (An alternative method for indicating list items, without numbering them, is to use abulleted list.)

When letters of an Alpha bet are used for this purpose ofenumeration,there are certain language-specific conventions as to which letters are used. For example, theRussianlettersЪandЬ(which in writing are only used for modifying the precedingconsonant), and usually alsoЫ,Й,andЁ,are omitted. Also in many languages that use extendedLatin script,themodified lettersare often not used in enumeration.

Notes

^Historically, computers only handled text in uppercase (this dates back totelegraphconventions).

References

^Abu-Haidar, J. A. (1983)."Review of A Dictionary of Modern Written Arabic (Arabic-English)".Bulletin of the School of Oriental and African Studies, University of London.46(2): 351–353.ISSN 0041-977X.JSTOR 615409.
^"Hans Wehr Arabic-English Dictionary".ejtaal.net.Retrieved2023-06-04.
^^a ^bM Programming: A Comprehensive Guide,Richard F. Walters, Digital Press, 1997

External links

Unicode Collation Algorithm:Unicode Technical Standard #10
Collation in Spanish Archived2006-08-13 at theWayback Machine
Collation of the names of the member states of the United Nations ArchivedAugust 30, 2005, at theWayback Machine
Typographical collation for many languages,as proposed in the List module ofCascading Style Sheets.
Collation Charts:Charts demonstrating language-specific sorting orders in various operating systems and DBMS
ICU Locale Explorer Archived2008-05-11 at theWayback Machine:An online demonstration of sorting in different languages that uses theUnicode Collation AlgorithmwithInternational Components for Unicode

[4] Historically, computers only handled text in uppercase (this dates back totelegraphconventions).

[1] Abu-Haidar, J. A. (1983)."Review of A Dictionary of Modern Written Arabic (Arabic-English)".Bulletin of the School of Oriental and African Studies, University of London.46(2): 351–353.ISSN 0041-977X.JSTOR 615409.

[2] "Hans Wehr Arabic-English Dictionary".ejtaal.net.Retrieved2023-06-04.

[Walters-3] M Programming: A Comprehensive Guide,Richard F. Walters, Digital Press, 1997

[1]

[2]

[3]

[note 1]