Jump to content

Delimiter

From Wikipedia, the free encyclopedia
(Redirected fromDelimit)
A stylistic depiction of values inside of a so-namedcomma-separated values(CSV) text file. The commas (shown in red) are used as field delimiters.

Adelimiteris a sequence of one or morecharactersfor specifying the boundary between separate, independent regions inplain text,mathematical expressionsor otherdata streams.[1][2]An example of a delimiter is thecommacharacter, which acts as afield delimiterin a sequence ofcomma-separated values.Another example of a delimiter is the time gap used to separate letters and words in the transmission ofMorse code.[citation needed]

Inmathematics,delimiters are often used to specify the scope of anoperation,and can occur both as isolated symbols (e.g.,colonin "") and as a pair of opposing-looking symbols (e.g.,angled bracketsin).

Delimiters represent one of various means of specifying boundaries in adata stream.Declarative notation,for example, is an alternate method (without the use of delimiters) that uses a length field at the start of a data stream to specify the number of characters that the data stream contains.[3]

Overview

[edit]

Delimiters may be characterized as field and record delimiters, or as bracket delimiters.

Field and record delimiters

[edit]

Field delimiters separate data fields. Record delimiters separate groups of fields.[4]

For example, theCSV formatuses a comma as the delimiter betweenfields,and anend-of-lineindicator as the delimiter betweenrecords:

fname,lname,age,salary
nancy,davolio,33,$30000
erin,borakova,28,$25250
tony,raphael,35,$28700

This specifies a simpleflat-file databasetableusing the CSV file format.

Bracket delimiters

[edit]

Bracket delimiters, also called block delimiters, region delimiters, or balanced delimiters, mark both the start and end of a region of text.[5][6]

Common examples of bracket delimiters include:[7]

Delimiters Description
() Parentheses.TheLispprogramming language syntax is cited as recognizable primarily by its use of parentheses.[8]
{} Braces (also calledcurly brackets[9]).
[] Brackets (commonly used to denote a subscript).
<> Angle brackets.[10]
"" commonly used to denotestring literals.[11]
'' commonly used to denotecharacter literals.[11]
<??> used to indicate XMLprocessing instructions.[12]
/**/ used to denotecommentsin some programming languages.[13]
<%%> used in someweb templatesto specify language boundaries.[14]

Conventions

[edit]

Historically, computing platforms have used certain delimiters by convention.[15][16]The following tables depict a few examples for comparison.

Programming languages (See also,Comparison of programming languages (syntax)).

String Literal End of Statement
Pascal singlequote semicolon
Python doublequote, singlequote end of line(EOL)

Field and Record delimiters(See also,ASCII,Control character).

End of Field End of Record End of File
Unix-likesystems includingmacOS,AmigaOS Tab LF none
Windows,MS-DOS,OS/2,CP/M Tab CRLF none (except in CP/M),Control-Z[17]
Classic Mac OS,Apple DOS,ProDOS,GS/OS Tab CR none
ASCII/Unicode UNIT SEPARATOR
Position 31 (U+001F)
RECORD SEPARATOR
Position 30 (U+001E)
FILE SEPARATOR
Position 28 (U+001C)

Delimiter collision

[edit]

Delimiter collisionis a problem that occurs when an author or programmer introduces delimiters into text without actually intending them to be interpreted as boundaries between separate regions.[4][18]In the case of XML, for example, this can occur whenever an author attempts to specify anangle bracketcharacter.

In most file types there is both a field delimiter and a record delimiter, both of which are subject to collision. In the case ofcomma-separated valuesfiles, for example, field collision can occur whenever an author attempts to include a comma as part of a field value (e.g., salary = "$30,000" ), and record delimiter collision would occur whenever a field contained multiple lines. Both record and field delimiter collision occur frequently in text files.

In some contexts, a malicious user or attacker may seek to exploit this problem intentionally. Consequently, delimiter collision can be the source of securityvulnerabilitiesandexploits.Malicious users can take advantage of delimiter collision in languages such asSQLandHTMLto deploy such well-known attacks asSQL injectionandcross-site scripting,respectively.

Solutions

[edit]

Because delimiter collision is a very common problem, various methods for avoiding it have been invented. Some authors may attempt to avoid the problem by choosing a delimiter character (or sequence of characters) that is not likely to appear in the data stream itself. Thisad hocapproach may be suitable, but it necessarily depends on a correct guess of what will appear in the data stream, and offers no security against malicious collisions. Other, more formal conventions are therefore applied as well.

ASCII delimited text

[edit]

The ASCII and Unicode character sets were designed to solve this problem by the provision of non-printing characters that can be used as delimiters. These are the range from ASCII 28 to 31.

ASCIIDec Symbol Unicode Name Common Name Usage
28 INFORMATION SEPARATOR FOUR file separator End of file. Or between a concatenation of what might otherwise be separate files.
29 INFORMATION SEPARATOR THREE group separator Between sections of data. Not needed in simple data files.
30 INFORMATION SEPARATOR TWO record separator End of a record or row.
31 INFORMATION SEPARATOR ONE unit separator Between fields of a record, or members of a row.

The use of ASCII 31Unit separatoras a field separator and ASCII 30Record separatorsolves the problem of both field and record delimiters that appear in a text data stream.[19]

Escape character

[edit]

One method for avoiding delimiter collision is to useescape characters.From a language design standpoint, these are adequate, but they have drawbacks:

  • text can be rendered unreadable when littered with numerous escape characters, a problem referred to asleaning toothpick syndrome(due to use of \ to escape / inPerlregular expressions,leading to sequences such as "\/\/" );
  • text becomes difficult to parse through regular expression
  • they require a mechanism to "escape the escapes" when not intended as escape characters; and
  • although easy to type, they can be cryptic to someone unfamiliar with the language.[20]
  • they do not protect against injection attacks[citation needed]

Escape sequence

[edit]

Escape sequences are similar to escape characters, except they usually consist of some kind of mnemonic instead of just a single character. One use is instring literalsthat include a doublequote ( ") character. For example inPerl,the code:

print"Nancy said \x22Hello World!\x22 to the crowd.";### use \x22

produces the same output as:

print"Nancy said \" Hello World!\ "to the crowd.";### use escape char

One drawback of escape sequences, when used by people, is the need to memorize the codes that represent individual characters (see also:character entity reference,numeric character reference).

Dual quoting delimiters

[edit]

In contrast to escape sequences and escape characters, dual delimiters provide yet another way to avoid delimiter collision. Some languages, for example, allow the use of either a single quote (') or a double quote ( ") to specify a string literal. For example, inPerl:

print'Nancy said "Hello World!" to the crowd.';

produces the desired output without requiring escapes. This approach, however, only works when the string does not containbothtypes of quotation marks.

Padding quoting delimiters

[edit]

In contrast to escape sequences and escape characters, padding delimiters provide yet another way to avoid delimiter collision.Visual Basic,for example, uses double quotes as delimiters. This is similar to escaping the delimiter.

print"Nancy said" "Hello World!" "to the crowd."

produces the desired output without requiring escapes. Like regular escaping it can, however, become confusing when many quotes are used. The code to print the above source code would look more confusing:

print"print" "Nancy said" "" "Hello World!" "" "to the crowd." ""

Configurable alternative quoting delimiters

[edit]

In contrast to dual delimiters, multiple delimiters are even more flexible for avoiding delimiter collision.[7]: 63 

For example, inPerl:

printqq^Nancy doesn't want to say "Hello World!" anymore.^;
printqq@Nancy doesn't want to say "Hello World!" anymore.@;
printqq(Nancy doesn't want to say "Hello World!" anymore.);

all produce the desired output through use ofquote operators,which allow any convenient character to act as a delimiter. Although this method is more flexible, few languages support it. Perl andRubyare two that do.[7]: 62 [21]

Content boundary

[edit]

Acontent boundaryis a special type of delimiter that is specifically designed to resist delimiter collision. It works by allowing the author to specify a sequence of characters that is guaranteed to always indicate a boundary between parts in a multi-part message, with no other possible interpretation.[22]

The delimiter is frequently generated from a random sequence of characters that is statistically improbable to occur in the content. This may be followed by an identifying mark such as aUUID,atimestamp,or some other distinguishing mark. Alternatively, the content may be scanned to guarantee that a delimiter does not appear in the text. This may allow the delimiter to be shorter or simpler, and increase the human readability of the document. (See e.g.,MIME,Here documents).

Whitespace or indentation

[edit]

Some programming and computer languages allow the use ofwhitespace delimitersorindentationas a means of specifying boundaries between independent regions in text.[23]

Regular expression syntax

[edit]

In specifying aregular expression,alternate delimiters may also be used to simplify the syntax formatchandsubstitutionoperations inPerl.[24]

For example, a simple match operation may be specified in Perl with the following syntax:

$string1='Nancy said "Hello World!" to the crowd.';# specify a target string
print$string1=~m/[aeiou]+/;# match one or more vowels

The syntax is flexible enough to specify match operations with alternate delimiters, making it easy to avoid delimiter collision:

$string1='Nancy said "http://Hello/World.htm" is not a valid address.';# target string

print$string1=~m@http://@;# match using alternate regular expression delimiter
print$string1=~m{http://};# same as previous, but different delimiter
print$string1=~m!http://!;# same as previous, but different delimiter.

Here document

[edit]

AHere documentallows the inclusion of arbitrary content by describing a special end sequence. Many languages support this includingPHP,bash scripts,rubyandperl.A here document starts by describing what the end sequence will be and continues until that sequence is seen at the start of a new line.[25]

Here is an example in perl:

print<<ENDOFHEREDOC;
It's very hard to encode a string with "certain characters".

Newlines, commas, and other characters can cause delimiter collisions.
ENDOFHEREDOC

This code would print:

It's very hard to encode a string with "certain characters".

Newlines, commas, and other characters can cause delimiter collisions.

By using a special end sequence all manner of characters are allowed in the string.

ASCII armor

[edit]

Although principally used as a mechanism for text encoding of binary data, ASCII armoringis a programming and systems administration technique that also helps to avoid delimiter collision in some circumstances.[26][27]This technique is contrasted from the other approaches described above because it is more complicated, and therefore not suitable for small applications and simple data storage formats. The technique employs a special encoding scheme, such asbase64,to ensure that delimiter or other significant characters do not appear in transmitted data. The purpose is to prevent multilayeredescaping,i.e. fordoublequotes.

This technique is used, for example, inMicrosoft'sASP.NETweb development technology, and is closely associated with the "VIEWSTATE" component of that system.[28]

Example
[edit]

The following simplified example demonstrates how this technique works in practice.

The first code fragment shows a simpleHTML tagin which the VIEWSTATE value contains characters that are incompatible with the delimiters of the HTML tag itself:

<inputtype="hidden"name="__VIEWSTATE"value="BookTitle:Nancy doesn't say"HelloWorld! "anymore. "/>

This first code fragment is notwell-formed,and would therefore not work properly in a "real world" deployed system.

To store arbitrary text in an HTML attribute,HTML entitiescan be used. In this case "&quot;" stands in for the double-quote:

<inputtype="hidden"name="__VIEWSTATE"value="BookTitle:Nancy doesn't say &quot;Hello World!&quot; anymore."/>

Alternatively, any encoding could be used that doesn't include characters that have special meaning in the context, such as base64:

<inputtype="hidden"name="__VIEWSTATE"value="Qm9va1RpdGxlOk5hbmN5IGRvZXNuJ3Qgc2F5ICJIZWxsbyBXb3JsZCEiIGFueW1vcmUu"/>

Orpercent-encoding:

<inputtype="hidden"name="__VIEWSTATE"value="BookTitle:Nancy%20doesn%27t%20say%20%22Hello%20World!%22%20anymore."/>

This prevents delimiter collision and ensures that incompatible characters will not appear inside the HTML code, regardless of what characters appear in the original (decoded) text.[28]

See also

[edit]

References

[edit]
  1. ^"Definition: delimiter".Federal Standard 1037C - Telecommunications: Glossary of Telecommunication Terms.Archivedfrom the original on 2013-03-05.Retrieved2019-11-25.
  2. ^"What is a Delimiter?".computerhope.Retrieved2020-08-09.
  3. ^Rohl, Jeffrey S. (1973).Programming in Fortran.Oxford Oxfordshire: Oxford University Press.ISBN978-0-7190-0555-8.describing the method in Hollerith notation under the Fortran programming language.
  4. ^abde Moor, Georges J. (1993).Progress in Standardization in Health Care Informatics.IOS Press.ISBN90-5199-114-2.p. 141
  5. ^Friedl, Jeffrey E. F. (2002).Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools.O'Reilly.ISBN0-596-00289-0.p. 319
  6. ^Scott, Michael Lee (1999).Programming Language Pragmatics.Morgan Kaufmann.ISBN1-55860-442-1.
  7. ^abcWall, Larry;Orwant, Jon (July 2000).Programming Perl(Third ed.). O'Reilly.ISBN0-596-00027-8.
  8. ^Kaufmann, Matt (2000).Computer-Aided Reasoning: An Approach.Springer.ISBN0-7923-7744-3.p. 3
  9. ^Meyer, Mark (2005).Explorations in Computer Science.Oxford Oxfordshire: Oxford University Press.ISBN978-0-7637-3832-7.references C-style programming languages prominently featuring curly brackets and semicolons.
  10. ^Dilligan, Robert (1998).Computing in the Web Age.Oxford Oxfordshire: Oxford University Press.ISBN978-0-306-45972-6.Describes syntax and delimiters used in HTML.
  11. ^abSchwartz, Randal(2005).Learning Perl.Oxford Oxfordshire: Oxford University Press.ISBN978-0-596-10105-3.Describesstring literals.
  12. ^Watt, Andrew (2003).Sams Teach Yourself Xml in 10 Minutes.Oxford Oxfordshire: Oxford University Press.ISBN978-0-672-32471-0.Describes XML processing instruction. p. 21.
  13. ^Cabrera, Harold (2002).C# for Java Programmers.Oxford Oxfordshire: Oxford University Press.ISBN978-1-931836-54-8.Describes single-line and multi-line comments. p. 72.
  14. ^"Jakarta Server Pages Specification, Version 4.0akarta Server Pages Specification, Version 4.0".GitHub.Retrieved2023-02-10.
  15. ^ISO/TC 97/SC 2(December 1, 1975).The set of control characters for ISO 646(PDF).ITSCJ/IPSJ.ISO-IR-1.{{citation}}:CS1 maint: numeric names: authors list (link)
  16. ^American National Standards Institute(December 1, 1975).ASCII graphic character set(PDF).ITSCJ/IPSJ.ISO-IR-6.
  17. ^Lewine, Donald (1991).Posix Programmer's Guide.Oxford Oxfordshire: Oxford University Press.ISBN978-0-937175-73-6.Describes use of control-z. p. 156,
  18. ^Friedl, Jeffrey (2006).Mastering Regular Expressions.Oxford Oxfordshire: Oxford University Press.ISBN978-0-596-52812-6.describing solutions for embedded-delimiter problems p. 472.
  19. ^Discussion on ASCII Delimited Text vs CSV and Tab Delimited
  20. ^Kahrel, Peter (2006).Automating InDesign with Regular Expressions.O'Reilly. p. 11.ISBN0-596-52937-6.
  21. ^Yukihiro, Matsumoto (2001).Ruby in a Nutshell.O'Reilly.ISBN0-596-00214-9.In Ruby, these are indicated asgeneral delimited strings.p. 11
  22. ^Network Protocols Handbook.Javvin Technologies Inc. 2005.ISBN0-9740945-2-8.p. 26
  23. ^Computational Linguistics and Intelligent Text Processing.Oxford Oxfordshire: Oxford University Press. 2001.ISBN978-3-540-41687-6.Describes whitespace delimiters. p. 258.
  24. ^Friedl, Jeffrey (2006).Mastering Regular Expressions.Oxford Oxfordshire: Oxford University Press.ISBN978-0-596-52812-6.page 472.
  25. ^Perl operators and precedence
  26. ^Rhee, Man (2003).Internet Security: Cryptographic Principles, Algorithms and Protocols.John Wiley and Sons.ISBN0-470-85285-2.(an example usage of ASCII armoring in encryption applications)
  27. ^Gross, Christian (2005).Open Source for Windows Administrators.Charles River Media.ISBN1-58450-347-5.(an example usage of ASCII armoring in encryption applications)
  28. ^abKalani, Amit (2004).Developing and Implementing Web Applications with Visual C#. NET and Visual Studio. NET.Que.ISBN0-7897-2901-6.(describes the use of Base64 encoding and VIEWSTATE inside HTML source code)
[edit]