Burrows–Wheeler transform

TheBurrows–Wheeler transform(BWT,also calledblock-sorting compression) rearranges acharacter stringinto runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such asmove-to-front transformandrun-length encoding.More importantly, the transformation isreversible,without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is analgorithmused to prepare data for use withdata compressiontechniques such asbzip2.It was invented byMichael BurrowsandDavid Wheelerin 1994 while Burrows was working atDEC Systems Research CenterinPalo Alto,California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using asuffix arraythus reaching linear time complexity.[1]

Burrows–Wheeler transform
Classpreprocessing for lossless compression
Data structurestring
Worst-caseperformanceO(n)
Worst-casespace complexityO(n)

Description

edit

When acharacter stringis transformed by the BWT, the transformationpermutesthe order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row.

For example:

Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Output TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT[2]

The output is easier to compress because it has many repeated characters. In this example the transformed string contains six runs of identical characters: XX, SS, PP, .., II, and III,which together make 13 out of the 44 characters.

Example

edit

The transform is done bysortingall thecircular shiftsof a text inlexicographic orderand by extracting the last column and the index of the original string in the set of sorted permutations ofS.

Given an input stringS =^BANANA$(step 1 in the table below), rotate itNtimes (step 2), whereN = 8is the length of theSstring considering also the red^character representing the start of the string and the red$character representing the 'EOF' pointer; these rotations, or circular shifts, are then sorted lexicographically (step 3). The output of the encoding phase is the last columnL = BNN^AA$Aafter step 3, and the index (0-based)Iof the row containing the original stringS,in this caseI = 6.

It is not necessary to use both$and^,but at least one must be used, else we cannot invert the transform, since all circular permutations of a string have the same Burrows–Wheeler transform.

Transformation
1. Input 2. All
rotations
3. Sort into
lexical order
4. Take the
last column
5. Output
^BANANA$
^BANANA$
$^BANANA
A$^BANAN
NA$^BANA
ANA$^BAN
NANA$^BA
ANANA$^B
BANANA$^
ANANA$^B
ANA$^BAN
A$^BANAN
BANANA$^
NANA$^BA
NA$^BANA
^BANANA$
$^BANANA
ANANA$^B
ANA$^BAN
A$^BANAN
BANANA$^
NANA$^BA
NA$^BANA
^BANANA$
$^BANANA
BNN^AA$A

The followingpseudocodegives a simple (though inefficient) way to calculate the BWT and its inverse. It assumes that the input stringscontains a special character 'EOF' which is the last character and occurs nowhere else in the text.

functionBWT (strings)
create a table, where the rows are all possible rotations of s
sort rows Alpha betically
return(last column of the table)
functioninverseBWT (strings)
create empty table
repeatlength(s)times
// first insert creates first column
insert s as a column of table before first column of the table
sort rows of the table Alpha betically
return(row that ends with the 'EOF' character)

Explanation

edit

To understand why this creates more-easily-compressible data, consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will group rotations starting with "he" together, and the last character of that rotation (which is also the character before the "he" ) will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "ache" ) mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).

The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it does thisreversibly,allowing the original document to be re-generated from the last column data.

The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters Alpha betically to get the first column. Then, the last and first columns (of each row) together give you allpairsof successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the firstand secondcolumns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:

Inverse transformation
Input
BNN^AA$A
Add 1 Sort 1 Add 2 Sort 2
B
N
N
^
A
A
$
A
A
A
A
B
N
N
^
$
BA
NA
NA
^B
AN
AN
$^
A$
AN
AN
A$
BA
NA
NA
^B
$^
Add 3 Sort 3 Add 4 Sort 4
BAN
NAN
NA$
^BA
ANA
ANA
$^B
A$^
ANA
ANA
A$^
BAN
NAN
NA$
^BA
$^B
BANA
NANA
NA$^
^BAN
ANAN
ANA$
$^BA
A$^B
ANAN
ANA$
A$^B
BANA
NANA
NA$^
^BAN
$^BA
Add 5 Sort 5 Add 6 Sort 6
BANAN
NANA$
NA$^B
^BANA
ANANA
ANA$^
$^BAN
A$^BA
ANANA
ANA$^
A$^BA
BANAN
NANA$
NA$^B
^BANA
$^BAN
BANANA
NANA$^
NA$^BA
^BANAN
ANANA$
ANA$^B
$^BANA
A$^BAN
ANANA$
ANA$^B
A$^BAN
BANANA
NANA$^
NA$^BA
^BANAN
$^BANA
Add 7 Sort 7 Add 8 Sort 8
BANANA$
NANA$^B
NA$^BAN
^BANANA
ANANA$^
ANA$^BA
$^BANAN
A$^BANA
ANANA$^
ANA$^BA
A$^BANA
BANANA$
NANA$^B
NA$^BAN
^BANANA
$^BANAN
BANANA$^
NANA$^BA
NA$^BANA
^BANANA$
ANANA$^B
ANA$^BAN
$^BANANA
A$^BANAN
ANANA$^B
ANA$^BAN
A$^BANAN
BANANA$^
NANA$^BA
NA$^BANA
^BANANA$
$^BANANA
Output
^BANANA$

Optimization

edit

A number ofoptimizationscan make these algorithms run more efficiently without changing the output. There is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the Alpha bet size and string length, the decoded string may be generated one character at a time from right to left. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.

One may also make the observation that mathematically, the encoded string can be computed as a simple modification of thesuffix array,and suffix arrays can be computed with linear time and memory. The BWT can be defined with regards to the suffix array SA of text T as (1-based inde xing ):

[3]

There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.

A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.[1]The algorithms vary somewhat by whether EOF is used, and in which direction the sorting was done. In fact, the original formulation did not use an EOF marker.[4]

Bijective variant

edit

Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an EOF marker to the end of the input or doing something equivalent, making it possible to distinguish the input string from all its rotations. Increasing the size of the Alpha bet (by appending the EOF character) makes later compression steps awkward.

There is abijectiveversion of the transform, by which the transformed string uniquely identifies the original, and the two have the same length and contain exactly the same characters, just in a different order.[5][6]

The bijective transform is computed by factoring the input into a non-increasing sequence ofLyndon words;such a factorization exists and is unique by theChen–Fox–Lyndon theorem,[7]and may be found in linear time and constant space.[8]The algorithm sorts the rotations of all the words; as in the Burrows–Wheeler transform, this produces a sorted sequence ofnstrings. The transformed string is then obtained by picking the final character of each string in this sorted list. The one important caveat here is that strings of different lengths are not ordered in the usual way; the two strings are repeated forever, and the infinite repeats are sorted. For example, "ORO" precedes "OR" because "OROORO..." precedes "OROROR...".

For example, the text "^BANANA$"is transformed into" AN NBA A^$"through these steps (the red$character indicates theEOFpointer) in the original string. The EOF character is unneeded in the bijective transform, so it is dropped during the transform and re-added to its proper place in the file.

The string is broken into Lyndon words so the words in the sequence are decreasing using the comparison method above. (Note that we're sorting '^' as succeeding other characters.) "^BANANA "becomes (^) (B) (AN) (AN) (A).

Bijective transformation
Input All
rotations
Sorted Alpha betically Last column
of rotated Lyndon word
Output
^BANANA$
^^^^^^^^... (^)
BBBBBBBB... (B)
ANANANAN... (AN)
NANANANA... (NA)
ANANANAN... (AN)
NANANANA... (NA)
AAAAAAAA... (A)
AAAAAAAA... (A)
ANANANAN... (AN)
ANANANAN... (AN)
BBBBBBBB... (B)
NANANANA... (NA)
NANANANA... (NA)
^^^^^^^^... (^)
AAAAAAAA... (A)
ANANANAN... (AN)
ANANANAN... (AN)
BBBBBBBB... (B)
NANANANA... (NA)
NANANANA... (NA)
^^^^^^^^... (^)
AN NBA A^$
Inverse bijective transform
Input
AN NBA A^
Add 1 Sort 1 Add 2 Sort 2
A
N
N
B
A
A
^
A
A
A
B
N
N
^
AA
NA
NA
BB
AN
AN
^^
AA
AN
AN
BB
NA
NA
^^
Add 3 Sort 3 Add 4 Sort 4
AAA
NAN
NAN
BBB
ANA
ANA
^^^
AAA
ANA
ANA
BBB
NAN
NAN
^^^
AAAA
NANA
NANA
BBBB
ANAN
ANAN
^^^^
AAAA
ANAN
ANAN
BBBB
NANA
NANA
^^^^
Output
^BANANA

Up until the last step, the process is identical to the inverse Burrows–Wheeler process, but here it will not necessarily give rotations of a single sequence; it instead gives rotations of Lyndon words (which will start to repeat as the process is continued). Here, we can see (repetitions of) four distinct Lyndon words: (A), (AN) (twice), (B), and (^). (NANA... doesn't represent a distinct word, as it is a cycle of ANAN....) At this point, these words are sorted into reverse order: (^), (B), (AN), (AN), (A). These are then concatenated to get

^BANANA

The Burrows–Wheeler transform can indeed be viewed as a special case of this bijective transform; instead of the traditional introduction of a new letter from outside our Alpha bet to denote the end of the string, we can introduce a new letter that compares as preceding all existing letters that is put at the beginning of the string. The whole string is now a Lyndon word, and running it through the bijective process will therefore result in a transformed result that, when inverted, gives back the Lyndon word, with no need for reassembling at the end.

Relatedly, the transformed text will only differ from the result of BWT by one character per Lyndon word; for example, if the input is decomposed into six Lyndon words, the output will only differ in six characters. For example, applying the bijective transform gives:

Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Lyndon words SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Output STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT

The bijective transform includes eight runs of identical characters. These runs are, in order:XX, II, XX, PP, .., EE, .., and IIII.

In total, 18 characters are used in these runs.

Dynamic Burrows–Wheeler transform

edit

When a text is edited, its Burrows–Wheeler transform will change. Salsonet al.[9]propose an algorithm that deduces the Burrows–Wheeler transform of an edited text from that of the original text, doing a limited number of local reorderings in the original Burrows–Wheeler transform, which can be faster than constructing the Burrows–Wheeler transform of the edited text directly.

Sample implementation

edit

ThisPythonimplementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation. It essentially does what the pseudocode section does.

Using theSTX/ETX control codesto mark the start and end of the text, and usings[i:] + s[:i]to construct theith rotation ofs,the forward transform takes the last character of each of the sorted rows:

fromcurses.asciiimportSTX,ETX

defbwt(s:str,start=chr(STX),end=chr(ETX))->str:
r"""
Apply Burrows–Wheeler transform to input string.

>>> bwt('BANANA')
'\x03ANNB\x02AA'
>>> bwt('BANANA', start='^', end='$')
'ANNB^AA$'
>>> bwt('BANANA', start='%', end='$')
'A$NNB%AA'
"""
assert(
startnotinsandendnotins
),"Input string cannot contain STX and ETX characters"
s=f"{start}{s}{end}"# Add start and end of text marker

# Table of rotations of string
table=sorted(f"{s[i:]}{s[:i]}"fori,cinenumerate(s))
last_column=[row[-1:]forrowintable]# Last characters of each row
return"".join(last_column)# Convert list of characters into string

The inverse transform repeatedly insertsras the left column of the table and sorts the table. After the whole table is built, it returns the row that ends with ETX, minus the STX and ETX.

definverse_bwt(r:str,start=chr(STX),end=chr(ETX))->str:
r"""
Apply inverse Burrows–Wheeler transform.

>>> inverse_bwt('\x03ANNB\x02AA')
'BANANA'
>>> inverse_bwt('ANNB^AA$', start='^', end='$')
'BANANA'
>>> inverse_bwt('A$NNB%AA', start='%', end='$')
'BANANA'
"""
str_len=len(r)
table=[""]*str_len# Make empty table
for_inrange(str_len):
table=sorted(rc+tcforrc,tcinzip(r,table))# Add a column of r

# Iterate over and check whether last character ends with ETX or not
s=next((rowforrowintableifrow.endswith(end)),"")

# Retrieve data from array and get rid of start and end markers
returns.rstrip(end).strip(start)

Following implementation notes from Manzini, it is equivalent to use a simplenull charactersuffix instead. The sorting should be done incolexicographic order(string read right-to-left), i.e.sorted(...,key=lambdas:s[::-1])in Python.[4](The above control codes actually fail to satisfy EOF being the last character; the two codes are actually thefirst.The rotation holds nevertheless.)

BWT applications

edit

As alossless compressionalgorithm the Burrows–Wheeler transform offers the important quality that its encoding is reversible and hence the original data may be recovered from the resulting compression. The lossless quality of Burrows algorithm has provided for different algorithms with different purposes in mind. To name a few, Burrows–Wheeler transform is used in algorithms forsequence alignment,image compression,data compression,etc. The following is a compilation of some uses given to the Burrows–Wheeler Transform.

BWT for sequence alignment

edit

The advent ofnext-generation sequencing(NGS) techniques at the end of the 2000s decade has led to another application of the Burrows–Wheeler transformation. In NGS,DNAis fragmented into small pieces, of which the first few bases aresequenced,yielding several millions of "reads", each 30 to 500base pairs( "DNA characters" ) long. In many experiments, e.g., inChIP-Seq,the task is now to align these reads to a referencegenome,i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied onhashing(e.g.,Eland,SOAP,[10]orMaq[11]). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed (Bowtie,[12]BWA,[13]and SOAP2[14]) that use the Burrows–Wheeler transform.

BWT for image compression

edit

The Burrows–Wheeler transformation has proved to be fundamental forimage compressionapplications. For example,[15]Showed a compression pipeline based on the application of the Burrows–Wheeler transformation followed by inversion, run-length, and arithmetic encoders. The pipeline developed in this case is known as Burrows–Wheeler transform with an inversion encoder (BWIC). The results shown by BWIC are shown to outperform the compression performance of well-known and widely used algorithms likeLossless JPEGandJPEG 2000.BWIC is shown to outperform those in terms of final compression size of radiography medical images on the order of 5.1% and 4.1% respectively. The improvements are achieved by combining BWIC and a pre-BWIC scan of the image in a vertical snake order fashion. More recently, additional works like that of[16]have shown the implementation of the Burrows–Wheeler Transform in conjunction with the knownmove-to-front transform(MTF) achieve near lossless compression of images.

BWT for compression of genomic databases

edit

Cox et al.[17]presented a genomic compression scheme that uses BWT as the algorithm applied during the first stage of compression of several genomic datasets including the human genomic information. Their work proposed that BWT compression could be enhanced by including a second stage compression mechanism called same-as-previous encoding ( "SAP" ), which makes use of the fact that suffixes of two or more prefix letters could be equal. With the compression mechanism BWT-SAP, Cox et al. showed that in the genomic database ERA015743, 135.5 GB in size, the compression scheme BWT-SAP compresses the ERA015743 dataset by around 94%, to 8.2 GB.

BWT for sequence prediction

edit

BWT has also been proved to be useful on sequence prediction which is a common area of study inmachine learningandnatural-language processing.In particular, Ktistakis et al.[18]proposed a sequence prediction scheme called SuBSeq that exploits the lossless compression of data of the Burrows–Wheeler transform. SuBSeq exploits BWT by extracting theFM-indexand then performing a series of operations called backwardSearch, forwardSearch, neighbourExpansion, and getConsequents in order to search for predictions given asuffix.The predictions are then classified based on a weight and put into an array from which the element with the highest weight is given as the prediction from the SuBSeq algorithm. SuBSeq has been shown to outperformstate of the artalgorithms for sequence prediction both in terms of training time and accuracy.

References

edit
  1. ^abBurrows, Michael;Wheeler, David J.(May 10, 1994),A block sorting lossless data compression algorithm,Technical Report 124, Digital Equipment Corporation, archived fromthe originalon January 5, 2003
  2. ^"adrien-mogenet/scala-bwt".GitHub.Retrieved19 April2018.
  3. ^Simpson, Jared T.; Durbin, Richard (2010-06-15)."Efficient construction of an assembly string graph using the FM-index".Bioinformatics.26(12): i367–i373.doi:10.1093/bioinformatics/btq217.ISSN1367-4803.PMC2881401.PMID20529929.
  4. ^abManzini, Giovanni (1999-08-18)."The Burrows–Wheeler Transform: Theory and Practice"(PDF).Mathematical Foundations of Computer Science 1999: 24th International Symposium, MFCS'99 Szklarska Poreba, Poland, September 6-10, 1999 Proceedings.Springer Science & Business Media.ISBN9783540664086.Archived(PDF)from the original on 2022-10-09.
  5. ^Gil, J.; Scott, D. A. (2009),A bijective string sorting transform(PDF),archived fromthe original(PDF)on 2011-10-08,retrieved2009-07-09
  6. ^Kufleitner, Manfred (2009), "On bijective variants of the Burrows–Wheeler transform", in Holub, Jan; Žďárek, Jan (eds.),Prague Stringology Conference,pp. 65–69,arXiv:0908.0239,Bibcode:2009arXiv0908.0239K.
  7. ^*Lothaire, M.(1997),Combinatorics on words,Encyclopedia of Mathematics and Its Applications, vol. 17, Perrin, D.; Reutenauer, C.; Berstel, J.; Pin, J. E.; Pirillo, G.; Foata, D.; Sakarovitch, J.; Simon, I.; Schützenberger, M. P.; Choffrut, C.; Cori, R.; Lyndon, Roger; Rota, Gian-Carlo. Foreword by Roger Lyndon (2nd ed.),Cambridge University Press,p. 67,ISBN978-0-521-59924-5,Zbl0874.20040
  8. ^Duval, Jean-Pierre (1983), "Factorizing words over an ordered Alpha bet",Journal of Algorithms,4(4): 363–381,doi:10.1016/0196-6774(83)90017-2,ISSN0196-6774,Zbl0532.68061.
  9. ^Salson M, Lecroq T, Léonard M, Mouchard L (2009)."A Four-Stage Algorithm for Updating a Burrows–Wheeler Transform".Theoretical Computer Science.410(43): 4350–4359.doi:10.1016/j.tcs.2009.07.016.
  10. ^Li R; et al. (2008)."SOAP: short oligonucleotide alignment program".Bioinformatics.24(5): 713–714.doi:10.1093/bioinformatics/btn025.PMID18227114.
  11. ^Li H, Ruan J, Durbin R (2008-08-19)."Mapping short DNA sequencing reads and calling variants using mapping quality scores".Genome Research.18(11): 1851–1858.doi:10.1101/gr.078212.108.PMC2577856.PMID18714091.
  12. ^Langmead B, Trapnell C, Pop M, Salzberg SL (2009)."Ultrafast and memory-efficient alignment of short DNA sequences to the human genome".Genome Biology.10(3): R25.doi:10.1186/gb-2009-10-3-r25.PMC2690996.PMID19261174.
  13. ^Li H, Durbin R (2009)."Fast and accurate short read alignment with Burrows–Wheeler Transform".Bioinformatics.25(14): 1754–1760.doi:10.1093/bioinformatics/btp324.PMC2705234.PMID19451168.
  14. ^Li R; et al. (2009). "SOAP2: an improved ultrafast tool for short read alignment".Bioinformatics.25(15): 1966–1967.doi:10.1093/bioinformatics/btp336.PMID19497933.
  15. ^Collin P, Arnavut Z, Koc B (2015)."Lossless compression of medical images using Burrows–Wheeler Transformation with Inversion Coder".2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).Vol. 2015. pp. 2956–2959.doi:10.1109/EMBC.2015.7319012.ISBN978-1-4244-9271-8.PMID26736912.S2CID4460328.
  16. ^Devadoss CP, Sankaragomathi B (2019)."Near lossless medical image compression using block BWT–MTF and hybrid fractal compression techniques".Cluster Computing.22:12929–12937.doi:10.1007/s10586-018-1801-3.S2CID33687086.
  17. ^Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012). "Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform".Bioinformatics.28(11). Oxford University Press: 1415–1419.arXiv:1205.0192.doi:10.1093/bioinformatics/bts173.PMID22556365.
  18. ^Ktistakis R, Fournier-Viger P, Puglisi SJ, Raman R (2019)."Succinct BWT-Based Sequence Prediction".Database and Expert Systems Applications.Lecture Notes in Computer Science. Vol. 11707. pp. 91–101.doi:10.1007/978-3-030-27618-8_7.ISBN978-3-030-27617-1.S2CID201058996.
edit