---
title: "SequenceAlignment"
language: "en"
type: "Symbol"
summary: "SequenceAlignment[s1, s2] finds an optimal alignment of sequences of elements in the strings, lists or biomolecular sequences s1 and s2, and yields a list of successive matching and differing sequences."
keywords: 
- bioinformatics
- string alignment
- list alignment
- DNA
- RNA
- protein sequence
- Needleman
- Wunsch
- Smith
- Waterman
- Hirschberg
- global
- local
- BLOSUM
- PAM
- dynamic programming
- unix diff
- diff
- patch files
- file diffing
- file differences
- code differences
canonical_url: "https://reference.wolfram.com/language/ref/SequenceAlignment.html"
source: "Wolfram Language Documentation"
related_guides: 
  - 
    title: "Sequence Alignment & Comparison"
    link: "https://reference.wolfram.com/language/guide/SequenceAlignmentAndComparison.en.md"
  - 
    title: "String Manipulation"
    link: "https://reference.wolfram.com/language/guide/StringManipulation.en.md"
  - 
    title: "Biomolecular Sequences"
    link: "https://reference.wolfram.com/language/guide/BiomolecularSequences.en.md"
  - 
    title: "Text Manipulation"
    link: "https://reference.wolfram.com/language/guide/ProcessingTextualData.en.md"
  - 
    title: "Scientific Data Analysis"
    link: "https://reference.wolfram.com/language/guide/ScientificDataAnalysis.en.md"
  - 
    title: "Life Sciences & Medicine: Data & Computation"
    link: "https://reference.wolfram.com/language/guide/LifeSciencesAndMedicineDataAndComputation.en.md"
  - 
    title: "Distance and Similarity Measures"
    link: "https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.en.md"
  - 
    title: "Text Analysis"
    link: "https://reference.wolfram.com/language/guide/TextAnalysis.en.md"
  - 
    title: "Math & Counting Operations on Lists"
    link: "https://reference.wolfram.com/language/guide/MathematicalAndCountingOperationsOnLists.en.md"
  - 
    title: "Natural Language Processing"
    link: "https://reference.wolfram.com/language/guide/NaturalLanguageProcessing.en.md"
---
# SequenceAlignment

SequenceAlignment[s1, s2] finds an optimal alignment of sequences of elements in the strings, lists or biomolecular sequences s1 and s2, and yields a list of successive matching and differing sequences.

## Details and Options

* ``SequenceAlignment[s1, s2]`` gives a list of the form ``{seg1, seg2, …}`` where each ``segi`` is either a single string or sequence of list elements ``u``, representing a matching segment, or a pair ``{u1, u2}``, representing segments that differ between the ``si``.

* The following options can be given:

|                   |           |                                              |
| ----------------- | --------- | -------------------------------------------- |
| GapPenalty        | 0         | additional cost for each alignment gap       |
| IgnoreCase        | False     | whether to ignore case of letters in strings |
| MergeDifferences  | True      | whether to combine adjacent differences      |
| Method            | "Global"  | alignment algorithm to be used               |
| SimilarityRules   | Automatic | rules for similarities between elements      |

* ``SequenceAlignment`` attempts to find an alignment that maximizes the total similarity score.

* ``SequenceAlignment`` by default finds a global Needleman–Wunsch alignment of the complete strings or lists ``s1`` and ``s2``.

* With the option setting ``Method -> "Local"``, it finds a local Smith–Waterman alignment.

* For sufficiently similar strings or lists, local and global alignment methods give the same result.

* ``SequenceAlignment`` also supports methods ``"AlignByLongestCommonSequence"`` and ``"AlignByLongestSubsequences"``, provided ``GapPenalty``, ``MergeDifferences`` and ``SimilarityRules`` are all set to their respective defaults.

* Whereas the ``"Global"`` and ``"Local"`` methods both maximize a similarity score, ``"AlignByLongestCommonSequence"`` maximizes the number of characters or list elements common to both sequences.

* ``"AlignByLongestSubsequences"`` is effectively a divide-and-conquer heuristic approximation to aligning by the longest common (not necessarily contiguous) sequence, trading accuracy for speed. When sequences are fairly close, the alignment quality will be good, outperforming the other methods by up to two orders of magnitude in speed.

* With the default setting ``SimilarityRules -> Automatic``, each match between two elements contributes 1 to the total similarity score, while each mismatch, insertion, or deletion contributes -1.

* Various named similarity matrices are supported, as specified in the notes for ``SimilarityRules``.

---

## Examples (16)

### Basic Examples (2)

Globally align two similar strings:

```wl
In[1]:= SequenceAlignment["abcXabcXabc", "abcYabcYabc"]

Out[1]= {"abc", {"X", "Y"}, "abc", {"X", "Y"}, "abc"}
```

---

Global alignment of two instances of ``BioSequence`` :

```wl
In[1]:= SequenceAlignment[BioSequence["DNA", "CGGAGT"], BioSequence["DNA", "CGTAGT"]]

Out[1]= {"CG", {"G", "T"}, "AGT"}
```

### Options (8)

#### GapPenalty (1)

By default, an alignment is found with two gaps:

```wl
In[1]:= SequenceAlignment["ac", "abcd"]

Out[1]= {"a", {"", "b"}, "c", {"", "d"}}
```

Increasing the penalty for gaps forces another alignment with fewer gaps:

```wl
In[2]:= SequenceAlignment["ac", "abcd", GapPenalty -> 2]

Out[2]= {"a", {"c", "bcd"}}
```

#### IgnoreCase (1)

``SequenceAlignment`` treats string input as case sensitive:

```wl
In[1]:= SequenceAlignment["abcdefgHIJKlmn", "abCDEfgHIjklmn"]

Out[1]= {"ab", {"cde", "CDE"}, "fgHI", {"JK", "jk"}, "lmn"}
```

With ``IgnoreCase -> True``, ``SequenceAlignment`` will convert both strings to lowercase before aligning:

```wl
In[2]:= SequenceAlignment["abcdefgHIJKlmn", "abCDEfgHIjklmn", IgnoreCase -> True]

Out[2]= {"abcdefghijklmn"}
```

#### MergeDifferences (1)

This gives insertions, deletions, and replacements as separate differences:

```wl
In[1]:= SequenceAlignment["abcXXabcXabc", "abcabcYYYabc", MergeDifferences -> False]

Out[1]= {"abc", {"XX", ""}, "abc", {"", "YY"}, {"X", "Y"}, "abc"}
```

#### Method (3)

Default global alignment of two strings:

```wl
In[1]:= SequenceAlignment["abcXXabcXabc", "abcabcYYYabc"]

Out[1]= {"abc", {"XX", ""}, "abc", {"X", "YYY"}, "abc"}

In[2]:= SequenceAlignment["abcXXabcXabc", "abcabcYYYabc", Method -> "Global"]

Out[2]= {"abc", {"XX", ""}, "abc", {"X", "YYY"}, "abc"}
```

Local alignment of the same strings:

```wl
In[3]:= SequenceAlignment["abcXXabcXabc", "abcabcYYYabc", Method -> "Local"]

Out[3]= {{"abcXX", ""}, "abc", {"X", ""}, "abc", {"", "YYYabc"}}
```

---

Take two biosequences:

```wl
In[1]:=
str1 = BioSequence[Entity["Gene", {"HBA1", {"Species" -> "HomoSapiens"}}]]["SequenceString"];
str2 = BioSequence[Entity["Gene", {"HBA2", {"Species" -> "HomoSapiens"}}]]["SequenceString"];
```

The ``"AlignByLongestCommonSequence"`` method maximizes the number of characters or list elements common to both sequences:

```wl
In[2]:= matchCount[align_] := StringLength[StringJoin@@Cases[align, _String]]

In[3]:= matchCount@SequenceAlignment[str1, str2]

Out[3]= 818

In[4]:= matchCount@SequenceAlignment[str1, str2, Method -> "AlignByLongestCommonSequence"]

Out[4]= 819
```

---

Take two texts, remove their diacritics and convert to lowercase:

```wl
In[1]:=
textA = ExampleData[{"Text", "UNHumanRightsIrish"}]//RemoveDiacritics//ToLowerCase;
textB = ExampleData[{"Text", "UNHumanRightsScottishGaelic"}]//RemoveDiacritics//ToLowerCase;
```

The ``"AlignByLongestSubsequences"`` method can be significantly faster for similar sequences, but it can give a notably smaller set of matching characters:

```wl
In[2]:= matchCount[align_] := StringLength[StringJoin@@Cases[align, _String]]

In[3]:= matchCount@SequenceAlignment[textA, textB]//AbsoluteTiming

Out[3]= {2.50196, 5838}

In[4]:= matchCount@SequenceAlignment[textA, textB, Method -> "AlignByLongestSubsequences"]//AbsoluteTiming

Out[4]= {0.074987, 3165}
```

#### SimilarityRules (2)

Align two short protein sequences:

```wl
In[1]:= SequenceAlignment["FTFTALILLAVAV", "FTALLLAAV"]

Out[1]= {{"FT", ""}, "FTAL", {"I", ""}, "LLA", {"V", ""}, "AV"}
```

Assigning a negative score to the deletion of ``"V"`` gives a different alignment:

```wl
In[2]:= SequenceAlignment["FTFTALILLAVAV", "FTALLLAAV", SimilarityRules -> {{"V", ""} -> -10}]

Out[2]= {{"FT", ""}, "FTAL", {"I", ""}, "LL", {"AV", "A"}, "AV"}
```

---

Align with type-specific similarity rules that align degenerate letters:

```wl
In[1]:= SequenceAlignment[BioSequence["DNA", "AAATTCCAAANNTNCCAAAA"], BioSequence["DNA", "GGTTCC"], SimilarityRules -> "SimilarDegenerateBases"]

Out[1]= {{"AAATTCCAAANN", "GG"}, "T", {"N", "T"}, "CC", {"AAAA", ""}}
```

Without the degenerate similarity rules, a perfect degenerate alignment is missed:

```wl
In[2]:= SequenceAlignment[BioSequence["DNA", "AAATTCCAAANNTNCCAAAA"], BioSequence["DNA", "GGTTCC"]]

Out[2]= {{"AAA", "GG"}, "TTCC", {"AAANNTNCCAAAA", ""}}
```

### Applications (4)

This gives the global alignment of two similar strings:

```wl
In[1]:= SequenceAlignment["That's one small step for man", "That's one small step for a man"]

Out[1]= {"That's one small step for", {"", " a"}, " man"}
```

---

This shows the difference between global and local string alignment:

```wl
In[1]:= SequenceAlignment["One fish two fish", "One fish two fish red fish blue fish"]

Out[1]= {"One fish two", {"", " fish red fish blue"}, " fish"}

In[2]:= SequenceAlignment["One fish two fish", "One fish two fish red fish blue fish", Method -> "Local"]

Out[2]= {"One fish two fish", {"", " red fish blue fish"}}
```

---

Obtain reference BRCA1 gene sequences for a human and a chimpanzee:

```wl
In[1]:=
human = Entity["Gene", {"BRCA1", {"Species" -> "HomoSapiens"}}]["ReferenceSequence"];
chimp = Entity["Gene", {"BRCA1", {"Species" -> "PanTroglodytes"}}]["ReferenceSequence"];
```

Check that their lengths are similar:

```wl
In[2]:= StringLength /@ {human, chimp}

Out[2]= {81189, 82169}
```

Align them using the default (``"Global"``) method, using ``ByteCount`` to check the size of the result:

```wl
In[3]:= ByteCount[align1 = SequenceAlignment[human, chimp]]//AbsoluteTiming

Out[3]= {104.224, 342624}
```

The ``"Local"`` method is slower, though it gives a more concise result:

```wl
In[4]:= ByteCount[align2 = SequenceAlignment[human, chimp, Method -> "Local"]]//AbsoluteTiming

Out[4]= {188.735, 286040}
```

Align using the longest sequence common to the pair:

```wl
In[5]:= ByteCount[align3 = SequenceAlignment[human, chimp, Method -> "AlignByLongestCommonSequence"]]//AbsoluteTiming

Out[5]= {104.71, 432504}
```

Method ``"AlignByLongestSubsequences"`` is the fastest in this case and gives the smallest result:

```wl
In[6]:= ByteCount[align4 = SequenceAlignment[human, chimp, Method -> "AlignByLongestSubsequences"]]//AbsoluteTiming

Out[6]= {0.175843, 250256}
```

Matching segments are close in total length, with the alignment using the longest common sequence having the largest matching part:

```wl
In[7]:= matchCount[align_] := StringLength[StringJoin@@Cases[align, _String]]

In[8]:= Map[matchCount, {align1, align2, align3, align4}]

Out[8]= {77315, 77308, 77316, 77161}
```

---

Obtain two Scandinavian language versions of the UN Universal Declaration of Human Rights:

```wl
In[1]:=
UNHRD = ExampleData[{"Text", "UNHumanRightsDanish"}]//RemoveDiacritics//ToLowerCase;
UNHRS = ExampleData[{"Text", "UNHumanRightsSwedish"}]//RemoveDiacritics//ToLowerCase;
Map[StringLength, {UNHRD, UNHRS}]

Out[1]= {11022, 11011}
```

Align using both the default and longest common subsequences methods and compare by byte count:

```wl
In[2]:= ByteCount[align1 = SequenceAlignment[UNHRD, UNHRS]]//AbsoluteTiming

Out[2]= {2.13724, 398688}

In[3]:= ByteCount[align2 = SequenceAlignment[UNHRD, UNHRS, Method -> "AlignByLongestSubsequences"]]//AbsoluteTiming

Out[3]= {0.092913, 294312}
```

The global method has around 60% of the characters in the matching sections:

```wl
In[4]:= matchCount[align_] := StringLength[StringJoin@@Cases[align, _String]]

In[5]:= matchCount[align1]

Out[5]= 6625
```

The faster heuristic method also manages to get nearly 57% of the characters in the matching parts:

```wl
In[6]:= matchCount[align2]

Out[6]= 6237
```

### Possible Issues (1)

When aligning nested lists, a list at level one can be a common element of the input lists:

```wl
In[1]:= a = {{1}, {}};SequenceAlignment[a, a]

Out[1]= {{{1}, {}}}
```

Or a list at level one may denote a difference between the two input lists:

```wl
In[2]:= b = {1};c = {};SequenceAlignment[b, c]

Out[2]= {{{1}, {}}}
```

As the two outputs are identical, the output cannot be used to disambiguate the two cases:

```wl
In[3]:= % === %%

Out[3]= True
```

### Neat Examples (1)

Compare two very similar genes:

```wl
In[1]:= SequenceAlignment[BioSequence[Entity["Gene", {"HBA1", {"Species" -> "HomoSapiens"}}]], BioSequence[Entity["Gene", {"HBA2", {"Species" -> "HomoSapiens"}}]]]

Out[1]= {{"", "CATAAACCCTGGCGCGCTCGCGGGCCGGC"}, "ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTGAGGCTCCCTCCCCTGCTCCGACCCGGGCTCCTCGCCCGCCCGGACCCACAGGCCACCCTCAACCGTCCTGGC ... , ""}, "TGGGCCTCCC", {"CC", "AA"}, "C", {"A", "G"}, "G", {"C", "G"}, "CCCTCCTCCCC", {"T", ""}, "TCC", {"", "T"}, "TGCA", {"C", ""}, "CCG", {"TA", "G"}, "CCC", {"", "TT"}, "CC", {"G", ""}, "TGGTCTTTGAATAAAGTCTGAGTGGGC", {"G", "A"}, "GC", {"", "A"}}
```

Use ``Diff`` to see the difference graphically:

```wl
In[2]:= Diff[BioSequence[Entity["Gene", {"HBA1", {"Species" -> "HomoSapiens"}}]], BioSequence[Entity["Gene", {"HBA2", {"Species" -> "HomoSapiens"}}]]]

Out[2]= DynamicModule[«3»]
```

## See Also

* [`Diff`](https://reference.wolfram.com/language/ref/Diff.en.md)
* [`LongestCommonSequence`](https://reference.wolfram.com/language/ref/LongestCommonSequence.en.md)
* [`LongestCommonSubsequence`](https://reference.wolfram.com/language/ref/LongestCommonSubsequence.en.md)
* [`SmithWatermanSimilarity`](https://reference.wolfram.com/language/ref/SmithWatermanSimilarity.en.md)
* [`NeedlemanWunschSimilarity`](https://reference.wolfram.com/language/ref/NeedlemanWunschSimilarity.en.md)
* [`LongestCommonSequencePositions`](https://reference.wolfram.com/language/ref/LongestCommonSequencePositions.en.md)
* [`LongestCommonSubsequencePositions`](https://reference.wolfram.com/language/ref/LongestCommonSubsequencePositions.en.md)
* [`SparseArray`](https://reference.wolfram.com/language/ref/SparseArray.en.md)
* [`SequenceCases`](https://reference.wolfram.com/language/ref/SequenceCases.en.md)
* [`SequencePosition`](https://reference.wolfram.com/language/ref/SequencePosition.en.md)
* [`SequenceSplit`](https://reference.wolfram.com/language/ref/SequenceSplit.en.md)
* [`StringCases`](https://reference.wolfram.com/language/ref/StringCases.en.md)
* [`StringPosition`](https://reference.wolfram.com/language/ref/StringPosition.en.md)
* [`BitXor`](https://reference.wolfram.com/language/ref/BitXor.en.md)
* [`WarpingCorrespondence`](https://reference.wolfram.com/language/ref/WarpingCorrespondence.en.md)
* [`BioSequence`](https://reference.wolfram.com/language/ref/BioSequence.en.md)
* [`Gene`](https://reference.wolfram.com/language/ref/entity/Gene.en.md)
* [`Protein`](https://reference.wolfram.com/language/ref/entity/Protein.en.md)
* [`FASTA`](https://reference.wolfram.com/language/ref/format/FASTA.en.md)
* [`GenBank`](https://reference.wolfram.com/language/ref/format/GenBank.en.md)
* [`PDB`](https://reference.wolfram.com/language/ref/format/PDB.en.md)

## Related Guides

* [Sequence Alignment & Comparison](https://reference.wolfram.com/language/guide/SequenceAlignmentAndComparison.en.md)
* [String Manipulation](https://reference.wolfram.com/language/guide/StringManipulation.en.md)
* [Biomolecular Sequences](https://reference.wolfram.com/language/guide/BiomolecularSequences.en.md)
* [Text Manipulation](https://reference.wolfram.com/language/guide/ProcessingTextualData.en.md)
* [Scientific Data Analysis](https://reference.wolfram.com/language/guide/ScientificDataAnalysis.en.md)
* [Life Sciences & Medicine: Data & Computation](https://reference.wolfram.com/language/guide/LifeSciencesAndMedicineDataAndComputation.en.md)
* [Distance and Similarity Measures](https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.en.md)
* [Text Analysis](https://reference.wolfram.com/language/guide/TextAnalysis.en.md)
* [Math & Counting Operations on Lists](https://reference.wolfram.com/language/guide/MathematicalAndCountingOperationsOnLists.en.md)
* [Natural Language Processing](https://reference.wolfram.com/language/guide/NaturalLanguageProcessing.en.md)

## Related Links

* [An Elementary Introduction to the Wolfram Language: String Patterns and Templates](https://www.wolfram.com/language/elementary-introduction/42-string-patterns-and-templates.html)

## History

* [Introduced in 2008 (7.0)](https://reference.wolfram.com/language/guide/SummaryOfNewFeaturesIn70.en.md) \| [Updated in 2020 (12.2)](https://reference.wolfram.com/language/guide/SummaryOfNewFeaturesIn122.en.md) ▪ [2024 (14.1)](https://reference.wolfram.com/language/guide/SummaryOfNewFeaturesIn141.en.md)