서열 정렬

두 서열이 서로 얼마나 유사한지, 다른지를 결정하는 것은 두 서열 간의 구조적, 기능적 또는 진화적 관계를 추론하기 위한 일반적인 접근 방식입니다. VectorBuilder의 서열 정렬 도구를 사용하면 DNA 또는 단백질 레벨에서 두 서열을 직접 비교할 수 있을 뿐만 아니라, 번역 기반의 두 DNA 서열을 비교할 수도 있습니다.

정렬은 전체 서열에 대한 동일성/유사성 퍼센트의 전체적인 관점과 개별 뉴클레오타이드/아미노산을 비교하는 집중된 관점을 제공합니다. 이 도구는 갭 및 갭 페널티를 활용하여 데이터 무결성을 유지하면서 두개의 뉴클레오타이드 또는 아미노산이 일치할 가능성을 최대화합니다. 갭은 정렬된 서열의 삽입 또는 삭제를 설명하고, 갭 페널티는 갭의 빈도와 길이를 기반으로 정렬에 음수 점수를 할당합니다.

Sequence alignment basics

When studying differences between genes, proteins, or organisms, sequence alignments can help to predict structural relationships, functions, and evolutionary changes. Two or more DNA or protein sequences can be compared for similarity at the local and global level. Each sequence is compared nucleotide by nucleotide, and matches are highlighted and designated with bar symbols. In the sequence below, there is 67% similarity (6/9 nucleotides), with total 3 total mismatches (Hamming distance).

However, aligning sequences is often complicated by the presence of substitutions as well as indels (insertions and deletions). Alignment algorithms can account for these events with gaps, where a space (-) can be placed to optimize alignment. In the sequence below, there is slightly lower similarity (60%) due to the insertion in the second sequence. The percent similarity only takes into account matches, not whether there is a mismatch, a gap, or part of an extended gap.

With larger and more complex sequence comparisons, it quickly becomes untenable to perform alignments by hand. The algorithm used in VectorBuilder’s Sequence Alignment tool determines the best alignment by optimizing the alignment score, which takes into account matches, mismatches, gaps, and extended gaps with individual scores for each event at each nucleotide.

Once you have the alignment for your sequences, you can examine the alignment score, the length of the alignment (how many total nucleotides matching), and the locations of high similarity. Aligning DNA from two different species can help determine more homologous regions and/or regions under higher selective pressure. When aligning a protein sequence with that of a well-characterized protein, you can predict secondary structures as well as function.

Aligning sequences with translation

Bridging the gap between the DNA and protein sequence can be extremely valuable in cloning efforts, particularly when cloning a gene in another species (heterologous expression). Changing the DNA sequence may or may not change the resultant protein sequence, because of redundancy in the genetic code. Most amino acids are coded by more than one codon sequence (Figure 1), so a mutation that changes GGA to GGC will still produce glycine.

Figure 1. Each three-letter nucleotide sequence corresponds to an amino acid or direction (start/stop).

To determine how DNA alignment translates to protein alignment, VectorBuilder offers an option to align based on translated DNA. Below, the Sox2 coding sequences in mouse and human are aligned. These sequences exhibit about 93% similarity (Figure 2A).

However, when the same sequences are used to view similarity of the translated protein (by selecting “DNA alignment based on translated protein sequence”), the resultant alignment shows 97% similarity, highlighting mutations to base pairs that have not influenced protein sequence or function (Figure 2B). Determining the similarity/difference between DNA or protein sequences as well as translated DNA sequences provides a powerful tool for examining relationships between proteins or organisms.

Figure 2. Alignment between coding sequences for Sox2 in human and mouse (A), and alignment between translated amino acid sequences (B).

Sequences in both GenBank and FASTA formats can be recognized.
For alignment based on translated sequences, you may optimize alignments by adjusting the frame for either sequence.
Max sequence length 10,000 bases or 10,000 amino acids.