且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何获得成对的“序列相似度得分"约 1000 种蛋白质?

更新时间:2021-12-28 09:50:54

根据 Chase 的建议,bioconductor 确实是要走的路,尤其是 Biostrings 包.要安装后者,我建议安装核心 bioconductor 库:

As per Chase's suggestion, bioconductor is indeed the way to go and in particular the Biostrings package. To install the latter I would suggest installing the core bioconductor library as such:

source("http://bioconductor.org/biocLite.R")
biocLite()

通过这种方式,您将涵盖所有依赖项.现在,要对齐 2 个蛋白质序列或任何两个序列,您需要使用 Biostrings 中的 pairwiseAlignment.给定一个由 2 个序列组成的 fasta protseq.fasta 文件,如下所示:

This way you will cover all dependencies. Now, to align 2 protein sequences or any two sequences for that matter you will need to use pairwiseAlignment from Biostrings. Given a fasta protseq.fasta file of 2 sequences that looks like this:

>protein1
MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
IPVHPNDHVNKSQ
>protein2
MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
TKHHEGFTNW*

如果您想使用 BLOSUM100 作为替换矩阵来全局对齐这 2 个序列,则打开间隙的惩罚为 0,扩展一个的惩罚为 -5:

If you want to globally align these 2 sequences using lets say BLOSUM100 as your substitution matrix, 0 penalty for opening a gap and -5 for extending one then:

require("Biostrings")
data(BLOSUM100)
seqs <- readFASTA("~/Desktop/protseq.fasta", strip.descs=TRUE)
alm <- pairwiseAlignment(seqs[[1]]$seq, seqs[[2]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)

这样做的结果是(去掉了一些对齐以节省空间):

The result of this is (removed some of the alignment to save space):

> alm
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] MYRALRLLARSRPLVRA-PAAALAS....
subject: [1] M-R-------SRP---AGPALLLLL.... 
score: -91

仅提取每个对齐的分数:

To only extract the score for each alignment:

> score(alm)
[1] -91

鉴于此,您现在可以使用一些非常简单的循环逻辑轻松完成所有成对对齐.为了更好地使用 bioconductor 进行成对对齐,我建议您阅读 这个.

Given this you can easily now do all pairwise alignments with some very simple looping logic. To get a better hang of pairwise alignment using bioconductor I suggest you read this.

另一种方法是进行多序列比对而不是成对.您可以使用 bio3d 并从那里使用 seqaln 功能对齐 fasta 文件中的所有序列.

An alternative approach would be to do a multiple sequence alignment instead of pairwise. You could use bio3d and from there the seqaln function to align all sequences in your fasta file.