更新时间:2022-10-24 18:47:38
与 @Jimbou 相同,但没有tidyverse :
库(biomaRt)# 数据d<-read.table(text ="1 2665697 4665777 MIR2011 10391435 12391516 MIR5001 15106831 17106911 MIR1221 23436535 25436616 MIR2341 23436575 25436656 MIR488)#指定数据库ensembl = useMart("ensembl",数据集="hsapiens_gene_ensembl")#遍历行,获取基因,然后粘贴以折叠,#最后与数据d绑定在一起.res<-cbind(d,基因= apply(d,1,function(i){x<-getBM(attributes = c("external_gene_name"),过滤器= c("chromosome_name","start","end"),值= list(i [1],i [2],i [3]),mart =合奏)#仅保留3个基因,因为输出太长.#在您的情况下,删除以下行x<-头(x,3)#返回基因,逗号分隔粘贴(x $ external_gene_name,折叠=,")}))资源#V1 V2 V3 V4基因#1 1 2665697 4665777 MIR201 TTC34,AC242022.1,AL592464.2#2 1 10391435 12391516 MIR500 AL139424.2,PGD,AL139424.1#3 1 15106831 17106911 MIR122 KAZN,TMEM51-AS1,TMEM51#4 1 23436535 25436616 MIR234 ASAP3,E2F2,AL021154.1#5 1 23436575 25436656 MIR488 ASAP3,E2F2,AL021154.1
I want to retrieve the genes that are present within a series of regions. Say, I have a bed file with query positions such like:
1 2665697 4665777 MIR201
1 10391435 12391516 MIR500
1 15106831 17106911 MIR122
1 23436535 25436616 MIR234
1 23436575 25436656 MIR488
I would like to get the genes that fall within those regions.
I have tried using biomaRt, and bedtools intersect, but the output I get, is a list of genes corresponding to all the regions, not one by one, as the desired output I would like to get would be the genes within each row, but in separate rows, a if I did one query region at a time. Basically I want to know what genes fall within each region, but still being able to identify which genes fall in which regions.
What I am doing is, from a region of detected miRNA, I am expanding the genome region upwards and downwards, so that I get the neighboring genes from this miRNA. I am using a 1 million bases windows up and down. This would work for just one query, but, how to do many queries with biomaRt or many intersections with bedtools, so that I get somewhat like:
1 2665697 4665777 MIR201 GENEX, GENEY, GENEZ...
1 10391435 12391516 MIR500 GENEA, GENEB, GENEC...
1 15106831 17106911 MIR122
1 23436535 25436616 MIR234
1 23436575 25436656 MIR488
Meaning that GENEX, GENEY and GENEZ fall within 1:2665697-4665777, with MIR201, placed in the middle, as this region is calculated subtracting 1 million bp to sart, and adding 1 million bp to end position.
I am somewhat determining the neighboring genes from each miRNA, to compare within species, but I do not get how to query multiple regions individually using biomaRt or bedtools.
Any help?
Same approach as @Jimbou without tidyverse:
library(biomaRt)
# data
d <- read.table(text = "1 2665697 4665777 MIR201
1 10391435 12391516 MIR500
1 15106831 17106911 MIR122
1 23436535 25436616 MIR234
1 23436575 25436656 MIR488")
# specify the database
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# loop through rows, get genes, then paste with collapse,
# and finally bind back with data d.
res <- cbind(
d,
genes = apply(d, 1, function(i){
x <- getBM(attributes=c("external_gene_name"),
filters = c("chromosome_name" , "start", "end"),
values = list(i[1], i[2], i[3]),
mart = ensembl)
# keeping only 3 genes, as output is too long.
# In your case remove below line
x <- head(x, 3)
# return genes, comma separated
paste(x$external_gene_name, collapse = ",")
})
)
res
# V1 V2 V3 V4 genes
# 1 1 2665697 4665777 MIR201 TTC34,AC242022.1,AL592464.2
# 2 1 10391435 12391516 MIR500 AL139424.2,PGD,AL139424.1
# 3 1 15106831 17106911 MIR122 KAZN,TMEM51-AS1,TMEM51
# 4 1 23436535 25436616 MIR234 ASAP3,E2F2,AL021154.1
# 5 1 23436575 25436656 MIR488 ASAP3,E2F2,AL021154.1