下面这个流程是下载这个网站公开数据的方法,使用到的工具是TCGAbiolinks(https://github.com/BioinformaticsFMRP/TCGAbiolinks),
主要是两种RNA表达谱数据和基因突变maf数据
下载的所有文件获取方法
- 站长已经把maf和表达谱文件已经上传到百度云,加入小站vip群里的小伙伴已经获得;
- 下面是下载所用到的方法,也可以自己下载,注意下载所有文件需要至少50G空间。
创建R 4.0环境
代码语言:javascript
复制
conda create -n R4 -c conda-forge -y r-essentials r-base r-devtools
conda activate R4
R
进入R语言环境
下载R包
代码语言:javascript
复制
install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") ## 致敬开发者
批量下载代码
代码语言:javascript
复制
library(TCGAbiolinks) projects <- getGDCprojects() projects <- projects$project_id TCGA_dowload<-function(x,dirpath){ #转录组数据 query.exp <-GDCquery( project = x, data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts"" ) GDCdownload(query.exp) Exp <- GDCprepare(query = query.exp) #SNV数据 query.maf <- GDCquery( project = x, data.category = "Simple Nucleotide Variation", access = "open" ) Maf <- GDCprepare(query = query.maf) saveRDS(Maf,file = paste0(dirpath,x,"_maf.rds")) #甲基化数据 for (i in c("450","27")) { query_met.hg38 <- GDCquery( project = x, data.category = "DNA Methylation", platform = paste0("Illumina Human Methylation ",i), data.type = "Methylation Beta Value" ) Met <- GDCprepare(query = query_met.hg38) saveRDS(Met,file = paste0(dirpath,x,"_met_Ill",i,".rds")) } #miRNA数据 query.mirna <- GDCquery( project = x, experimental.strategy = "miRNA-Seq", data.category = "Transcriptome Profiling", data.type = "miRNA Expression Quantification" ) GDCdownload(query.mirna) Mirna <- GDCprepare(query = query.mirna) saveRDS(Mirna,file = paste0(dirpath,x,"_miRNA.rds")) #蛋白表达量 query.rppa <- GDCquery( project = x, data.category = "Proteome Profiling", data.type = "Protein Expression Quantification" ) GDCdownload(query.rppa) Proteins <- GDCprepare(query.rppa) saveRDS(Proteins,file = paste0(dirpath,x,"_protein.rds")) #CNV数据 CNV.type<-c("Allele-specific Copy Number Segment", "Gene Level Copy Number", "Masked Copy Number Segment") for (ii in CNV.type) { query.CNV <- GDCquery( project = project, data.category = "Copy Number Variation", data.type = ii ) GDCdownload(query.CNV) CNV <- GDCprepare(query = query.CNV) saveRDS(CNV,file = paste0(dirpath,project,"_CNV_", stringr::str_replace_all(i," ","_"), ".rds")) } }
批量下载数据
for (j in projects) {
print(j)
try(TCGA_dowload(j,dirpath = "./TCGAbiolinks_data/"),silent = T)
}
下载数据说明
文件使用
- 下载文件保存格式是rds,使用下面方法可以加载
代码语言:javascript
复制
TCGA_ACC_Exp<-readRDA("TCGA-ACC_exp.rds") ##注意文件路径要正确
- 表达谱数据 表达谱数据包括:
代码语言:javascript
复制
TCGA_ACC_Exp_unstrand<-SummarizedExperiment::assay(TCGA_ACC_Exp,1)
- 临床信息 表达谱中整合了临床信息可以用下面方法提取
代码语言:javascript
复制
TCGA_ACC_clinData<-SummarizedExperiment::colData(TCGA_ACC_Exp)
- 关于maf 下载的SNV_maf文件没有临床信息需要自己整理一下才能使用maftools