ANNOVAR进行突变注释

ANNOVAR简介

ANNOVAR Documentation ANNOVAR其网页版wANNOVAR

Comparison between ANNOVAR and other command line tools for variant annotation

例如对于人的外显子组测序，可以获得4百万个SNVs(单核苷酸突变)和50万个indels(insertion or deletions)突变。对于基因间区的突变，我们对其侧翼基因感兴趣，并想知道两者之间距离；对于外显子区域的变异，我们关系其对于氨基酸的改变。

ANNOVAR是一个perl编写的命令行工具，能对根据hg18,hg19,hg38不同版本基因组call 出来的遗传变异进行功能注释。允许多种输入文件格式，包括最常被使用的VCF格式，只需要给出chromosome, start position, end position, reference nucleotide and observed nucleotides这6个字段。输出文件也有多种格式，包括注释过的VCF文件、用tab或者逗号分隔的text文件。

ANNOVAR支持三种不同形式的注释： gene-based, region-based 和filter-based. 这三种注释分别针对于每一个variant的不同方面：

基于基因的注释（gene-based annotation）：SNPs或者CNV变异是否造成编码蛋白氨基酸的改变或者影响；
基于区域的注释（region-based annotation）揭示variant 与不同基因组特定区域的关系，例如：它是否与以下区域有overlap: conserved genomic elements，cytogenetic bands, microRNA target sites和Encyclopedia of DNA Elements (ENCODE)-annotated regions等
基于过滤的注释（ filter-based annotation ）则给出这个variant的一系列信息，例如是否该变异在dbSNP数据库报道过，该变异在1000genome项目，外显子组项目中的频率等，以及不同的计算机算法对变异注释的结果（SIFT/PolyPhen/LRT/MutationTaster/MutationAssessor/FATHMM/MetaSVM/MetaLR scores）等

下载（无需安转，但需要安装perl环境）

常用注释数据库下载

注释文件在ANNOVER分For gene-based annotation和For filter-based annotation以列表的形式详细介绍，链接

perl annotate_variation.pl –downdb –webfrom annovar –buildver hg19 refGene humandb/
perl annotate_variation.pl –downdb –webfrom annovar –buildver hg19 1000g2015aug humandb/
perl annotate_variation.pl –downdb –webfrom annovar –buildver hg19 exac03 humandb/
perl annotate_variation.pl –downdb –webfrom annovar –buildver hg19 clinvar_20170905 humandb/
perl annotate_variation.pl –downdb –webfrom annovar –buildver hg19 avsnp150 humandb/
perl annotate_variation.pl –downdb –webfrom annovar –buildver hg19 cosmic70 humandb/
perl annotate_variation.pl –downdb –webfrom annovar –buildver hg19 icgc21 humandb/
perl annotate_variation.pl –downdb –webfrom annovar –buildver hg19 nci60 humandb/

这里下载的是几个通常用到的数据库的最新版：

Build	Table Name	Explanation	Date
hg19	refGene	FASTA sequences for all annotated transcripts in RefSeq Gene	20170601
hg19	1000g2015aug (6 data sets)	The 1000G team fixed a bug in chrX frequency calculation. Based on 201508 collection v5b (based on 201305 alignment)	20150824
hg19	exac03	ExAC 65000 exome allele frequency data for ALL, AFR (African), AMR (Admixed American), EAS (East Asian), FIN (Finnish), NFE (Non-finnish European), OTH (other), SAS (South Asian)). version 0.3. Left normalization done.	20151129
hg19	clinvar_20170905	Clinvar version 20170905 with separate columns (CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID)	20171003
hg19	avsnp150	dbSNP150 with allelic splitting and left-normalization	20170929
hg19	cosmic70	COSMIC database version 70	20140911
hg19	icgc21	International Cancer Genome Consortium version 21	20160622
hg19	nci60	NCI-60 human tumor cell line panel exome sequencing allele frequency data	20130724

‘1000g2015aug’(version August 2015), 是2015年版，包含6个 data sets，提供千人基因组项目的替换等位基因频率等信息。
‘exac03’(version 0.3),是0.3版外显子Exome Aggregation Consortium中报道过的variants。
‘clinvar_20140929’ for the variants reported in the ClinVar database (version 20140929)，ClinVar数据库整合了十多个不同类型数据库、通过标准的命名法来描述疾病，同时支持科研人员将数据下载到本地中，开展更为个性化的研究。ClinVar数据库的目的在于整合这些分散的数据、将变异、临床表型、实证数据以及功能注解与分析等四个方面的信息，通过专家评审，逐步形成一个标准的、可信的、稳定的遗传变异-临床表型相关的数据库。

基于保守基因组元件注释数据文件，每一个在ANNOVAR说明文档都有详细介绍，链接地址，这里我们也只下载常用的即可。

perl annotate_variation.pl --downdb --buildver hg19 cytoBand humandb/ 
perl annotate_variation.pl -build hg19 -downdb gwasCatalog humandb/

‘cytoBand’ 是每个细胞间band（cytogenetic band）的染色体坐标信息 ,

注意：

1、第一个命令中不包含 ‘–webfrom annovar’ 选项, 因此是从the UCSC Genome Browser annotation database下载文件的；

2、 ‘–buildver hg19’ 选项是针对hg19这一版的基因组的；

3、运行上面命令后，在 ‘humandb/’ 目录下会多几个以 ‘hg19’为前缀的文件。

变异注释

ANNOVER input file

ANNOVER的输入文件要求VCF 4.0及以上，这一要求对于绝大多数call 变异的软件，如GATK何SAMtools都是满足的。VCF (Variant Call Format) specifications （包括 VCF4.0，4.1，4.2，以及BCF v2.1）

对于ANNOVAR新手，最方便使用table_annovar.pl程序

table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -csvout -polish -xref example/gene_xref.txt

参数解释

-buildver hg19 ANNOVAR默认注释是根据hg18的，所以这里需要指定hg19。
-out myanno 输出注释结果文件myanno.hg19_multianno.txt
-operation g表示gene-based；-operation gx 表示 gene-based with cross-reference annotation (from -xref argument)；-operation r 表示region-based； -operation f 表示 filter-based。如果不提供xref文件就只能指定-operation g。
-csvout使用逗号输出，如果喜欢使用tab键输出，则只需要去除 -csvout

annotate_variation.pl是 ANNOVAR的核心程序

使用例子：

annotate_variation.pl -geneanno -dbtyep refGene -out ex1 -buildver hg19 example/ex1.avinput humandb/
annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/ 
annotate_variation.pl -filter -dbtype exac03 -buildver hg19 example/ex1.avinput humandb/

参数解释及技术问题

–geneanno -dbtype refGene是默认的，可以省略
ncRNA above refers to RNA without coding annotation. It does not mean that this is a RNA that will never be translated. For example, BC039000 is regarded as ncRNA by ANNOVAR when using UCSC Known Gene annotation, but it is regarded as a protein-coding gene by ANNOVAR when using ENSEMBL annotation
By default, the gene name is printed in the second column in the variant_function file. Sometimes, a user may want to see transcript name instead. The –transcript_function argument can be used to specify this behavior. Note that it is very likely that multiple transcript names will be printed in the output separated by comma, as each gene name typically corresponds to several transcript names

注释结果解释

表头含义

各列内容及含义

Func.refGene, Gene.refGene, GeneDetail.refGene, ExonicFunc.refGene, AAChange.refGene列包含变异对基因结构的影响。其中如果Func.refGene给出注释显示该变异位于exonic/intronic/ncRNA，那么Gene.refGene列给出相应的基因名（如果是多个基因，中间用逗号分割）；如果不是，则给出该变异侧翼两个基因及与其距离。
ExAC* 等列表示在外显子组项目（Exome Aggregation Consortium）数据集中所有样本，包括亚人种中的等位基因频率。
avsnp138 列该变异在dbSNP（version 138）数据集中的SNP 的ID号
其他列包含使用多个变异功能预测软件得到的预测分数，包括SIFT scores, PolyPhen2 HDIV scores, PolyPhen2 HVAR scores, LRT scores, MutationTaster scores, MutationAssessor score, FATHMM scores, GERP++ scores, CADD scores, DANN scores, PhyloP scores and SiPhy scores等。

Func.refGene列具体含义

Value	Default precedence	Explanation	Sequence Ontology
exonic	1	variant overlaps a coding	exon_variant (SO:0001791)
splicing	1	variant is within 2-bp of a splicing junction (use -splicing_threshold to change this)	splicing_variant (SO:0001568)
ncRNA	2	variant overlaps a transcript without coding annotation in the gene definition (see Notes below for more explanation)	non_coding_transcript_variant (SO:0001619)
UTR5	3	variant overlaps a 5’ untranslated region	5_prime_UTR_variant (SO:0001623)
UTR3	3	variant overlaps a 3’ untranslated region	3_prime_UTR_variant (SO:0001624)
intronic	4	variant overlaps an intron	intron_variant (SO:0001627)
upstream	5	variant overlaps 1-kb region upstream of transcription start site	upstream_gene_variant (SO:0001631)
downstream	5	variant overlaps 1-kb region downtream of transcription end site (use -neargene to change this)	downstream_gene_variant (SO:0001632)
intergenic	6	variant is in intergenic region	intergenic_variant (SO:0001628)

The precedence defined above is used to decide what function to print out when a variant fit multiple functional categories

ExonicFunc.refGene列具体含义

Value	Default precedence	Explanation	Sequence Ontology
frameshift insertion	1	an insertion of one or more nucleotides that cause frameshift changes in protein coding sequence	frameshift_elongation (SO:0001909)
frameshift deletion	2	a deletion of one or more nucleotides that cause frameshift changes in protein coding sequence	frameshift_truncation (SO:0001910)
frameshift block substitution	3	a block substitution of one or more nucleotides that cause frameshift changes in protein coding sequence	frameshift_variant (SO:0001589)
stopgain	4	a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of stop codon downstream of the variant will not be counted as “stopgain”!	stop_gained (SO:0001587)
stoploss	5	a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site	stop_lost (SO:0001578)
nonframeshift insertion	6	an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift changes in protein coding sequence	inframe_insertion (SO:0001821)
nonframeshift deletion	7	a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift changes in protein coding sequence	inframe_deletion (SO:0001822)
nonframeshift block substitution	8	a block substitution of one or more nucleotides that do not cause frameshift changes in protein	coding sequence inframe_variant (SO:0001650)
nonsynonymous SNV	9	a single nucleotide change that cause an amino acid change	missense_variant (SO:0001583)
synonymous SNV	10	a single nucleotide change that does not cause an amino acid change	synonymous_variant (SO:0001819)
unknown	11	unknown function (due to various errors in the gene structure definition in the database file)	sequence_variant (SO:0001060)

A unified annotation format

不同的注释软件对突变注释的表示方式不统一，不利于交流，为此，多个注释软件的作者协定了一份统一的注释格式，文档链接

突变有害性注释结果选择建议（ANNOVER作者）

一般而言，建议使用one popular prediction score (such as SIFT) 和 one meta-score (such as metaSVM)预测来判断突变是否是有害的
对于 noncoding variants的有害性预测推荐使用one popular prediction score (such as PhyloP) 和 one meta-score (such as CADD)

补充

vcf文件经过Annovar注释后突变个数应该相同，但对于原vcf文件类似如下情况，Annovar则会分别注释成两行

在Annovar官网的FAQ已经给出解释：

这里需要重新了解下vcf文件格式，（Annovar VCF Processing Guide）：

GT：样品的基因型（genotype）。两个数字中间用’/‘分开，这两个数字表示双倍体的sample的基因型。0 表示样品中有ref的allele； 1 表示样品中variant的allele； 2表示有第二个variant的allele
VCF文件是描述位点的格式文件。尽管称作变异检出格式，但并不是为了描述变异或者基因型检出。其可以作为包含基因型检出（甚至变异）的一种选择，并且对于很多肺二倍体物种或者喝多情况（例如线粒体或者人的肿瘤），其有时候甚至不能说是包含基因型检出。但是，很多变异检出软件的目的是为了产生基因型检出；它们使用VCF作为输出文件，但是这并不意味着VCF文件是未存储变异检出而专门设计的。
因为VCF是一个位点描述文件，具有很多的结果。首先，并不是一行对应着一个变异。因为多个变异能在相同的位点，VCF文件的一行原则上可以描述多个突变（包含野生型非突变allele），和多种类型的基因型检出。例如，看下面这来自VCF文件的一行、其有8个tab键分割的列。在ALT列，具有数个逗号分割的改变的alleles。因此在单行上，数个插入和删除，以及一个单核苷酸（SNV）突变同时存在。

1 112240038 . CTTT CTTTT,CTTTTT,CTTA,CTT,CT,C . PASS AC=986,3,1249,3,127,3;AF=0.196885,0.000599042,0.249401,0.000599042,0.0253594,0.000599042

很多Annovar使用者喜欢保留VCF文件中的突变注释信息（通过INFO列）。因此对于上述这种情况，我们需要在相同行的INFO列加注释信息到所有的6个allels，并且确保使用者知道相应的注释对应相应的allele

VCF可以hijack你的变异，使得SNVs变为多核苷酸突变，并且使得简单的indels变的复杂。这将对注释造成困难，举上述的 CTTT->CTTA 改变作为例子，其应该是简单的T>A （SNV），但是因为deletion/insertion hijack the locus，其被写为CTTT->CTTA 而不是T->A 。考虑到单个allele频率数据库（例如千人基因组项目频率数据库）将只具有 T->A 但没有 CTTT->CTTA，然后这个变异通过注释软件注释后将被遗漏，甚至其确实在1000G中被观察到过。

同样的，CTTT->CTTTT 变异可能在数据库中没有记录，因为C>CT可能是更合适的记录这个变异的方式。

目前为止还没有描述indel的普遍认同的特定方式。很多使用者偏好做left-normalization（变异开始的位置应该尽可能向左，直至不能再左，这样数目越少越好）。但是HGVS明确指明了left-normalization将被用在cDNA（mRNA）的坐标上,这意味着 right-normalization被要求使用在人的基因组一半的基因上。
在阅读了这些事实后，现在的问题时我们应该如何注释VCF文件来确保最准确的结果呢？

因为left-normalization变得越来越流行，我的建议是是就使用left-normalization。我的第二个建议是各个VCF行仅描述一个变异，以便indel不能隐藏SNPs，以此确保1对1的和数据库匹配。

我作为ANNOVAR开发者决定去预处理所有的1000G文件，ESP6500si文件和dbSNP文件，以便一行包含一个变异，使得每个变异都采取 left-normalized。已经更新的数据再2014年的12月可以获得了。

作为使用者我们应该这样做：（1）分割VCF文件确保每一行只包含一个变异；（2）left-normalize所有的VCF文件；（3）使用ANNOVAR注释

所以使用命令：

bcftools norm -m-both -o ex1.step1.vcf ex1.vcf.gz

bcftools norm -f human_g1k_v37.fasta -o ex1.step2.vcf ex1.step1.vcf

第一个命令分割多allels变异检出为单独的行，第二个命令运行真正的 left-normalization。（有时候第一个命令可能出现没有变异能被分解，尽管在文件中存在这些变异，这种情况下，你可以使用vt program 代替）

目前使用left-normalized的数据集包括：avsnp138，avsnp142，clinvar_20150330
，1000g2014oct，exac03，esp6500siv2

PS:更新（2018年12月25）


perl annotate_variation.pl -downdb -buildver hg19 -webfrom annovar dbscsnv11 humandb_new/  
##剪切位点预测（splicing impact by Ada Boost and Random Forest）
perl annotate_variation.pl -downdb -buildver hg19 -webfrom annovar cg46 humandb_new/
##  在一个全基因组测序项目(包含46个无血缘关系的样本)中的等位基因频率

创新：