-
Notifications
You must be signed in to change notification settings - Fork 259
Description
I hope this is not something that has been already asked before. Here an example VCF file:
## fileformat=VCFv4.1
## FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
## FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## contig=<ID=20,length=63025520>
# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
20 20202020 . C G,T 0 PASS . GT:AD 1/2:0,48,61
If I split this file with the command "bcftools norm -m -any" I obtain:
## fileformat=VCFv4.1
## FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
## FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## contig=<ID=20,length=63025520>
# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
20 20202020 . C G 0 PASS . GT:AD 1/0:0,48
20 20202020 . C T 0 PASS . GT:AD 0/1:0,61
However now I am in the uncomfortable situation where each site is heterozygous despite the allelic depth supporting "1/1" calls rather than "0/1" calls. I am sure people will have different opinions about this, but part of the reason many want to split multi-allelic sites is to consider each alternate allele as an allele to be interpreted as that allele against every other allele. It would be great to have at least an option to properly re-format the AD field so that the total sum of the AD fields is maintained after splitting, so that instead of splitting:
1/2:AD[0],AD[1],AD[2] -> 1/0:AD[0],AD[1] and 0/1:AD[0],AD[2]
It gets split instead as:
1/2:AD[0],AD[1],AD[2] -> 1/0:AD[0]+AD[2],AD[1] and 0/1:AD[0]+AD[1],AD[2]
I hope this makes sense.