-
Notifications
You must be signed in to change notification settings - Fork 6
Tutorials setting examples
Here we find settings files for 4 simple test-case scenarios. They are available
in ngsphy/data/settings (see Escalona et al. submitted for further details).
They correspond to the 4 possible input modes. The trees, sequences, reference alleles and INDELible
control files needed for their execution can be found, respectively, in:
ngsphy/data/treesngsphy/data/sequences/ngsphy/data/reference_alleles/ngsphy/data/indelible/
You can use them to check the proper installation of the pipeline and adapt them
to your particular case. In addition, there is a script file for each case scenario in: ngsphy/test
-
Generating read counts from a single gene tree
- Corresponds to
inputmode=1 - Script:
ngsphy.test.1.sh
- Corresponds to
-
Generating Illumina reads from a single gene tree, using an ancestral sequence
- Corresponds to
inputmode=2 - Script:
ngsphy.test.2.sh
- Corresponds to
-
Generating read counts from a single gene tree, using an anchor sequence
- Corresponds to
inputmode=3 - Script:
ngsphy.test.3.sh
- Corresponds to
-
Generating Illumina reads from gene tree distribution
- Corresponds to
inputmode=4 - Script:
ngsphy.test.4.sh
- Corresponds to
NOTES:
- All setting files here described assume that:
- executables are accessible from any folder, and are properly (re)named
- all the requested files are in the same directory, otherwise you will have to change the path related options accordingly.
- Current Linux version reports better logs thatn OSX (debug log option).
Here, we will generate read counts for the tips of a single gene tree. The
read counts will have 0.1% of sequencing error and the expected coverage is 100x.
Reference allele file is not given, thus the one with the label 1_0_0 is used (default).
Output will be stored in the current working directory. Settings file looks as
below and is named ngsphy.settings.1.txt
[general]
path=.
output_folder_name=NGSphy_output
ploidy=1
[data]
inputmode=1
gene_tree_file=t1.tree
indelible_control_file=control.1.txt
[coverage]
experiment=F:100
[ngs-read-counts]
read_counts_error=0.1
[execution]
environment=bash
running_times=off
threads=2
FILE: ngsphy.settings.1.txt
For the generation of sequences with INDELible, the control file:
[TYPE] NUCLEOTIDE 1 // nucleotide simulation using algorithm from method 1
[SETTINGS]
[ancestralprint] NEW // generates a file with the ancestral sequence
[output] FASTA
[MODEL] m1 // no insertions, no gamma
[submodel] HKY 2.5 // HKY with kappa=2.5
[statefreq] 0.1 0.2 0.3 0.4 // frequencies for T C A G
[NGSPHYPARTITION] t1 m1 1000 // t1 with model m1 to generate 1000 bp long sequences
FILE: control.1.txt
The tree has 4 tips, one per individual.

((1_0_0:1.0, 2_0_0:1.0):1.0,(3_0_0:1.0, 4_0_0:1.0):1.0);
FILE: t1.tree
To run this example, use:
ngsphy -s ngsphy.settings.1.txt
Several output folders/files are produced under the main directory:
-
alignments: this contains the data used and generated by INDELible. -
coverage: the exact coverage for each loci (L) of each individual (I). This is written into a table with dimensions (I x L). In this case, this value was fixed at100xfor all individuals. -
individuals: where the sequence files (FASTA) for all loci and individuals are written. There is a subfolder structure reflecting the number of replicates and gene trees. In this case, the single gene-tree t1 has4tips, corresponding to4haploid individuals, thus we will have4FASTA files, each file containing a single sequence corresponding to an individual. Here, an example of a single individual file (see more in Section 6.2.7. Individual assignment).:
$ cat individuals/1/1/NGSphy_1_1_ngsphydata_0.fasta
>NGSphy:1:1:ngsphydata:0:1_0_0
GGCCGTGGCCCGGGGTGGGAAACGGCCGACGAAAATGGGGGAATCCAACCAGTGG....-
ind_labels: it has as many files as replicates, and stores the relation replicate/individual/species/locus/sequence. In this case:
cat ind_labels/NGSphy.1.individuals.csv
indexREP,indID,speciesID,locusID,geneID
1,0,1,0,0
1,1,2,0,0
1,2,3,0,0
1,3,4,0,0-
reads: stores VCF with the simulated read counts. If sequencing error is introduced, the VCF files without errors will also be written.
|__reads/
|__no_error/ # VCF files without sequencing error
|__1/ # replicate identifier
|__ngsphydata_1_1_NOERROR.vcf
|__with_error/ # VCF files with sequencing error
|__1/ # replicate identifier
|__ngsphydata_1_1.vcf
-
ref_alleles: FASTA files with the sequences of the reference alleles used for the read count process.
|__ref_alleles/
|__1/ # replicate identifier
|__NGSphy_REF_1_1.fasta # FASTA file with reference allele sequence for replicate 1, locus 1.
Here we will simulate Illumina reads from diploid individuals evolving under a single gene tree with a known ancestral sequence. The Illumina reads will have the following characteristics:
- Machine:
HiSeq2000. -
100bpPE reads. - Fragments will have mean length of
250bp(standard deviation50bp). - Expected coverage of
50x.
Settings file looks as below and is named ngsphy.settings.2.txt.
[general]
path=.
output_folder_name=NGSphy_output
ploidy=2
[data]
inputmode=2
gene_tree_file=t2.tree
ancestral_sequence_file=my_ancestral.fasta
indelible_control_file=control.2.txt
[coverage]
experiment=F:50
[ngs-reads-art]
amp=true
l=100
m=250
p=true
q=true
s=50
sam=true
ss=HS20
[execution]
environment = bash
runART=off
threads=2
FILE: ngsphy.settings.2.txt
INDELible control file:
[TYPE] NUCLEOTIDE 1
[SETTINGS]
[output] FASTA
[ancestralprint] NEW
[MODEL] m1 // no insertions, no gamma
[submodel] HKY 0.5 // HKY with kappa=0.5
[NGSPHYPARTITION] t2 m1 500
FILE: control.2.txt
The tree, in this case, has 8 tips belonging to 4 individuals of 4 species.

( ((1_0_1:1.0,1_0_0:1.0):1.0, (2_0_1:1.0,2_0_0:1.0):1.0):1.0,((3_0_1:1.0,3_0_0:1.0):1.0, (4_0_1:1.0,4_0_0:1.0):1.0):1.0);
FILE: t2.tree
To run this example, use:
ngsphy -s ngsphy.settings.2.txt
Several output files are produced under the main directory:
-
alignments: this contains the data used and generated by INDELible. -
coverage: the exact coverage for each loci (L) of each individual (I). This is written into a table with dimensions (I x L). In this case, this value was fixed at50xfor all individuals. -
individuals: where the sequence files (FASTA) for all loci and individuals are written. There is a subfolder structure reflecting the number of replicates and gene trees. In this case, the single gene-treet2has8tips, corresponding to4diploid individuals, thus we will have4FASTA files, each file containing a2sequences corresponding to an individual. -
ind_labels: relation sequence/locus/gene/individual/species. In this case:
$cat NGSphy.1.individuals.csv
indexREP,indID,speciesID,locusID,mateID1,mateID2
1,0,2,0,0,1
1,1,3,0,0,1
1,2,4,0,1,0
1,3,1,0,1,0-
reads: stores the FASTQ files generated by ART. In this case the execution has been turned off (runART=off) and we obtain an empty hierarchical folder structure -
scripts: File with all needed command lines to generate the Illumina reads from all the diploid individuals in ART. For medium-big datasets it is convenient to use this feature and run ART separately. The file looks like this:
$ cat NGSphy_output/scripts/NGSphy.sh
art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydata_0.fasta --out /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_0_R
art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydata_1.fasta --out /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_1_R
art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydata_2.fasta --out /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_2_R
art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydata_3.fasta --out /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_3_RIn this example we use a sequence from a specific tip of the tree called anchor
sequence - to root the tree and start the simulation. We will obtain sequences
from the rest of the gene tree tips keeping their relationships, and then read
counts for all tips with 0.1% sequencing error and 100x expected coverage. The
allele used as reference for the read counts is the anchor sequence (here 2_0_0;
in my_anchor_sequence.fasta), but a different one can be specified.
[general]
path=.
output_folder_name=NGSphy_output
ploidy=1
[data]
inputmode=3
gene_tree_file=t3.tree
anchor_sequence_file=my_anchor_sequence.fasta
anchor_tip_label=2_0_0
indelible_control_file=control.3.txt
[coverage]
experiment=F:100
[ngs-read-counts]
read_counts_error=0.1
reference_alleles_file=my_reference_allele_file.txt
[execution]
environment=bash
running_times=off
threads=2
ngsphy.settings.3.txt
The tree:

((((1_0_0:1.0, 2_0_0:1.0):1.0),(3_0_0:1.0, 4_0_0:1.0):1.0),5_0_0:3.0);
t3.tree
Information on how to evolve the sequences is in the control file that INDELible will use:
[TYPE] NUCLEOTIDE 1
[SETTINGS]
[ancestralprint] NEW
[output] FASTA
[MODEL] m1
[submodel] HKY 0.1
[NGSPHYPARTITION] t3 m1 100
control.3.txt
To run this example, use:
ngsphy -s ngsphy.settings.3.txt
-
alignments: this contains the data used and generated by INDELible. -
coverage: he exact coverage for each loci (L) of each individual (I). This is written into a table with dimensions (I x L). In this case, this value was fixed at100xfor all individuals. -
individuals: where the sequence files (FASTA) for all loci and individuals are written. There is a subfolder structure reflecting the number of replicates and gene trees. In this case, the single gene-treet3has5tips, corresponding to5haploid individuals, thus we will have5FASTA files, each file containing a single sequence corresponding to an individual. -
ind_labels: relation sequence/locus/gene/individual/species. -
reads: stores VCF with the simulated read counts. -
ref_alleles: FASTA files with the sequences of the reference alleles used for the read counts process. These files will be generated whether the reference allele matches the anchor sequence or not.
In this more complex example, we will simulate Illumina reads from a gene tree distribution, which needs to be first obtained with SimPhy.
For this example we will simulate 2 species tree replicates with a variable height
between 200.000 years and 20.000.000 years (u:200000,20000000). These trees will
have 5 ingroup + 1 outgroup species. The ingroup species will have 6 individuals per
species, while the outgroup is a single individual. Each replicate will have 10 gene trees.
The effective population size (10.000) of each species and the substitution rate (0.00001)
of each gene tree are fixed. Finally will add heterogeneity at different levels (-h parameters).
To run the simulation we use:
simphy -rs 2 -rl f:10
-sb ln:-15,1 -st u:200000,20000000 -sl f:5
-so f:1 -sp f:100000 -su f:0.00001 -si f:6
-hh ln:1.2,1 -hl ln:1.4,1 -hg f:200
-v 1 -o testwsimphy -cs 6656
-om 1 -od 1 -op 1 -oc 1 -on 1
In detail:
| Parameter | Value | Description |
|---|---|---|
-rs |
2 | number of species tree replicates |
-rl |
f:10 | number of locus tree per replicate |
-rg |
f:1 | number of gene trees per replicate |
-sb |
ln:-15,1 | speciation rate (events/time unit) |
-st |
u:200000,20000000 | species tree height (time units) |
-sl |
f:5 | number of taxa |
-so |
f:1 | Ratio between ingroup height and the branch from the root to the ingroup |
-sp |
f:100000 | tree-wide effective population size |
-su |
f:0.00001 | tree-wide substitution rate |
-si |
f:6 | number of individuals per species |
-hh |
ln:1.2,1 | Gene-by-lineage-specific locus tree parameter |
-hl |
ln:1.4,1 | Gene-family-specific rate heterogeneity modifiers |
-hg |
f:200 | Gene-by-lineage-specific rate heterogeneity modifiers |
-v |
1 | verbosity: Global settings summary, simulation progress per replicate (number of simulated gene trees), warnings and errors |
-o |
testwsimphy | Common output prefix-name (for folders and files) |
-cs |
6656 | Random number generator seed |
-od |
1 | Activates the SQLite database output |
-op |
1 | Activates logging of sampled options |
-on |
1 | Activates the output of the bounded locus subtrees file |
To simulate the DNA alignment for every gene tree simulated we now use the script provided
with SimPhy (INDELIble_wrapper.pl) and the following INDELible control file.
[TYPE] NUCLEOTIDE 1
[SETTINGS]
[output] FASTA
[fastaextension] fasta
[MODEL] complex_common
[submodel] GTR $(rd:6,16,2,8,20,4)
[statefreq] $(d:1,1,1,1)
[rates] 0 $(e:2) 0
[SIMPHY-UNLINKED-MODEL] simple_unlinked
[submodel] HKY $(e:1)
[statefreq] $(d:1,1,1,1)
[SIMPHY-PARTITIONS] simple [1 simple_unlinked 500] //// The first half of the gene families will evolve under the model "simple_unlinked". Their sequence lengths are sampled from a Normal with mean=1000 and sd=100.
[SIMPHY-EVOLVE] 1 data // One sequence alignment for each gene tree, saved in files with "dataset" as common prefix (it will generate dataset_1, dataset_2, etc.)
FILE: control.4.txt. This file is based on a SimPhy example case. For more information on this example please go to SimPhy wiki
To run we use:
# perl INDELIble_wrapper.pl <simphy_folder> <simphy's_indelible_control_file> <seed_for_random_number_generation> <num_threads>
perl INDELIble_wrapper.pl testwsimphy/ ngsphy/data/indelible/control.4.txt $RANDOM 2
Here, we use the Linux evironment variable $RANDOM, which returns a different random integer in the range [0,32767].
For more information, go here.
After this, we will obtain a folder hierarchical structure that looks like this:
|__testwsimphy/
|__testwsimphy.command # SimPhy log files
|__testwsimphy.db # SimPhy log files
|__testwsimphy.params # SimPhy log files
|__1/ # Species tree replicate
|__data_[1-10].fasta # INDELIble output
|__data_[1-10]_TRUE.phy # INDELIble output
|__g_trees[1-10].trees # Simphy output
|__control.txt # INDELIble_wrapper.pl output
|__bounded_locus_subtrees.out # Simphy output
|__LOG.txt # INDELIble output
|__l_trees.trees # Simphy output
|__s_tree.trees # Simphy output
|__trees.txt # INDELIble output
|__2/ # Species tree replicate
|__data_[1-10].fasta # INDELIble output
|__data_[1-10]_TRUE.phy # INDELIble output
|__g_trees[1-10].trees # SimPhy output
|__control.txt # INDELIble_wrapper.pl output
|__bounded_locus_subtrees.out # SimPhy output
|__LOG.txt # INDELIble output
|__l_trees.trees # SimPhy output
|__s_tree.trees # SimPhy output
|__trees.txt # INDELIble output
Now we use NGSphy to generate the Illumina reads from the tips of these gene-trees
with the settings file ngsphy.settings.4.txt (see below). As we want to generate
diploid individuals in this case, NGSphy will randomly “mate” tips (gene-copies)
within each taxa/species. Outgroup taxa has always one sequence (when it is generated
in SimPhy) and is assumed to be homozygous (so its sequence is duplicated before
generating reads/read counts). As the species trees generated have 6 taxa
(5 ingroup + 1 outgroup), each ingroup having 6 tips, we wil have 16 individuals
in total:
-
6gene-copies *5ingroup taxa =30sequences =15diploid individuals. -
1outgroup *2(homozygous) =2sequences =1diploid individual.
Also, we want the base coverage to be 100x for both replicates (experiment=F:100),
but we want to add some variation among loci and individuals. This variation is
in this example modeled with a Log Normal distribution with mean 1.2 and standard
deviation 1 for the individuals and a Log Normal distribution with mean 1.3 and
standard deviation 1 for the loci. These distributions will model the underlying
Gamma distribution that will sample the rate multipliers for the specific individual
and locus. For example, the multipliers for the individual variation, for each replicate,
might be sampled from distributions like these, first (to the left) the shapes of the
Gamma distributions, and finally from the Gamma distribution the multipliers:
Example of possible distributions of the rate multipliers according the settings file ngsphy.settings.4.txt to model the coverage variation among individuals.
[general]
path=.
output_folder_name=NGSphy_output
ploidy=2
[data]
inputmode=4
simphy_folder_path=./testwsimphy
simphy_data_prefix=data
simphy_filter=true
[coverage]
experiment=F:100
individual=LN:1.2,1
locus=LN:1.3,1
offtarget=0.25, 0.01
notcaptured=0.5
taxon= 1,0.5;2,0.25
[ngs-reads-art]
l=100
m=250
p=true
q=true
s=50
sam=true
ss=HS20
[execution]
environment = bash
runART = on
threads=2
ngsphy.settings.4.txt
To run this example, use:
ngsphy -s ngsphy.settings.4.txt
-
coverage: the exact coverage for each loci (L) of each individual (I). This is written into a table with dimensions (I x L). In this case, we include variation in coverage among loci and individuals as in targeted-sequencing experiments. It is also introduced the simulation of off-target loci (thus with less coverage) and some of the loci are assumed to be not captured/sequenced at all. We have2files, one per replicate. From the simulated species trees we get16individuals per replicate and10gene-trees (loci) each. Hence, the coverage tables have16 x 10dimensions (I x L). Content of the coverage file (testwsimphy.1.coverage.csv):
$ cat testwsimphy.1.coverage.csv
indID ,L.01 ,L.02 ,L.03 ,L.04 ,L.05 ,L.06 ,L.07 ,L.08 ,L.09 ,L.10
0 ,4.317, 401.540, 467.337, 0.000, 0.0, 222.453, 0.0, 0.0, 190.841, 286.270
1 ,7.646, 711.199, 827.737, 0.0, 0.0, 394.004, 0.0, 0.0, 338.013, 507.035
2 ,6.401, 595.411, 692.976, 0.0, 0.0, 329.857, 0.0, 0.0, 282.982, 424.486
3 ,10.602, 986.243, 1147.850, 0.0, 0.0, 546.378, 0.0, 0.0, 468.733, 703.121
4 ,15.316, 1424.710, 1658.165, 0.0, 0.0, 789.289, 0.0, 0.0, 677.124, 1015.717
5 ,8.363, 777.926, 905.398, 0.0, 0.0, 430.971, 0.0, 0.0, 369.726, 554.606
6 ,3.749, 348.727, 405.870, 0.0, 0.0, 193.195, 0.0, 0.0, 165.740, 248.618
7 ,6.680, 621.410, 723.236, 0.0, 0.0, 344.261, 0.0, 0.0, 295.339, 443.022
8 ,7.538, 701.225, 816.129, 0.0, 0.0, 388.478, 0.0, 0.0, 333.272, 499.924
9 ,3.724, 346.376, 403.134, 0.0, 0.0, 191.892, 0.0, 0.0, 164.623, 246.942
10 ,6.686, 621.916, 723.824, 0.0, 0.0, 344.541, 0.0, 0.0, 295.579, 443.382
11 ,3.334, 310.150, 360.971, 0.0, 0.0, 171.823, 0.0, 0.0, 147.405, 221.115
12 ,18.427, 1714.101, 1994.976, 0.0, 0.0, 949.611, 0.0, 0.0, 814.664, 1222.033
13 ,7.502, 697.831, 812.178, 0.0, 0.0, 386.598, 0.0, 0.0, 331.659, 497.504
14 ,6.712, 624.364, 726.674, 0.0, 0.0, 345.898, 0.0, 0.0, 296.743, 445.128
15 ,0.015, 1.401, 1.631, 0.0, 0.0, 0.776, 0.0, 0.0, 0.666, 0.999
-
individuals: we will have the sequence files for the individuals generated. This case, we asked to generate diploid individuals. We will have16individuals per locus, and each file contains2sequences. -
ind_labels: relation correspondence replicate/individual/species/locus/sequences. -
reads: this folder follows a hierarchical structure like the one obtained in SimPhy, and it stores the FASTQ files generated by ART. -
scripts: this folder will be empty since ART is, in this example, ran within NGSphy.
- About NGSphy
- Citation
- Input/output files
- NGS overage heterogeneity
- Installation
- Usage
- The settings file
- Output
- Additional information
- Getting help
- References
- Generating read counts from a single gene tree (inputmode=1)
- Generating Illumina reads from a single gene tree, using an ancestral sequence (inputmode=2)
- Generating read counts from a single gene tree, using an anchor sequence (inputmode=3)
- Generating Illumina reads from gene tree distribution (inputmode=4)