Tutorials setting examples

Available tutorials and settings files

Here we find settings files for 4 simple test-case scenarios. They are available in ngsphy/data/settings (see Escalona et al. submitted for further details). They correspond to the 4 possible input modes. The trees, sequences, reference alleles and INDELible control files needed for their execution can be found, respectively, in:

ngsphy/data/trees
ngsphy/data/sequences/
ngsphy/data/reference_alleles/
ngsphy/data/indelible/

You can use them to check the proper installation of the pipeline and adapt them to your particular case. In addition, there is a script file for each case scenario in: ngsphy/test

Generating read counts from a single gene tree
- Corresponds to inputmode=1
- Script: ngsphy.test.1.sh
Generating Illumina reads from a single gene tree, using an ancestral sequence
- Corresponds to inputmode=2
- Script: ngsphy.test.2.sh
Generating read counts from a single gene tree, using an anchor sequence
- Corresponds to inputmode=3
- Script: ngsphy.test.3.sh
Generating Illumina reads from gene tree distribution
- Corresponds to inputmode=4
- Script: ngsphy.test.4.sh

NOTES:

All setting files here described assume that:

executables are accessible from any folder, and are properly (re)named

all the requested files are in the same directory, otherwise you will have to change the path related options accordingly.

Current Linux version reports better logs thatn OSX (debug log option).

1. Generating read counts from a single gene tree

Here, we will generate read counts for the tips of a single gene tree. The read counts will have 0.1% of sequencing error and the expected coverage is 100x. Reference allele file is not given, thus the one with the label 1_0_0 is used (default). Output will be stored in the current working directory. Settings file looks as below and is named ngsphy.settings.1.txt

[general]
path=.
output_folder_name=NGSphy_output
ploidy=1
[data]
inputmode=1
gene_tree_file=t1.tree
indelible_control_file=control.1.txt
[coverage]
experiment=F:100
[ngs-read-counts]
read_counts_error=0.1
[execution]
environment=bash
running_times=off
threads=2

FILE: ngsphy.settings.1.txt

For the generation of sequences with INDELible, the control file:

[TYPE] NUCLEOTIDE 1 // nucleotide simulation using algorithm from method 1
[SETTINGS]
    [ancestralprint] NEW // generates a file with the ancestral sequence
    [output] FASTA
[MODEL] m1 // no insertions, no gamma
  [submodel] HKY 2.5 // HKY with kappa=2.5
  [statefreq] 0.1 0.2 0.3 0.4 // frequencies for T C A G
[NGSPHYPARTITION] t1 m1 1000 // t1 with model m1 to generate 1000 bp long sequences

FILE: control.1.txt

The tree has 4 tips, one per individual.

t1.tree

((1_0_0:1.0, 2_0_0:1.0):1.0,(3_0_0:1.0, 4_0_0:1.0):1.0);

FILE: t1.tree

Execution

To run this example, use:

ngsphy -s ngsphy.settings.1.txt

Output

Several output folders/files are produced under the main directory:

alignments: this contains the data used and generated by INDELible.
coverage: the exact coverage for each loci (L) of each individual (I). This is written into a table with dimensions (I x L). In this case, this value was fixed at 100x for all individuals.
individuals: where the sequence files (FASTA) for all loci and individuals are written. There is a subfolder structure reflecting the number of replicates and gene trees. In this case, the single gene-tree t1 has 4 tips, corresponding to 4 haploid individuals, thus we will have 4 FASTA files, each file containing a single sequence corresponding to an individual. Here, an example of a single individual file (see more in Section 6.2.7. Individual assignment).:

$ cat individuals/1/1/NGSphy_1_1_ngsphydata_0.fasta

>NGSphy:1:1:ngsphydata:0:1_0_0
GGCCGTGGCCCGGGGTGGGAAACGGCCGACGAAAATGGGGGAATCCAACCAGTGG....

ind_labels: it has as many files as replicates, and stores the relation replicate/individual/species/locus/sequence. In this case:

cat ind_labels/NGSphy.1.individuals.csv
indexREP,indID,speciesID,locusID,geneID
1,0,1,0,0
1,1,2,0,0
1,2,3,0,0
1,3,4,0,0

reads: stores VCF with the simulated read counts. If sequencing error is introduced, the VCF files without errors will also be written.

|__reads/
    |__no_error/ # VCF files without sequencing error
        |__1/ # replicate identifier
            |__ngsphydata_1_1_NOERROR.vcf
    |__with_error/ # VCF files with sequencing error
        |__1/  # replicate identifier
            |__ngsphydata_1_1.vcf

ref_alleles: FASTA files with the sequences of the reference alleles used for the read count process.

|__ref_alleles/
    |__1/ # replicate identifier
        |__NGSphy_REF_1_1.fasta # FASTA file with reference allele sequence for replicate 1, locus 1.

2. Generating Illumina reads from a single gene tree, using an ancestral sequence

Here we will simulate Illumina reads from diploid individuals evolving under a single gene tree with a known ancestral sequence. The Illumina reads will have the following characteristics:

Machine: HiSeq2000.
100bp PE reads.
Fragments will have mean length of 250bp (standard deviation 50bp).
Expected coverage of 50x.

Settings file looks as below and is named ngsphy.settings.2.txt.

[general]
path=.
output_folder_name=NGSphy_output
ploidy=2
[data]
inputmode=2
gene_tree_file=t2.tree
ancestral_sequence_file=my_ancestral.fasta
indelible_control_file=control.2.txt
[coverage]
experiment=F:50
[ngs-reads-art]
amp=true
l=100
m=250
p=true
q=true
s=50
sam=true
ss=HS20
[execution]
environment = bash
runART=off
threads=2

FILE: ngsphy.settings.2.txt

INDELible control file:

[TYPE] NUCLEOTIDE 1
[SETTINGS]
  [output] FASTA
  [ancestralprint] NEW
[MODEL] m1 // no insertions, no gamma
  [submodel] HKY 0.5 // HKY with kappa=0.5
[NGSPHYPARTITION] t2 m1 500

FILE: control.2.txt

The tree, in this case, has 8 tips belonging to 4 individuals of 4 species.

t2.tree

( ((1_0_1:1.0,1_0_0:1.0):1.0, (2_0_1:1.0,2_0_0:1.0):1.0):1.0,((3_0_1:1.0,3_0_0:1.0):1.0, (4_0_1:1.0,4_0_0:1.0):1.0):1.0);

FILE: t2.tree

Execution

To run this example, use:

ngsphy -s ngsphy.settings.2.txt

Output

Several output files are produced under the main directory:

alignments: this contains the data used and generated by INDELible.
coverage: the exact coverage for each loci (L) of each individual (I). This is written into a table with dimensions (I x L). In this case, this value was fixed at 50x for all individuals.
individuals: where the sequence files (FASTA) for all loci and individuals are written. There is a subfolder structure reflecting the number of replicates and gene trees. In this case, the single gene-tree t2 has 8 tips, corresponding to 4 diploid individuals, thus we will have 4 FASTA files, each file containing a 2 sequences corresponding to an individual.
ind_labels: relation sequence/locus/gene/individual/species. In this case:

$cat NGSphy.1.individuals.csv

indexREP,indID,speciesID,locusID,mateID1,mateID2
1,0,2,0,0,1
1,1,3,0,0,1
1,2,4,0,1,0
1,3,1,0,1,0

reads: stores the FASTQ files generated by ART. In this case the execution has been turned off (runART=off) and we obtain an empty hierarchical folder structure
scripts: File with all needed command lines to generate the Illumina reads from all the diploid individuals in ART. For medium-big datasets it is convenient to use this feature and run ART separately. The file looks like this:

$ cat NGSphy_output/scripts/NGSphy.sh
art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydata_0.fasta --out /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_0_R
art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydata_1.fasta --out /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_1_R
art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydata_2.fasta --out /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_2_R
art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydata_3.fasta --out /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_3_R

3. Generating read counts from a single gene tree, using an anchor sequence

In this example we use a sequence from a specific tip of the tree called anchor sequence - to root the tree and start the simulation. We will obtain sequences from the rest of the gene tree tips keeping their relationships, and then read counts for all tips with 0.1% sequencing error and 100x expected coverage. The allele used as reference for the read counts is the anchor sequence (here 2_0_0; in my_anchor_sequence.fasta), but a different one can be specified.

[general]
path=.
output_folder_name=NGSphy_output
ploidy=1
[data]
inputmode=3
gene_tree_file=t3.tree
anchor_sequence_file=my_anchor_sequence.fasta
anchor_tip_label=2_0_0
indelible_control_file=control.3.txt
[coverage]
experiment=F:100
[ngs-read-counts]
read_counts_error=0.1
reference_alleles_file=my_reference_allele_file.txt
[execution]
environment=bash
running_times=off
threads=2

ngsphy.settings.3.txt

The tree:

t3.tree

((((1_0_0:1.0, 2_0_0:1.0):1.0),(3_0_0:1.0, 4_0_0:1.0):1.0),5_0_0:3.0);

t3.tree

Information on how to evolve the sequences is in the control file that INDELible will use:

[TYPE] NUCLEOTIDE 1
[SETTINGS]
  [ancestralprint] NEW
  [output] FASTA
[MODEL] m1
  [submodel] HKY 0.1
[NGSPHYPARTITION] t3 m1 100

control.3.txt

Execution

To run this example, use:

ngsphy -s ngsphy.settings.3.txt

Output

alignments: this contains the data used and generated by INDELible.
coverage: he exact coverage for each loci (L) of each individual (I). This is written into a table with dimensions (I x L). In this case, this value was fixed at 100x for all individuals.
individuals: where the sequence files (FASTA) for all loci and individuals are written. There is a subfolder structure reflecting the number of replicates and gene trees. In this case, the single gene-tree t3 has 5 tips, corresponding to 5 haploid individuals, thus we will have 5 FASTA files, each file containing a single sequence corresponding to an individual.
ind_labels: relation sequence/locus/gene/individual/species.
reads: stores VCF with the simulated read counts.
ref_alleles: FASTA files with the sequences of the reference alleles used for the read counts process. These files will be generated whether the reference allele matches the anchor sequence or not.

4. Generating Illumina reads from gene tree distribution

In this more complex example, we will simulate Illumina reads from a gene tree distribution, which needs to be first obtained with SimPhy.

SimPhy run

For this example we will simulate 2 species tree replicates with a variable height between 200.000 years and 20.000.000 years (u:200000,20000000). These trees will have 5 ingroup + 1 outgroup species. The ingroup species will have 6 individuals per species, while the outgroup is a single individual. Each replicate will have 10 gene trees. The effective population size (10.000) of each species and the substitution rate (0.00001) of each gene tree are fixed. Finally will add heterogeneity at different levels (-h parameters). To run the simulation we use:

simphy  -rs 2 -rl f:10
        -sb ln:-15,1 -st u:200000,20000000 -sl f:5
        -so f:1 -sp f:100000 -su f:0.00001 -si f:6
        -hh ln:1.2,1 -hl ln:1.4,1 -hg f:200
        -v 1 -o testwsimphy -cs 6656
        -om 1 -od 1 -op 1 -oc 1 -on 1

In detail:

Parameter	Value	Description
`-rs`	2	number of species tree replicates
`-rl`	f:10	number of locus tree per replicate
`-rg`	f:1	number of gene trees per replicate
`-sb`	ln:-15,1	speciation rate (events/time unit)
`-st`	u:200000,20000000	species tree height (time units)
`-sl`	f:5	number of taxa
`-so`	f:1	Ratio between ingroup height and the branch from the root to the ingroup
`-sp`	f:100000	tree-wide effective population size
`-su`	f:0.00001	tree-wide substitution rate
`-si`	f:6	number of individuals per species
`-hh`	ln:1.2,1	Gene-by-lineage-specific locus tree parameter
`-hl`	ln:1.4,1	Gene-family-specific rate heterogeneity modifiers
`-hg`	f:200	Gene-by-lineage-specific rate heterogeneity modifiers
`-v`	1	verbosity: Global settings summary, simulation progress per replicate (number of simulated gene trees), warnings and errors
`-o`	testwsimphy	Common output prefix-name (for folders and files)
`-cs`	6656	Random number generator seed
`-od`	1	Activates the SQLite database output
`-op`	1	Activates logging of sampled options
`-on`	1	Activates the output of the bounded locus subtrees file

To simulate the DNA alignment for every gene tree simulated we now use the script provided with SimPhy (INDELIble_wrapper.pl) and the following INDELible control file.

[TYPE] NUCLEOTIDE 1
[SETTINGS]
	[output]	FASTA
	[fastaextension]	fasta
[MODEL] complex_common
	[submodel]	GTR $(rd:6,16,2,8,20,4)
	[statefreq] $(d:1,1,1,1)
	[rates] 0 $(e:2) 0

[SIMPHY-UNLINKED-MODEL]	simple_unlinked
	[submodel]    HKY $(e:1)
  	[statefreq]   $(d:1,1,1,1)

[SIMPHY-PARTITIONS] simple [1 simple_unlinked 500] //// The first half of the gene families will evolve under the model "simple_unlinked". Their sequence lengths are sampled from a Normal with mean=1000 and sd=100.
[SIMPHY-EVOLVE] 1 data // One sequence alignment for each gene tree, saved in files with "dataset" as common prefix (it will generate dataset_1, dataset_2, etc.)

FILE: control.4.txt. This file is based on a SimPhy example case. For more information on this example please go to SimPhy wiki

To run we use:

# perl INDELIble_wrapper.pl <simphy_folder> <simphy's_indelible_control_file> <seed_for_random_number_generation> <num_threads>
perl INDELIble_wrapper.pl testwsimphy/ ngsphy/data/indelible/control.4.txt $RANDOM 2

Here, we use the Linux evironment variable $RANDOM, which returns a different random integer in the range [0,32767]. For more information, go here.

After this, we will obtain a folder hierarchical structure that looks like this:

|__testwsimphy/
    |__testwsimphy.command    # SimPhy log files
    |__testwsimphy.db    # SimPhy log files
    |__testwsimphy.params  # SimPhy log files
    |__1/ # Species tree replicate
        |__data_[1-10].fasta  # INDELIble output
        |__data_[1-10]_TRUE.phy  # INDELIble output
        |__g_trees[1-10].trees # Simphy output
        |__control.txt # INDELIble_wrapper.pl output
        |__bounded_locus_subtrees.out # Simphy output
        |__LOG.txt  # INDELIble output
        |__l_trees.trees # Simphy output
        |__s_tree.trees # Simphy output
        |__trees.txt # INDELIble output
    |__2/ # Species tree replicate
        |__data_[1-10].fasta  # INDELIble output
        |__data_[1-10]_TRUE.phy  # INDELIble output
        |__g_trees[1-10].trees # SimPhy output
        |__control.txt # INDELIble_wrapper.pl output
        |__bounded_locus_subtrees.out # SimPhy output
        |__LOG.txt  # INDELIble output
        |__l_trees.trees # SimPhy output
        |__s_tree.trees # SimPhy output
        |__trees.txt # INDELIble output

Running NGSphy

Now we use NGSphy to generate the Illumina reads from the tips of these gene-trees with the settings file ngsphy.settings.4.txt (see below). As we want to generate diploid individuals in this case, NGSphy will randomly “mate” tips (gene-copies) within each taxa/species. Outgroup taxa has always one sequence (when it is generated in SimPhy) and is assumed to be homozygous (so its sequence is duplicated before generating reads/read counts). As the species trees generated have 6 taxa (5 ingroup + 1 outgroup), each ingroup having 6 tips, we wil have 16 individuals in total:

6 gene-copies * 5 ingroup taxa = 30 sequences = 15 diploid individuals.
1 outgroup * 2 (homozygous) = 2 sequences = 1 diploid individual.

Also, we want the base coverage to be 100x for both replicates (experiment=F:100), but we want to add some variation among loci and individuals. This variation is in this example modeled with a Log Normal distribution with mean 1.2 and standard deviation 1 for the individuals and a Log Normal distribution with mean 1.3 and standard deviation 1 for the loci. These distributions will model the underlying Gamma distribution that will sample the rate multipliers for the specific individual and locus. For example, the multipliers for the individual variation, for each replicate, might be sampled from distributions like these, first (to the left) the shapes of the Gamma distributions, and finally from the Gamma distribution the multipliers:

Possible distributions for rate multipliers Example of possible distributions of the rate multipliers according the settings file ngsphy.settings.4.txt to model the coverage variation among individuals.

[general]
path=.
output_folder_name=NGSphy_output
ploidy=2
[data]
inputmode=4
simphy_folder_path=./testwsimphy
simphy_data_prefix=data
simphy_filter=true
[coverage]
experiment=F:100
individual=LN:1.2,1
locus=LN:1.3,1
offtarget=0.25, 0.01
notcaptured=0.5
taxon= 1,0.5;2,0.25
[ngs-reads-art]
l=100
m=250
p=true
q=true
s=50
sam=true
ss=HS20
[execution]
environment = bash
runART = on
threads=2

ngsphy.settings.4.txt

Execution

To run this example, use:

ngsphy -s ngsphy.settings.4.txt

Output

coverage: the exact coverage for each loci (L) of each individual (I). This is written into a table with dimensions (I x L). In this case, we include variation in coverage among loci and individuals as in targeted-sequencing experiments. It is also introduced the simulation of off-target loci (thus with less coverage) and some of the loci are assumed to be not captured/sequenced at all. We have 2 files, one per replicate. From the simulated species trees we get 16 individuals per replicate and 10 gene-trees (loci) each. Hence, the coverage tables have 16 x 10 dimensions (I x L). Content of the coverage file (testwsimphy.1.coverage.csv):

$ cat testwsimphy.1.coverage.csv
indID ,L.01 ,L.02 ,L.03 ,L.04 ,L.05 ,L.06 ,L.07 ,L.08 ,L.09 ,L.10
0 ,4.317, 401.540, 467.337, 0.000, 0.0, 222.453, 0.0, 0.0, 190.841, 286.270
1 ,7.646, 711.199, 827.737, 0.0, 0.0, 394.004, 0.0, 0.0, 338.013, 507.035
2 ,6.401, 595.411, 692.976, 0.0, 0.0, 329.857, 0.0, 0.0, 282.982, 424.486
3 ,10.602, 986.243, 1147.850, 0.0, 0.0, 546.378, 0.0, 0.0, 468.733, 703.121
4 ,15.316, 1424.710, 1658.165, 0.0, 0.0, 789.289, 0.0, 0.0, 677.124, 1015.717
5 ,8.363, 777.926, 905.398, 0.0, 0.0, 430.971, 0.0, 0.0, 369.726, 554.606
6 ,3.749, 348.727, 405.870, 0.0, 0.0, 193.195, 0.0, 0.0, 165.740, 248.618
7 ,6.680, 621.410, 723.236, 0.0, 0.0, 344.261, 0.0, 0.0, 295.339, 443.022
8 ,7.538, 701.225, 816.129, 0.0, 0.0, 388.478, 0.0, 0.0, 333.272, 499.924
9 ,3.724, 346.376, 403.134, 0.0, 0.0, 191.892, 0.0, 0.0, 164.623, 246.942
10 ,6.686, 621.916, 723.824, 0.0, 0.0, 344.541, 0.0, 0.0, 295.579, 443.382
11 ,3.334, 310.150, 360.971, 0.0, 0.0, 171.823, 0.0, 0.0, 147.405, 221.115
12 ,18.427, 1714.101, 1994.976, 0.0, 0.0, 949.611, 0.0, 0.0, 814.664, 1222.033
13 ,7.502, 697.831, 812.178, 0.0, 0.0, 386.598, 0.0, 0.0, 331.659, 497.504
14 ,6.712, 624.364, 726.674, 0.0, 0.0, 345.898, 0.0, 0.0, 296.743, 445.128
15 ,0.015, 1.401, 1.631, 0.0, 0.0, 0.776, 0.0, 0.0, 0.666, 0.999

individuals: we will have the sequence files for the individuals generated. This case, we asked to generate diploid individuals. We will have 16 individuals per locus, and each file contains 2 sequences.
ind_labels: relation correspondence replicate/individual/species/locus/sequences.
reads: this folder follows a hierarchical structure like the one obtained in SimPhy, and it stores the FASTQ files generated by ART.
scripts: this folder will be empty since ART is, in this example, ran within NGSphy.