One of the most cost-effective things bioinformatics can do is to simulate giving. These simulations are an underestimated research tool. We often simulate data to benchmark different algorithms or to perform unit tests on our analysis pipelines. From my personal point of view, it’s possible to go further: if we understand a biological problem, then we should be able to simulate data. If the entire problem is correctly understood, then the analysis of simulated and experimental data should lead to the same conclusion. This validates that the mechanism is well understood.
In this post, I’ll describe how to create a small synthetic genomes, assign mutations to it and simulate mutations at a spcific frequencies. The aim is to create a minimal dataset to facilitate unit testing of my pipelines.
Creation of a synthetic genomes
There are many ways of doing this. When working on a single model, it’s best to start with the genome of interest and introduce mutations. If, on the other hand, you want something completly artificial, you can generate a sequence with a probability of 0.25 for each letter (A, T, C and G). A lot of tools exist, like random_dna from Sequence Manipulation Suit, all bioinformatics libraries, makenucseq from EMBOSS.. But a long time ago I made a very basic package inSilicoGenome which allows you to generate simple sequences and mutations, I’m going to use it. - Get the code first, install a correct envs in a linux terminal having conda/mamba
# 1) Clone the repo
git clone git@github.com:propan2one/inSilicoGenome.git
# 2) create a conda env where all tools work together
# in my case, for ease of use, I'm going to use biopython
conda create -y -p ~/envs/insilicogenome \
--channel conda-forge python=3.11.11 Poetry
conda activate ~/envs/insilicogenome
poetry install
- Then use python’s dedicated functions to create the sequences corresponding to the different haplotypes
# In python --------------------------------------------------------------------
# 3) Import function available in the package
from insilicogenome import insilicogenome
from insilicogenome import insilicodata
# 4) Declare all variables
= 5000
size = "haplo_01.fasta"
output =10
range_start=4500
range_end
# 5) Make a reference sequence
= insilicogenome.random_dnasequence(size)
sequence = '0.8')
insilicogenome.write_fasta_genome(output, sequence, description
# 6) Make a haplotype
= insilicodata.generate_table_small_variation(output,
vcf =range_start, range_end=range_end)
range_start
insilicodata.create_variants(output, vcf, =range_start, range_end=range_end) range_start
Once you’ve created the second haplotype, you should have a new fasta sequence and a pseudo VCF that keeps track of all the variations. So with the generate_table_small_variation
function you’ll have, taking the reference: a small insertion and deletion (InDel) of one base, an insertion and deletion of 5 bases, 1 SNV (also called: SNP) and 1 Multi-nucleotide variants (MNVs) decomposed into 5 SNVs making a total of 6 variations.
Verification of variations present in the two haplotypes
A very quick way to check that all variations are present between the two FASTAs is to use Needle from EMBOSS to perform a pairwise alignment. As you can see, there is 99.6% of identity between the two sequence.
# haplo_01 916 . T A 45 PASS 1SNP -----------------------------------
haplo_01 901 ACGCACTATACTTGATAATGGCTGCCGCAGGCGCCGAGCCTTAGGAGTTG 950
|||||||||||||||.||||||||||||||||||||||||||||||||||
haplo_01_10_4 901 ACGCACTATACTTGAAAATGGCTGCCGCAGGCGCCGAGCCTTAGGAGTTG 950
# haplo_01 1967 . CG C 45 PASS small_DEL
# haplo_01 1992 . AAATCT A 45 PASS DEL ----------------------------
haplo_01 1951 TCGCTTTTGCGGTCCGCGAACGTCATCCCCGACCAGGTGGTAAATCTGTC 2000
||||||||||||||||| |||||||||||||||||||||||| |||
haplo_01_10_4 1951 TCGCTTTTGCGGTCCGC-AACGTCATCCCCGACCAGGTGGTA-----GTC 1994
# haplo_01 2925 . T TAGTGA 45 PASS INS ----------------------------
haplo_01 2901 GGCGAGTTTGTTGGGATATTGATAT-----CCGGACTAGACCCTTAACAC 2945
||||||||||||||||||||||||| ||||||||||||||||||||
haplo_01_10_4 2895 GGCGAGTTTGTTGGGATATTGATATAGTGACCGGACTAGACCCTTAACAC 2944
# haplo_01 3123 . A AC 45 PASS small_INS --------------------------
haplo_01 3096 ATAATTCGTGCCTCGAATATCGCTCGCA-CGCGCGCGTATTCGGGAGCAA 3144
|||||||||||||||||||||||||||| |||||||||||||||||||||
haplo_01_10_4 3095 ATAATTCGTGCCTCGAATATCGCTCGCACCGCGCGCGTATTCGGGAGCAA 3144
# haplo_01 3977 . GCAGTT GGGCCA 45 PASS 5MNP -----------------------
haplo_01 3945 TCCCTGTTCCCAATTCCAGCACGCCCGTCTTTGCAGTTAGATACCTGATT 3994
|||||||||||||||||||||||||||||||||.....||||||||||||
haplo_01_10_4 3945 TCCCTGTTCCCAATTCCAGCACGCCCGTCTTTGGGCCAAGATACCTGATT 3994
The variations are taken from the ‘VCF’ file created by insilicogenome. Here, only the lines presenting the variations have been represented. You’ll notice that the MNV is not represented in the same way as in the VCF FILE. In the VCF, it’s decompose into 5 SNVs, which is normal.
Simulation of Illumina data for both haplotypes
To generate the illumina sequencing data, I’ll use InSilicoSeq but other alternative existe, like art_illumina or wgsim. To mimic the presence of two haplotypes in my dataset, I’ll use the relative abundance of the two fasta files previously generated. This way I can ensure that the allelic frequencies of each variation are well defined. This is a simple example, but in theory it’s possible to design more complex phenomena by creating multiple haplotypes of the same genome (perhaps we’ll explore this later).
# 1) Installation of insilicoseq -----------------------------------------------
conda create -y -p ~/envs/insilicoseq \
--channel conda-forge --channel bioconda insilicoseq=2.0.1
conda activate ~/envs/insilicoseq
# 2) Generate the abundance file -----------------------------------------------
cat haplo*.fasta >> haplotypes.fna # Combine the FASTA files
grep -e ">" haplotypes.fna | sed "s/>//g" >> abundance.txt
sed -i "s/_variant/_variant\t0.2/g" abundance.txt # AF of 0.2 for haplotype 2
sed -i "s/ /\t/g" abundance.txt # Correct the space introduce from 'write_fasta_genome'
# 3) Simulate the reads (125) --------------------------------------------------
# Because FASTA is ~5k nt and we want to genrate 50X cov, we want 250k
# nucleotides. So it's ~2000 paired reads
iss generate --cpus 4 -g haplotypes.fna --abundance_file abundance.txt \
-m HiSeq --n_reads 2k --compress -o 20241227JD1_haploreads
rm *tmp* # remove not necessary tmp files
Very well, we now have a synthetic dataset with which we can evaluate bioinfo tools such as variant calling, or even approaches such as genome-wide approaches. We’ll keep the following files:
haplo_01.fasta
the reference file at AF=0.80 andhaplo_01_10_4500_variant.fasta
the alternative reference at AF=0.20.20241227JD1_haploreads_R1.fastq.gz
&20241227JD1_haploreads_R2.fastq.gz
the Illumina paired reads simulated data (length ~125nt).haplo_01_10-4500.vcf
the corresponding VCF file which the differents variations between the two haplotypes.
The dataset includes a number of things to bear in mind when analyzing our simulated data. For example, the fact that the sequences are linear, so we expect falls in coverage at the extremities, or the use of a very small reference. However, this will enable us to use the dataset in pipeline design, whether for unit testing or rough benchmarking.
What did we learn in the process?
- Insert genomic variations into a reference while keeping track of them (VCF & FASTA file).
- Compare two sequences to verify the presence of the introduced variations.
- Simulate illumina data to analyze the synthetic reference.