Synthetic data generation

Bioinformatics
Synthetic
Simulation
Python
Author

Jean DELMOTTE

Published

December 27, 2024

Image from TheOrsna in Pixabay

One of the most cost-effective things bioinformatics can do is to simulate giving. These simulations are an underestimated research tool. We often simulate data to benchmark different algorithms or to perform unit tests on our analysis pipelines. From my personal point of view, it’s possible to go further: if we understand a biological problem, then we should be able to simulate data. If the entire problem is correctly understood, then the analysis of simulated and experimental data should lead to the same conclusion. This validates that the mechanism is well understood.

In this post, I’ll describe how to create a small synthetic genomes, assign mutations to it and simulate mutations at a spcific frequencies. The aim is to create a minimal dataset to facilitate unit testing of my pipelines.

Creation of a synthetic genomes

There are many ways of doing this. When working on a single model, it’s best to start with the genome of interest and introduce mutations. If, on the other hand, you want something completly artificial, you can generate a sequence with a probability of 0.25 for each letter (A, T, C and G). A lot of tools exist, like random_dna from Sequence Manipulation Suit, all bioinformatics libraries, makenucseq from EMBOSS.. But a long time ago I made a very basic package inSilicoGenome which allows you to generate simple sequences and mutations, I’m going to use it. - Get the code first, install a correct envs in a linux terminal having conda/mamba

# 1) Clone the repo 
git clone git@github.com:propan2one/inSilicoGenome.git
 
# 2) create a conda env where all tools work together
#    in my case, for ease of use, I'm going to use biopython
conda create -y -p ~/envs/insilicogenome \
    --channel conda-forge python=3.11.11 Poetry
conda activate ~/envs/insilicogenome
poetry install
  • Then use python’s dedicated functions to create the sequences corresponding to the different haplotypes
# In python --------------------------------------------------------------------
# 3) Import function available in the package
from insilicogenome import insilicogenome
from insilicogenome import insilicodata

# 4) Declare all variables
size = 5000
output = "haplo_01.fasta"
range_start=10
range_end=4500

# 5) Make a reference sequence
sequence = insilicogenome.random_dnasequence(size)
insilicogenome.write_fasta_genome(output, sequence, description = '0.8')

# 6) Make a haplotype
vcf = insilicodata.generate_table_small_variation(output,
    range_start=range_start, range_end=range_end)
insilicodata.create_variants(output, vcf, 
    range_start=range_start, range_end=range_end)

Once you’ve created the second haplotype, you should have a new fasta sequence and a pseudo VCF that keeps track of all the variations. So with the generate_table_small_variation function you’ll have, taking the reference: a small insertion and deletion (InDel) of one base, an insertion and deletion of 5 bases, 1 SNV (also called: SNP) and 1 Multi-nucleotide variants (MNVs) decomposed into 5 SNVs making a total of 6 variations.

Verification of variations present in the two haplotypes

A very quick way to check that all variations are present between the two FASTAs is to use Needle from EMBOSS to perform a pairwise alignment. As you can see, there is 99.6% of identity between the two sequence.

# haplo_01  916 .   T   A   45  PASS    1SNP -----------------------------------
haplo_01         901 ACGCACTATACTTGATAATGGCTGCCGCAGGCGCCGAGCCTTAGGAGTTG    950
                     |||||||||||||||.||||||||||||||||||||||||||||||||||
haplo_01_10_4    901 ACGCACTATACTTGAAAATGGCTGCCGCAGGCGCCGAGCCTTAGGAGTTG    950

# haplo_01  1967    .   CG  C   45  PASS    small_DEL
# haplo_01  1992    .   AAATCT  A   45  PASS    DEL ----------------------------
haplo_01        1951 TCGCTTTTGCGGTCCGCGAACGTCATCCCCGACCAGGTGGTAAATCTGTC   2000
                     ||||||||||||||||| ||||||||||||||||||||||||     |||
haplo_01_10_4   1951 TCGCTTTTGCGGTCCGC-AACGTCATCCCCGACCAGGTGGTA-----GTC   1994

# haplo_01  2925    .   T   TAGTGA  45  PASS    INS ----------------------------
haplo_01        2901 GGCGAGTTTGTTGGGATATTGATAT-----CCGGACTAGACCCTTAACAC   2945
                     |||||||||||||||||||||||||     ||||||||||||||||||||
haplo_01_10_4   2895 GGCGAGTTTGTTGGGATATTGATATAGTGACCGGACTAGACCCTTAACAC   2944

# haplo_01  3123    .   A   AC  45  PASS    small_INS --------------------------
haplo_01        3096 ATAATTCGTGCCTCGAATATCGCTCGCA-CGCGCGCGTATTCGGGAGCAA   3144
                     |||||||||||||||||||||||||||| |||||||||||||||||||||
haplo_01_10_4   3095 ATAATTCGTGCCTCGAATATCGCTCGCACCGCGCGCGTATTCGGGAGCAA   3144

# haplo_01  3977    .   GCAGTT  GGGCCA  45  PASS    5MNP -----------------------
haplo_01        3945 TCCCTGTTCCCAATTCCAGCACGCCCGTCTTTGCAGTTAGATACCTGATT   3994
                     |||||||||||||||||||||||||||||||||.....||||||||||||
haplo_01_10_4   3945 TCCCTGTTCCCAATTCCAGCACGCCCGTCTTTGGGCCAAGATACCTGATT   3994

The variations are taken from the ‘VCF’ file created by insilicogenome. Here, only the lines presenting the variations have been represented. You’ll notice that the MNV is not represented in the same way as in the VCF FILE. In the VCF, it’s decompose into 5 SNVs, which is normal.

Simulation of Illumina data for both haplotypes

To generate the illumina sequencing data, I’ll use InSilicoSeq but other alternative existe, like art_illumina or wgsim. To mimic the presence of two haplotypes in my dataset, I’ll use the relative abundance of the two fasta files previously generated. This way I can ensure that the allelic frequencies of each variation are well defined. This is a simple example, but in theory it’s possible to design more complex phenomena by creating multiple haplotypes of the same genome (perhaps we’ll explore this later).

# 1) Installation of insilicoseq -----------------------------------------------
conda create -y -p ~/envs/insilicoseq \
    --channel conda-forge --channel bioconda insilicoseq=2.0.1
conda activate ~/envs/insilicoseq

# 2) Generate the abundance file -----------------------------------------------
cat haplo*.fasta >> haplotypes.fna # Combine the FASTA files
grep -e ">" haplotypes.fna  | sed "s/>//g" >> abundance.txt
sed -i "s/_variant/_variant\t0.2/g" abundance.txt # AF of 0.2 for haplotype 2
sed -i "s/ /\t/g" abundance.txt # Correct the space introduce from 'write_fasta_genome'

# 3) Simulate the reads (125) --------------------------------------------------
#   Because FASTA is ~5k nt and we want to genrate 50X cov, we want 250k
#   nucleotides. So it's ~2000 paired reads 
iss generate --cpus 4 -g haplotypes.fna --abundance_file abundance.txt \
    -m HiSeq --n_reads 2k --compress -o 20241227JD1_haploreads
rm *tmp* # remove not necessary tmp files

Very well, we now have a synthetic dataset with which we can evaluate bioinfo tools such as variant calling, or even approaches such as genome-wide approaches. We’ll keep the following files:

  • haplo_01.fasta the reference file at AF=0.80 and haplo_01_10_4500_variant.fasta the alternative reference at AF=0.20.

  • 20241227JD1_haploreads_R1.fastq.gz & 20241227JD1_haploreads_R2.fastq.gz the Illumina paired reads simulated data (length ~125nt).

  • haplo_01_10-4500.vcf the corresponding VCF file which the differents variations between the two haplotypes.

The dataset includes a number of things to bear in mind when analyzing our simulated data. For example, the fact that the sequences are linear, so we expect falls in coverage at the extremities, or the use of a very small reference. However, this will enable us to use the dataset in pipeline design, whether for unit testing or rough benchmarking.

What did we learn in the process?

  • Insert genomic variations into a reference while keeping track of them (VCF & FASTA file).
  • Compare two sequences to verify the presence of the introduced variations.
  • Simulate illumina data to analyze the synthetic reference.