The BIGSdb website Policy concerning the platform & data use agreement and the privacy notice of BIGSdb-Pasteur was updated on March 25, 2024. Please consult it before using the platform and the data. If any questions, contact us.

Genome Quality Check and the “QC status” field

QC criteria:

Starting from January 2023, genomes must comply with the following QC criteria (Table 1), defined in the scope of the KlebNET-GSP consortium, in order to be imported into the BIGSdb Klebsiella database.

Table 1. KlebNET-GSP quality criteria

CriteriaMethodAccepted criteraRejected criteria
Contamination checkconFindr,kmerFinder,kraken2…< 5% contamination> 5% contamination
Species identificationKleborateacceptable identityWeak identity
Genome quality - Number of contigsarbitrary threshold≤ 500 contigs> 500 contigs
Genome quality - Genome sizemean ±2SD[4,969,898; 6,132,846] bp< 4,969,898 or > 6,132,846 bp
Genome quality - GC contentmean ±2SD[56.35; 57.98] %GC< 56.35% OR > 57.98% GC

QC status:

The “QC status” field was added to the isolate fields (Figure 1) to record the QC metrics of genomes, in particular for those deposited into the database before the systematic application of the KlebNET QC criteria. The QC criteria proposed by the KlebNET-GSP consortium were applied to all genomes in the database in January 2023, and a QC status code was defined for each genome.

Figure 1. Example of isolate SB20 (id 10) with a QC status

QC status

The QC status is encoded as a 4-digit code (e.g., 0000), each position corresponding to a specific metric in this order: Species, number of contigs, genome size and %GC. To build the QC status code, a score corresponding to valid, rejected, or inconclusive criteria is attributed to each metric:

Code explanations:

0: valid criteria
1: rejected criteria (too many contigs, genome size or %GC below the lower limit)
2: rejected criteria (genome size or %GC above the higher limit)
x: inconclusive criteria (this happens for species check, as rMLST species identification can display multiple results, for instance, e.g., due to genome contamination)

Table 2 provide example QC codes and their interpretations.

NB. Only good quality genomes (QC status: 0000) are used by curators to designate novel alleles, profiles and LIN codes. Submissions of genomes that fail the QC may be entirely rejected.

Table 2. Example of QC codes and their interpretations

QC statusSpeciesContigs numberGenome Size%GCInterpretation
00000000Valid genome (KLEBNET-GSP QC-passed)
00110011Genome size too small and low %GC
11221122Not a Klebsiella, too many contigs, genome size too big and high %GC
11111111Not a Klebsiella, too many contigs, genome size too small and low %GC
x100x100Species not verified or validated, too many contigs

Internal assembly checks:

Since November 2022, BIGSdb includes a built-in tool to check contiguity metrics of assembly data. The assembly checks are displayed on the isolate’s information page:

Figure 2. Example of assembly check status for a high-quality genome

QC passed

Figure 3. Example of assembly check status for a low-quality genome

QC failed

rMLST species identification:

The rMLST species identification tool is used to verify the taxonomic designation of the isolates by extracting ribosomal MLST alleles from genomes (Bray et al., 2022, Ribosomal MLST nucleotide identity (rMLST-NI), a rapid bacterial species identification method: application to Klebsiella and Raoultella genomic species validation). The highest taxonomic rank that can be reliably identified, e.g. species, the taxon and its full taxonomy are displayed on the isolate’s information page. An indication of the confidence for the result will also be displayed - this is based on the proportion of alleles found that are unique to a taxon.

Figure 4. Example of rMLST species identification: K. pneumoniae

rMLST species id
Edit on GitLab