DNA barcoder

Similarity cutoff calculation

Reference data input
Type of cutoff
Rank

Settings

How to use this page

This page calculates similarity cutoffs for a reference dataset. The similarity cutoff is a percentage at which an unidentified sequence and a reference sequence have to minimally coincide. This can be given as a global value or local values. A global similarity cutoff is representative for the whole dataset. Local similarity cutoffs are given per taxonomic group. Local similarity cutoffs will generally give more accurate classification results (see the about page).
Explanations of which inputs to use for the cutoff calculation can be found below.

Reference Data Input

The reference dataset needs to be given as a file with FASTA format.
Make sure the description lines of the FASTA file have the following format:
>ID_NAME k__kingdom;p__phylum;c__class;o__order;f__family;s__species
Example:
>MH854570 k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Nectriaceae;g__Fusarium;s__Fusarium_equiseti

Similarity Matrix

To calculate the similarity cutoffs a similarity matrix needs to be used. When no matrix is given one will be created automatically. Similarity matrices can be created separately on the analysis page.

Type of Similarity Cutoff

The similarity cutoff is a percentage at which an unidentified sequence and a reference sequence have to minimally coincide. This can be given as a global value or local values. A global similarity cutoff is representative for the whole dataset. Local similarity cutoffs are given per taxon. Local similarity cutoff will generally give more accurate results (see the about page).

Rank

The identification rank is the taxonomic level at which sequences will be classified.
When calculating local cutoff values, a higher rank also has to be selected. A cutoff value will be calculated for each clade of the higher rank.

Settings

Minimum sequence alignment length:
The default minimum alignment length is 400. It is however recommended lowering this when working with smaller barcodes. When using short sequences such as ITS1 or ITS2 sequences it is recommended to use a minimum sequence alignment length of 50.
This setting is not used when a similarity matrix is given.

Minimum number of groups and Minimum number of sequences settings:
The minimum number of groups setting refers to the number of different clades of the identification rank within a higher rank clade.
The minimum number of sequences setting refers to the number of sequences within a higher rank clade.
When working with small datasets it is recommended to lower these settings.

Maximum proportion:
A cutoff will only be calculated for a taxonomic group when the proportion of the largest group is smaller than the maximum proportion setting. This is to avoid the problem of inaccurate similarity cutoff calculation due to imbalanced data.

Remove complexes:
If this setting is turned on indistinguishable groups will be removed before the similarity cutoff calculation. This setting is automatically turned on for calculations at the species level.