The MUGA genotype data is analyzed in the following way:
- S1 and S2 backgrounds are selected based on the information provided in the Strain Detail Sheet. Each call in the given Sample is compared with the corresponding call in the S1 and S2 background candidates. A call is classified as S1 Match, S2 Match, Het, Uninformative, or Unknown. Uninformative calls are those that cannot be used to distinguish between S1 and S2 – (e.g. where the Sample, S1, and S2 are all the same, or if S1 or S2 is an n call).
- The classified calls are then processed using the DBSCAN clustering algorithm to determine where clusters of like calls (Het, S1, S2, and unknown) are found on each chromosome. Given a set of like calls in the chromosome space, DBSCAN identifies groups of calls that are above a specified density threshold as a cluster.
- Based on experimentation across multiple samples, the algorithm density parameters are set to >2 like calls per 1 Mb to define a region/cluster.
- See https://en.wikipedia.org/wiki/DBSCAN for more detail on how this algorithm works
- After clusters (aka regions) of like calls are identified, adjacent regions with the same classification are merged. Unclassified areas between regions are either merged into a larger region if both regions surrounding it share the same classification or split equally if the identified regions surrounding it are different.
- Using the merged region data, the calls are reanalyzed to determine the percentage of the genome that is heterozygous, the contributions of S1 and S2, and unknown regions that could indicate the presence of a 3rd background.
- The calls and merged region data are used to draw an interactive idiogram that clearly displays the regions and allows the user to explore the genotype data and analysis visually.