recan-0.1.2/0000777000000000000000000000000013616464714010750 5ustar 00000000000000recan-0.1.2/PKG-INFO0000666000000000000000000003020613616464714012046 0ustar 00000000000000Metadata-Version: 2.1 Name: recan Version: 0.1.2 Summary: recan: recombination analysis tool Home-page: https://github.com/babinyurii/recan Author: Yuriy Babin Author-email: babin.yurii@gmail.com License: MIT Download-URL: https://github.com/babinyurii/recan/archive/v_0.1.2.tar.gz Description: # recan `recan` is a Python package which allows to construct genetic distance plots to explore and discover recombination events in viral genomes. This method has been previously implemented in desktop software tools: RAT[1], Simplot[2] and RDP4 [8]. ## Requirements To use `recan`, you will need: - Python 3 - Biopython - plotly - pandas - Jupyter notebook ## Intallation To install the package via `pip` run : ` $ pip install recan ` If you are going to use `recan` in JupyterLab, follow [the insctructions to install the Jupyter Lab Plotly renderer](https://plot.ly/python/getting-started/#jupyterlab-support-python-35) ## Usage example The package is intended to be used in Jupyter notebook. Import `Simgen` class from the recan package: ```python from recan.simgen import Simgen ``` create an object of the Simgen class. To initialize the object pass your alignment in fasta format as an argument: ```python sim_obj = Simgen("./datasets/hbv_C_Bj_Ba.fasta") ``` The input data are taken from the article by Sugauchi et al.(2002). This paper describes recombination event observed in hepatitis B virus isolates. The object of the Simgen class has method `get_info()` which shows information about the alignment. ```python sim_obj.get_info() ``` ``` index: sequence id: 0 AB048704.1_genotype_C_ 1 AB033555.1_Ba 2 AB010291.1_Bj alignment length: 3215 ``` We have three sequences in our alignment. `Simgen` class is based upon the `MultipleSequenceAlignment` class of the Biopython library. So, we treat our alignment as the array with n_samples and n_features, where 'samples' are sequences themselves, and the features are columns of nucleotides in the alignment. Index corresponds to the sequence. Note, that indices start with 0. After you've created the object you can draw the similarity plot. Call the method `simgen()` of the Simgen object to draw the plot. Pass the following parameters to the method: - `window`: sliding window size. The number of nucleotides the sliding window will span. It has the value of 500 by default. - `shift`: this is the step our window slides downstream the alignment. It's value is set to 250 by default - `pot_rec`: the index of the potential recombinant. All the other sequences will be plotted as function of distance to that sequence. Use method `get_info()` to get the indices, especially if your alignment has many sequences. The isolate of Ba genotype is the recombinant between the virus of C genotype and genotype Bj. Let's plot it. We set genotype Ba as the potential recombinant : ```python sim_obj.simgen(window=200, shift=50, pot_rec=1) ``` ![hbv_1](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/HBV_1_rec_C_B_annotated.PNG) Potential recombinant is not shown in the plot, as the distances are calculated relative to it. The higher is the distance function (i.e. the closer to 1), the closer is the sequence to the recombinant and vice versa. We can see typical 'crossover' of the distances which is the indicator of the possible recombination event. The distance of one isolate 'drops down' whereas the distance of the other remains the same of even gets closer to the potential recombinant, this abrupt drop shows that recombination could take place. The picture from the article is shown below. It's just turned upside down relative to our plot, and instead of distance drop we see distance rising. Here Bj 'goes away' from the genotype C, whereas Ba keeps the same distance ![Ba_Bj_C](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_C_Bj_Ba.jpg) By default `simgen()` method plots the whole alignment. But after initial exploration, we can take a closer look at a particular region by passing the `region` parameter to the simgen method. We can slice the alignment by using this parameter. `region` must be a tuple or a list with two integers: the start and the end position of the alignment slice. ```python region = (start, end) ``` ```python sim_obj.simgen(window=200, shift=50, pot_rec=1, region=(1000, 2700)) ``` ![hbv_slice_1](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_slice_1.png) To customize the plot or just to export and store the data, use `get_data()` method. `get_data()` returns pandas DataFrame object with sequences as samples, and distances at given points as features. ```python sim_obj.get_data() ``` ![hbv_df_example](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_df_example.png) If optional paremeter `df` is set to `False`, `get_data()` returns a tuple containing list of ticks and a dictionary of lists. Each dictionary key is the sequence id, and lists under the keys contain the corresponding distances. ```python positions, data = sim_obj.get_data(df=False) ``` ``` print(positions) [1050, 1100, 1150, 1200, 1250, 1300, 1350, 1400, 1450, 1500, 1550, 1600, 1650, 1700, 1750, 1800, 1850, 1900, 1950, 2000, 2050, 2100, 2150, 2200, 2250, 2300, 2350, 2400, 2450, 2500, 2550, 2600, 2650, 2700] print(data) {'AB048704.1_genotype_C_': [0.88, 0.935, 0.925, 0.955, 0.955, 0.965, 0.95, 0.935, 0.94, 0.92, 0.9299999999999999, 0.945, 0.925, 0.945, 0.96, 0.95, 0.975, 0.9733333333333334, 0.96, 0.96], 'AB010291.1_Bj': [0.98, 0.975, 0.97, 0.97, 0.965, 0.95, 0.91, 0.88, 0.85, 0.83, 0.825, 0.865, 0.885, 0.9299999999999999, 0.98, 0.97, 0.98, 0.9733333333333334, 0.96, 0.96]} ``` Once you've returned the data, you can easily customize the plot by using your favourite plotting library: ```python dist_data = sim_obj.get_data() import matplotlib.pyplot as plt import seaborn as sns sns.set() fig_dist1 = plt.figure(figsize=(20, 8)) plt.plot(df.loc["AB048704.1_genotype_C_", : ], lw=7, alpha=0.7, label="AB048704.1_genotype_C_") plt.plot(df.loc["AB010291.1_Bj", : ], lw=7, alpha=0.7, label="AB010291.1_Bj") plt.ylim(0.75, 1.05) plt.title("similarity distance plot", fontsize=25) plt.ylabel("distance relative to Ba", fontsize=20) plt.xlabel("nucleotide position", fontsize=20) plt.xticks(fontsize=15) plt.yticks(fontsize=15) plt.axvline(1750, alpha=0.5, color="red", lw=3, linestyle="dashed", label="putative recombination break points") plt.axvline(2250, alpha=0.5, color="red", lw=3, linestyle="dashed" ) plt.legend(prop={"size":20}) plt.show() ``` ![hbv_matplotlib](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_matplotlib.png) `simgen()` method has optional parameter `dist` which denoted method used to calculate pairwise distance. By default its value is set to `pdist`, so `simgen()` calculates simple pairwise distance. To use Kimura 2 parameter distance set the value of this parameter to `k2p` ```python sim_obj.simgen(window=200, shift=50, pot_rec=1, region=(1000, 2700), dist='k2p') ``` to save the distance data in excel or csv format use the method `save_data()`: ```python sim_obj.save_data(out="excel", out_name="hbv_distance_data") ``` If there are about 20 or 30 sequences in the input file and their names are long, legend element may hide the plot. So, to be able to analyze many sequences at once, it's better to use short consice sequence names instead of long ones. Like this: ![hbv_short_names](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/short_names.png) To illustrate how typical breakpoints may look like, here are shown some examples of previously described recombinations in the genomes of different viruses. The fasta alignments used are available at [datasets folder](datasets). Putative recombinations in the of 145000 bp genome of lumpy skin disease virus [4]: ![lsdv](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/lsdv_rec_sar.png) Recombination in HIV genome [5]: ![hiv](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hiv_rec_kal153.png) HCV intergenotype recombinant 2k/1b [6]: ![hcv](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hcv_2k_1b_rec.png) Norovirus recombinant isolate [7]: ![norovirus](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/norovirus_rec.png) **references** 1. Recombination Analysis Tool (RAT): a program for the high-throughput detection of recombination. Bioinformatics, Volume 21, Issue 3, 1 February 2005, Pages 278–281, https://doi.org/10.1093/bioinformatics/bth500 2. https://sray.med.som.jhmi.edu/SCRoftware/simplot/ 3. Hepatitis B Virus of Genotype B with or without Recombination with Genotype C over the Precore Region plus the Core Gene. Fuminaka Sugauchi et al. JOURNAL OF VIROLOGY, June 2002, p. 5985–5992. 10.1128/JVI.76.12.5985-5992.2002 https://jvi.asm.org/content/76/12/5985 4. Sprygin A, Babin Y, Pestova Y, Kononova S, Wallace DB, Van Schalkwyk A, et al. (2018) Analysis and insights into recombination signals in lumpy skin disease virus recovered in the field. PLoS ONE 13(12): e0207480. https://doi.org/ 10.1371/journal.pone.0207480 5. Liitsola, K., Holm K., Bobkov, A., Pokrovsky, V., Smolskaya,T., Leinikki,P., Osmanov,S. and Salminen,M. (2000) An AB recombinant and its parental HIV type 1 strains in the area of the former Soviet Union: low requirements for sequence identity in recombination. UNAIDS Virus Isolation Network. AIDS Res. Hum. Retroviruses, 16, 1047–1053. 6. Smith, D. B., Bukh, J., Kuiken, C., Muerhoff, A. S., Rice, C. M., Stapleton, J. T., & Simmonds, P. (2014). Expanded classification of hepatitis C virus into 7 genotypes and 67 subtypes: Updated criteria and genotype assignment web resource. Hepatology, 59(1), 318–327. https://doi.org/10.1002/hep.26744 7. Jiang,X., Espul,C., Zhong,W.M., Cuello,H. and Matson,D.O. (1999) Characterization of a novel human calicivirus that may be a naturally occurring recombinant. Arch. Virol., 144, 2377–2387. 8. Martin, D. P., Murrell, B., Golden, M., Khoosal, A., & Muhire, B. (2015). RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evolution, 1(1), 1–5. https://doi.org/10.1093/ve/vev003 Keywords: DNA recombination,bioinformatics,genetic distance Platform: UNKNOWN Classifier: Development Status :: 4 - Beta Classifier: Intended Audience :: Developers Classifier: Topic :: Software Development :: Build Tools Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Description-Content-Type: text/markdown recan-0.1.2/README.md0000666000000000000000000002335513616517342012233 0ustar 00000000000000# recan `recan` is a Python package which allows to construct genetic distance plots to explore and discover recombination events in viral genomes. This method has been previously implemented in desktop software tools: RAT[1], Simplot[2] and RDP4 [8]. ## Requirements To use `recan`, you will need: - Python 3 - Biopython - plotly - pandas - Jupyter notebook ## Intallation To install the package via `pip` run : ` $ pip install recan ` If you are going to use `recan` in JupyterLab, follow [the insctructions to install the Jupyter Lab Plotly renderer](https://plot.ly/python/getting-started/#jupyterlab-support-python-35) ## Usage example The package is intended to be used in Jupyter notebook. Import `Simgen` class from the recan package: ```python from recan.simgen import Simgen ``` create an object of the Simgen class. To initialize the object pass your alignment in fasta format as an argument: ```python sim_obj = Simgen("./datasets/hbv_C_Bj_Ba.fasta") ``` The input data are taken from the article by Sugauchi et al.(2002). This paper describes recombination event observed in hepatitis B virus isolates. The object of the Simgen class has method `get_info()` which shows information about the alignment. ```python sim_obj.get_info() ``` ``` index: sequence id: 0 AB048704.1_genotype_C_ 1 AB033555.1_Ba 2 AB010291.1_Bj alignment length: 3215 ``` We have three sequences in our alignment. `Simgen` class is based upon the `MultipleSequenceAlignment` class of the Biopython library. So, we treat our alignment as the array with n_samples and n_features, where 'samples' are sequences themselves, and the features are columns of nucleotides in the alignment. Index corresponds to the sequence. Note, that indices start with 0. After you've created the object you can draw the similarity plot. Call the method `simgen()` of the Simgen object to draw the plot. Pass the following parameters to the method: - `window`: sliding window size. The number of nucleotides the sliding window will span. It has the value of 500 by default. - `shift`: this is the step our window slides downstream the alignment. It's value is set to 250 by default - `pot_rec`: the index of the potential recombinant. All the other sequences will be plotted as function of distance to that sequence. Use method `get_info()` to get the indices, especially if your alignment has many sequences. The isolate of Ba genotype is the recombinant between the virus of C genotype and genotype Bj. Let's plot it. We set genotype Ba as the potential recombinant : ```python sim_obj.simgen(window=200, shift=50, pot_rec=1) ``` ![hbv_1](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/HBV_1_rec_C_B_annotated.PNG) Potential recombinant is not shown in the plot, as the distances are calculated relative to it. The higher is the distance function (i.e. the closer to 1), the closer is the sequence to the recombinant and vice versa. We can see typical 'crossover' of the distances which is the indicator of the possible recombination event. The distance of one isolate 'drops down' whereas the distance of the other remains the same of even gets closer to the potential recombinant, this abrupt drop shows that recombination could take place. The picture from the article is shown below. It's just turned upside down relative to our plot, and instead of distance drop we see distance rising. Here Bj 'goes away' from the genotype C, whereas Ba keeps the same distance ![Ba_Bj_C](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_C_Bj_Ba.jpg) By default `simgen()` method plots the whole alignment. But after initial exploration, we can take a closer look at a particular region by passing the `region` parameter to the simgen method. We can slice the alignment by using this parameter. `region` must be a tuple or a list with two integers: the start and the end position of the alignment slice. ```python region = (start, end) ``` ```python sim_obj.simgen(window=200, shift=50, pot_rec=1, region=(1000, 2700)) ``` ![hbv_slice_1](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_slice_1.png) To customize the plot or just to export and store the data, use `get_data()` method. `get_data()` returns pandas DataFrame object with sequences as samples, and distances at given points as features. ```python sim_obj.get_data() ``` ![hbv_df_example](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_df_example.png) If optional paremeter `df` is set to `False`, `get_data()` returns a tuple containing list of ticks and a dictionary of lists. Each dictionary key is the sequence id, and lists under the keys contain the corresponding distances. ```python positions, data = sim_obj.get_data(df=False) ``` ``` print(positions) [1050, 1100, 1150, 1200, 1250, 1300, 1350, 1400, 1450, 1500, 1550, 1600, 1650, 1700, 1750, 1800, 1850, 1900, 1950, 2000, 2050, 2100, 2150, 2200, 2250, 2300, 2350, 2400, 2450, 2500, 2550, 2600, 2650, 2700] print(data) {'AB048704.1_genotype_C_': [0.88, 0.935, 0.925, 0.955, 0.955, 0.965, 0.95, 0.935, 0.94, 0.92, 0.9299999999999999, 0.945, 0.925, 0.945, 0.96, 0.95, 0.975, 0.9733333333333334, 0.96, 0.96], 'AB010291.1_Bj': [0.98, 0.975, 0.97, 0.97, 0.965, 0.95, 0.91, 0.88, 0.85, 0.83, 0.825, 0.865, 0.885, 0.9299999999999999, 0.98, 0.97, 0.98, 0.9733333333333334, 0.96, 0.96]} ``` Once you've returned the data, you can easily customize the plot by using your favourite plotting library: ```python dist_data = sim_obj.get_data() import matplotlib.pyplot as plt import seaborn as sns sns.set() fig_dist1 = plt.figure(figsize=(20, 8)) plt.plot(df.loc["AB048704.1_genotype_C_", : ], lw=7, alpha=0.7, label="AB048704.1_genotype_C_") plt.plot(df.loc["AB010291.1_Bj", : ], lw=7, alpha=0.7, label="AB010291.1_Bj") plt.ylim(0.75, 1.05) plt.title("similarity distance plot", fontsize=25) plt.ylabel("distance relative to Ba", fontsize=20) plt.xlabel("nucleotide position", fontsize=20) plt.xticks(fontsize=15) plt.yticks(fontsize=15) plt.axvline(1750, alpha=0.5, color="red", lw=3, linestyle="dashed", label="putative recombination break points") plt.axvline(2250, alpha=0.5, color="red", lw=3, linestyle="dashed" ) plt.legend(prop={"size":20}) plt.show() ``` ![hbv_matplotlib](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_matplotlib.png) `simgen()` method has optional parameter `dist` which denoted method used to calculate pairwise distance. By default its value is set to `pdist`, so `simgen()` calculates simple pairwise distance. To use Kimura 2 parameter distance set the value of this parameter to `k2p` ```python sim_obj.simgen(window=200, shift=50, pot_rec=1, region=(1000, 2700), dist='k2p') ``` to save the distance data in excel or csv format use the method `save_data()`: ```python sim_obj.save_data(out="excel", out_name="hbv_distance_data") ``` If there are about 20 or 30 sequences in the input file and their names are long, legend element may hide the plot. So, to be able to analyze many sequences at once, it's better to use short consice sequence names instead of long ones. Like this: ![hbv_short_names](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/short_names.png) To illustrate how typical breakpoints may look like, here are shown some examples of previously described recombinations in the genomes of different viruses. The fasta alignments used are available at [datasets folder](datasets). Putative recombinations in the of 145000 bp genome of lumpy skin disease virus [4]: ![lsdv](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/lsdv_rec_sar.png) Recombination in HIV genome [5]: ![hiv](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hiv_rec_kal153.png) HCV intergenotype recombinant 2k/1b [6]: ![hcv](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hcv_2k_1b_rec.png) Norovirus recombinant isolate [7]: ![norovirus](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/norovirus_rec.png) **references** 1. Recombination Analysis Tool (RAT): a program for the high-throughput detection of recombination. Bioinformatics, Volume 21, Issue 3, 1 February 2005, Pages 278–281, https://doi.org/10.1093/bioinformatics/bth500 2. https://sray.med.som.jhmi.edu/SCRoftware/simplot/ 3. Hepatitis B Virus of Genotype B with or without Recombination with Genotype C over the Precore Region plus the Core Gene. Fuminaka Sugauchi et al. JOURNAL OF VIROLOGY, June 2002, p. 5985–5992. 10.1128/JVI.76.12.5985-5992.2002 https://jvi.asm.org/content/76/12/5985 4. Sprygin A, Babin Y, Pestova Y, Kononova S, Wallace DB, Van Schalkwyk A, et al. (2018) Analysis and insights into recombination signals in lumpy skin disease virus recovered in the field. PLoS ONE 13(12): e0207480. https://doi.org/ 10.1371/journal.pone.0207480 5. Liitsola, K., Holm K., Bobkov, A., Pokrovsky, V., Smolskaya,T., Leinikki,P., Osmanov,S. and Salminen,M. (2000) An AB recombinant and its parental HIV type 1 strains in the area of the former Soviet Union: low requirements for sequence identity in recombination. UNAIDS Virus Isolation Network. AIDS Res. Hum. Retroviruses, 16, 1047–1053. 6. Smith, D. B., Bukh, J., Kuiken, C., Muerhoff, A. S., Rice, C. M., Stapleton, J. T., & Simmonds, P. (2014). Expanded classification of hepatitis C virus into 7 genotypes and 67 subtypes: Updated criteria and genotype assignment web resource. Hepatology, 59(1), 318–327. https://doi.org/10.1002/hep.26744 7. Jiang,X., Espul,C., Zhong,W.M., Cuello,H. and Matson,D.O. (1999) Characterization of a novel human calicivirus that may be a naturally occurring recombinant. Arch. Virol., 144, 2377–2387. 8. Martin, D. P., Murrell, B., Golden, M., Khoosal, A., & Muhire, B. (2015). RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evolution, 1(1), 1–5. https://doi.org/10.1093/ve/vev003 recan-0.1.2/recan/0000777000000000000000000000000013616464714012040 5ustar 00000000000000recan-0.1.2/recan/__init__.py0000666000000000000000000000000213616144756014142 0ustar 00000000000000 recan-0.1.2/recan/simgen.py0000666000000000000000000002671613616144756013711 0ustar 00000000000000""" Simgen realizes the interface to manipulate alignment in Jupyter notebook, and to explore recombination events using similarity plots """ from Bio.Align import MultipleSeqAlignment from Bio.SeqRecord import SeqRecord #import matplotlib.pyplot as plt import pandas as pd from plotly.offline import init_notebook_mode, iplot import plotly.graph_objs as go class Simgen(MultipleSeqAlignment): """ A class used to represent an alignment object --- Methods ------- simgen(pot_rec, window=500, shift=250, region=False, dist="pdist") outputs similarity plots using slidign window approach get_data(df=True) returns distance data as a pandas DataFrame object or a dictionary of lists with key as a sequence id and distances in corresponding lists get_info() outputs information about the alignment, index (which is the row number), sequence names, and alignment length save_data(path=False, out="csv") saves the data spreadsheet as a csv or excel file """ def __init__(self, path): """ initializing Simgen --- Parameters ---------- path: str path to the fasta file containing alignment """ from Bio import AlignIO recs_prepared = (x for x in AlignIO.read(path, "fasta")) # it works without it, but i believe it's formally right super(Simgen, self).__init__(recs_prepared) # you need explicitly call the __init__ of the upperclass self._distance = {} # empty, until call of the simgen function self._align = None # current slice or whole MultipleSeqRecord for plotting self._ticks = None # ticks for plot def _draw_simplot(self): """draws similarity plot using plotly""" init_notebook_mode() data = [] for key in self._distance.keys(): trace = go.Scatter(y=self._distance[key], x=self._ticks, name=key) data.append(trace) layout = go.Layout( xaxis=dict( title="nucleotide position"), yaxis=dict( title="sequence identity"), legend=dict(x=-0.1, y=1.5, orientation="h")) #legend=dict(x=-0.1, y=1.5)) fig = go.Figure(data=data, layout=layout) iplot(fig) #TODO # tick label merge def _draw_simplot_mpl(self): """draws similartiy plot using matplotlib """ data = list(self._distance.values()) ticks = self._ticks[1:] #labels = list(self._distance.keys()) #print(data, ticks, labels, sep="\n") plt.figure(figsize=(15, 8)) for i in data: plt.plot(range(1, len(i) + 1), i) plt.xticks(list(range(1, len(ticks))), ticks, rotation='vertical') plt.show() def _get_x_labels(self, left_border, right_border, shift): """creates tick labels""" tick_container = [] tick_container.append(left_border) while tick_container[-1] < right_border: tick_container.append(tick_container[-1] + shift) if tick_container[-1] > right_border: tick_container[-1] = right_border #self._ticks = tick_container return tick_container def _move_window(self, window, pot_rec, shift, dist): """moves window""" distance_data = {} parents = list(range(0, len(self._align))) parents.remove(pot_rec) align_length = len(self._align[0, :]) for par in parents: dist_container = [] start = 0 finish = shift while start < align_length: seq1 = self._align[pot_rec, start:finish].seq # here is a potential recombinant sequence slice seq2 = self._align[par, start:finish].seq # here's a parent's slice if dist == "pdist": distance = self._pdistance(seq1, seq2) dist_container.append(distance) #calculate pdistance, append to container elif dist == "k2p": distance = self._K2Pdistance(seq1, seq2) dist_container.append(distance) #dist_container.append(self._K2Pdistance(seq1, seq2)) #calculate pdistance, append to container start += shift finish = start + window distance_data[self._align[par].id] = dist_container #self._distance = distance_data return distance_data def _K2Pdistance(self, seq1, seq2): """ Kimura 2-Parameter distance = -0.5 log( (1 - 2p -q) * sqrt( 1 - 2q ) ) where: p = transition frequency q = transversion frequency """ from math import log, sqrt pairs = [] for x in zip(seq1, seq2): if '-' not in x: pairs.append(x) ts_count=0 tv_count=0 length = len(pairs) transitions = [ "AG", "GA", "CT", "TC"] transversions = [ "AC", "CA", "AT", "TA", "GC", "CG", "GT", "TG" ] for (x, y) in pairs: if x + y in transitions: ts_count += 1 elif x + y in transversions: tv_count += 1 p = float(ts_count) / length q = float(tv_count) / length try: d = -0.5 * log( (1 - 2*p - q) * sqrt( 1 - 2*q ) ) except ValueError: print ("Tried to take log of a negative number") return None return 1 - d def _pdistance(self, seq1, seq2): """calculates 1 - pairwise distance between two sequences""" p = 0 pairs = [] for x in zip(seq1, seq2): if '-' not in x: pairs.append(x) for (x, y) in pairs: if x != y: p += 1 length = len(pairs) #assert length > 0, "AssertionError: perhaps your alignment contains only or too many gaps" try: dist = float(1 - p / length) # '1 - p' to take plot 'upside down' return dist except ZeroDivisionError as e: #print(e, ": perhaps your alignment contains only gaps") pass def _get_collect_sliced_left_right_borders(self, region=False): if region: assert region[0] < region[1], "the value of the first nucleotide position should be less than the second one" collect_sliced = [] for rec in self._records: # access to seq of the SeqRecord obj inside MultipleSeqAlignment sliced_seq = rec.seq[region[0]:region[1]] collect_sliced.append(SeqRecord(sliced_seq, id=rec.id, name=rec.name, description=rec.description)) left_border = region[0] # border for the first tick right_border = region[1] # if region, 'right_border' is actual position else: collect_sliced = [] for rec in self._records: # access to seq of the SeqRecord obj inside MultipleSeqAlignment sliced_seq = rec.seq[:] collect_sliced.append(SeqRecord(sliced_seq, id=rec.id, name=rec.name, description=rec.description)) left_border = 1 # border for the first tick right_border = self.get_alignment_length() return collect_sliced, left_border, right_border def simgen(self, pot_rec, window=500, shift=250, region=False, dist='pdist'): """slices the alignment, collects the distance data, outputs the plot Parameters: ----------- pot_rec: int the index of the sequence under study. use get_info() to find out. it is the index of the row in the alignment: starts with 0, like the 'x' dimension in the numpy array window: int sliding window size. 500 by default shift: int the step window slides downstream the alignment. 250 by default region: a tuple or a list of two integers or False by default the region of the alignment to analyze: the start and the end nucleotide positions. by default takes the whole alignment length dist: str 'pdist' or 'k2p' pairwise or Kimura methods to calculate distance """ assert window >=1, "window parameter can't be a negative or zero" assert shift >= 1, "shift parameter can't be a negative or zero" collect_sliced, left_border, right_border = self._get_collect_sliced_left_right_borders(region) self._align = MultipleSeqAlignment(collect_sliced) # creating tick labels for the plot #self._get_x_labels(left_border, right_border, shift) self._ticks = self._get_x_labels(left_border, right_border, shift) # calculating pairwise distance #self._move_window(window, pot_rec, shift, dist) self._distance = self._move_window(window, pot_rec, shift, dist) self._draw_simplot() #TODO # inter=True, when tick merge will be solved # add this parameter to choose between plotly and mpl #if inter: # self._draw_simplot() #else: # self._draw_simplot_mpl() def get_data(self, df=True): """returns distance data Parameters --------- df: bool True: returns pandas DataFrame object False: returns a dictionary where keys are the sequence ids and values are distance data """ if df: return pd.DataFrame(data=self._distance, index=self._ticks[1:]).T else: return self._ticks[1:], self._distance def get_info(self): """outputs information about the alignment: index (which is the row number), sequence names, and alignment lengths""" print("index:", "sequence id:", sep="\t") for counter, value in enumerate(self): print(counter, value.id, sep="\t") print("alignment length: ", self.get_alignment_length()) #print(self) def save_data(self, path=False, out="csv", out_name="distance_data"): """saves the data spreadsheet as a csv or excel file Parameters --------- path: str output destination out: str output file format: "csv" or "excel" out_name: str output file name """ df = pd.DataFrame(data=self._distance, index=self._ticks[1:]).T if path: if out == "csv": df.to_csv(out_name + ".csv") elif out == "excel": writer = pd.ExcelWriter(out_name + ".xlsx") df.to_excel(path) else: print("invalid output file") else: if out == "csv": df.to_csv(out_name + ".csv") elif out == "excel": writer = pd.ExcelWriter(out_name + ".xlsx") df.to_excel(writer) else: print("invalid output file format") recan-0.1.2/recan.egg-info/0000777000000000000000000000000013616464714013532 5ustar 00000000000000recan-0.1.2/recan.egg-info/PKG-INFO0000666000000000000000000003020613616464714014630 0ustar 00000000000000Metadata-Version: 2.1 Name: recan Version: 0.1.2 Summary: recan: recombination analysis tool Home-page: https://github.com/babinyurii/recan Author: Yuriy Babin Author-email: babin.yurii@gmail.com License: MIT Download-URL: https://github.com/babinyurii/recan/archive/v_0.1.2.tar.gz Description: # recan `recan` is a Python package which allows to construct genetic distance plots to explore and discover recombination events in viral genomes. This method has been previously implemented in desktop software tools: RAT[1], Simplot[2] and RDP4 [8]. ## Requirements To use `recan`, you will need: - Python 3 - Biopython - plotly - pandas - Jupyter notebook ## Intallation To install the package via `pip` run : ` $ pip install recan ` If you are going to use `recan` in JupyterLab, follow [the insctructions to install the Jupyter Lab Plotly renderer](https://plot.ly/python/getting-started/#jupyterlab-support-python-35) ## Usage example The package is intended to be used in Jupyter notebook. Import `Simgen` class from the recan package: ```python from recan.simgen import Simgen ``` create an object of the Simgen class. To initialize the object pass your alignment in fasta format as an argument: ```python sim_obj = Simgen("./datasets/hbv_C_Bj_Ba.fasta") ``` The input data are taken from the article by Sugauchi et al.(2002). This paper describes recombination event observed in hepatitis B virus isolates. The object of the Simgen class has method `get_info()` which shows information about the alignment. ```python sim_obj.get_info() ``` ``` index: sequence id: 0 AB048704.1_genotype_C_ 1 AB033555.1_Ba 2 AB010291.1_Bj alignment length: 3215 ``` We have three sequences in our alignment. `Simgen` class is based upon the `MultipleSequenceAlignment` class of the Biopython library. So, we treat our alignment as the array with n_samples and n_features, where 'samples' are sequences themselves, and the features are columns of nucleotides in the alignment. Index corresponds to the sequence. Note, that indices start with 0. After you've created the object you can draw the similarity plot. Call the method `simgen()` of the Simgen object to draw the plot. Pass the following parameters to the method: - `window`: sliding window size. The number of nucleotides the sliding window will span. It has the value of 500 by default. - `shift`: this is the step our window slides downstream the alignment. It's value is set to 250 by default - `pot_rec`: the index of the potential recombinant. All the other sequences will be plotted as function of distance to that sequence. Use method `get_info()` to get the indices, especially if your alignment has many sequences. The isolate of Ba genotype is the recombinant between the virus of C genotype and genotype Bj. Let's plot it. We set genotype Ba as the potential recombinant : ```python sim_obj.simgen(window=200, shift=50, pot_rec=1) ``` ![hbv_1](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/HBV_1_rec_C_B_annotated.PNG) Potential recombinant is not shown in the plot, as the distances are calculated relative to it. The higher is the distance function (i.e. the closer to 1), the closer is the sequence to the recombinant and vice versa. We can see typical 'crossover' of the distances which is the indicator of the possible recombination event. The distance of one isolate 'drops down' whereas the distance of the other remains the same of even gets closer to the potential recombinant, this abrupt drop shows that recombination could take place. The picture from the article is shown below. It's just turned upside down relative to our plot, and instead of distance drop we see distance rising. Here Bj 'goes away' from the genotype C, whereas Ba keeps the same distance ![Ba_Bj_C](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_C_Bj_Ba.jpg) By default `simgen()` method plots the whole alignment. But after initial exploration, we can take a closer look at a particular region by passing the `region` parameter to the simgen method. We can slice the alignment by using this parameter. `region` must be a tuple or a list with two integers: the start and the end position of the alignment slice. ```python region = (start, end) ``` ```python sim_obj.simgen(window=200, shift=50, pot_rec=1, region=(1000, 2700)) ``` ![hbv_slice_1](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_slice_1.png) To customize the plot or just to export and store the data, use `get_data()` method. `get_data()` returns pandas DataFrame object with sequences as samples, and distances at given points as features. ```python sim_obj.get_data() ``` ![hbv_df_example](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_df_example.png) If optional paremeter `df` is set to `False`, `get_data()` returns a tuple containing list of ticks and a dictionary of lists. Each dictionary key is the sequence id, and lists under the keys contain the corresponding distances. ```python positions, data = sim_obj.get_data(df=False) ``` ``` print(positions) [1050, 1100, 1150, 1200, 1250, 1300, 1350, 1400, 1450, 1500, 1550, 1600, 1650, 1700, 1750, 1800, 1850, 1900, 1950, 2000, 2050, 2100, 2150, 2200, 2250, 2300, 2350, 2400, 2450, 2500, 2550, 2600, 2650, 2700] print(data) {'AB048704.1_genotype_C_': [0.88, 0.935, 0.925, 0.955, 0.955, 0.965, 0.95, 0.935, 0.94, 0.92, 0.9299999999999999, 0.945, 0.925, 0.945, 0.96, 0.95, 0.975, 0.9733333333333334, 0.96, 0.96], 'AB010291.1_Bj': [0.98, 0.975, 0.97, 0.97, 0.965, 0.95, 0.91, 0.88, 0.85, 0.83, 0.825, 0.865, 0.885, 0.9299999999999999, 0.98, 0.97, 0.98, 0.9733333333333334, 0.96, 0.96]} ``` Once you've returned the data, you can easily customize the plot by using your favourite plotting library: ```python dist_data = sim_obj.get_data() import matplotlib.pyplot as plt import seaborn as sns sns.set() fig_dist1 = plt.figure(figsize=(20, 8)) plt.plot(df.loc["AB048704.1_genotype_C_", : ], lw=7, alpha=0.7, label="AB048704.1_genotype_C_") plt.plot(df.loc["AB010291.1_Bj", : ], lw=7, alpha=0.7, label="AB010291.1_Bj") plt.ylim(0.75, 1.05) plt.title("similarity distance plot", fontsize=25) plt.ylabel("distance relative to Ba", fontsize=20) plt.xlabel("nucleotide position", fontsize=20) plt.xticks(fontsize=15) plt.yticks(fontsize=15) plt.axvline(1750, alpha=0.5, color="red", lw=3, linestyle="dashed", label="putative recombination break points") plt.axvline(2250, alpha=0.5, color="red", lw=3, linestyle="dashed" ) plt.legend(prop={"size":20}) plt.show() ``` ![hbv_matplotlib](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hbv_matplotlib.png) `simgen()` method has optional parameter `dist` which denoted method used to calculate pairwise distance. By default its value is set to `pdist`, so `simgen()` calculates simple pairwise distance. To use Kimura 2 parameter distance set the value of this parameter to `k2p` ```python sim_obj.simgen(window=200, shift=50, pot_rec=1, region=(1000, 2700), dist='k2p') ``` to save the distance data in excel or csv format use the method `save_data()`: ```python sim_obj.save_data(out="excel", out_name="hbv_distance_data") ``` If there are about 20 or 30 sequences in the input file and their names are long, legend element may hide the plot. So, to be able to analyze many sequences at once, it's better to use short consice sequence names instead of long ones. Like this: ![hbv_short_names](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/short_names.png) To illustrate how typical breakpoints may look like, here are shown some examples of previously described recombinations in the genomes of different viruses. The fasta alignments used are available at [datasets folder](datasets). Putative recombinations in the of 145000 bp genome of lumpy skin disease virus [4]: ![lsdv](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/lsdv_rec_sar.png) Recombination in HIV genome [5]: ![hiv](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hiv_rec_kal153.png) HCV intergenotype recombinant 2k/1b [6]: ![hcv](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/hcv_2k_1b_rec.png) Norovirus recombinant isolate [7]: ![norovirus](https://raw.githubusercontent.com/babinyurii/recan/master/pictures/norovirus_rec.png) **references** 1. Recombination Analysis Tool (RAT): a program for the high-throughput detection of recombination. Bioinformatics, Volume 21, Issue 3, 1 February 2005, Pages 278–281, https://doi.org/10.1093/bioinformatics/bth500 2. https://sray.med.som.jhmi.edu/SCRoftware/simplot/ 3. Hepatitis B Virus of Genotype B with or without Recombination with Genotype C over the Precore Region plus the Core Gene. Fuminaka Sugauchi et al. JOURNAL OF VIROLOGY, June 2002, p. 5985–5992. 10.1128/JVI.76.12.5985-5992.2002 https://jvi.asm.org/content/76/12/5985 4. Sprygin A, Babin Y, Pestova Y, Kononova S, Wallace DB, Van Schalkwyk A, et al. (2018) Analysis and insights into recombination signals in lumpy skin disease virus recovered in the field. PLoS ONE 13(12): e0207480. https://doi.org/ 10.1371/journal.pone.0207480 5. Liitsola, K., Holm K., Bobkov, A., Pokrovsky, V., Smolskaya,T., Leinikki,P., Osmanov,S. and Salminen,M. (2000) An AB recombinant and its parental HIV type 1 strains in the area of the former Soviet Union: low requirements for sequence identity in recombination. UNAIDS Virus Isolation Network. AIDS Res. Hum. Retroviruses, 16, 1047–1053. 6. Smith, D. B., Bukh, J., Kuiken, C., Muerhoff, A. S., Rice, C. M., Stapleton, J. T., & Simmonds, P. (2014). Expanded classification of hepatitis C virus into 7 genotypes and 67 subtypes: Updated criteria and genotype assignment web resource. Hepatology, 59(1), 318–327. https://doi.org/10.1002/hep.26744 7. Jiang,X., Espul,C., Zhong,W.M., Cuello,H. and Matson,D.O. (1999) Characterization of a novel human calicivirus that may be a naturally occurring recombinant. Arch. Virol., 144, 2377–2387. 8. Martin, D. P., Murrell, B., Golden, M., Khoosal, A., & Muhire, B. (2015). RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evolution, 1(1), 1–5. https://doi.org/10.1093/ve/vev003 Keywords: DNA recombination,bioinformatics,genetic distance Platform: UNKNOWN Classifier: Development Status :: 4 - Beta Classifier: Intended Audience :: Developers Classifier: Topic :: Software Development :: Build Tools Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Description-Content-Type: text/markdown recan-0.1.2/recan.egg-info/SOURCES.txt0000666000000000000000000000036013616464714015415 0ustar 00000000000000README.md setup.cfg setup.py recan/__init__.py recan/simgen.py recan.egg-info/PKG-INFO recan.egg-info/SOURCES.txt recan.egg-info/dependency_links.txt recan.egg-info/requires.txt recan.egg-info/top_level.txt test/test.py test/test_results.pyrecan-0.1.2/recan.egg-info/dependency_links.txt0000666000000000000000000000000113616464714017600 0ustar 00000000000000 recan-0.1.2/recan.egg-info/requires.txt0000666000000000000000000000004313616464714016127 0ustar 00000000000000pandas plotly biopython matplotlib recan-0.1.2/recan.egg-info/top_level.txt0000666000000000000000000000000613616464714016260 0ustar 00000000000000recan recan-0.1.2/setup.cfg0000666000000000000000000000012613616464715012571 0ustar 00000000000000[metadata] description-file = README.md [egg_info] tag_build = tag_date = 0 recan-0.1.2/setup.py0000666000000000000000000000267313616464537012475 0ustar 00000000000000#from distutils.core import setup from os import path from distutils.core import setup import setuptools # noqa this_directory = path.abspath(path.dirname(__file__)) with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f: long_description = f.read() setup( name = 'recan', long_description = long_description, # added to package readme on pypi long_description_content_type = "text/markdown", # added to package readme on pypi packages = ['recan'], version = '0.1.2', license='MIT', description = 'recan: recombination analysis tool', author = 'Yuriy Babin', author_email = 'babin.yurii@gmail.com', url = 'https://github.com/babinyurii/recan', download_url = 'https://github.com/babinyurii/recan/archive/v_0.1.2.tar.gz', keywords = ['DNA recombination', 'bioinformatics', 'genetic distance'], install_requires=[ 'pandas', 'plotly', 'biopython', 'matplotlib' ], classifiers=[ 'Development Status :: 4 - Beta', 'Intended Audience :: Developers', 'Topic :: Software Development :: Build Tools', 'License :: OSI Approved :: MIT License', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.4', 'Programming Language :: Python :: 3.5', 'Programming Language :: Python :: 3.6', ], ) recan-0.1.2/test/0000777000000000000000000000000013616464714011727 5ustar 00000000000000recan-0.1.2/test/test.py0000666000000000000000000001334113616144756013263 0ustar 00000000000000# -*- coding: utf-8 -*- """ Created on Tue Oct 22 15:57:30 2019 @author: babin """ from Bio.Align import MultipleSeqAlignment import unittest from test_results import dist_whole_align_ref, dist_win_250_shift_100_ref, dist_whole_align_def_params_k2p import sys sys.path.append("..") from recan.simgen import Simgen class DistTestCase(unittest.TestCase): def setUp(self): self.sequences = [("AAAAAGGGGG", "AAAAAGGGGG"), ("AAAAAAAAAT", "AAAAAAAAAA"), ("AAAAATTTTT", "AAAAAAAAAA"), ("ATTTTTTTTT", "AGGGGGGGGGG")] self.pdistances = [1.0, 0.9, 0.5, 0.1 ] self.ids = ["AB048704.1_genotype_C_", "AB033555.1_Ba", "AB010291.1_Bj"] self.default_shift = 250 self.default_window = 500 self.ticks_whole_align_ref = [1, 251, 501, 751, 1001, 1251, 1501, 1751, 2001, 2251, 2501, 2751, 3001, 3215] self.ticks_region_ref = [250, 500, 700] self.pot_rec = 1 self.dist_reg_ref = {'AB048704.1_genotype_C_': [0.9319999999999999, 0.945], 'AB010291.1_Bj': [0.992, 0.975]} # whole alignment slice self.sim_obj = Simgen("./hbv_C_Bj_Ba.fasta") self.collect_sliced, self.left_border, self.right_border = self.sim_obj._get_collect_sliced_left_right_borders(region=False) self.whole_align_obj = MultipleSeqAlignment(self.collect_sliced) self.ticks_whole_align = self.sim_obj._get_x_labels(self.left_border, self.right_border, self.default_shift) self.sim_obj._align = MultipleSeqAlignment(self.collect_sliced) # store alignment in _align attr of the sim_obj, otherwise move window will throw error self.distance_whole_align = self.sim_obj._move_window(self.default_window, self.pot_rec, self.default_shift, "pdist") # changin window and shift params self.dist_whole_align_win_250_shift_100 = self.sim_obj._move_window(window=250, pot_rec=self.pot_rec, shift=100, dist="pdist") #k2p distance with the same params self.distance_whole_align_k2p = self.sim_obj._move_window(self.default_window, self.pot_rec, self.default_shift, "k2p") # region slice self.sim_obj_reg = Simgen("./hbv_C_Bj_Ba.fasta") self.collect_sliced_reg, self.left_border_reg, self.right_border_reg = self.sim_obj_reg._get_collect_sliced_left_right_borders(region=(250, 700)) self.sliced_align_obj = MultipleSeqAlignment(self.collect_sliced_reg) self.ticks_region = self.sim_obj_reg._get_x_labels(self.left_border_reg, self.right_border_reg, self.default_shift) self.sim_obj_reg._align = MultipleSeqAlignment(self.collect_sliced_reg) self.dist_reg = self.sim_obj_reg._move_window(self.default_window, self.pot_rec, self.default_shift, "pdist") def test_get_collect_sliced_left_right_borders_whole_alignment_ids(self): for counter, value in enumerate(self.collect_sliced): self.assertEqual(value.id, self.ids[counter]) def test_get_collect_sliced_left_right_borders_region_ids(self): for counter, value in enumerate(self.collect_sliced_reg): self.assertEqual(value.id, self.ids[counter]) def test_get_collect_sliced_left_right_borders_whole_alignment_len(self): for i in self.collect_sliced: self.assertEqual(len(i.seq), 3215) def test_get_collect_sliced_left_right_borders_region_len(self): for i in self.collect_sliced_reg: self.assertEqual(len(i.seq), 450) def test_get_collect_sliced_left_right_borders_whole_alignment(self): self.assertEqual(self.left_border, 1) self.assertEqual(self.right_border, 3215) def test_get_collect_sliced_left_right_borders_region(self): self.assertEqual(self.left_border_reg, 250) self.assertEqual(self.right_border_reg, 700) def test_get_x_labels_whole_align_ticks(self): for counter, value in enumerate(self.ticks_whole_align): self.assertEqual(value, self.ticks_whole_align_ref[counter]) def test_get_x_labels_region_ticks(self): for counter, value in enumerate(self.ticks_region): self.assertEqual(value, self.ticks_region_ref[counter]) def test_move_window_whole_align_default_params(self): for key in self.distance_whole_align.keys(): for counter, value in enumerate(self.distance_whole_align[key]): self.assertEqual(value, dist_whole_align_ref[key][counter] ) def test_move_window_whole_align_win_250_shift_100(self): for key in self.dist_whole_align_win_250_shift_100.keys(): for counter, value in enumerate(self.dist_whole_align_win_250_shift_100[key]): self.assertEqual(value, dist_win_250_shift_100_ref[key][counter] ) def test_move_window_whole_align_def_params_k2p(self): for key in self.distance_whole_align_k2p.keys(): for counter, value in enumerate(self.distance_whole_align_k2p[key]): self.assertEqual(value, dist_whole_align_def_params_k2p[key][counter] ) def test_move_window_region(self): for key in self.dist_reg.keys(): for counter, value in enumerate(self.dist_reg[key]): self.assertEqual(value, self.dist_reg_ref[key][counter]) def test_pdistance_method(self): for counter, value in enumerate(self.pdistances): dist = self.sim_obj._pdistance(self.sequences[counter][0], self.sequences[counter][1]) self.assertEqual(round(dist, 1), value) if __name__ == '__main__': unittest.main() recan-0.1.2/test/test_results.py0000666000000000000000000000411713616144756015045 0ustar 00000000000000# -*- coding: utf-8 -*- """ Created on Tue Oct 22 15:58:44 2019 @author: babin """ posits_def = [251, 501, 751, 1001, 1251, 1501, 1751, 2001, 2251, 2501, 2751, 3001, 3215] dist_whole_align_ref = {'AB048704.1_genotype_C_': [0.88, 0.938, 0.914, 0.886, 0.89, 0.908, 0.938, 0.948, 0.948, 0.886, 0.852, 0.8580645161290322, 0.827906976744186], 'AB010291.1_Bj': [0.968, 0.986, 0.946, 0.92, 0.94, 0.964, 0.95, 0.892, 0.914, 0.9359999999999999, 0.924, 0.935483870967742, 0.9255813953488372]} dist_win_250_shift_100_ref = {'AB048704.1_genotype_C_': [0.87, 0.9, 0.9359999999999999, 0.924, 0.944, 0.944, 0.948, 0.888, 0.868, 0.86, 0.888, 0.9, 0.908, 0.88, 0.916, 0.924, 0.94, 0.96, 0.948, 0.9319999999999999, 0.944, 0.9359999999999999, 0.96, 0.9319999999999999, 0.864, 0.8200000000000001, 0.88, 0.892, 0.88, 0.844, 0.827906976744186, 0.8608695652173913, 0.9333333333333333], 'AB010291.1_Bj': [0.95, 0.984, 0.988, 0.984, 0.98, 0.98, 0.98, 0.92, 0.896, 0.888, 0.928, 0.94, 0.96, 0.948, 0.976, 0.976, 0.968, 0.952, 0.896, 0.844, 0.86, 0.908, 0.976, 0.948, 0.916, 0.904, 0.9359999999999999, 0.948, 0.94, 0.9359999999999999, 0.9255813953488372, 0.9217391304347826, 0.8666666666666667]} dist_whole_align_def_params_k2p = {'AB048704.1_genotype_C_': [0.8681719101219889, 0.9351731626008992, 0.9083728156043438, 0.8750271283550077, 0.879929128403318, 0.9015597329057567, 0.9351297624958606, 0.9459250442159328, 0.9459717143364927, 0.8760802380420646, 0.8343273948904422, 0.841497348083017, 0.8033200314745574], 'AB010291.1_Bj': [0.9671530980992109, 0.9858456107911616, 0.9438329817983037, 0.9150569322625627, 0.9372918193486423, 0.9630251291666885, 0.9481456308045444, 0.8823622232289046, 0.9077377632214376, 0.9325670957791264, 0.919398127767968, 0.9323907045444492, 0.9211964811945209]}