pax_global_header00006660000000000000000000000064133010471140014504gustar00rootroot0000000000000052 comment=8ef170cc7693d4654a40ef82db60d4554e675b4d QCumber-2.3.0/000077500000000000000000000000001330104711400130445ustar00rootroot00000000000000QCumber-2.3.0/.conda_cache/000077500000000000000000000000001330104711400153315ustar00rootroot00000000000000QCumber-2.3.0/.conda_cache/this_file_only_exists_in_order_to_track_this_directory000066400000000000000000000000001330104711400305320ustar00rootroot00000000000000QCumber-2.3.0/.gitignore000077500000000000000000000000421330104711400150330ustar00rootroot00000000000000.Rhistory .idea .svn __pycache__ QCumber-2.3.0/.gitlab-ci.yml000066400000000000000000000020711330104711400155000ustar00rootroot00000000000000image: continuumio/miniconda3:latest cache: paths: - $CI_PROJECT_DIR/.conda_cache before_script: - conda update -y -n base conda # - "export _PYTHON_SYSCONFIGDATA_NAME='_sysconfigdata_m_linux_x86_64-linux-gnu'" # - conda update -y pip # - "grep allow_softlinks $CONDA_PREFIX/.condarc || echo allow_softlinks: False >> $CONDA_PREFIX/.condarc" # - conda update -y gxx_linux-64 || conda install gxx_linux-64 # - unset _PYTHON_SYSCONFIGDATA_NAME - conda update -y gcc_linux-64 || conda install gcc_linux-64 - export _PYTHON_SYSCONFIGDATA_NAME='_sysconfigdata_m_linux_x86_64-linux-gnu' # - source activate $CXX # - gcc --version - bash gitlab-ci.sh - source activate $CI_PROJECT_DIR/.conda_cache/qcumber - bash build.sh stages: - test test: only: - development_unstable stage: test script: - cd test # - grep MemTotal /proc/meminfo - ./test_qcumber2.py # - ./test_qcumber2.py > /dev/null 2>&1 & # - cat /sys/fs/cgroup/memory/memory.usage_in_bytes # - while true; do cat /sys/fs/cgroup/memory/memory.usage_in_bytes; sleep 3; done QCumber-2.3.0/CHANGELOG000066400000000000000000000036141330104711400142620ustar00rootroot00000000000000## [Unreleased] Activation of new file input comming in 2.4 Fix for kraken will be in 2.3.1 ## [2.3.0] - 2018-16-05 ### Added - Support for continous integration - Sample sheet generation (not used) #### Test script: - Verbosity Option - intoduced diffrent run levels for tests (low to high spec) (CI Node friendly) - remote data support ### Fixes - SE MODE abort because of regex issue #19 - TrimBetter regex issue caused it to trim everything ## [2.2.1] - 2018-20-03 ### Added #### Test script: - bash completion - regex testing - mapping - local real data tests (no validation yet) ### Changes - touch ups - zipped reference input works consistently - reexecution of rules is now based on their parameters #### Test script: - utility functions for manipulation of goldstandard ### Fixes - matplotlib issue with long filenames (tight_layout() does not like them) - ## [2.2.0] - 2018-24-01 ### Added - An extended test suite is introduced. For more information please see the readme, which you will find in the test folder (soon: see #15) #4 - Insert size estimation: Distribution plots in mapping folder for each sample and text files with average, min and max fragment length. In batch report: boxplot with fragment length distribution ### Changed - When using the option - - save for the Illumina Sequence Analysis Viewer, now all standard generated .xml files are required, as well as the InterOp folder. #14 - “Couldn’t rename sample files” warnings are not displayed anymore. - In single end data the plots in the batch report are now colored. - In the kraken plots the top 10 taxonomies are displayed, instead of all over 5%. The taxanomy “root” is silenced. #7 ## [2.1.1] ## [2.1.0] - 2017-31-12 ### Added - unclassified out option for kraken ### Changed - report generation and format of output. Fixes #6 and concurrency issue - Permission Updates ## [2.0.4] - 2017-17-11QCumber-2.3.0/LICENSE000077500000000000000000000172101330104711400140550ustar00rootroot00000000000000 GNU LESSER GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License, supplemented by the additional permissions listed below. 0. Additional Definitions. As used herein, "this License" refers to version 3 of the GNU Lesser General Public License, and the "GNU GPL" refers to version 3 of the GNU General Public License. "The Library" refers to a covered work governed by this License, other than an Application or a Combined Work as defined below. An "Application" is any work that makes use of an interface provided by the Library, but which is not otherwise based on the Library. Defining a subclass of a class defined by the Library is deemed a mode of using an interface provided by the Library. A "Combined Work" is a work produced by combining or linking an Application with the Library. The particular version of the Library with which the Combined Work was made is also called the "Linked Version". The "Minimal Corresponding Source" for a Combined Work means the Corresponding Source for the Combined Work, excluding any source code for portions of the Combined Work that, considered in isolation, are based on the Application, and not on the Linked Version. The "Corresponding Application Code" for a Combined Work means the object code and/or source code for the Application, including any data and utility programs needed for reproducing the Combined Work from the Application, but excluding the System Libraries of the Combined Work. 1. Exception to Section 3 of the GNU GPL. You may convey a covered work under sections 3 and 4 of this License without being bound by section 3 of the GNU GPL. 2. Conveying Modified Versions. If you modify a copy of the Library, and, in your modifications, a facility refers to a function or data to be supplied by an Application that uses the facility (other than as an argument passed when the facility is invoked), then you may convey a copy of the modified version: a) under this License, provided that you make a good faith effort to ensure that, in the event an Application does not supply the function or data, the facility still operates, and performs whatever part of its purpose remains meaningful, or b) under the GNU GPL, with none of the additional permissions of this License applicable to that copy. 3. Object Code Incorporating Material from Library Header Files. The object code form of an Application may incorporate material from a header file that is part of the Library. You may convey such object code under terms of your choice, provided that, if the incorporated material is not limited to numerical parameters, data structure layouts and accessors, or small macros, inline functions and templates (ten or fewer lines in length), you do both of the following: a) Give prominent notice with each copy of the object code that the Library is used in it and that the Library and its use are covered by this License. b) Accompany the object code with a copy of the GNU GPL and this license document. 4. Combined Works. You may convey a Combined Work under terms of your choice that, taken together, effectively do not restrict modification of the portions of the Library contained in the Combined Work and reverse engineering for debugging such modifications, if you also do each of the following: a) Give prominent notice with each copy of the Combined Work that the Library is used in it and that the Library and its use are covered by this License. b) Accompany the Combined Work with a copy of the GNU GPL and this license document. c) For a Combined Work that displays copyright notices during execution, include the copyright notice for the Library among these notices, as well as a reference directing the user to the copies of the GNU GPL and this license document. d) Do one of the following: 0) Convey the Minimal Corresponding Source under the terms of this License, and the Corresponding Application Code in a form suitable for, and under terms that permit, the user to recombine or relink the Application with a modified version of the Linked Version to produce a modified Combined Work, in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source. 1) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (a) uses at run time a copy of the Library already present on the user's computer system, and (b) will operate properly with a modified version of the Library that is interface-compatible with the Linked Version. e) Provide Installation Information, but only if you would otherwise be required to provide such information under section 6 of the GNU GPL, and only to the extent that such information is necessary to install and execute a modified version of the Combined Work produced by recombining or relinking the Application with a modified version of the Linked Version. (If you use option 4d0, the Installation Information must accompany the Minimal Corresponding Source and Corresponding Application Code. If you use option 4d1, you must provide the Installation Information in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source.) 5. Combined Libraries. You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities that are not Applications and are not covered by this License, and convey such a combined library under terms of your choice, if you do both of the following: a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities, conveyed under the terms of this License. b) Give prominent notice with the combined library that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. 6. Revised Versions of the GNU Lesser General Public License. The Free Software Foundation may publish revised and/or new versions of the GNU Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Library as you received it specifies that a certain numbered version of the GNU Lesser General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that published version or of any later version published by the Free Software Foundation. If the Library as you received it does not specify a version number of the GNU Lesser General Public License, you may choose any version of the GNU Lesser General Public License ever published by the Free Software Foundation. If the Library as you received it specifies that a proxy can decide whether future versions of the GNU Lesser General Public License shall apply, that proxy's public statement of acceptance of any version is permanent authorization for you to choose that version for the Library. QCumber-2.3.0/QCumber-2000077500000000000000000000737351330104711400145060ustar00rootroot00000000000000#!/usr/bin/env python3 __author__ = 'LieuV' __version__ = "2.1.1" import argparse import re import getpass import warnings import os import json import sys from itertools import groupby from collections import OrderedDict import subprocess from pandas import read_csv import snakemake import datetime import yaml import input_utils # Set paths ADAPTER_PATH = "" # Adapter path from trimmomatic. # Should be set during installation KRAKEN_DB = subprocess.check_output("echo $KRAKEN_DB_PATH", shell=True).decode("utf-8").strip() if len(KRAKEN_DB) == 0: KRAKEN_DB = " " # insert space such that snakemake can handle empty value # Pattern for Illumina readnames base_pattern = ( r"(?P.*)_(?PL\d{3})_(?P(R1|R2))_(?P\d{3}).*") # --------------------------------------< Functions >---------------------------------------------------------- def main(): parser = argparse.ArgumentParser( formatter_class=argparse.RawDescriptionHelpFormatter, description="""\ -------------------------------------------------------------------- < QCumber > < Quality control and read trimming of NGS data > https://gitlab.com/RKIBioinformaticsPipelines/QCumber/ --------------------------------------------------------------------""", epilog=("Example usage: QCumber-2 --input fastq_folder" " --reference reference.fasta")) # ------------------------------------------------< INPUT >---------------------------------------------------------# group_input = parser.add_argument_group("Input") group_input.add_argument( '--input', '-i', dest='input', help=("input sample folder. Illumina filenames should end with" "___number, e.g. Sample_12_345_R1_001.fastq," " to find the right paired set."), required=False, nargs="+") group_input.add_argument( '--read1', '-1', dest='r1', help="Read 1 file", required=False) group_input.add_argument( '--read2', '-2', dest='r2', help="Read 2 file", required=False) group_input.add_argument( '--technology', '-T', dest='technology', choices=["Illumina", "IonTorrent", "PacBio"], required=False, help=("If not set, automatically determine technology and " "search for fastq and bam files. " "Set technology to IonTorrent if all files are bam-files," " else set technology to Illumina.")) group_input.add_argument( '--adapter', '-a', dest='adapter', choices=['TruSeq2-PE', 'TruSeq2-SE', 'TruSeq3-PE', 'TruSeq3-SE', 'TruSeq3-PE-2', 'NexteraPE-PE'], help="Adapter name for trimming. Default: all") mapping_exclusion = group_input.add_mutually_exclusive_group() mapping_exclusion.add_argument( '--reference', '-r', dest='reference', required=False, help=("Map reads against reference." + " Reference needs to be in fasta-format.")) mapping_exclusion.add_argument( '--index', '-I', dest='index', required=False, help="Bowtie2 index if available.") group_input.add_argument( '--kraken_db', '-d', dest='kraken_db', help=("Custom Kraken database. Default value is taken from" " environment variable KRAKEN_DB_PATH. " "Default: %(default)s."), required=False, default=KRAKEN_DB) group_input.add_argument( '--kraken_classified_out', dest='kraken_classified_out', help=("Kraken (un)classified-out option." " If set, both the --classified-out" " and --unclassified-out option are set. " "Default: %(default)s."), required=False, default=False, action='store_true') group_optional = parser.add_argument_group("Optional steps") group_optional.add_argument( '--sav', '-w', dest='sav', required=False, help=("Illumina folder for SAV. Requires RunInfo.xml, RunParameter.xml" "and Interop folder.")) group_optional.add_argument( '--trimBetter', choices=["assembly", "mapping", "default"], help=("Optimize trimming parameter using 'Per sequence base content'" + " from fastqc. Not recommended for amplicons.")) group_optional.add_argument('--nokraken', '-K', action="store_true") group_optional.add_argument('--notrimming', '-Q', action="store_true") group_params = parser.add_argument_group("Parameter settings") group_params.add_argument( '--illuminaclip', '-L', dest='illuminaclip', default="2:30:10", help=('Illuminaclip option: ' '::.' 'Default: %(default)s')) group_params.add_argument( '--only_trim_adapters', '-A', action='store_true', help='If this option is selected, only adapters will be clipped') group_params.add_argument( '--minlen', '-m', default=50, dest='minlen', help=('Minlen parameter for Trimmomatic. Drops read short than minlen.' ' Default: %(default)s'), type=int) group_params.add_argument( '--trimOption', '-O', dest="trimOption", help=('Additional Trimmomatic input.' ' Default (if trimBetter is not set): SLIDINGWINDOW:4:20'), type=str) group_params.add_argument( '--trimBetter_threshold', '-b', dest='trimBetter_threshold', help=("Set -trimBetter to use this option.Default setting" " for Illumina: 0.15 and for IonTorrent: 0.25."), required=False, type=float) group_output = parser.add_argument_group("Output") group_output.add_argument('--output', '-o', dest='output', default="") group_output.add_argument( '--rename', '-R', dest="rename", required=False, help="TSV File with two columns: ") group_output.add_argument( '--save_mapping', '-S', action="store_true", default=False) parser.add_argument('--threads', '-t', dest='threads', default=4, type=int, help="Number of threads. Default: %(default)s") parser.add_argument( '--config', '-c', dest='config', help=("Configfile to run pipeline. " "Additional parameters in the commandline " "will override arguments in configfile." "If not given and config/config.txt exists in" "the directory of the QCumber-2 executable," "that file will be loaded by default.")) parser.add_argument('--version', '-v', action='version', version='%(prog)s v' + __version__) arguments, unknown_args = parser.parse_known_args() arguments = vars(arguments) if len(sys.argv) == 1: parser.print_help() sys.exit(1) configfile = arguments["config"] default_configfile = os.path.join( os.path.dirname(os.path.realpath(__file__)), "config", "config.txt") if not arguments["config"] and os.path.isfile(default_configfile): configfile = default_configfile if configfile: config_args = yaml.load(open(configfile, "r")) keep_args = dict() for arg in config_args.keys(): print(arg) if (arguments[arg] is None) or arguments[arg] == " ": arguments[arg] = config_args[arg] arguments["output"] = os.path.abspath(arguments["output"]) if arguments["only_trim_adapters"]: arguments["trimBetter"] = None if not os.path.isdir(arguments["output"]): os.mkdir(arguments["output"]) if arguments["reference"]: arguments["reference"] = os.path.abspath(arguments["reference"]) if arguments["sav"]: arguments["sav"] = os.path.abspath(arguments["sav"]) if arguments["rename"]: arguments["rename"] = os.path.abspath(arguments["rename"]) parameter = yaml.load( open(os.path.join( os.path.dirname(os.path.realpath(__file__)), "config", "parameter.txt"), "r")) check_input_validity(arguments) # Load adaptive filetypes qcumber_path = os.path.dirname(os.path.realpath(__file__)) sample_file_name = os.path.join(arguments["output"], "samples.yaml") with open(os.path.join(qcumber_path, 'filenames.yaml'), 'r') as filetype_h: filename_types = yaml.load(filetype_h) all_files = [] for file_or_dir in arguments["input"]: if os.path.isdir(file_or_dir): for root, dirs, files in os.walk(file_or_dir): for file in files: if os.path.getsize(os.path.join(root, file)) != 0: all_files.append(os.path.join(root, file)) elif os.path.isfile(file_or_dir): if os.path.getsize(file_or_dir): all_files.append(file_or_dir) # Get Parsed Samples fomr input utils module formats_found, discarded = input_utils.parse_sample_info( all_files, filename_types, ['pacbio', 'illumina_fastq']) try: illumina_data = formats_found['illumina_fastq'] # print(repr(format_known.mfrs).replace('>,', '>,\n')) except KeyError: exit('No samples found or none met criteria!!\n' 'These files were discarded:\n' '%s' % '\n'.join(discarded)) try: pacbio_data = formats_found['pacbio'] print('looking for pacbio data...') pac_samples = pacbio_data.get_samples() with open(sample_file_name.replace('.yaml', '_pacbio.yaml'), 'w') as sample_file: yaml.dump(pac_samples, sample_file, default_flow_style=False) print('looking for illumina data...') except KeyError: pass sample_dict = illumina_data.flatten_naive() # flatten naive just drops read info from sampel name # if it is paired end data. len_known = len(sample_dict) if False: try: salvaged_dict = illumina_data.leftovers.process_leftovers( rename=True, rename_start_index=len_known+1) sample_dict.update(salvaged_dict) except input_utils.AmbigiousPairedReadsError as err: eprint('Failed parsing files with unrecognized' ' naming convention\n', 'Reason:\n', err) # Write samples to working directory with open(sample_file_name, 'w') as sample_file: yaml.dump(sample_dict, sample_file, default_flow_style=False) type, samples, joined_samples, name_dict, join_reads = ( get_input(arguments, parameter)) get_defaults(arguments, parameter) force_run_list = [] os.makedirs(arguments["output"], exist_ok=True) config_file_path = os.path.join(arguments["output"], "config.yaml") #if os.path.isfile(config_file_path): # pass #else: with open(config_file_path, 'w') as config_fh: yaml.dump(arguments, config_fh, default_flow_style=False) # additional infos general_information = OrderedDict() general_information["User"] = getpass.getuser() general_information["QCumber"] = __version__ general_information["QCumber_path"] = os.path.dirname( os.path.realpath(__file__)) general_information["Execution time"] = datetime.datetime.now().ctime() system_info = os.uname() general_information["Operating system"] = OrderedDict() general_information["Operating system"]["System"] = system_info.sysname general_information["Operating system"]["Server"] = system_info.nodename general_information["Operating system"]["Operating version"] = ( system_info.version) general_information["Operating system"]["Release"] = system_info.release general_information["Operating system"]["Machine"] = system_info.machine general_information["Tool versions"] = OrderedDict() general_information["Tool versions"]["Python"] = re.sub("\n", "", sys.version) general_information["Tool versions"]["Snakemake"] = snakemake.__version__ general_information["Tool versions"]["FastQC"] = ( get_version("fastqc --version")) if not arguments["notrimming"]: general_information["Tool versions"]["Trimmomatic"] = ( get_version("trimmomatic -version", "trimmomatic")) if arguments["technology"] == "Illumina" and not arguments["adapter"]: general_information["adapter"] = ( os.path.join(os.path.dirname(os.path.realpath(__file__)), "config", "adapters.fa")) elif arguments["technology"] == "Illumina" and arguments["adapter"]: general_information["adapter"] = os.path.join(ADAPTER_PATH, (arguments["adapter"] + ".fa")) if arguments["reference"] or arguments["index"]: general_information["Tool versions"]["Bowtie2"] = ( get_version("bowtie2 --version")) if not arguments["nokraken"]: general_information["Tool versions"]["Kraken"] = ( get_version("kraken --version")) general_information["Sample information"] = OrderedDict() general_information["Sample information"]["type"] = type general_information["Sample information"]["samples"] = samples general_information["Sample information"]["join_reads"] = join_reads general_information["Sample information"]["join_lanes"] = joined_samples general_information["Sample information"]["rename"] = name_dict os.makedirs(os.path.join(arguments["output"], "QCResults", "_data"), exist_ok=True) general_information_file = os.path.join(arguments["output"], "QCResults", "_data", "general_information.json") json.dump(general_information, open(general_information_file, "w")) # Fixed: # if QCumber is run repeatedly, the rule bowtie_mapping # will not be executed. # hence it is necessary to force to run the rule everytime # if (arguments["save_mapping"]): # force_run = "--forcerun bowtie_mapping" # else: force_run = "--forcerun" cmd_string = ( "snakemake " "--configfile {workdir}/config.yaml " "--snakefile {snakefile} {additional_commands} " "--directory {workdir} " "--cores {cores} {targets} {force_run} " ).format( additional_commands=" ".join(unknown_args), snakefile=os.path.join(os.path.dirname(os.path.realpath(__file__)), "Snakefile"), workdir=arguments["output"], configfile=general_information_file, cores=arguments["threads"], targets='', # "QCResults/batch_report.html", force_run=force_run) print(cmd_string) process = subprocess.Popen( ("snakemake " "--configfile {workdir}/config.yaml " "--snakefile {snakefile} {additional_commands} " "--directory {workdir} " "--cores {cores} {targets} {force_run} " ' -R $(snakemake --list-params-changes ' # '--list-input-changes --list-code-changes ' '--configfile {workdir}/config.yaml ' '--snakefile {snakefile} {additional_commands} ' '--directory {workdir} ' '--cores {cores} {targets} {force_run})' ).format( additional_commands=" ".join(unknown_args), snakefile=os.path.join(os.path.dirname(os.path.realpath(__file__)), "Snakefile"), workdir=arguments["output"], configfile=general_information_file, cores=arguments["threads"], targets='', # "QCResults/batch_report.html", force_run=force_run), shell=True) process.wait() exit(process.returncode) def get_basename(abs_name): return os.path.basename(os.path.splitext( os.path.splitext(os.path.basename(abs_name))[0])[0]) def getFilenameWithoutExtension(string, getBase=False): if getBase: string = os.path.basename(string) string = os.path.splitext(string)[0] i = 0 while os.path.splitext(string)[-1] in [ ".gz", ".gzip", ".zip", ".bz", ".fasta", ".fastq", ".bam"]: string = os.path.splitext(string)[0] return string def get_setname(filename, base=True, grouping=True): """ Get Setname Args: filename (obj::`str`): filename; Kwargs: base (bool): transform filename to basename (default = True) grouping (bool): if not grouping, only return basename of file (default = True) Returns: """ if not grouping: return get_basename(filename) try: if base: filename = get_basename(filename) sep = iter([";", "\t"]) if arguments["rename"]: rename_dict = read_csv(arguments["rename"], index_col=0, header=None) while (rename_dict.columns.__len__() == 0 or sep.__length_hint__() != 0): rename_dict = read_csv(arguments["rename"], index_col=0, header=None, sep=next(sep)) if rename_dict.columns.__len__() == 0: warnings.warn( ("Problems reading rename-file %s. " "Valid delimiters are" " ';', ',' or '\\t'.") % arguments["rename"], UserWarning, stacklevel=2) new_name = [(x, rename_dict.ix[x][1]) for x in rename_dict.index if filename.startswith(x)] if len(new_name) != 1: warnings.warn( ("Could not find unique renames." " Found %s for %s") % (new_name, filename), UserWarning, stacklevel=2) filename = getFilenameWithoutExtension(filename, True) else: filename = filename.replace(new_name[0][0], new_name[0][1]) except: # print("Couldnt rename sample files.") pass try: paired_reads_pattern = base_pattern setname_pattern = re.search(paired_reads_pattern, os.path.basename(filename)) if setname_pattern: return setname_pattern.group("setname") else: return filename except: print("Problems getting samplenames: %s" % filename) return filename def get_defaults(arguments, parameter): if arguments["only_trim_adapters"]: return if arguments["trimBetter"] == "assembly": # if arguments["forAssembly"]: if not arguments["trimBetter_threshold"]: arguments["trimBetter_threshold"] = ( parameter["forAssembly." + arguments['technology']] ['trimBetter_threshold']) if not arguments["trimOption"]: arguments["trimOption"] = ( parameter["forAssembly." + arguments['technology']] ["trimOption"]) elif arguments["trimBetter"] == "mapping": # elif arguments["forMapping"]: if not arguments["trimBetter_threshold"]: arguments["trimBetter_threshold"] = ( parameter["forMapping." + arguments['technology']] ["trimBetter_threshold"]) if not arguments["trimOption"]: arguments["trimOption"] = ( parameter["forMapping." + arguments['technology']] ["trimOption"]) elif arguments["trimBetter"] == "default": arguments["trimBetter_threshold"] = (parameter["Trimmomatic"] ["trimBetter_threshold"]) if not arguments["trimOption"]: arguments["trimOption"] = parameter["Trimmomatic"]["trimOption"] if arguments["trimBetter_threshold"]: arguments["trimBetter_threshold"] = (parameter["Trimmomatic"] ["trimBetter_threshold"]) def get_version(cmd, jar=None): try: return re.match(r"(?P.*)\n", subprocess.check_output( cmd, shell=True, stderr=subprocess.PIPE ).decode("utf-8")).group("version") except: try: try: return subprocess.check_output( cmd, shell=True, stderr=subprocess.PIPE ).decode("utf-8") except: from debpackageinfo import get_upstream_version return get_upstream_version(jar) except: return "NaN" def get_input(arguments, parameter): """ Get Read Input Checks wether input reads are from Ion Torrent or Illumina, by checking file extensions of bam and/or fastq. Detects presence of read pair in input and returns type of reads used (single end / paired end reads) Args: None Returns: (obj::`str`): type - PE or SE (Paired Ends / Single Ends) (obj::`collections.OrderedDict`): sample_dict (obj::`collections.OrderedDict`): join_lanes (obj::`dict`): name_dict (obj::`dict`): join_reads """ bam_ext = [x.strip(" ") for x in parameter["Fileextension"]["bam"]] fastq_ext = [x.strip(" ") for x in parameter["Fileextension"]["fastq"]] sample_dict = OrderedDict() all_files = [] name_dict = {} join_reads = {} type = "PE" join_lanes = OrderedDict() if arguments["r1"]: assert os.path.getsize(arguments["r1"]) != 0, ( "File %s is empty." % arguments["r1"]) if any([arguments["r1"].endswith(ext) for ext in bam_ext]): arguments["technology"] = "IonTorrent" else: arguments["technology"] = "Illumina" if arguments["r2"]: assert os.path.getsize(arguments["r2"]) != 0, ( "File %s is empty." % arguments["r2"]) sample_dict[get_setname(arguments["r1"])] = ( [os.path.abspath(arguments["r1"]), os.path.abspath(arguments["r2"])]) name_dict[get_basename(arguments["r1"])] = ( get_setname(arguments["r1"]) + "_R1") name_dict[get_basename(arguments["r2"])] = ( get_setname(arguments["r2"]) + "_R2") else: type = "SE" sample_dict[get_setname(arguments["r1"])] = ( [os.path.abspath(arguments["r1"])]) name_dict[get_basename(arguments["r1"])] = ( get_setname(arguments["r1"])) else: if os.path.isdir(arguments["input"][0]): for root, dirs, files in os.walk(arguments["input"][0]): for file in files: if any([file.endswith(ext) for ext in fastq_ext + bam_ext]): if os.path.getsize(os.path.join(root, file)) != 0: all_files.append(os.path.join(root, file)) else: warnings.warn("Skip empty file %s" % file, stacklevel=2) else: all_files = arguments["input"] assert all_files, ( ("Check input again. " "No files found for pattern %s ") % arguments["input"]) if len([x for x in all_files if any([ext in x for ext in bam_ext]) ] ) == len(all_files): arguments["technology"] = "IonTorrent" else: arguments["technology"] = "Illumina" if (len(all_files) == 0): sys.exit(str(arguments["input"]) + " does not contain fastq or bam files.") # find read pairs all_files = sorted(list(all_files)) if all([re.search(base_pattern, x) for x in all_files]): for setname, files in groupby(all_files, key=lambda x: re.search( base_pattern, x ).group("setname")): read_pairs = dict() setname = get_setname(setname) for lane, lane_file in groupby(list(files), key=lambda x: re.search( base_pattern, x ).group("lane")): read_pairs[lane] = [] for readgroup, readfiles in groupby( list(lane_file), key=lambda x: re.search(base_pattern, x ).group("read")): readfiles = list(readfiles) if len(readfiles) != 0: if len(readfiles) > 1: concat_reads = ( "QCResults/tmp/join_reads/" + "_".join( re.search(base_pattern, os.path.basename( readfiles[0]) ).groups()[:-2]) + "_000.fastq.gz") join_reads[concat_reads] = [os.path.abspath(x) for x in readfiles] readfiles = concat_reads else: readfiles = os.path.abspath(readfiles[0]) read_pairs[lane].append(readfiles) # Multiple lanes if len(read_pairs) > 1: join_lanes[setname] = [] for key in sorted(read_pairs.keys()): samplename = setname + "_" + key sample_dict[samplename] = read_pairs[key] if len(read_pairs[key]) == 2: # type == "PE": name_dict[get_basename(read_pairs[key][0])] = ( get_setname(samplename) + "_R1") name_dict[get_basename(read_pairs[key][1])] = ( get_setname(samplename) + "_R2") else: type = "SE" name_dict[get_basename(read_pairs[key][0])] = ( get_setname(samplename, grouping=False)) if len(read_pairs) > 1: join_lanes[setname].append(samplename) else: # treat each file as sample print("Treat files as single end") sample_dict = OrderedDict( [[get_setname(x, grouping=False), [os.path.abspath(x)]] for x in all_files]) name_dict = dict( [get_basename(x), get_setname(x, grouping=False)] for x in all_files) type = "SE" return type, sample_dict, join_lanes, name_dict, join_reads def check_input_validity(arguments): if arguments["reference"]: ref_file = arguments["reference"] seq_record = "" if not os.path.exists(arguments["reference"]): sys.exit("Reference does not exist.") try: if ref_file[-2:] == "gz": with subprocess.Popen(["gzip", "-cd", ref_file], stdout=subprocess.PIPE) as gz_proc: seq_record = gz_proc.stdout.readline().decode() gz_proc.terminate() else: with open(arguments["reference"], "r") as ref_fh: seq_record = ref_fh.readline() assert seq_record.startswith(">"), ("Error: Reference" "file is not valid.") except AssertionError as e: sys.exit(e) except FileNotFoundError: sys.exit("Error: Reference file not found") # # Check validity of Kraken DB if not arguments["nokraken"]: if not os.path.exists(arguments["kraken_db"]): sys.exit("ERROR: %s does not exist.i" " Enter a valid database" " for kraken" % arguments["kraken_db"]) else: if "database.kdb" not in os.listdir(arguments["kraken_db"]): sys.exit("ERROR: database " + arguments["kraken_db"] + " does not contain necessary file database.kdb") # # Check input if arguments["input"]: if not all([os.path.exists(x[0]) for x in arguments["input"]]): sys.exit(str(arguments["input"]) + " does not exist.") else: if not arguments["r1"]: sys.exit("Pleaser enter an input file (--input or -1/-2)") if not os.path.isfile(arguments["r1"]): sys.exit(arguments["r1"] + " does not exist. Input file required." + " Use option -input or -1 / -2.") if arguments["r2"]: if not os.path.isfile(arguments["r2"]): sys.exit(arguments["r2"] + " does not exist.") if arguments["trimBetter_threshold"] and not arguments["trimBetter"]: sys.exit("--trimBetter must be set to use --trimbetter_threshold." " Add --trimBetter to your command " "or remove --trimbetter_threshold.") # --------------------------------------< main >------------------------------- if __name__ == "__main__": main() QCumber-2.3.0/Rscripts/000077500000000000000000000000001330104711400146555ustar00rootroot00000000000000QCumber-2.3.0/Rscripts/barplot.R000077500000000000000000000032001330104711400164410ustar00rootroot00000000000000options(warn=-1) require(jsonlite) require(ggplot2) args = commandArgs(trailingOnly=TRUE) convert2filename<- function(string, ext=".png"){ string<- gsub("\\[%\\]", "percentage", string) string<- gsub("\\[#\\]", "number", string) string<- gsub(" ", "_", string) return(paste(string, ext, sep="")) } summary_json<- jsonlite::fromJSON(args[1])$summary tablenames<- names(summary_json)[!names(summary_json) %in% c("images", "Trim Parameter")] summary_json<- summary_json[tablenames] #summary<- as.data.frame(read.csv(args[1])) for( i in tablenames[2:length(tablenames)]){ ggplot(summary_json, aes(x=summary_json[,"setname"], y = summary_json[,i]), environment = environment()) + geom_bar(stat = "identity", fill="#4593C1") + theme(axis.text.x=element_text(angle=90, hjust=1, vjust = 0.5), legend.position = "none") + ggtitle(i) + xlab("Sample")+ ylab(i) ggsave(paste(args[2], convert2filename(i), sep="/")) } temp_json <- data.frame(rbind(cbind( setname = summary_json$setname, type = "Total reads [#]", value= summary_json$`Total reads [#]`), cbind( setname = summary_json$setname, type = "Reads after trimming [#]", value= summary_json$`Reads after trimming [#]`) )) temp_json$value <- as.numeric(as.character(temp_json$value)) ggplot(temp_json, aes(x= )) ggplot(temp_json, aes(x=setname, y = value, by=type, fill=type))+ geom_bar(stat="identity",position = "identity", alpha= 0.9) + theme(axis.text.x=element_text(angle=90, hjust=1, vjust = 0.5)) + ggtitle("Number of Reads") + xlab("Sample")+ ylab("Number of Reads") ggsave(paste(args[2], "number_of_reads.png", sep="/")) QCumber-2.3.0/Rscripts/boxplot.R000077500000000000000000000022201330104711400164660ustar00rootroot00000000000000#!/usr/bin/env Rscript options(warn=-1) library(ggplot2) args = commandArgs(trailingOnly=TRUE) mytable<- read.csv(args[1], header = F) colnames(mytable) <- c("Sample", "Trimmed", "Read", "Value", "Count") mytable<- mytable[which(mytable$Count>0),] if(!any(is.na(mytable$Read))){ gp<- ggplot(mytable, aes(fill=Read,group=interaction(Sample, Trimmed, Read),x=Sample, y=Value, weight=Count)) gp<- gp+ geom_boxplot(outlier.size = 0.5) +theme(axis.text.x=element_text(angle=90, hjust=1, vjust = 0.5),legend.position="none") + scale_fill_manual(values=c("#E25845", "#4593C1")) + ggtitle(args[3]) + xlab(args[4])+ ylab(args[5]) }else{ gp<- ggplot(mytable, aes(fill="R1",group=Sample,x=Sample, y=Value, weight=Count)) gp<- gp+ geom_boxplot(outlier.size = 0.5) +theme(axis.text.x=element_text(angle=90, hjust=1, vjust = 0.5),legend.position="none") + scale_fill_manual(values=c("#E25845")) + ggtitle(args[3]) + xlab(args[4])+ ylab(args[5]) } if(length(unique(mytable$Sample))>15){ gp<-gp + facet_wrap(~ Trimmed, ncol=1) height=20 }else{ gp<-gp + facet_wrap(~ Trimmed) height=10 } ggsave(args[2],plot=gp, unit="cm")QCumber-2.3.0/Rscripts/sav.R000077500000000000000000000142521330104711400156000ustar00rootroot00000000000000library(savR) library(reshape2) args = commandArgs(trailingOnly=TRUE) project <- savR(args[1]) ################ ## Indexing ## ################ #total reads total_reads<- clusters(project, 1L) pf_reads<- pfClusters(project, 1L) ################ ## Plots ## ################ ## # Data By Cycle ## extraction<- extractionMetrics((project)) pdf("QCResults/SAV.pdf") # Data By Cycle, FWHM/All Lanes / Both surfaces / All Bases reshaped_extraction <- melt(extraction, measure.vars= c("FWHM_A","FWHM_C", "FWHM_T","FWHM_G")) FWHM<- (aggregate(reshaped_extraction$value, by=list(reshaped_extraction$cycle, reshaped_extraction$variable), FUN=mean)) colnames(FWHM) <- c("Cycles","FWHM", "Value") FWHM$FWHM<- sub("FWHM_","",FWHM$FWHM) ggplot(data=FWHM )+ geom_line( aes(x=Cycles , y =Value, color=FWHM)) + ggtitle("Data by Cycle - FWHM") + xlab("Cycle") + ylab("All bases FWHM") ggsave(paste(args[2], "/data_by_cycle_fwhm.png", sep="")) # Data By Cycle,Intensity /All Lanes / Both surfaces / All Bases reshaped_extraction <- melt(extraction, measure.vars= c("int_A","int_C", "int_T","int_G")) intensity<- (aggregate(reshaped_extraction$value, by=list(reshaped_extraction$cycle, reshaped_extraction$variable), FUN=mean)) colnames(intensity) <- c("Cycles","Intensity", "Value") intensity$Intensity<- sub("int_","", intensity$Intensity) ggplot(data=intensity )+ geom_line( aes(x=Cycles , y =Value, color=Intensity))+ ggtitle("Data By Cycle - Intensity")+ xlab("Cycle")+ylab("All bases intensity") ggsave(paste(args[2], "/data_by_cycle_intensity.png", sep="")) # Data By Cycle, %Base /All Lanes / Both surfaces / All Bases # corr<- correctedIntensities(project) corr[,seq(14,17)]<-round(corr[,seq(14,17)] / apply(corr[,seq(14,17)], 1, sum) *100,2) corr<- melt(corr, measure.vars= c("num_A","num_C", "num_T","num_G")) corr<-(aggregate(corr$value, by=list(corr$cycle, corr$variable), FUN=mean)) colnames(corr)<- c("Cycle", "Base", "Perc_Base") corr$Base<- sub("num_","", corr$Base) ggplot(corr) + geom_line(aes(x=Cycle, y= Perc_Base, color=Base)) + ylab("All Bases % Base") + ggtitle("Data by Cycle - % Base") ggsave(paste(args[2], "/data_by_cycle_base.png" , sep ="")) ## # Data By Lane ## tiles<- tileMetrics(project) # Density, Both Surfaces #pfBoxplot(project) # Generate a boxplot of the numbers of clusters and the number of Illumina pass-filter clusters per tile and lane dens <-(tiles[which(tiles$code==100 | tiles$code==101 ),]) dens[which(dens$code==100),]$code <- "Raw Clusters" dens[which(dens$code==101),]$code<- "PF Clusters" dens$value <- dens$value/1000 ggplot(data = dens , aes(x=lane, y=value, fill=code))+ geom_boxplot() + ggtitle("Data By Lane - Cluster Density") + xlab("Lane")+ylab("Cluster Density (K/mm2)") ggsave(paste(args[2], "/data_by_lane_cluster.png", sep="")) # Phasing, Both Surfaces, All Bases phasing_code <- seq(200, (200 + (length(project@reads)-1)*2),2) phasing <-(tiles[which(tiles$code %in% phasing_code) ,]) for(i in phasing_code){ cat(paste("Read ",((i-200)/2)+1)) phasing[which(phasing$code==i),]$code = paste("Read ",((i-200)/2)+1) } ggplot(data = phasing[which(phasing$value>0),] , aes(x=lane, y=value*100, fill=code))+ geom_boxplot() + ggtitle("Data By Lane - Phasing")+ xlab("Lane")+ ylab("% Phasing")+ scale_x_continuous(breaks = unique(phasing$lane)) ggsave(paste(args[2], "/data_by_lane_phasing.png", sep="")) # Pre-Phasing, Both Surfaces, All Bases prephasing_code <- seq(201, (201 + (length(project@reads)-1)*2),2) prephasing <-(tiles[which(tiles$code %in% prephasing_code) ,]) for(i in prephasing_code){ prephasing[which(prephasing$code==i),]$code = paste("Read ",((i-201)/2)+1) } ggplot(data = prephasing[which(prephasing$value>0),] , aes(x=lane, y=value*100, fill=code))+ geom_boxplot() + ggtitle("Data By Lane - Prephasing")+ xlab("Lane")+ ylab("% Prephasing") + scale_x_continuous(breaks = unique(prephasing$lane)) ggsave(paste(args[2], "/data_by_lane_prephasing.png", sep="")) ## # QScore Heatmap ## png(paste(args[2], "/qscore_heatmap.png", sep=""), height=1025, width = 2571, res = 200) qualityHeatmap(project, lane=seq(1,project@layout@lanecount) ,read=c(1,2))+ theme(axis.title.y = element_blank()) dev.off() qualityHeatmap(project, lane=seq(1,project@layout@lanecount) ,read=c(1,2))+ theme(axis.title.y = element_blank()) qualy<- qualityMetrics(project) qualy<- data.frame(apply(qualy, 2, as.numeric)) qualy_all<- melt(qualy, measure.vars= colnames(qualy)[4:ncol(qualy)]) qualy_all<- aggregate(qualy_all$value, by=list(qualy_all$variable), FUN=sum) colnames(qualy_all)<- c("QScore","Total") qualy_all$Total <- qualy_all$Total/1000000 qualy_all$QScore <- as.numeric(qualy_all$QScore) ggplot(qualy_all, aes(x=QScore, y = Total )) + geom_bar(stat="identity", aes(fill=QScore>=30)) + ylab("Total (million)") + geom_vline(aes(xintercept=30), linetype="dashed") + geom_text(aes(x=35, y=max(Total)-max(Total)*0.1 ,label=(paste("QScore >=30 \n", round(sum(qualy_all[which(qualy_all$QScore>=30),]$Total)/1000,2), "G \n", round(sum(qualy_all[which(qualy_all$QScore>=30),]$Total)/ sum(qualy_all$Total)*100,2), "%") ))) + ggtitle("QScore Distribution") + theme(legend.position="none") ggsave(paste(args[2], "/qscore_distr.png", sep="")) over_q30 <- which(colnames(qualy) =="Q30"):ncol(qualy) qualy_q30 <- as.data.frame(cbind(qualy[which(qualy$cycle>=25),"cycle"], apply(qualy[which(qualy$cycle>=25),over_q30],1, sum))) colnames(qualy_q30) <- c("cycle", "sum") sum_per_cycle <- cbind(qualy[which(qualy$cycle>=25),"cycle"], apply(qualy[which(qualy$cycle>=25),],1, sum)) colnames(sum_per_cycle) <- c("cycle", "sum") qualy_q30$sum <-100* qualy_q30$sum/ sum_per_cycle[,"sum"] ggplot(qualy_q30, aes(x=cycle, y = as.numeric(sum) )) + geom_point()+ ylab("% >=Q30") + ggtitle("Data by Cycle - %>=Q30") ggsave(paste(args[2], "/qscore_q30.png", sep="")) dev.off() QCumber-2.3.0/Snakefile000077500000000000000000000517731330104711400147100ustar00rootroot00000000000000__version__ = "2.0.0" include: "modules/init.snakefile" include: "modules/sav.snakefile" include: "modules/fastqc.snakefile" include: "modules/trimming.snakefile" include: "modules/mapping.snakefile" include: "modules/classification.snakefile" #-------------------< Helper functions >---------------------------------------------------------# from modules.json_output import write_summary_json, write_summary_json_new, get_fastqc_results, combine_csv, get_plot_type_names from modules.utils import which def trimming_input(wildcards): if not config["notrimming"]: if geninfo_config["Sample information"]["type"] == "PE": return expand("{path}/trimmed/{sample}_{read}_fastqc", path = fastqc_path, read=["R1", "R2"], sample=geninfo_config["Sample information"]["samples"]) else: return expand("{path}/trimmed/{sample}_fastqc", path = fastqc_path, sample=sample_dict.keys()) else: return None def get_input(wildcards, if_not, ext, samplelist =[], path = ""): if path !="": path += "/" if not config[if_not]: if samplelist: return expand("{path}{sample}{ext}" , path = path, ext=ext, sample=samplelist) return expand("{path}{sample}{ext}", path = path, ext=ext, sample=wildcards.sample) else: return "" def get_all_fastqc(wildcards, path = fastqc_path + "/raw"): ''' Generate raw sample names Note: I(Rene) believe that this should also return the _fastqc_data.txt files, because they are required by trimbetter and the summary. ''' return ["%s/%s_fastqc%s" % ( path, geninfo_config["Sample information"]["rename"][get_name(x)], fastqc_stat) for x in unique_samples[wildcards.sample] for fastqc_stat in ["","/fastqc_data.txt"]] def get_trimmomatic_fastqc(wildcards, ext, path = trimming_path): ''' Generate list of filepaths ending with read identifying string and _fastqc Returns: obj::`list` of filenames Example: ["Path/to/QCResults/FastQC/trimmed/Sample1_S1_L001_R1_fastqc", "Path/to/QCResults/FastQC/trimmed/Sample1_S1_L001_R2_fastqc"] ''' if config["notrimming"]: return [] paired = [] if geninfo_config["Sample information"]["type"]=="PE" and ext =="_fastqc": paired =["_R1","_R2"] if wildcards.sample in geninfo_config["Sample information"]["samples"].keys(): if paired: return expand("{path}/{sample}{paired}{ext}", sample= wildcards.sample, ext = ext, path = path, paired = paired) else: return expand("{path}/{sample}{ext}" , sample= wildcards.sample, ext = ext, path = path) else: if paired: return expand("{path}/{sample}{paired}{ext}", sample=(geninfo_config["Sample information"] ["join_lanes"][wildcards.sample]), ext = ext, path = path, paired = paired) else: return expand("{path}/{sample}{ext}", sample=(geninfo_config["Sample information"] ["join_lanes"][wildcards.sample]), ext = ext, path = path) assert False, "Something went wrong" def get_trimmomatic_pseudofile(wildcards): ''' Provides locations for pseudofiles used to force trimmomatic to run This used to be done with log files, which caused those to disappear in case of an error. These files have been used to report to get_trimmomatic_results(), but as they do not contain any data they produced bad values in report. ''' if wildcards.sample in geninfo_config["Sample information"]["samples"].keys(): return expand("{path}/{sample}.trimmomatic.log" , sample= wildcards.sample, path=log_path) else: return expand("{path}/{sample}.trimmomatic.log", sample=(geninfo_config["Sample information"] ["join_lanes"][wildcards.sample]), path = log_path) def get_trimmomatic_params(wildcards): if wildcards.sample in geninfo_config["Sample information"]["samples"].keys(): return expand("{path}/{sample}.trimmomatic.params", sample = wildcards.sample, path=trimming_path) else: return expand("{path}/{sample}.trimmomatic.params", sample=(geninfo_config["Sample information"] ["join_lanes"][wildcards.sample]), path=trimming_path) def get_batch_files(wildcards): steps = {"summary_json": data_path + "/summary.json"} # if pdflatex is not installed on the system, skip pdf output files if which("pdflatex") is not None: steps["sample_report"] = expand("{path}/{sample}.pdf", sample=unique_samples.keys(), path=main_path) if config["sav"]: steps["sav"] = sav_results if not config["nokraken"]: steps["kraken_html"] = main_path + "/kraken.html" steps["kraken_png"] = classification_path + "/kraken_batch.png" return steps #--------------------------------------------< RULES >-----------------------------------------------------------------# rule run_all: input: main_path + "/batch_report.html", lambda wildcards: (( "%s/%s.sam" % (mapping_path, samp) for samp in unique_samples.keys()) if config["save_mapping"] else []) params: save_mapping = config["save_mapping"] rule write_final_report: input: unpack(get_batch_files) output: main_path + "/batch_report.html" run: #shell("cp {source} {output}", source = join(geninfo_config["QCumber_path"], "batch_report.html")) env = Environment( trim_blocks=True, variable_start_string='{{~', variable_end_string="~}}") env.loader = FileSystemLoader(geninfo_config["QCumber_path"]) template = env.get_template("batch_report.html") summary = json.load(open(str(input.summary_json), "r")) general_information = json.load( open( data_path + "/general_information.json", "r")) if config["sav"]: sav = json.load(open( str(input.sav), "r")) sav_json = json.dumps(sav) else: sav_json = [] #sav = json.load(open(str(input.general_information), "r"), object_pairs_hook=OrderedDict) geninfo_config["Commandline"] = cmd_input html = template.render( general_information= json.dumps(config), summary = json.dumps(summary["Results"]), summary_img = json.dumps(summary["summary_img"]), sav = sav_json ) html_file = open(str(output), "w") html_file.write(html) html_file.close() # Write PDF report for each sample def get_steps_per_sample(wildcards): ''' Get dictionary of steps required to write sample output sets up filenames required by rule "get_sample_json" These vary depending on the arguments provided by the user Affected by: notrimming | reference | nokraken | nomapping Returns: steps (obj::`dict`): dictonary of required steps key is obj::`str` step value is obj::`list`(obj::`str`) filenames ''' steps = {"raw_fastqc" : get_all_fastqc(wildcards)} if not config["notrimming"]: steps["trimming"]= get_trimmomatic_pseudofile(wildcards) steps["trimming_params"] = get_trimmomatic_params(wildcards) steps["trimming_fastqc"] = get_trimmomatic_fastqc( wildcards, "_fastqc", path=fastqc_path + "/trimmed") if config["reference"] or config["index"]: steps["mapping"] = get_input( wildcards, if_not="nomapping", ext=".bowtie2.log", samplelist=[], path=log_path) if not config["nokraken"]: steps["kraken"] = get_input( wildcards,if_not = "nokraken", ext=".csv", samplelist=[], path = classification_path ) # "{path}/{wildcards.sample}.kraken.png".format(path = classification_path, wildcards=wildcards) steps["kraken_log"] = get_input( wildcards,if_not = "nokraken", ext=".kraken.log", samplelist=[], path = log_path) return steps ''' raw_fastqc = get_all_fastqc, trimming =get_trimmomatic_log, trimming_params = lambda wildcards: get_trimmomatic_params(wildcards), trimming_fastqc = lambda wildcards: get_trimmomatic_fastqc(wildcards, "_fastqc", path = fastqc_path + "/trimmed"), mapping = lambda wildcards: get_input(wildcards,if_not = "nomapping", ext=".bowtie2.log",samplelist=[], path = log_path), kraken = lambda wildcards: get_input(wildcards,if_not = "nokraken", ext=".csv", samplelist=[], path = classification_path ), kraken_log = lambda wildcards: get_input(wildcards,if_not = "nokraken", ext=".kraken.log", samplelist=[], path = log_path) ''' def get_sample_json_output(): output = { "json": data_path + "/{sample}.json", "newjson" : data_path + "/{sample}_new.json", } for plot_type_name in get_plot_type_names(): output["samplecsv" + plot_type_name] = temp(data_path + "/{sample}_" + plot_type_name + ".csv") if not config["nokraken"]: output["kraken_plot"] = classification_path + "/{sample}.kraken.png" return output ''' ##### Note: Most run time bugs are some how involved with this rule ###### It calls getter functions from submodule snakefiles found in "./modules/" This rule has lots of side effects ''' rule write_sample_json: input: unpack(get_steps_per_sample) output: **get_sample_json_output() params: notrimming=config["notrimming"], nokraken=config["nokraken"], nomapping=config["nomapping"] message: "Write {wildcards.sample}.json" run: summary_dict = OrderedDict() summary_dict["Name"] = wildcards.sample summary_dict["Files"] = unique_samples[wildcards.sample] summary_dict["Date"] = datetime.date.today().isoformat() paired_end = geninfo_config["Sample information"]["type"] == "PE" fastqc_dict, total_seq ,overrepr_count, adapter_content = ( get_fastqc_results( parameter, (x for x in input.raw_fastqc if x[-4:] != ".txt" ), data_path , "raw", to_base64, paired_end=paired_end)) #"QCResults/Report/tmp" summary_dict["Total sequences"] = total_seq summary_dict["%Overrepr sequences"] = overrepr_count summary_dict["%Adapter content"] = adapter_content summary_dict["raw_fastqc_results"] = fastqc_dict if not params.notrimming: summary_dict.update(get_trimmomatic_result( list(input.trimming), list(input.trimming_params))) print(input.trimming) fastqc_dict, total_seq, overrepr_count, adapter_content = ( get_fastqc_results(parameter, input.trimming_fastqc, data_path,"trimmed", to_base64)) if fastqc_dict !=[]: summary_dict["trimmed_fastqc_results"] = fastqc_dict summary_dict["%Overrepr sequences (trimmed)"] = overrepr_count summary_dict["%Adapter content (trimmed)"] = adapter_content # sort dict order new_order = ["Name", "Files", "Date", "Total sequences", "#Remaining Reads","%Remaining Reads", "%Adapter content","%Adapter content (trimmed)", "%Overrepr sequences", "%Overrepr sequences (trimmed)", "raw_fastqc_results","trimmed_fastqc_results"] new_order.extend(list( set(summary_dict.keys()) - set(new_order))) summary_dict = OrderedDict( (key, summary_dict[key]) for key in new_order) if not params.nomapping: summary_dict.update(get_bowtie2_result(str(input.mapping))) summary_dict["Reference"] = config["reference"] if not params.nokraken: kraken_results = get_kraken_result( str(input.kraken), str(output.kraken_plot)) if kraken_results: summary_dict.update(kraken_results) kraken_log = "" with open(str(input.kraken_log),"r") as kraken_reader: for line in kraken_reader.readlines(): if "..." not in line: kraken_log +=line summary_dict["kraken_log"] = kraken_log json.dump(summary_dict, open(str(output.json), "w")) fastqc_dict, total_seq ,overrepr_perc, adapter_content = ( get_fastqc_results(parameter, (x for x in input.raw_fastqc if x[-4:] != ".txt" ), data_path , "raw", to_base64)) res = dict() res["Sample"] = dict() res["Sample"]["Name"] = wildcards.sample res["Sample"]["TS"] = total_seq res["Sample"]["PAC"] = adapter_content res["Sample"]["PORS"] = overrepr_perc res["Sample"]["POST"] = "N/A" res["Sample"]["PACT"] = "N/A" res["Sample"]["NRR"] = "N/A" res["Sample"]["PRR"] = "N/A" res["Sample"]["NAR"] = "N/A" res["Sample"]["PAR"] = "N/A" res["Sample"]["NC"] = "N/A" res["Sample"]["PC"] = "N/A" if not config["notrimming"]: fastqc_dict, total_seq, overrepr_perc, adapter_content = ( get_fastqc_results(parameter, input.trimming_fastqc, data_path,"trimmed", to_base64)) trimmomatic_results = get_trimmomatic_result(list(input.trimming), list(input.trimming_params)) res["Sample"]["POST"] = overrepr_perc res["Sample"]["PACT"] = adapter_content res["Sample"]["NRR"] = trimmomatic_results["#Remaining Reads"] res["Sample"]["PRR"] = trimmomatic_results["%Remaining Reads"] if not config["nomapping"]: mapping_result = get_bowtie2_result(str(input.mapping)) res["Sample"]["NAR"] = mapping_result["#AlignedReads"] res["Sample"]["PAR"] = mapping_result["%AlignedReads"] if not config["nokraken"]: kraken_results = get_kraken_result(str(input.kraken), str(output.kraken_plot)) if kraken_results is None: res["Sample"]["NC"] = "N/A" res["Sample"]["PC"] = "N/A" json.dump(res, open(str(output.newjson), "w")) def get_report_info(wildcards): steps = { "sample_json" : "{path}/{sample}.json".format( sample = wildcards.sample, path = data_path), "raw_fastqc" : get_all_fastqc(wildcards)} if not config["notrimming"]: try: trimmed_path = fastqc_path + "/trimmed" # ((fastqc_path + "/trimmed") # not needed and # missing parentheses # if not True # config["trimBetter"] # else (trimbetter_path + "/FastQC")) except KeyError: trimmed_path = fastqc_path + "/trimmed" steps["trimming_fastqc"]= get_trimmomatic_fastqc( wildcards, "_fastqc", path=trimmed_path) #if not config["nokraken"]: # steps["kraken"] = classification_path + "/{sample}.translated".format(sample = wildcards.sample) return steps rule write_sample_report: input: unpack(get_report_info) #sample_json = data_path + "/{sample}.json" output: temp(main_path + "/{sample}.aux"), pdf=main_path + "/{sample}.pdf", tex=temp(main_path + "/{sample}.tex") log: log_path + "/texreport.log" message: "Write {wildcards.sample}.pdf" run: env = Environment(trim_blocks = True, variable_start_string='{{~', variable_end_string = "~}}") env.loader = FileSystemLoader(geninfo_config["QCumber_path"]) template = env.get_template("report.tex") sample = json.load(open(str(input.sample_json),"r"), object_pairs_hook=OrderedDict ) if "Reference" in sample.keys(): sample["Reference"] = basename(sample["Reference"] ) sample["path"] = dirname(sample["Files"][0]) sample["Files"] = [basename(x) for x in sample["Files"]] # import pprint; pprint.pprint(sample) pdf_latex = template.render( #general_information=json.load(open(str(input.general.json),"r")), general_information=geninfo_config, sample=sample) latex = open(str(output.tex), "w") latex.write(pdf_latex) latex.close() #shell( "pdflatex -interaction=nonstopmode -output-directory=$(dirname {output.pdf}) {output.tex} -shell-escape 1>&2> {log}" ) with open(log[0], 'a') as f_log: with subprocess.Popen( ["pdflatex", "-interaction=nonstopmode", "-output-directory=%s" % dirname(output.pdf), output.tex], stdout=f_log, stderr=sys.stdout) as pdflatex_proc: pdflatex_proc.wait() # dont knopw how to get rid of this log # shell("mv {log} {mv_log}", log = str(output.pdf).replace(".pdf", # ".log"), # mv_log = str(log).replace("texreport.", # "." + wildcards.sample + ".")) rule write_kraken_report: input: kraken = lambda wildcards: get_input( wildcards, if_not = "nokraken", ext = ".csv", samplelist= unique_samples.keys() , path = classification_path) output: kraken_html = main_path + "/kraken.html" shell: "ktImportText {input.kraken} -o {output.kraken_html}" def get_files_of_all_steps(): steps = {"raw_fastqc": expand( "{path}/raw/{sample}_fastqc", sample=sample_dict.keys(), path=fastqc_path)} if not config["notrimming"]: steps["trimming"] = trimming_input if not config["nomapping"]: steps["mapping"] = lambda wildcards: get_input( wildcards, if_not="nomapping", ext=".sam", samplelist=unique_samples.keys(),path=mapping_path) if not config["nokraken"]: steps["kraken_png"] = classification_path + "/kraken_batch.png", steps["sample_json"] = expand( "{path}/{sample}.json", sample=unique_samples.keys(), path=data_path) return steps def get_batch_output(): ''' Creation of dictonary that stores the output of steps required to finish one batch summary_json: Path/2/_data/summary.json fastqc_plots: GC_content | length distribution | per sequence quality scores ''' steps = {} steps["summary_json"] = data_path + "/summary.json" steps["summary_json_new"] = data_path + "/summary_new.json" steps["fastqc_plots"] = list( expand("{path}/{img}.png", path="QCResults/_data", img=["Per_sequence_GC_content", "Per_sequence_quality_scores", "Sequence_Length_Distribution"]) ) steps["n_read_plot"] = "QCResults/_data/reads_after_trimming.png" if not config["nomapping"]: steps["mapping_plot"] = "QCResults/_data/mapping.png" steps["insertsize_plot"] = "QCResults/_data/insertsize.png" return steps def get_batch_report_input(): steps={} steps["sample_json"] = expand("{path}/{sample}.json", sample=unique_samples.keys(), path=data_path) steps["sample_json_new"] = expand("{path}/{sample}_new.json", sample=unique_samples.keys(), path=data_path) steps["samplecsv"] = expand(data_path + "/{sample}_{plot_type}.csv", sample=unique_samples.keys(), plot_type=get_plot_type_names()) if not config["nokraken"]: steps["kraken_batch"] = classification_path + "/kraken_batch.png" if config["reference"] or config["index"]: steps["insertsize"] = expand("{mapping_path}/{sample}_insertsizes.txt", sample=unique_samples.keys(), mapping_path=mapping_path) return steps # Write html report for all samples rule write_batch_report: input: #sample_json = expand("{path}/{sample}.json", sample=unique_samples.keys(), path=data_path) **get_batch_report_input() output: **get_batch_output() params: nokraken = config["nokraken"] run: combine_csv(input.samplecsv, data_path) fastqc_csv = expand("{path}/{img}.csv", path="QCResults/_data", img = ["Per_sequence_GC_content", "Per_sequence_quality_scores", "Sequence_Length_Distribution"]) write_summary_json(output, config, input, fastqc_csv, geninfo_config, boxplots, shell, get_name, to_base64) write_summary_json_new(output, input.sample_json_new) QCumber-2.3.0/__init__.py000077500000000000000000000000001330104711400151460ustar00rootroot00000000000000QCumber-2.3.0/batch_report.html000077500000000000000000027510121330104711400164210ustar00rootroot00000000000000 QCumber Batch Report

Sequencer Information


{{headline}}

{{table}}
{{attr}} {{value}}
{{headline}}
{{value}}
{{headline}} {{value}} {{val}}

Sequencer Plots

Summary

{{headline}}
{{value}}

{{value}}

{{head}} {{val}}
Show details

{{sample.Name}}

Raw data

{{key}}

Trimmed data

{{key}}


{{headline}}

{{table}}
{{attr}} {{value}}
{{headline}}
{{value}}
{{headline}} {{value}} {{val}}
QCumber-2.3.0/build.sh000066400000000000000000000013301330104711400144740ustar00rootroot00000000000000#!/bin/bash # grep adapters from trimmomatic if [ -a "$CONDA_PREFIX/share/trimmomatic/adapters" ]; then cat $CONDA_PREFIX/share/trimmomatic/adapters/* > config/adapters.fa adapter_path=$CONDA_PREFIX/share/trimmomatic/adapters/; else cat $CONDA_PREFIX/../../../share/trimmomatic/adapters/* > config/adapters.fa adapter_path=$CONDA_PREFIX/../../../share/trimmomatic/adapters/; fi mkdir -p $CONDA_PREFIX/opt/qcumber/ cp -r * $CONDA_PREFIX/opt/qcumber/ #sed -i.bak "1c\#!$PYTHON" $CONDA_PREFIX/opt/qcumber/QCumber-2 sed -i.bak "s#ADAPTER_PATH = \"\"#ADAPTER_PATH = \"$adapter_path\"#g" $CONDA_PREFIX/opt/qcumber/QCumber-2 ln -s $CONDA_PREFIX/opt/qcumber/QCumber-2 $CONDA_PREFIX/bin/ chmod u+x $CONDA_PREFIX/bin/QCumber-2 QCumber-2.3.0/config/000077500000000000000000000000001330104711400143115ustar00rootroot00000000000000QCumber-2.3.0/config/__init__.py000077500000000000000000000000001330104711400164130ustar00rootroot00000000000000QCumber-2.3.0/config/parameter.txt000077500000000000000000000023201330104711400170320ustar00rootroot00000000000000FastQC: Per base sequence quality : per_base_quality.png Per sequence quality scores : per_sequence_quality.png Per base sequence content : per_base_sequence_content.png Per base GC content : per_base_gc_content.png Per sequence GC content : per_sequence_gc_content.png Per base N content : per_base_n_content.png Sequence Length Distribution: sequence_length_distribution.png Sequence Duplication Levels : duplication_levels.png Kmer Content : kmer_profiles.png Per tile sequence quality : per_tile_quality.png Adapter Content : adapter_content.png Fileextension: fastq: - .fastq.gz - .fastq - .fq - .fq.gz bam: - .bam Trimmomatic: illuminaClip_seedMismatch : 2 illuminaClip_simpleClip : 10 leading : 3 trailing : 3 trimOption : "SLIDINGWINDOW:4:20" trimBetter_threshold : 0.15 forAssembly.Illumina: trimBetter_threshold : 0.1 trimOption : "SLIDINGWINDOW:4:25" forAssembly.IonTorrent: trimBetter_threshold : 0.2 trimOption : "SLIDINGWINDOW:4:15" forMapping.Illumina: trimBetter_threshold : 0.15 trimOption : "SLIDINGWINDOW:4:15" forMapping.IonTorrent: trimBetter_threshold : 0.25 trimOption : "SLIDINGWINDOW:4:15" QCumber-2.3.0/environment/000077500000000000000000000000001330104711400154105ustar00rootroot00000000000000QCumber-2.3.0/environment/packages.yaml000077500000000000000000000010721330104711400200550ustar00rootroot00000000000000name: qcumber channels: - bioconda - johanneskoester - r - conda-forge - ostrokach dependencies: - R=3.3 - bioconductor-biocinstaller - bioconductor-savr - bitstring - bowtie2=2.3 - dwgsim - docopt - fastqc=0.11 - gcc - gzip - jinja2 - kraken=0.10 - krona - matplotlib=2.0 - numpy - pandas>=0.19 - python=3.6 - pyyaml=3.12 - r-ggplot2=2.2.1 - r-quantreg - r-reshape2 - r-stringi - samtools=1.3 - seaborn - trimmomatic=0.36 - xmltodict - snakemake=4.8 - glob2 # - pip: # - snakemake<=4.8 # - glob2 QCumber-2.3.0/filenames.yaml000066400000000000000000000136321330104711400157000ustar00rootroot00000000000000# Filename conventions for FastQ files ion_torrent_bam: main_exts: - '.bam' formats: - some_ion_torrent_format: main_sep: '_' format: sample_name: regex: '.+' illumina_fastq: main_exts: ['.fasta', '.fastq', '.fas', '.fna', '.fnq', '.fa'] secondary_exts: ['.gz', '.xz', '.bz2', '.lzma', '.lzo', '.lz', '.rz'] formats: - illumina_basel_fastq: main_sep: '_' id_fields: ['openBIS_id', 'sample_name'] format: openBIS_id: subf_num: 3 flowcell: {} lane_short: regex: '\d+' sample_name: subf_num: 2 index: subf_num: 2 sample_num: regex: 'S\d+' lane: regex: 'L\d+' read: regex: 'R\d+' running_num: regex: '\d+' mismatches_index: regex: 'MM_\d+' optional: True - illumina_legacy1_fastq: main_sep: '_' id_fields: ['sample_name', 'unknown1', 'date', 'unknown2'] id_sep: '_' format: sample_name: subf_num: 4 unknown1: {} date: subf_num: 3 unknown2: subf_num: 3 barcode: {} lane: regex: 'L\d+' read: regex: 'R\d+' running_num: regex: '\d+' - illumina_legacy2_fastq: main_sep: '_' id_fields: ['sample_name', 'unknown1', 'date', 'unknown2'] id_sep: '_' format: sample_name: subf_num: 4 unknown1: subf_num: 2 date: subf_num: 3 unknown2: subf_num: 3 barcode: {} lane: regex: 'L\d+' read: regex: 'R\d+' running_num: regex: '\d+' - illumina_someinstitution1_fastq: main_sep: '_' format: sample_name: subf_num: 3 sample_num: regex: 'S\d+' lane: regex: 'L\d+' read: regex: 'R\d+' running_num: regex: '\d+' - illumina_someinstitution1.2_fastq: main_sep: '_' format: sample_name: subf_num: 5 subf_sep: '-' sample_num: regex: 'S\d+' read: regex: 'R\d+' running_num: regex: '\d+' - illumina_someinstitution2_fastq: main_sep: '_' format: sample_name: subf_num: 3 subf_sep: '-' sample_num: regex: 'S\d+' read: regex: 'R\d+' running_num: regex: '\d+' - illumina_someinstitution2.1_fastq: main_sep: '_' format: sample_name: subf_num: 2 subf_sep: '-' sample_num: regex: 'S\d+' read: regex: 'R\d+' running_num: regex: '\d+' - illumina_standard_fastq: main_sep: '_' format: sample_name: {} sample_num: regex: 'S\d+' lane: regex: 'L\d+' read: regex: 'R\d+' running_num: regex: '\d+' - illumina_fallback1_fastq: main_sep: '_' format: sample_name: regex: '.+' sample_num: regex: 'S\d+' lane: regex: 'L\d+' read: regex: 'R\d+' running_num: regex: '\d+' - illumina_fallback2_fastq: main_sep: '_' format: sample_name: regex: '.+' lane: regex: 'L\d+' read: regex: 'R\d+' running_num: regex: '\d+' - illumina_fallback3_fastq: main_sep: '_' format: sample_name: regex: '.+' read: regex: 'R\d+' running_num: regex: '\d+' - sra_standard_paired_fastq: main_sep: '_' format: sample_name: regex: 'SRR.+' read: regex: '\d' - sra_standard_single_fastq: main_sep: '_' format: sample_name: regex: 'SRR.+' - dwgsim2_paired_bwa_fastq: main_sep: '\.' format: sample_name: regex: '.+' bwa_flag: regex: 'bwa' read: regex: 'read\d' pacbio: main_exts: ['.bax', '.h5', '.xml'] secondary_exts: ['.h5'] formats: - standard_pacbio: main_sep: '\.' format: sample_name: regex: '.+' file_selector: {} QCumber-2.3.0/gitlab-ci.sh000066400000000000000000000013761330104711400152420ustar00rootroot00000000000000#!/usr/bin/env bash # Inspired by OP of # https://stackoverflow.com/questions/48540257/ # caching-virtual-environment-for-gitlab-ci ENV_NAME="qcumber" if [[ -z "${CI_PROJECT_DIR}" ]]; then CI_PROJECT_DIR="." fi if [ ! -d "$CI_PROJECT_DIR/.conda_cache/$ENV_NAME" ]; then echo "Environment $ENV_NAME does not exist. Creating it now!" conda-env create -f environment/packages.yaml \ -p "$CI_PROJECT_DIR/.conda_cache/$ENV_NAME" else echo "Updating $ENV_NAME with packages.yaml" conda-env update -f environment/packages.yaml \ -p "$CI_PROJECT_DIR/.conda_cache/$ENV_NAME" fi echo "Activating environment: $CI_PROJECT_DIR/.conda_cache/$ENV_NAME" source activate $CI_PROJECT_DIR/.conda_cache/qcumber #"$CI_PROJECT_DIR/.conda_cache/$ENV_NAME" QCumber-2.3.0/input_utils.py000066400000000000000000001232431330104711400160020ustar00rootroot00000000000000# -*- coding: utf-8 -*- '''utils module for snakemake pipeline This module provides utility classes and functions that make it easy to handle nucliotide sequence data ''' from itertools import takewhile from collections import OrderedDict import sys import os.path import re import yaml import pprint ''' FastQ Illumina filenames FastQ with complex naming convention): _____ ____ openBIS ID contains underscores and "mismatches in index" field may or may not be present Example: __<6>__ ____<001>_.fastq.gz ([(openBIS_ID,3,'_'), flowcell_name, lane, (sample_name,['USER,ID'],'_'), sample_number] __<68928>__<6>__ ____<001>_.fastq.gz (?P)_(?P)_(?P)_(?P)_(?P)_(?P_(?P)_(?P)_(?P)_(?P).fastq.gz IlluminaStandard: ____ ''' # Exceptions # illumina_standard_filename = [ # ('sample_name', {}), ('sample_num', {'regex': 'S\d+'}), # ('lane', {'regex': 'L\d+'}), ('read', {'regex': 'R\d+'}), # ('running_num', {'regex': '\d+'})] class SequenceFile: filename = '' path = '' ext = '' ext2 = '' add_exts = [] def __init__(self, filename, path='', ext='', ext2='', add_exts=[]): self.filename = filename self.path = path self.ext = ext self.ext2 = ext2 self.add_exts = [] def __lt__(self, other): return (([self.filename, self.ext, self.ext2] + other) < ([other.filename, other.ext, other.ext2] + other.other)) def __str__(self): return ''.join([os.path.join(self.path, self.filename), self.ext, self.ext2] + self.add_exts) def __repr__(self): return ''.join([os.path.join(self.path, self.filename), self.ext, self.ext2] + self.add_exts) class Field(): name = '' regex = '' subf_num = 1 subf_sep = None optional = False def __init__(self, name, regex='', subf_num=1, subf_sep=None, optional=False): self.name = name self.regex = regex self.subf_num = subf_num self.subf_sep = subf_sep self.optional = optional def to_regex(self, field_sep): if self.subf_num < 1: name = self.name raise MalformedFieldError( 'Field %s has less than 1 entry' % (name)) elif self.subf_num == 1: field_regex = (r'[^%s]*' % field_sep if not self.regex else self.regex) re_pattern = r'(?P<%s>%s)' % (self.name, field_regex) else: sfs = (field_sep + (self.subf_sep if self.subf_sep is not None else '')) subf_sep = field_sep if self.subf_sep is None else self.subf_sep subf_regex = r'[^%s]*' % format(sfs) subf_regex = subf_sep.join([subf_regex]*self.subf_num) re_pattern = ( '(?P<%s>%s)' % (self.name, subf_regex)) return re_pattern class MalformedFieldError(Exception): '''Exception thrown if regex field is malformed ''' pass class AmbigiousPairedReadsError(Exception): ''' Exception thrown if read pairs can't be matched Paired Read data has at most two files that are grouped together If more are found, this esxception is raised. Example: S1_R1.fastq, S1_R2.fastq S1_R3.fastq > Exception S1_R1.fastq, S1_R2.fastq S1_R2.fastq > Exception S1_R1.fastq, S1_R2.fastq > OK ''' pass class UnknownExtensionError(Exception): '''Exception thrown if input sequence file has an unknown file extension If the pipeline only excepts nucleotide sequence files, like Fastas and Fastqs, inputting a mapping file or index will cause this exception to be thrown. ''' pass class IncongruentFieldValueError(Exception): '''Exception thrown if ther is a missmatch in a sample file grouping If a grouping of files, that belong to the same SampleId grouping, differ in a non variable field of the filename, this exception is raised ''' pass class SampleIDNotUniqueError(Exception): '''File exists more than once in sample file grouping This might be caused by a sample id that is not unique ''' pass class FormatMismatch(Exception): pass # Classes class MultiFileReads: def __init__(self): pass def get_files(self): pass def add_files(self): pass class IlluminaMFR(MultiFileReads): n_files = 0 uses_common_fields_as_id = False n_branches = 0 format_name = '' field_names = [] id_fields = [] id_string = '' id_sep = '' main_sep = '' var_fields = [] ignore_fields = [] regex_str = '' regex = None sample_dict = {} non_var_dict = {} common_fields_dict = {} default_field_values = {} def __init__(self, filename_rule, main_sep='_', id_sep='', id_fields=['sample_name'], var_fields=['lane', 'running_num'], format_name='new_format', default_field_values={'sample_num': 'S1', 'lane': 'L001', 'read': 'R1', 'running_num': '001'}): super().__init__() # Construct regex string self.regex_str = main_sep.join([ Field(name, **key_opts).to_regex(main_sep) for (name, key_opts) in filename_rule]) self.regex = re.compile(self.regex_str) self.field_names = [name for (name, _) in filename_rule] self.id_fields = id_fields self.id_sep = id_sep self.id_string = '' self.default_field_values = default_field_values self.var_fields = var_fields self.sample_dict = {} self.non_var_dict = {} self.format_name = format_name self.main_sep = main_sep self.n_files = 0 self.n_branches = 0 self.common_fields_dict = {} def __repr__(self): return 'IlluminaMFR<%s>' % ', '.join( ['format_name: %s' % self.format_name, '\nn_files: %i' % self.n_files, 'n_branches: %i' % self.n_branches, 'tree:\n%s' % pprint.pformat(self.sample_dict)]) def add_files(self, file_list, read_target=0): if isinstance(file_list, SequenceFile): file_list = [file_list] file_list = sorted(file_list) for file in file_list: try: field_dict = self.regex.match(file.filename).groupdict() except AttributeError: raise FormatMismatch( 'FormatMismatch: %s does not fit %s' % (file, self.format_name)) sample_dict = self.sample_dict read = None var_dict = {} if self.n_files == 0: self.common_fields_dict = dict((key, val) for (key, val) in field_dict.items() if key not in (self.var_fields + ['read'])) self.update_id_field() for field in self.field_names: f_value = field_dict[field] sample_key = field + ':' + f_value read_field = field == 'read' if read_field: read_id = f_value[-1] if not read_target or read_id == str(read_target): read = file.filename elif field in self.var_fields: if self.n_files == 0 or sample_key not in sample_dict: sample_dict[sample_key] = {} sample_dict = sample_dict[sample_key] var_dict[field] = f_value # print(var_dict) else: if self.n_files == 0: self.non_var_dict[sample_key] = 42 elif (not read_field and sample_key not in self.non_var_dict): raise IncongruentFieldValueError( 'File: %s\n of ID Group %s has' ' group missmatch in field: %s\n' ' with value %s' % ( file, self.id_string, field, f_value)) if read is None: read_id = 1 read = file.filename # if read is not None: # if 'read%s' % read_id in sample_dict: raise SampleIDNotUniqueError( 'Error: files \n%s\n%s\n' 'have the same variable arguments. ' % ( str(file), sample_dict['read'+str(read_id)])) var_dict['read%s' % read_id] = file if len(sample_dict) < 1: self.n_branches += 1 sample_dict.update(var_dict.copy()) self.n_files += 1 def update_id_field(self, id_sep=None, id_fields=None): if id_sep is not None: self.id_sep = id_sep if id_fields: self.id_fields = id_fields self.id_string = self.id_sep.join( [self.common_fields_dict[field] for field in self.id_fields]) if id_sep or id_fields: self.use_common_fields_as_id = False def set_format_name(self, new_name): self.format_name = new_name def get_files(self, undetermined=None): output = [] to_do = [self.sample_dict] while to_do: curr_node = to_do.pop() if not any(isinstance(next_node, dict) for next_node in curr_node.values()): output += [curr_node] else: to_do.extend([curr_node[n] for n in curr_node]) return output def use_common_fields_as_id(self): new_id_fields = [f for f in self.field_names if f not in (self.var_fields+['read'])] self.update_id_field(id_fields=new_id_fields) self.uses_common_fields_as_id = True class PacBioMFR(MultiFileReads): n_files = 0 n_branches = 0 format_name = '' field_names = [] id_fields = [] id_string = '' id_sep = '' main_sep = '' var_fields = [] ignore_fields = [] regex_str = '' regex = None sample_dict = {} non_var_dict = {} common_fields_dict = {} def __init__(self, filename_rule, main_sep='_', id_sep='', id_fields=['sample_name'], var_fields=['dummy'], format_name='new_format'): super().__init__() # Construct regex string self.regex_str = main_sep.join([ Field(name, **key_opts).to_regex(main_sep) for (name, key_opts) in filename_rule]) self.regex = re.compile(self.regex_str) self.field_names = [name for (name, _) in filename_rule] self.id_fields = id_fields self.id_sep = id_sep self.id_string = '' self.var_fields = var_fields self.sample_dict = {} self.non_var_dict = {} self.format_name = format_name self.main_sep = main_sep self.n_files = 0 self.n_branches = 0 self.common_fields_dict = {} def __repr__(self): return 'PacBioMFR<%s>' % ', '.join( ['format_name: %s' % self.format_name, '\nn_files: %i' % self.n_files, 'n_branches: %i' % self.n_branches, 'tree:\n%s' % pprint.pformat(self.sample_dict)]) def add_files(self, file_list, file_selector_target=''): if isinstance(file_list, SequenceFile): file_list = [file_list] file_list = sorted(file_list) for file in file_list: try: field_dict = self.regex.match(file.filename).groupdict() except AttributeError: raise FormatMismatch( 'FormatMismatch: %s does not fit %s' % (file, self.format_name)) sample_dict = self.sample_dict file_selector = None var_dict = {} if self.n_files == 0: self.common_fields_dict = dict((key, val) for (key, val) in field_dict.items() if key not in (self.var_fields + ['file_selector'])) self.update_id_field() for field in self.field_names: f_value = field_dict[field] sample_key = field + ':' + f_value file_selector_field = field == 'file_selector' if file_selector_field: file_selector_id = f_value if not file_selector_target or file_selector_id == str(file_selector_target): file_selector = file.filename elif field in self.var_fields: if self.n_files == 0 or sample_key not in sample_dict: sample_dict[sample_key] = {} sample_dict = sample_dict[sample_key] var_dict[field] = f_value else: if self.n_files == 0: self.non_var_dict[sample_key] = 42 elif (not file_selector_field and sample_key not in self.non_var_dict): raise IncongruentFieldValueError( 'File: %s\n of ID Group %s has' ' group missmatch in field: %s\n' ' with value %s' % ( file, self.id_string, field, f_value)) if 'file_selector_id' in sample_dict: raise SampleIDNotUniqueError( 'Error: files \n%s\n%s\n' 'have the same variable arguments. ' % ( file_selector, sample_dict[str(file_selector_id)])) if file_selector is not None: # var_dict[file_selector_id] = file if len(sample_dict) < 1: self.n_branches += 1 sample_dict.update(var_dict.copy()) self.n_files += 1 def update_id_field(self, id_sep=None, id_fields=None): if id_sep is not None: self.id_sep = id_sep if id_fields: self.id_fields = id_fields self.id_string = self.id_sep.join( [self.common_fields_dict[field] for field in self.id_fields]) def set_format_name(self, new_name): self.format_name = new_name def get_files(self, undetermined=None): output = [] to_do = [self.sample_dict] while to_do: curr_node = to_do.pop() if not any(isinstance(next_node, dict) for next_node in curr_node.values()): output += [curr_node] else: to_do.extend([curr_node[n] for n in curr_node]) return output class IonTorrentMFR(MultiFileReads): pass def test_fun(filename): '''testfunction used for debugging purposes imports first format of given filenames.yaml, builds and returns corresponding multi file read object ''' formats = get_formats_from_file(filename) basel = list(formats[0].values())[0] mfr = IlluminaMFR(basel['format'].items(), main_sep=basel['main_sep']) return mfr class SampleInfoPaired: ''' Information Container for Paired End Read Data Attributes: ID (str): first section of file base name shared by read pair READ1 (str): Absolute path of first half of read pairs READ2 (str): Absolute path to second half of read pairs exts (:obj:`list` of :obj:`str`): List of extensions used by read pair in correct order zip_exts (:obj:`list` of :obj:`str`): List of compression extensions used by pair (also in correct order) If they are empty strings, no compression extension is used ''' ID = '' READ1 = '' READ2 = '' exts = ['', ''] zip_exts = ['', ''] add_info = {} def __init__(self, r1, r2, id_str, exts=['.fastq', '.fastq'], zip_exts=['', ''], add_info={}): '''Initializer Reads in the attributes in the order read1, read2, ids, exts, zip, exts Kwargs: exts: default ['.fastq', '.fastq'] zip_exts: default ['', ''] ''' self.ID = id_str self.READ1 = r1 self.READ2 = r2 self.exts = exts self.zip_exts = zip_exts self.add_info = add_info def __str__(self): return 'SampleInfoPaired()' def __repr__(self): att = (self.ID, self.READ1, self.READ2, ','.join(self.exts), ','.join(self.zip_exts)) return ('SampleInfoPaired()' if not any(att) else '' % att) class SampleInfoSingle: ''' Information Container for Single End Read Data Attributes: ID (str): first section of read file base name READ1 (str): Absolute path to read file ext ( :obj:`str`): Extension used by read data file zip_ext (:obj:`str`): Compression extension used by read file If the string is empty, no compression extension is used ''' ID = '' READ1 = '' ext = '' zip_ext = '' add_info = {} def __init__(self, ids, r1, ext='.fastq', zip_ext='', add_info={}): '''Initializer Reads in the attributes in the order ids, read1, exts, zip, exts Kwargs: ext: default '.fastq' zip_ext: default '' ''' self.ID = ids self.READ1 = r1 self.ext = ext self.zip_ext = zip_ext self.add_info = add_info def __str__(self): return 'SampleInfoSingle()' def __repr__(self): att = (self.ID, self.READ1, self.ext, self.zip_ext) return ('SampleInfoSingle()' if not any(att) else '' % att) class PacBioSampleInfoRS_II: bax1 = '' bax2 = '' bax3 = '' metadata = '' bas = '' add_info = {} def __init__(self, ids, metadata='', bax1='', bax2='', bax3='', bas='', add_info={}): '''Initializer Reads in the attributes in the order ids, read1, exts, zip, exts Kwargs: ext: default '.fastq' zip_ext: default '' ''' self.ID = ids self.metadata = metadata self.bas = bas self.add_info = add_info self.bax1 = bax1 self.bax2 = bax2 self.bax3 = bax3 def __str__(self): return 'SampleInfoSingle()' class PacBioSampleInfoRS: ID = '' metadata = '' bas = '' add_info = {} def __init__(self, ids, metadata='', bas='', add_info={}): '''Initializer Reads in the attributes in the order ids, read1, exts, zip, exts Kwargs: ext: default '.fastq' zip_ext: default '' ''' self.ID = ids self.metadata = metadata self.bas = bas self.add_info = add_info def __str__(self): return 'SampleInfoSingle()' class ReferenceInfo: ''' Information Container for Genomic Reference Data Attributes: ID (str): reference file base name without extension REFERENCE (str): Absolute path to reference file ext ( :obj:`str`): Extension used by reference data file zip_ext (:obj:`str`): Compression extension used by reference file If the string is empty, no compression extension is used ''' ID = '' REFERENCE = '' ext = '' zip_ext = '' def __init__(self, id, reference, ext='.fna', zip_ext='.gz'): self.ID = id self.REFERENCE = reference self.ext = ext self.zip_ext = zip_ext def __str__(self): return 'ReferenceInfo()' def __repr__(self): att = (self.ID, self.REFERENCE, self.ext, self.zip_ext) return ('ReferenceInfo()' if not any(att) else '' % att) def eprint(*args, **kwargs): ''' print function that prints to stderr :return: returns nothing ''' print(*args, file=sys.stderr, **kwargs) def test_extension(filename, extension_list): ''' tests which extension a file uses Args: filename (:obj:`str`): name of file whose extension will get checked extension_list (:obj:`list` of :obj:`str`): list of extensions that the file will be checked against. should contain the dot and extensions that share a prefix should be sorted in ascending order Returns: (:obj:`str`): Extension used or '' if not found in extension_list ''' res = '' for ext in extension_list: if len(filename.split(ext)) == 2: res = ext break return res def parse_sample_info(sample_list, format_dict, use_common_fields_as_id=False, target_formats=['illumina_fastq']): '''Parses filenames and generates SampleInfoObjects Turns list of input files into read data containers. It finds pairs for paired end read data and determines sample ids, which compression extension is used (if any), and which nucleotide sequence file extension is used. It only accepts files that end in (otional:) Args: sample_list(:obj:`list` of :obj:`str`): list of filenames format_dict(:obj:`dict` nested and with various types): file naming conventions loaded from yaml file Kwargs: use_common_fields_as_id(:obj:`bool` default False): Ignores ID Fields specified in formats config and just uses all fields that are not marked as variable as ID target_formats(:obj:`list` of :obj:`str`): List of machine target_formats to check against Choose from: illumina_fastq, ion_torrent_bam, pacbio Returns: :obj:`dict` of :obj:`MFR_Collection` Returns dictionary of MFR_Collections found, with respective target_format as key value. Example: result["illumina_fastq"] returns a Illumina_MFR_Collection() It also provides a dictionary entry for discarded files result["discarded"] Raises: UnknownExtensionError: If sequence file extension unknown AmbigiousPairedReadsError: If paired data has to many matching files ''' collections = {'ion_torrent_bam': IonTorrent_MFR_Collection, 'illumina_fastq': Illumina_MFR_Collection, 'pacbio': PacBio_MFR_Collection} sample_list = sorted(sample_list) mfr_samples = {} # Collection of detected multifile samples mfrs_found = dict() discarded = list() # Accumulate sample information and build samples dictinary for sample in sample_list: found_format_or_is_leftover = False for target_format_type in target_formats: if found_format_or_is_leftover: break seq_exts = format_dict[target_format_type]['main_exts'] format_list = format_dict[target_format_type]['formats'] # Select which mfr type to try mfr_type = collections[target_format_type].mfr_type try: zip_exts = format_dict[target_format_type]['secondary_exts'] except KeyError: zip_exts = [] used_ext = test_extension(sample, seq_exts) if not used_ext: continue raise UnknownExtensionError( 'Extension not recognized\n%s' % sample) sample_string, zipped = sample.split(used_ext) if zipped and not test_extension(zipped, zip_exts): continue path = os.path.dirname(sample) sample = os.path.basename(sample_string) # Get first section of file name for ID seq_file = SequenceFile(sample, path=path, ext=used_ext, ext2=zipped, add_exts=[]) # Check if known format mfr = find_format(seq_file, format_list, mfr_type) if target_format_type not in mfrs_found: mfrs_found[target_format_type] = ( collections[target_format_type]()) if issubclass(type(mfr), MultiFileReads): if use_common_fields_as_id: mfr.use_common_fields_as_id() mfrs_found[target_format_type].add(mfr, seq_file) else: mfrs_found[target_format_type].leftovers.add(sample, path, zipped, used_ext) found_format_or_is_leftover = True if not found_format_or_is_leftover: discarded.append(sample) # Process samples and build for id_str in mfr_samples: mfr = mfr_samples[id_str] print('MultiFileSample: ', id_str, 'format:', mfr.format_name) print(mfr.get_files()) return mfrs_found, discarded class MFR_Collection: mfrs = {} mfr_type = MultiFileReads def __init__(self): self.mfrs = {} def add(self, mfr, seq_file): pass ### ---- Illumina FastQ Collections ------ ### class Illumina_MFR_Collection(MFR_Collection): mfrs = {} leftovers = None mfr_type = IlluminaMFR def __init__(self): super().__init__() self.leftovers = Leftovers() self.mfrs = {} def add(self, mfr, seq_file): if mfr.id_string not in self.mfrs: self.mfrs[mfr.id_string] = mfr else: self.mfrs[mfr.id_string].add_files(seq_file) def flatten_rename(self, newIDPrefix='S', start_index=1): ''' ''' result = {} index = start_index for mfr_id in sorted(self.mfrs.keys()): mfr = self.mfrs[mfr_id] for samp_dict in mfr.get_files(): read_num = len([x for x in samp_dict if x[:-1] == 'read']) if read_num == 1: # if 'read1' not in samp_dict: # print(samp_dict, '\n', mfr.format_name) file = samp_dict['read1'] sample = SampleInfoSingle(mfr.id_string, os.path.abspath(str(file)), ext=file.ext, zip_ext=file.ext2) elif read_num == 2: file1 = samp_dict['read1'] file2 = samp_dict['read2'] sample = SampleInfoPaired(os.path.abspath(str(file1)), os.path.abspath(str(file2)), mfr.id_string, [file1.ext, file2.ext], [file2.ext2, file2.ext2]) else: raise AmbigiousPairedReadsError( 'To many files map together:\n%s' % repr(mfr).replace('>,', '>,\n')) result['%s%i' % (newIDPrefix, index)] = sample index += 1 return result def flatten_naive(self): result = {} for mfr_id in sorted(self.mfrs.keys()): mfr = self.mfrs[mfr_id] for samp_dict in mfr.get_files(): id_values = [] for field in mfr.field_names: if field in mfr.var_fields: id_values.append(samp_dict[field]) if field in mfr.common_fields_dict: id_values.append(mfr.common_fields_dict[field]) id_string = mfr.main_sep.join(id_values) read_num = len([x for x in samp_dict if x[:-1] == 'read']) if read_num == 1: # if 'read1' not in samp_dict: # print(samp_dict, '\n', mfr.format_name) file = samp_dict['read1'] sample = SampleInfoSingle(id_string, os.path.abspath(str(file)), ext=file.ext, zip_ext=file.ext2) elif read_num == 2: file1 = samp_dict['read1'] file2 = samp_dict['read2'] sample = SampleInfoPaired(os.path.abspath(str(file1)), os.path.abspath(str(file2)), id_string, [file1.ext, file2.ext], [file2.ext2, file2.ext2]) else: raise AmbigiousPairedReadsError( 'To many files map together:\n%s' % repr(mfr).replace('>,', '>,\n')) result[id_string] = sample return result def get_samples(self): result = {} for mfr_id in sorted(self.mfrs.keys()): mfr = self.mfrs[mfr_id] container = [] for samp_dict in mfr.get_files(): read_num = len([x for x in samp_dict if x[:-1] == 'read']) add_info = dict((x, y) for x, y in samp_dict.items() if x[:-1] != 'read') add_info.update(mfr.common_fields_dict) add_info['format'] = mfr.format_name if read_num == 1: # if 'read1' not in samp_dict: # print(samp_dict, '\n', mfr.format_name) try: file = samp_dict['read1'] sample = SampleInfoSingle(mfr.id_string, os.path.abspath(str(file)), ext=file.ext, zip_ext=file.ext2, add_info=add_info) except KeyError as err: eprint('get_verbose_samples(): cannot find read1\n ' 'culprit: %s' % repr(mfr).replace('>,', '>,\n')) elif read_num == 2: file1 = samp_dict['read1'] file2 = samp_dict['read2'] sample = SampleInfoPaired(os.path.abspath(str(file1)), os.path.abspath(str(file2)), mfr.id_string, [file1.ext, file2.ext], [file2.ext2, file2.ext2], add_info=add_info) else: raise AmbigiousPairedReadsError( 'To many files map together:\n%s' % repr(mfr).replace('>,', '>,\n')) container.append(sample) result[mfr_id] = container return result class Leftovers: samples = {} delims = [] num_files = 0 def __init__(self, delims=['_', '.', '+']): self.samples = {} self.delims = delims self.num_files = 0 def add(self, sample, path, zipped, used_ext): for delim in self.delims: sample_delim_split = sample.split(delim) sample_ID = sample_delim_split[0] if len(sample_delim_split) > 1: break # Sample not seen before num = len(self.samples) if sample_ID not in self.samples: lcp = 0 self.samples[sample_ID] = ([sample], [path], lcp, num, [zipped], [used_ext]) num += 1 # Sample already seen else: (prev_sams, prev_paths, prev_lcp, old_num, zippeds, exts) = ( self.samples[sample_ID]) # Use longest common prefix of files to determine read type # Note: Not safe if reads are fractured between different flow cell # lanes or tiles... this will break lcp = len(longest_common_prefix(prev_sams[0], sample)) self.samples[sample_ID] = ( prev_sams+[sample], prev_paths+[path], lcp, old_num, zippeds+[zipped], exts+[used_ext]) def process_leftovers(self, rename=True, rename_start_index=1): results = dict() for id_str in self.samples: sams, paths, lcp, num, zippeds, exts = self.samples[id_str] num += rename_start_index identical = all(lcp == len(x) for x in sams) if len(sams) == 2 and not identical: #print(sams) pair = [x[lcp] for x in sams] index = pair.index('1') ord_ext = [exts[index]] ord_zippeds = [zippeds[index]] read1 = os.path.join(paths[index], sams[index] + exts[index] + zippeds[index] if zippeds else '') read1 = os.path.abspath(read1) index = pair.index('2') ord_ext += [exts[index]] ord_zippeds += [zippeds[index]] read2 = os.path.join(paths[index], sams[index] + exts[index] + zippeds[index] if zippeds else '') read2 = os.path.abspath(read2) final_id_string = ('S%i' % num) if rename else id_str results[final_id_string] = SampleInfoPaired( read1, read2, id_str, exts=ord_ext, zip_exts=ord_zippeds) elif len(sams) == 1 or identical: read1 = os.path.join(paths[0], sams[0] + exts[0] + zippeds[0] if zippeds else '') final_id_string = ('S%i' % num) if rename else id_str results[final_id_string] = SampleInfoSingle( id_str, read1, ext=exts[0], zip_ext=zippeds[0]) else: # Here goes missing logic to deal with flow cell lanes and co #print(sams) raise AmbigiousPairedReadsError( 'Error: Found %i Files for Sample. Expected 1 or 2\n' 'Files for id: %s\n%s\n Flow cell Logic is currently missing' '' % (len(sams), id_str, '\n'.join(sams))) return results ### ------- Ion-Torrent Bam Collection -------- ### class IonTorrent_MFR_Collection(MFR_Collection): mfrs = {} mfr_type = IonTorrentMFR leftovers = None def __init__(self): super().__init__() leftovers = Leftovers() ### ------- PacBio h5 and meta.xml Collection ------- ### class PacBio_MFR_Collection(MFR_Collection): mfrs = {} mfr_type = PacBioMFR leftovers = None def __init__(self): super().__init__() self.mfrs = {} def add(self, mfr, seq_file): if mfr.id_string not in self.mfrs: self.mfrs[mfr.id_string] = mfr else: self.mfrs[mfr.id_string].add_files(seq_file) def get_samples(self): results = {} for mfr_id in sorted(self.mfrs.keys()): mfr = self.mfrs[mfr_id] container = [] for samp_dict in mfr.get_files(): bax_num = len([x for x in samp_dict if x in '123']) file_dict = dict((x if x not in '123' else 'bax%s' % x, str(y)) for x, y in samp_dict.items() if x in ['bas', 'metadata'] or y.ext == '.bax' and x in '123') add_info = dict((x, str(y)) for x, y in samp_dict.items() if x not in ['1', '2', '3', 'bas', 'metadata']) add_info.update(mfr.common_fields_dict) add_info['format'] = mfr.format_name file_dict['add_info'] = add_info if bax_num == 0 and ('bas' in file_dict): # if 'read1' not in samp_dict: # print(samp_dict, '\n', mfr.format_name) try: file = samp_dict sample = PacBioSampleInfoRS(mfr.id_string, **file_dict) except KeyError as err: eprint('get_verbose_samples(): cannot find read1\n ' 'culprit: %s' % repr(mfr).replace('>,', '>,\n')) results[mfr_id] = sample elif bax_num == 3: sample = PacBioSampleInfoRS_II(mfr.id_string, **file_dict) results[mfr_id] = sample else: #eprint('To many/less files map together:\n%s' % # repr(mfr).replace('>,', '>,\n')) pass return results def parse_reference_info(reference_list): '''Parsing reference file names into reference info objects Turns list of input files into reference data containers. It determines reference ids, which compression extension is used (if any), and which nucleotide sequence file extension is used. It only accepts files that end in (otional:) Args: reference_list(:obj:`list` of :obj:`str`): list of filenames Returns: :obj:`list` of :obj:`ReferenceInfo` Raises: UnknownExtensionError: If sequence file extension unknown ''' results = [] zip_exts = ['.gz', '.xz', '.bz2', '.lzma', '.lzo', '.lz', '.rz'] seq_exts = ['.fasta', '.fastq', '.fas', '.fna', '.fnq', '.fa'] for ref, num in zip(reference_list, range(1, len(reference_list)+1)): used_ext = test_extension(ref, seq_exts) if not used_ext: continue # raise UnknownExtensionError( # 'Extension not recognized\n%s' % ref) ref_id, zipped = ref.split(used_ext) if zipped and not test_extension(zipped, zip_exts): continue ref_id = os.path.basename(ref_id) results.append(('G%i' % num, ReferenceInfo(ref_id, os.path.abspath(ref), used_ext, zipped))) return results def find_format(file, formats, mfr_type): '''Checks which naming scheme fits ''' out = file for format_raw in formats: format_name = list(format_raw.keys())[0] settings_dict = list(format_raw.values())[0] mfr = mfr_type(settings_dict['format'].items(), main_sep=settings_dict['main_sep'], format_name=format_name) try: mfr.add_files(out) id_sep = (settings_dict['id_sep'] if 'id_sep' in settings_dict else None) id_fields = (settings_dict['id_fields'] if 'id_fields' in settings_dict else None) if id_sep is not None or id_fields: mfr.update_id_field(id_fields=id_fields, id_sep=id_sep) except FormatMismatch: # eprint('%s\n is not format: %s' % (file, format_name)) # eprint(mfr.regex_str) continue return mfr return out def get_formats_from_file(format_yaml): '''Loads a Yaml with fastq naming standards Warning: Currently depends on the transient feature of insert order preserving dictionaries (only python3.6) ''' with open(format_yaml, 'r') as format_fh: return ordered_load(format_fh) def ordered_load(stream, Loader=yaml.Loader, object_pairs_hook=OrderedDict): '''Stolen function to get ordered dictionary from yaml https://stackoverflow.com/questions/5121931/ in-python-how-can-you-load-yaml-mappings-as-ordereddicts/21048064#21048064 https://stackoverflow.com/users/650222/coldfix ''' class OrderedLoader(Loader): pass def construct_mapping(loader, node): loader.flatten_mapping(node) return object_pairs_hook(loader.construct_pairs(node)) OrderedLoader.add_constructor( yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, construct_mapping) return yaml.load(stream, OrderedLoader) def longest_common_prefix(str1, str2): '''longest common prefix of two strings ''' return [i[0] for i in takewhile(lambda x: (len(set(x)) == 1), zip(str1, str2))] QCumber-2.3.0/modules/000077500000000000000000000000001330104711400145145ustar00rootroot00000000000000QCumber-2.3.0/modules/__init__.py000077500000000000000000000000001330104711400166160ustar00rootroot00000000000000QCumber-2.3.0/modules/classification.snakefile000077500000000000000000000172151330104711400214030ustar00rootroot00000000000000############################# # READ CLASSIFICATION # ############################# cmap = matplotlib.cm.get_cmap('Set3') def get_kraken_result(filename, outputfile): level = "p" kraken_counts = read_csv(str(filename), sep="\t", header=None) if kraken_counts[0][0]==0 and kraken_counts[1][0]=="unclassified": with open(outputfile,"w") as empty_img: pass return None kraken_counts.columns = ["count", "root", "d", "p", "c", "o", "f", "g", "s"] nclassified = int(sum(kraken_counts[1:]["count"])) pclassified = round(100* kraken_counts[1:]["count"].sum() / kraken_counts["count"].sum(), 2) kraken_counts["perc"] = (100*kraken_counts["count"] / sum(kraken_counts["count"])).round(4) new_index = kraken_counts[level] for i in new_index[new_index != new_index].index: new_index.loc[i] = [x for x in kraken_counts.ix[i, ["root", "d", "p", "c", "o", "f", "g", "s"]] if x == x][-1] kraken_counts.index = list(new_index) kraken_counts.index.name = "Name" kraken_counts = kraken_counts.reset_index().groupby("Name").sum() kraken_counts = kraken_counts[kraken_counts["perc"]>0 ] kraken_counts["perc"].plot.bar(stacked = True,edgecolor='black', alpha=0.9) legend = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.savefig(outputfile, bbox_extra_artists=(legend,),bbox_inches="tight") return OrderedDict({ "kraken_img":outputfile,"#Classified": nclassified, "%Classified": pclassified , "kraken_results":kraken_counts[["perc","count"]].to_latex() }) #--------------------------------------------< RULES >-----------------------------------------------------------------# rule kraken_classification: input: db = config["kraken_db"], fastq_files = get_all_reads output: report = temp(classification_path + "/{sample}.krakenreport"), log = log_path + "/{sample}.kraken.log" log: log_path + "/{sample}.kraken.log" params: preload ="--preload", classified = classification_path + "/{sample}.classified_reads.fastq", unclassified = classification_path + "/{sample}.unclassified_reads.fastq" #type = "--fastq-input" threads: max_threads run: #fastq_files = [x for x in list(input.fastq_files) if getsize(x) != 0] fastq_files = [] for fastq in list(input.fastq_files): with gzip.open(fastq, "r") as fastq_file: if len(fastq_file.readlines()) ==0: shell("touch {output.report}") shell("echo 'Found empty input.' > {log} ") else: fastq_files.append(fastq) if len(fastq_files)!=0: if (config["kraken_classified_out"]): shell("kraken {params.preload} --threads {threads} --db {input.db} {input.fastq_files} --output {output.report} --classified-out {params.classified} --unclassified-out {params.unclassified} 2> {log}") else: shell("kraken {params.preload} --threads {threads} --db {input.db} {input.fastq_files} --output {output.report} 2> {log}") rule kraken_translate: input: db = config["kraken_db"], raw_report = classification_path + "/{sample}.krakenreport" output: classification_path + "/{sample}.translated" shell: "if [ -s {input.raw_report} ]; then kraken-translate {input.raw_report} --db {input.db} --mpa-format > {output}; else touch {output} ; fi" rule kraken_csv: input: translated = classification_path + "/{sample}.translated", kraken_log = log_path + "/{sample}.kraken.log" output: classification_path + "/{sample}.csv" run: if getsize(str(input.translated)) == 0: with open(str(output), "w") as out: out.write('0\tunclassified\t\t\t\t\t\t\t') else: kraken_log = open(str((input.kraken_log)),"r") for line in kraken_log.readlines(): pattern = re.match("\s+(?P\d+) sequences unclassified \((?P\d+.\d+)%\)", line) if pattern: unclassified = pattern.group("n_unclassified") report = read_csv(str(input.translated), header=None, sep="\t")[1].value_counts() kraken_counts = DataFrame() kraken_counts = kraken_counts.append( Series({"count": unclassified, "root": "unclassified"}, name="unclassified")) for x in report.index: temp = dict([x.split("__") for x in x.split("|") if x != "root"]) temp["count"] = report[x] if x == "root": temp["root"] = "root" for missing_col in (set(["count", "root", "d", "p", "c", "o", "f", "g", "s"]) - set(temp.keys()) ): temp[missing_col]="" kraken_counts = kraken_counts.append(Series(temp, name=x)) kraken_counts[["count", "root", "d", "p", "c", "o", "f", "g", "s"]].to_csv(str(output), sep="\t", index=False, header=False) rule kraken_batch_plot: input: expand(classification_path + "/{sample}.csv", sample= unique_samples.keys()) params: level = "s" output: csv = classification_path + "/kraken_batch_result.csv", png = classification_path + "/kraken_batch.png" run: kraken_summary = DataFrame() for sample in sorted(list(input)): kraken_counts = read_csv(str(sample), sep="\t", header=None) kraken_counts.columns = ["count","root", "d", "p","c","o","f","g","s"] kraken_counts["perc"] = (kraken_counts["count"] / sum(kraken_counts["count"])).round(4) * 100 # sort kraken_counts by perc and take the first 10 most abundant entries (possible to pass number pre config/parameter) kraken_counts.sort_values("perc", ascending=False, inplace=True) try: kraken_filtered = kraken_counts.head(10) except: kraken_filtered = kraken_counts new_index = kraken_filtered[params.level] for i in new_index[new_index !=new_index].index: new_index.loc[i] = [x for x in kraken_filtered.ix[i,["root", "d", "p","c","o","f","g","s"]] if x==x][-1] kraken_filtered = kraken_filtered["perc"] kraken_filtered.index = list(new_index) kraken_filtered.index.name = "Name" kraken_filtered = kraken_filtered.reset_index().groupby("Name").sum() kraken_filtered = kraken_filtered["perc"].append(Series({"other": (100 - sum(kraken_filtered["perc"]))})) kraken_filtered.name = re.search(classification_path + "/(?P.*).csv", str(sample)).group("sample") try: kraken_filtered.drop("unclassified", inplace = True) kraken_filtered.drop("root", inplace = True) except: pass #print("No column unclassified") kraken_summary = concat([kraken_summary, kraken_filtered], axis=1) # Sort again (with name=column with perc entries) such that written table for kraken batch is sorted by perc kraken_summary.sort_values(kraken_filtered.name, ascending=False, inplace=True) kraken_summary.to_csv(str(output.csv)) kraken_summary.T.plot.bar(stacked = True,edgecolor='black', title = "Classified reads by Kraken [%]", alpha=0.9) legend = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.savefig(str(output.png), bbox_extra_artists=(legend,), bbox_inches='tight') QCumber-2.3.0/modules/fastqc.snakefile000077500000000000000000000054201330104711400176640ustar00rootroot00000000000000import sys #--------------------------------------------< RULES >-----------------------------------------------------------------# # Run FastQC on raw data rule fastqc_raw: input: lambda wildcards: sample_dict[wildcards.sample] output: qc_summary=temp(fastqc_path + "/raw/{sample}_fastqc/fastqc_data.txt"), zip= fastqc_path + "/raw/{sample}_fastqc.zip", html = temp(fastqc_path + "/raw/{sample}_fastqc.html"), folder = temp(fastqc_path + "/raw/{sample}_fastqc"), message: "Run FastQC on raw data." threads: max_threads log: log_path + "/logfile.fastqc.txt" run: shell("fastqc {input} -o {path}/raw/ " "--extract -t {threads} >> {log} 2>&1 ", path = fastqc_path ) if ("{path}/raw/{name}_fastqc".format(name=get_name(str(input)), path=fastqc_path) != str(output.folder)): shell("rsync -r --remove-source-files" " {path}/raw/{name}_fastqc/* {output.folder}", name=get_name(str(input)), path=fastqc_path ) #shell("mv {/fastqc_data.txt " # "{output.qc_summary}", # name=get_name(str(input)), path=fastqc_path) shell("mv -f {path}/raw/{name}_fastqc.html {output.html}", name=get_name(str(input)), path=fastqc_path) shell("mv -f {path}/raw/{name}_fastqc.zip {output.zip}", name=get_name(str(input)), path=fastqc_path) rule fastqc_trimmed: input: trimming_path + "/{sample}.fastq.gz" output: fastqc_path + "/trimmed/{sample}_fastqc/fastqc_data.txt", fastqc_path + "/trimmed/{sample}_fastqc.zip", temp(fastqc_path + "/trimmed/{sample}_fastqc.html"), folder = temp(fastqc_path + "/trimmed/{sample}_fastqc"), threads: max_threads log: log_path + "/logfile.trimmed.fastqc.txt" #temp(log_path + "/{sample}.fastqc.trimmed.log") message: "Run FastQC for trimmed data." run: # print('sizes:', ' '.join(['%s:%i:' % (x,os.path.getsize(x)) # for x in input]), file=sys.stderr) shell( "if [ `zcat '{input}' | head -n 1 | wc -c ` -eq 0 ];" "then touch {output};" "else fastqc {input} " # "-Djava.awt.headless=true " "-o $(dirname {output.folder})" " --extract -t {threads} >> {log} 2>&1; fi ") rule trimmomatic_stats_2_csv: input: fastqc_path + "/raw/{sample}_{read}_fastqc_data.txt" output: temp(fastqc_path + "/raw/{sample}_{read}_fastqc_stat.csv") run: pass QCumber-2.3.0/modules/init.snakefile000077500000000000000000000176541330104711400173620ustar00rootroot00000000000000from os.path import basename, splitext, join, isfile, dirname, abspath, exists, getsize import glob import re import json import getpass import subprocess import sys import re import datetime from jinja2 import * from collections import OrderedDict import numpy import yaml import gzip import input_utils from pandas import (read_csv, Series, DataFrame, concat, to_numeric, MultiIndex, melt) import matplotlib matplotlib.use('Agg') from matplotlib import pyplot as plt import seaborn as sns import base64 try: plt.style.use('ggplot') except: pass try: from StringIO import StringIO except: from io import StringIO #wildcard_constraints: sample = "[^\/]+" qcumber_path = os.path.abspath(workflow.basedir) #import pdb; pdb.set_trace() main_path = "QCResults" log_path= main_path + "/_logfiles" data_path= main_path + "/_data" sav_path= main_path + "/SAV" fastqc_path= main_path + "/FastQC" trimming_path= main_path + "/Trimmed" trimbetter_path= main_path + "/trimBetter" # temp folder mapping_path= main_path + "/Mapping" classification_path= main_path + "/Classification" try: with open('samples.yaml', 'r') as sample_h: sample_info_new = yaml.load(sample_h) sample_info_new_complex = dict((x, y) for x, y in sample_info.items() if isinstance(y, list)) sample_info_new_simple = dict((x, y) for x, y in sample_info.items() if not isinstance(y, list)) except: pass # max threads for all rules max_threads = 10 with open(os.path.join(data_path,'general_information.json'),'r') as info: geninfo_config = json.load(info) onsuccess: if len(geninfo_config["Sample information"]["join_reads"]) != 0: try: shell("rm -r QCResults/tmp") except: pass if cmd_input["trimBetter"]: try: shell("rm -r {trimbetter_path}".format(trimbetter_path=trimbetter_path)) except: print("Could not remove %s" % trimbetter_path) #---------------------------------------------< Functions >------------------------------------------------------------# cmap = matplotlib.cm.get_cmap('Set3') # convert images to base64 def to_base64(file): with open(file, "rb") as imgfile: imgstring = base64.b64encode(imgfile.read()) return 'data:image/png;base64,' + imgstring.decode("utf-8") def bam_to_fastq(bamfile): return ["{path}/tmp/bam_to_fastq/{sample}.fastq".format(path=main_path, sample=get_name(bamfile))] def get_name(abs_name): new_name = basename(abs_name) while splitext(new_name)[-1] !="": new_name = splitext(new_name)[0] return new_name def get_all_reads(wildcards, raw = False): if config["notrimming"]: if any([x.endswith(".bam") for x in unique_samples[wildcards.sample]]): return bam_to_fastq(unique_samples[wildcards.sample][0]) # array of bam files should always be of length 1 else: # geninfo_config["Sample information"]["samples"][wildcards.sample] return unique_samples[wildcards.sample] elif raw: # get raw reads if any([x.endswith(".bam") for x in geninfo_config["Sample information"]["samples"][wildcards.sample]]): return bam_to_fastq(geninfo_config["Sample information"]["samples"][wildcards.sample][0]) else: return geninfo_config["Sample information"]["samples"][wildcards.sample] # get trimmed reads elif wildcards.sample in geninfo_config["Sample information"]["join_lanes"]: if (geninfo_config["Sample information"]["type"] == "PE"): return expand("{path}/{sample}.{read}.fastq.gz", read=["1P", "1U", "2P", "2U"], sample=geninfo_config["Sample information"]["join_lanes"][wildcards.sample], path = trimming_path) else: return expand("{path}/{sample}.fastq.gz", sample=geninfo_config["Sample information"]["join_lanes"][wildcards.sample], path = trimming_path) else: if (geninfo_config["Sample information"]["type"] == "PE"): return expand("{path}/{sample}.{read}.fastq.gz", read=["1P", "1U", "2P", "2U"], sample=wildcards.sample, path = trimming_path) elif (geninfo_config["Sample information"]["type"] == "SE"): return expand("{path}/{sample}.fastq.gz", sample=wildcards.sample, path = trimming_path) def get_total_number(filename): try: with open(filename, "r") as fastqc_data: for line in fastqc_data.readlines(): if line.startswith("Total Sequences"): return int(re.search("Total Sequences\s+(?P\d+)", line).group("reads")) except: return 0 # Plot BOXPLOTS boxplots = {"Per_sequence_quality_scores": { "title": "Per sequence quality scores", "ylab": "Mean Sequence Quality (Phred Score)", "xlab": "Sample"}, "Sequence_Length_Distribution":{ "title": "Sequence Length Distribution", "ylab": "Sequence Length (bp)", "xlab": "Sample"}, "Per_sequence_GC_content":{ "title": "Per sequence GC content", "ylab": "Mean GC content (%)", "xlab": "Sample"} } def plot_summary(csv, outfile): df = read_csv(csv, header=None, sep=",") df.columns = ["Sample", "Type", "Read", "Value", "Count"] # workaround for weighted boxplots new_df = DataFrame() for i in range(len(df.index)): if int(df.ix[i, "Count"]) != 0: new_df = new_df.append(DataFrame([df.ix[i, :-1]] * int(df.ix[i, "Count"])), ignore_index=True) print(new_df.columns) g = sns.FacetGrid(new_df, col="Type", size=4, aspect=.7) (g.map(sns.boxplot, "Sample", "Value", "Read") .despine(left=True) .add_legend(title = "Read")) plt.savefig(outfile) #------------------------------------------< make config files >-------------------------------------------------------# parameter = yaml.load(open(os.path.join(geninfo_config["QCumber_path"], "config", "parameter.txt"), "r")) sample_dict = dict([(geninfo_config["Sample information"]["rename"][get_name(x)], x) for x in sum(geninfo_config["Sample information"]["samples"].values(), [])]) if any([x for x in sum( geninfo_config["Sample information"]["samples"].values(), []) if x.endswith(".bam")]): rule bam_to_fastq: input: lambda wildcards: sample_dict[wildcards.sample] output: temp(main_path + "/tmp/{sample}.fastq") message: "Convert bam to fastq" run: shell("samtools bam2fq {input} > {output}") joined_samples = dict((x, sum( [geninfo_config["Sample information"]["samples"][val] for val in geninfo_config["Sample information"]["join_lanes"][x]], [])) for x in geninfo_config["Sample information"]["join_lanes"].keys()) unique_samples = dict(joined_samples, **dict( (x, geninfo_config["Sample information"]["samples"][x]) for x in geninfo_config["Sample information"]["samples"].keys() if x not in sum(geninfo_config["Sample information"]["join_lanes"].values(), []))) cmd_input = yaml.load(open("config.yaml","r")) if config["reference"] or config["index"]: config["nomapping"] = False else: config["nomapping"] = True rule preprocess_join_readfiles: input: lambda wildcards: geninfo_config["Sample information"]["join_reads"]["QCResults/tmp/join_reads/"+wildcards.sample ] output: temp("QCResults/tmp/join_reads/{sample}") shell: "cat {input} > {output}" QCumber-2.3.0/modules/json_output.py000077500000000000000000000372461330104711400174760ustar00rootroot00000000000000import json import re import sys from collections import OrderedDict from io import StringIO from matplotlib import pyplot as plt from os.path import basename, getsize, join from pandas import DataFrame, Series, read_csv, concat from shutil import copyfile def get_plot_type_names(): return [x.replace(" ", "_") for x in get_store_data() if x not in get_skip_csv()] def get_skip_csv(): return ["Adapter Content", "Overrepresented sequences"] def get_skip(): return ["Basic Statistics", "Kmer Content", "Overrepresented sequences"] def get_store_data(): return ['Sequence Length Distribution', 'Per sequence quality scores', 'Per sequence GC content', 'Adapter Content', 'Overrepresented sequences'] # convert result type to color (adapted to color blindness) def get_color(value): if (value == "pass"): return "green" elif (value == "fail"): return "red" else: return "orange" def combine_csv(inputcsvs, data_path): files = dict() for plot_type in get_plot_type_names(): files[plot_type] = open(data_path + "/" + plot_type + ".csv", "w") for csv in inputcsvs: f = open(csv, "r") s = f.read() f.close() for plot_type in get_plot_type_names(): if plot_type in csv: files[plot_type].write(s) break for plot_type in get_plot_type_names(): files[plot_type].close() # creates CSV used for boxplotting def createCSV(name, data, plot_type, path, trimmed, paired_end=True): df = read_csv(StringIO("".join(data)), sep="\t") name_groups = re.search(r"(?P.*)_(?P(R1|R2)).*", name) # if exists(filename): # summary = read_csv (filename, header = None, sep = ",") # else: # summary = DataFrame() summary = DataFrame() if plot_type == "Sequence Length Distribution": try: df["#Length"] = DataFrame(df["#Length"].str.split("-", expand=True), dtype=int).mean(axis=1) except: # Length is already of type int pass samplename = "NA" if name_groups and paired_end: samplename = name_groups.group("samplename") df = concat([ Series([name_groups.group("samplename")]* len(df.index), name = "Sample") , Series([trimmed ]* len(df.index), name="Trimmed"), Series([name_groups.group("read") ]* len(df.index), name="Read"), df ],axis =1) else: samplename = name df = concat([Series([name] * len(df.index), name="Sample"), Series([trimmed] * len(df.index), name="type"), Series([""] * len(df.index), name="read"), df],axis=1) if len(summary) !=0: summary.columns = df.columns filename = path + '/' + samplename + "_" + plot_type.replace(" ", "_") + ".csv" summary = concat([summary, df], axis=0, ignore_index = True) summary_path = open(filename, 'a') summary.to_csv(summary_path, header = False, index = False) summary_path.close() def get_fastqc_results(parameter, fastqc_path, outdir, type, to_base64, paired_end=True): """ Parses trimmomatics Sample(ID)_fastqc_data.txt files While parsing fastqc_data records are transcribed into a csv file Returns: fastq_results (obj::`dict`): Key (obj::`str`): Basename of File without "_fastqc" Value (obj::`dict`): Statistics Key (obj::`str`): Name of stat Value: value (int): number of sequences (double): Percentage of overrepresented sequences """ skip = get_skip() store_data = get_store_data() total_seq = 0 overrepr_count = 0 adapter_content = [] fastqc_results = OrderedDict() data = [] store = False for fastqc in fastqc_path: sample = OrderedDict() name = basename(fastqc).replace("_fastqc","") sample["img"]= OrderedDict() if getsize(fastqc) != 0: with open(join(fastqc, "fastqc_data.txt"), "r") as fastqc_data: for line in iter(fastqc_data): if line.startswith("Total Sequences"): sample["Total Sequences"] = int(line.strip().split("\t" )[-1]) total_seq += int(line.strip().split("\t")[-1]) elif line.startswith('>>END_MODULE'): if len(data) > 0: if key[0] in store_data: if key[0] == 'Adapter Content': ac = max(read_csv(StringIO("".join(data)), sep="\t", index_col=0 ).max().round(2)) adapter_content.append(ac) sample["%Adapter content"] = ac elif key[0] == "Overrepresented sequences": ors= read_csv(StringIO("".join(data)), sep="\t")["Count"].sum() overrepr_count += ors sample["%Overrepr sequences"] =int(ors) else: createCSV(name, data, key[0], outdir, type, paired_end=paired_end) data = [] store = False elif line.startswith('>>'): key = line.split("\t") key[0] = key[0].replace(">>", "") if key[0] in store_data: store = True if not key[0] in skip: img = {} img["color"] = get_color(key[1].replace("\n","")) try: img["base64" ] = to_base64( join(fastqc, "Images", parameter["FastQC"][key[0]])) img["path"] = join(fastqc, "Images", parameter["FastQC"][key[0]]) sample["img"][key[0]] = img except Exception as exp: print(exp) pass #img["base64"] = "" if store and not line.startswith(">>"): data.extend([line]) fastqc_results[name] = sample try: adapter = round(sum(adapter_content)/len(adapter_content) ,2) except: adapter = 0 if total_seq == 0: overrepr = 0 else: overrepr = round(100*overrepr_count / total_seq ,2) #print(fastqc_results) return fastqc_results, int(total_seq), overrepr , adapter def write_sample_json(outfilename, samplename, snakeinput, cmd_input): # Fastqc_dict is historical, omitted from output for now, but still present in call to # maintain tuple unpacking order fastqc_dict, total_seq ,overrepr_perc, adapter_content = ( get_fastqc_results( (x for x in snakeinput.raw_fastqc if x[-4:] != ".txt" ), data_path , "raw" )) res = dict() res["Sample"] = dict() res["Sample"]["Name"] = samplename res["Sample"]["TS"] = total_seq res["Sample"]["PAC"] = adapter_content res["Sample"]["PORS"] = overrepr_perc res["Sample"]["POST"] = "N/A" res["Sample"]["PACT"] = "N/A" res["Sample"]["NRR"] = "N/A" res["Sample"]["PRR"] = "N/A" res["Sample"]["NAR"] = "N/A" res["Sample"]["PAR"] = "N/A" res["Sample"]["NC"] = "N/A" res["Sample"]["PC"] = "N/A" if not cmd_input["notrimming"]: fastqc_dict, total_seq, overrepr_perc, adapter_content = ( get_fastqc_results(input.trimming_fastqc, data_path,"trimmed")) trimmomatic_results = get_trimmomatic_result(list(snakeinput.trimming), list(snakeinput.trimming_params)) res["Sample"]["POST"] = overrepr_perc res["Sample"]["PACT"] = adapter_content res["Sample"]["NRR"] = trimmomatic_results["#Remaining Reads"] res["Sample"]["PRR"] = trimmomatic_results["%Remaining Reads"] if not cmd_input["nomapping"]: mapping_results = get_bowtie2_result(str(snakeinput.mapping)) res["Sample"]["NAR"] = mapping_result["#AlignedReads"] res["Sample"]["PAR"] = mapping_result["%AlignedReads"] if not cmd_input["nokraken"]: kraken_results = get_kraken_result(str(snakeinput.kraken), str(snakeoutput.kraken_plot)) res["Sample"]["NC"] = kraken_results["#Classified"] res["Sample"]["PC"] = kraken_results["%Classified"] json.dump(res, open(outfilename, "w")) def write_summary_json_new(output, sample_json): res = dict() res["Headers"] = dict() res["Headers"]["Name"] = "Sample name" res["Headers"]["TS"] = "Total sequences" res["Headers"]["PAC"] = "% Adapter content" res["Headers"]["PORS"] = "% Overrepresented sequences" res["Headers"]["POST"] = "% Overrepresented sequences (trimmed)" res["Headers"]["PACT"] = "% Adapter content (trimmed)" res["Headers"]["NRR"] = "# Remaining reads" res["Headers"]["PRR"] = "% Remaining reads" res["Headers"]["NAR"] = "# Aligned reads" res["Headers"]["PAR"] = "% Aligned reads" res["Headers"]["NC"] = "# Classified" res["Headers"]["PC"] = "% Classified" res["Samples"] = [] for sample in list(sample_json): res["Samples"] += [json.load(open(str(sample)))["Sample"]] json.dump(res, open(str(output.summary_json_new), "w")) def write_summary_json(output, cmd_input, ruleinput, fastqc_csv, config, boxplots, shell, get_name, to_base64): summary = OrderedDict() summary["summary_img"] = {} for infile, outfile in zip(fastqc_csv, list(output.fastqc_plots)): shell("Rscript --vanilla {path}/Rscripts/boxplot.R" + " {input} {output} '{title}' '{xlab}' '{ylab}'", path = config["QCumber_path"], input = infile, output = outfile, title = boxplots[get_name(infile)]["title"], xlab = boxplots[get_name(infile)]["xlab"], ylab = boxplots[get_name(infile)]["ylab"]) summary["summary_img"][ boxplots[get_name(infile)]["title"]] = ( to_base64(outfile)) if not cmd_input["nomapping"]: insertsizes = [] samplenames = [] notzero = 0 for infile in ruleinput["insertsize"]: f = open(infile, "r") data = [int(x) for x in f.read().split(",")] notzero += len([x for x in data if x != 0]) f.close() insertsizes += [data] samplenames += [get_name(infile).replace("_insertsizes", "")] if notzero == 0: insertsizes = [[0, 0, 0, 0, 0] for x in ruleinput["insertsize"]] boxplt = plt.boxplot(insertsizes, 0, '', patch_artist=True) for patch in boxplt['boxes']: patch.set_facecolor('#E25845') plt.xticks([x+1 for x in range(len(ruleinput["insertsize"]))], samplenames, rotation="vertical") try: plt.tight_layout() # This breaks when sample names are to large. # It raises a value error: bottom cannot be # larger than top in matplotlib 2.1.x # Might get fixed in 2.2.x except ValueError: print("Warning: (%s:matplotlib) Some labels to " "long for tight_layout plot" % __file__, file=sys.stderr) plt.title("Fragment lengths") plt.savefig(str(output.insertsize_plot), bbox_inches='tight') summary["summary_img"]["Insertsize"] = to_base64(output.insertsize_plot) plt.clf() summary["Results"] = [] batch_plot = DataFrame() for sample in sorted(list(ruleinput.sample_json)): sample_dict = json.load(open(str(sample),"r"), object_pairs_hook=OrderedDict) summary["Results"].append(sample_dict ) df = DataFrame.from_dict( dict((key, val) for key, val in sample_dict.items() if key in ["Total sequences", "#Remaining Reads", "%Classified","%AlignedReads", "%Adapter content", "%Adapter content (trimmed)", "%Overrepr sequences", "%Overrepr sequences (trimmed)"] ), orient="index").T df.index=[sample_dict["Name"]] batch_plot=batch_plot.append(df) if not cmd_input["notrimming"]: batch_plot["Total sequences"] = (batch_plot["Total sequences"] - batch_plot["#Remaining Reads"]) batch_plot[["#Remaining Reads","Total sequences"]].plot.bar( stacked=True, edgecolor='black', title="Number of Reads", alpha=0.9) else: batch_plot["Total sequences"].plot.bar( stacked=True, edgecolor='black', title="Number of Reads", alpha=0.9) legend = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.savefig(str(output.n_read_plot), bbox_extra_artists=(legend,), bbox_inches='tight') plt.close() summary["summary_img"]["n_read"] = to_base64(str(output.n_read_plot)) try: if not cmd_input["notrimming"]: batch_plot[ ["%Adapter content", "%Adapter content (trimmed)"] ].plot.bar(edgecolor='black', title = "Adapter content [%]",alpha=0.9) else: batch_plot["%Adapter content"].plot.bar( edgecolor='black',title="Adapter content [%]", alpha=0.9) legend = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.savefig("QCResults/_data/adapter.png" , bbox_extra_artists=(legend,),bbox_inches='tight') plt.close() summary["summary_img"]["adapter"] = to_base64(data_path + "/adapter.png") except: pass try: if not cmd_input["notrimming"]: batch_plot[ ["%Overrepr sequences", "%Overrepr sequences (trimmed)"] ].plot.bar(edgecolor='black', title="Overrepresented sequences [%]", alpha=0.9) else: batch_plot["%Overrepr sequences"].plot.bar( edgecolor='black', title="Overrepresented sequences [%]", alpha=0.9) legend = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.savefig("QCResults/_data/overrepr_seq.png", bbox_extra_artists=(legend,), bbox_inches='tight') plt.close() summary["summary_img"]["overrepr_seq"] = to_base64( data_path + "/overrepr_seq.png") except: pass if not cmd_input["nomapping"]: batch_plot["%AlignedReads"].plot.bar( edgecolor='black', title = "Map to Reference [%]",alpha=0.9) plt.savefig(str(output.mapping_plot), bbox_inches='tight') summary["summary_img"]["mapping"] = to_base64( str(output.mapping_plot)) plt.close() if not cmd_input["nokraken"]: summary["summary_img"]["kraken"] = to_base64(ruleinput.kraken_batch) json.dump(summary, open(str(output.summary_json), "w")) QCumber-2.3.0/modules/mapping.snakefile000077500000000000000000000113721330104711400200410ustar00rootroot00000000000000################# # MAPPING # ################# from utils import calculate_insert_length def get_bowtie2_result(filename): mapping_dict = OrderedDict() mapping_dict["Bowtie2_log"] = "" mapping_dict["#AlignedReads"] = 0 with open(filename, "r") as logfile: for line in logfile.readlines(): pattern1 = re.match( ".* (?P\d+) \((?P\d+.\d+)%\) aligned exactly 1 time", line) pattern2 = re.match( ".* (?P\d+) \((?P\d+.\d+)%\) aligned >1 times", line) pattern3 = re.match( "(?P\d+\.\d+)\% overall alignment rate", line) if pattern1: mapping_dict["#AlignedReads"] += int(pattern1.group("aligned_exact")) #mapping_dict["%AlignedReads"] = float(pattern1.group("percent_aligned_exact").replace(",", ".")) elif pattern2: mapping_dict["#AlignedReads"] += int(pattern2.group("aligned_more_than_one")) #mapping_dict["%AlignedReads"] = float(pattern2.group("percent_aligned_more_than_one")) elif pattern3: mapping_dict["%AlignedReads"] = float(pattern3.group("overall")) mapping_dict["Bowtie2_log"] += line return mapping_dict if not config["nomapping"]: if config["index"]: index_file = config["index"] else: index_file = mapping_path + "/%s.bowtie2" %basename(config["reference"]) rule bowtie_index: input: config["reference"] output: temp(mapping_path +"/{ref,.*bowtie2}.1.bt2"), temp(mapping_path +"/{ref,.*bowtie2}.2.bt2"), temp(mapping_path +"/{ref,.*bowtie2}.3.bt2"), temp(mapping_path +"/{ref,.*bowtie2}.4.bt2"), temp(mapping_path +"/{ref,.*bowtie2}.rev.1.bt2"), temp(mapping_path +"/{ref,.*bowtie2}.rev.2.bt2") log: log_path +"/logfile.mapping.log" message: "Building bt2 index for mapping." shell: "bowtie2-build {input} %s/{wildcards.ref} 1>&2>> {log}" % mapping_path if config["save_mapping"]: samfile = mapping_path + "/{sample}.sam" else: samfile = temp(mapping_path + "/{sample}.sam") rule bowtie_mapping: input: trimmed_fastq = get_all_reads, check_index = expand(index_file + ".{index}.bt2" , index = [1,2,3,4,"rev.1","rev.2"]) output: samfile = samfile, logfile = temp(log_path + "/{sample}.bowtie2.log"), statsfile = mapping_path + "/{sample}_fragmentsize.txt", imagefile = mapping_path + "/{sample}_fragmentsize.png", insertsizefile = mapping_path + "/{sample}_insertsizes.txt" log: log_path +"/{sample}.bowtie2.log" params: notrimming = config["notrimming"] threads: max_threads run: # Check for empty files fastq_files = [x for x in list(input.trimmed_fastq) if getsize(x) != 0] if len(fastq_files) == 0: shell("touch {output.samfile}") f = open(output.statsfile, "w") f.write("Average\tMinimum\tMaximum\n0\t0\t0") else: if params.notrimming: paired_1 = [f for f in fastq_files if "_R1_" in f] paired_2 = [f for f in fastq_files if "_R2_" in f] else: paired_1 = [x for x in fastq_files if x.find(".1P.fastq") != -1] paired_2 = [x for x in fastq_files if x.find(".2P.fastq") != -1] unpaired = [x for x in fastq_files if x.find("U.fastq") != -1] paired_1.sort() paired_2.sort() unpaired_command = "" paired_command = "" if len(paired_1) != 0: paired_command = " -1 " + ",".join(paired_1) + " -2 " + ",".join(paired_2) if len(unpaired) != 0: unpaired_command = " -U " + ",".join(unpaired) shell("bowtie2 -x {ref} {unpaired_command} {paired_command} -S {output.samfile} --threads {threads} 2> {log}", ref = index_file, unpaired_command = unpaired_command, paired_command = paired_command) calculate_insert_length(output.samfile, output.statsfile, output.imagefile, output.insertsizefile) QCumber-2.3.0/modules/sav.snakefile000077500000000000000000000151671330104711400172050ustar00rootroot00000000000000def plot_data_by_cycle(metrics, x, y, title, output, alt_ylabel, caption ): df = metrics.df[[x, "%s_A" % y, "%s_C" % y, "%s_T" % y, "%s_G" % y]] df.columns = [x.replace(y + "_", "") for x in df.columns] df = df.groupby(x).mean() y = alt_ylabel df.columns.name = y df.index.name = x df.plot(title=title + " - %s" % df.columns.name, legend=True) plt.savefig(output) plt.close() return {"title" : title + " - %s" % df.columns.name, "caption" : caption, "base64" : to_base64(output) } def plot_data_by_lane(metrics, codes, x, y, group, title, output, caption ): df = metrics.df[metrics.df.code.isin(codes.keys())] df = df[["code", "lane", "value"]] for key in codes.keys(): df.code = df.code.replace(key, codes[key]) df.index = MultiIndex.from_tuples(list(zip(df["code"], df["lane"]))) df = df[df.value > 0].append(df[df.value == 0].drop_duplicates()) df.columns = [group, x, y] ax = sns.boxplot(x=x, y=y, hue=group, data=df).set_title(title) plt.savefig(output) plt.close() return {"title" : title, "caption" : caption, "base64" : to_base64(output) } def get_sav_input(savfolder): steps = {} infiles = ["CompletedJobInfo.xml", "RunInfo.xml"] interopfiles = ["ControlMetricsOut.bin", "CorrectedIntMetricsOut.bin", "ErrorMetricsOut.bin", "ExtractionMetricsOut.bin", "IndexMetricsOut.bin", "QMetricsOut.bin", "TileMetricsOut.bin"] steps["savfolder"] = savfolder steps["infiles"] = expand("{savfolder}/{infiles}", savfolder=savfolder, infiles=infiles) steps["interopfiles"] = expand("{savfolder}/InterOp/{interopfiles}", savfolder=savfolder, interopfiles=interopfiles) return steps if config["sav"]: sav_results = data_path + "/sav.json" rule plot_sav: input: **get_sav_input(config["sav"]) output: #unpack(sav_results) pdf = main_path + "/SAV.pdf", plots = expand(data_path + "/{img}.png", img = ["data_by_cycle_base","data_by_cycle_fwhm","data_by_cycle_intensity", "data_by_lane_phasing","data_by_lane_prephasing", "data_by_lane_cluster", "qscore_distr" , "qscore_heatmap"]), json = sav_results log: log_path + "/sav_report.log" run: import xmltodict shell("Rscript --vanilla {path}/Rscripts/sav.R {input} {outfolder}", path = geninfo_config["QCumber_path"], outfolder = data_path, input=input.savfolder) sav_plots =[] sav_plots.append( {"title": "Data by Cycle - FWHM", "caption": "The average full width of clusters at half maximum (in pixels).", "base64": to_base64(data_path + "/data_by_cycle_fwhm.png"), "filename": data_path + "/data_by_cycle_fwhm.png" }) sav_plots.append({"title": "Data by Cycle - intensity", "caption": "This plot shows the intensity by color of the 90% percentile of the data for each cycle.", "base64": to_base64(data_path + "/data_by_cycle_intensity.png"), "filename": data_path + "/data_by_cycle_intensity.png"}) sav_plots.append({"title": "Data by Cycle - %Base", "caption": "The percentage of clusters for which the selected base has been called.", "base64": to_base64(data_path + "/data_by_cycle_base.png"), "filename": data_path + "/data_by_cycle_base.png"}) sav_plots.append({"title": "Data by Cycle - %>=Q30", "caption": 'The percentage of bases with a quality score of 30 or higher, respectively. This chart is generated after the 25th cycle, and the values represent the current cycle.', "base64": to_base64(data_path + "/qscore_q30.png"), "filename": data_path + "/qscore_q30.png"}) sav_plots.append({"title": "Data by Lane - %Phasing", "caption": 'The percentage of molecules in a cluster for which sequencing falls behind (phasing) the current cycle within a read The graph is split out per read.', "base64": to_base64(data_path + "/data_by_lane_phasing.png"), "filename": data_path + "/data_by_lane_phasing.png"}) sav_plots.append({"title": "Data by Lane - %Prephasing", "caption": 'The percentage of molecules in a cluster for which sequencing falls behind (phasing) the current cycle within a read The graph is split out per read.', "base64": to_base64(data_path + "/data_by_lane_prephasing.png"), "filename": data_path + "/data_by_lane_prephasing.png"}) sav_plots.append({"title": "Data by Lane - Cluster density", "caption": 'The density of clusters for each tile (in thousand per mm2)', "base64": to_base64(data_path + "/data_by_lane_cluster.png"), "filename": data_path + "/data_by_lane_cluster.png"}) sav_plots.append({"title": "QScore Heatmap", "caption": 'The Q-score heat map shows the Q-score by cycle for all lanes.', "base64": to_base64(data_path + "/qscore_heatmap.png"), "filename": data_path + "/qscore_heatmap.png"}) sav_plots.append({"title": "QScore Distribution", "caption": 'The Q-score distribution shows the number of reads by quality score. The quality score os cumulative for current cycle and previous cycles, and only reads that pass the quality filter are included. The Q-score is based on the Phred scale. ', "base64": to_base64(data_path + "/qscore_distr.png"), "filename": data_path + "/qscore_distr.png"}) xml = OrderedDict() xml["tables"] = OrderedDict() runinfo = xmltodict.parse( "".join(open(join(str(input.savfolder), "RunInfo.xml"),"r").readlines() )) runinfo["RunInfo"]["Run"]["Reads"] = runinfo["RunInfo"]["Run"]["Reads"]["Read"] xml["tables"]["RunInfo"] = runinfo["RunInfo"]["Run"] try: runparam = xmltodict.parse( "".join(open(join(str(input.savfolder),"RunParameters.xml"),"r").readlines() )) except: runparam = xmltodict.parse( "".join(open(join(str(input.savfolder),"runParameters.xml"),"r").readlines() )) runparam['RunParameters'].pop("Reads", None) xml["tables"]["RunParameter"] = dict( (key, value) for (key,value) in runparam["RunParameters"].items() if not key.startswith("@xml")) xml["img"] = sav_plots json.dump(xml, open(str(output.json), "w")) QCumber-2.3.0/modules/trimming.snakefile000077500000000000000000000330241330104711400202320ustar00rootroot00000000000000############# # Functions # ############# def get_trimmomatic_result(files, params): n_reads = 0 total_reads = 0 for file in files: with open(file,"r") as logfile: for line in logfile.readlines(): if re.match("Input Read", line): if geninfo_config["Sample information"]["type"]=="PE": pattern = re.match("Input Read Pairs:\s+(?P\d+)\s+" "Both Surviving:\s+(?P\d+) \((?P\d+.\d+)%\)\s+" "Forward Only Surviving:\s+(?P\d+) \(\d+.\d+%\)\s+" "Reverse Only Surviving:\s+(?P\d+) \(\d+.\d+%\).*", line) total_reads += int(pattern.group("total")) * 2 n_reads += int(pattern.group("nSurvived")) * 2 + int(pattern.group("forward")) + int(pattern.group("reverse")) else: pattern = re.match(".*Surviving: (?P\d+) \((?P\d+.\d+)%\)", line) pattern2 = re.match("Input Reads: (?P\d+)", line) total_reads += int(pattern2.group("total")) n_reads+= int(pattern.group("nSurvived")) all_params ={} for param in params: with open(param,"r") as paramfile: all_params[os.path.basename(param).replace(".trimmomatic.params", "")] = paramfile.read() if total_reads != 0: perc_remaining = round( 100*(n_reads/total_reads),2) else: perc_remaining = 0 return OrderedDict([ ("#Remaining Reads", n_reads), ("%Remaining Reads", perc_remaining), ("Trim parameter", all_params)]) def get_defaults(): params = "" if config["technology"] == "Illumina": params +="ILLUMINACLIP:%s:%s " % (geninfo_config["adapter"], config["illuminaclip"]) if not config["only_trim_adapters"]: params+="%s MINLEN:%s " % (config["trimOption"], str(config["minlen"])) return params def perc_slope(a,b,perc): if abs(numpy.diff([a,b]))>(b*float(perc)): return True return False def optimize_trimming(filename, outname, perc=0.1): if not exists(filename): with open(outname, "w") as paramfile: pass else: mytable=None with open(filename, "rb") as fastqcdata: table = [] ifwrite = False #previous_line = '' while True: line = fastqcdata.readline() if not line: break line = line.decode() line = line.replace("\n", "") if line.startswith(">>END_MODULE") and ifwrite: try: dtype = {'names': table[0], 'formats': ['|S15', float, float, float, float]} mytable = numpy.asarray([tuple(x) for x in table[1:]], dtype=dtype) break except: pass elif re.search("Sequence length\s+\d+", line): seq_length = re.search("Sequence length\s+(?P\d+)", line).group("length") #elif re.search("Sequence length\s+\d+(?P\d+)", line): # The culprit ^^^ this killed trimBetter # seq_length = re.search("Sequence length\s+\d+(?P\d+)", line).group("length") elif line.startswith(">>Per base sequence content"): #print(line, seq_length) ifwrite = True temp = line.split("\t") if temp[1].lower() == "pass": with open(outname, "w") as paramfile: paramfile.write("%s;%s" % (0, seq_length)) return True elif ifwrite: table.append(line.split("\t")) # previous_line = line headcrop = 0 tailcrop = len(mytable["A"]) column = numpy.ma.array(mytable) print(rep(column)) for i in range(-4, int(round(len(mytable["A"]) / 3, 0)), 1): for nucl in ["A", "C", "G", "T"]: column[nucl].mask[max(i, 0):i + 5] = True if headcrop >0: column[nucl].mask[:headcrop] = True if tailcrop < len(mytable["A"]): column[nucl].mask[tailcrop:] = True # check heacrop if (perc_slope(numpy.mean(mytable[nucl][max(i, 0):i + 5]), numpy.mean(column[nucl]), perc=perc)) & (headcrop < (i + 5)): headcrop = i + 5 trim_bool = True elif headcrop < i: column[nucl].mask[max(i, 0):i + 5] = False # now crop from the end column[nucl].mask[-(i + 5):(min(len(mytable[nucl]), len(mytable[nucl]) - i))] = True if (perc_slope(numpy.mean(mytable[nucl][-(i + 6): (min(len(mytable[nucl]) - 1, len(mytable[nucl]) - 1 - i))]), numpy.mean(column[nucl]), perc=perc)) & (tailcrop > len(mytable[nucl]) - (i + 5)): tailcrop = len(mytable[nucl]) - (i + 5) trim_bool = True else: column[nucl].mask[-(i + 5): (min(len(mytable["A"]) - 1, len(mytable[nucl]) - 1 - i))] = False with open(outname, "w") as paramfile: paramfile.write("%s;%s" % (headcrop, tailcrop-headcrop)) return True def get_best_params(output, r1, r2=None): with open (r1,"r") as r1_file: r1_params =r1_file.read().replace("\n","").split(";") # print('head-/tail-crop params:\nread1:', r1_params) if r2 is not None: with open(r2, "r") as r2_file: r2_params = r2_file.read().replace("\n","").split(";") # print('read2:', r2_params) else: r2_params = [-float("inf"), float("inf")] if not r1_params[0]=="": new_params = " ".join([ get_defaults(), "HEADCROP:" + str(max(int(r1_params[0]),int(r2_params[0]))), "CROP:" + str(min(int(r1_params[1]),int(r2_params[1]))), "MINLEN:" + str(config["minlen"])]) else: new_params = get_defaults() with open(output, "w") as outfile: outfile.write(new_params) return new_params def get_trimmomatic_input(wildcards): input = {} #if geninfo_config["Sample information"]["samples"][wildcards.sample][0].endswith(".bam"): # input["fastq_files"] = bam_to_fastq(geninfo_config["Sample information"]["samples"][wildcards.sample][0]) #else: # input["fastq_files"] = geninfo_config["Sample information"]["samples"][wildcards.sample] input["fastq_files"] = get_all_reads(wildcards, True) if config["trimBetter"]: if geninfo_config["Sample information"]["type"] == "PE": input["params"] = list(expand("{path}/{sample}_{read}.params", read=["R1", "R2"], sample=wildcards.sample, path=trimbetter_path)) else: input["params"] = list(expand("{path}/{sample}.params", sample = wildcards.sample, path = trimbetter_path)) return input def get_trimmomatic_output(path, is_temp = False): output = {} if is_temp: #output["pseudo_trimfile"] = (path + # "/{sample}.trimBetter.trimmomatic.pseudo") #output["params_file"] = temp( path + "/{sample}.trimmomatic.params") output["logfile"] = log_path + "/{sample}.trimBetter.trimmomatic.log" else: output["params_file"] = temp(path + "/{sample}.trimmomatic.params") #output["pseudo_trimfile"] = path + "/{sample}.trimmomatic.pseudo" output["logfile"] = log_path + "/{sample}.trimmomatic.log" if geninfo_config["Sample information"]["type"]=="SE": if is_temp: output["trimmed_files"] = [temp(path + "/{sample}.fastq.gz")] else: output["trimmed_files"] = [path + "/{sample}.fastq.gz"] else: if is_temp: output["trimmed_files"] = [temp(path + "/{sample}.1P.fastq.gz"), temp(path + "/{sample}.1U.fastq.gz"), temp(path + "/{sample}.2P.fastq.gz"), temp(path + "/{sample}.2U.fastq.gz")] else: output["trimmed_files"] =[ path + "/{sample}.1P.fastq.gz", path + "/{sample}.1U.fastq.gz", path + "/{sample}.2P.fastq.gz", path + "/{sample}.2U.fastq.gz"] return output #--------------------------------------------< RULES >-----------------------------------------------------------------# if not config["notrimming"]: if config["trimBetter"]: rule join_reads_trimBetter: input: r1 = [trimbetter_path + "/{sample}.1P.fastq.gz", trimbetter_path + "/{sample}.1U.fastq.gz"], r2 = [trimbetter_path + "/{sample}.2P.fastq.gz", trimbetter_path + "/{sample}.2U.fastq.gz"] output: r1_out = temp(trimbetter_path + "/{sample}_R1.fastq.gz"), r2_out = temp(trimbetter_path + "/{sample}_R2.fastq.gz") shell: "cat {input.r1} > {output.r1_out} | " "cat {input.r2} > {output.r2_out} " rule fastqc_trimBetter: input: fastq_files = trimbetter_path + "/{sample}_{read}.fastq.gz" output: temp(trimbetter_path + "/FastQC/{sample}_{read}_fastqc.zip"), temp(trimbetter_path + "/FastQC/{sample}_{read}_fastqc.html"), fastqc = temp(trimbetter_path + "/FastQC/{sample}_{read}_fastqc"), log = temp(trimbetter_path + "/FastQC/{sample}_{read}.fastqc.log") threads: max_threads message: "Run FastQC to obtain better trimming paramters." run: #print('sizes:', ' '.join(['%s:%i|' % (x,os.path.getsize(x)) # for x in input]), file=sys.stderr) shell( "if [ `zcat '{input}' | head -n 1 | wc -c ` -eq 0 ]; " "then touch {output}; " "else fastqc {input} -o $(dirname {output.fastqc})" " --extract --nogroup -t {threads} > {output.log} 2>&1; " "fi; ") rule optimize_trimming_parameter: input: trimbetter_path + "/FastQC/{sample}_{read}_fastqc" output: temp(trimbetter_path + "/{sample}_{read}.params") params: perc_slope = config["trimBetter_threshold"] run: # Apperently os.path.join is loaded somehow res = optimize_trimming(join(str(input),"fastqc_data.txt"), str(output), float(params.perc_slope)) if not res: shell('exit 1') rule trimmomatic_trimBetter: input: fastq_files = lambda x: geninfo_config["Sample information"]["samples"][x.sample] output: **get_trimmomatic_output(trimbetter_path, is_temp = True) threads: max_threads log: get_trimmomatic_output(trimbetter_path, is_temp = True)['logfile'] #log_path + "/{sample}.trimmomatic.trimBetter.log" params: get_defaults() shell: ("trimmomatic %s -threads {threads} {input.fastq_files}" " {output.trimmed_files} {params}" " 2> {log}") % geninfo_config["Sample information"]["type"] #-- end trimbetter rule trimmomatic: input: unpack(get_trimmomatic_input) output: **get_trimmomatic_output(trimming_path) log: log_path + "/{sample}.trimmomatic.log" params: minlen = config["minlen"], trimOption = config["trimOption"] threads: max_threads run: #print('sizes:', ' '.join(['%s:%i|' % (x,os.path.getsize(x)) # for x in input]), file=sys.stderr) try: new_params = get_best_params(str(output.params_file), *list(input.params)) except: new_params = get_defaults() if params.minlen: pass if params.trimOption: pass paramfile = open(str(output.params_file),"w") paramfile.write(new_params) paramfile.close() shell("trimmomatic %s -threads {threads} " # -Xmx512m " "{input.fastq_files} {output.trimmed_files} %s 2> {output.logfile}" % ( geninfo_config["Sample information"]["type"], new_params)) #shell("touch {output.pseudo_trimfile}") rule join_reads: input: r1 = [trimming_path + "/{sample}.1P.fastq.gz",trimming_path + "/{sample}.1U.fastq.gz"], r2 = [trimming_path + "/{sample}.2P.fastq.gz",trimming_path + "/{sample}.2U.fastq.gz"] output: r1_out = temp(trimming_path + "/{sample}_R1.fastq.gz"), r2_out = temp(trimming_path + "/{sample}_R2.fastq.gz") run: # print('sizes:', ' '.join(['%s:%i|' % (x,os.path.getsize(x)) # for x in input]), file=sys.stderr) shell( "cat {input.r1} > {output.r1_out} | " "cat {input.r2} > {output.r2_out}") QCumber-2.3.0/modules/utils.py000077500000000000000000000037061330104711400162370ustar00rootroot00000000000000import fileinput import matplotlib.pyplot as plt import numpy def which(program): import os def is_exe(fpath): return os.path.isfile(fpath) and os.access(fpath, os.X_OK) fpath, fname = os.path.split(program) if fpath: if is_exe(program): return program else: for path in os.environ["PATH"].split(os.pathsep): exe_file = os.path.join(path, program) if is_exe(exe_file): return exe_file return None def calculate_insert_length(samfile, statsfile, imagefile, insertsizefile): minlen=1000000 maxlen=0 totlen = 0 lines = 0 alllens = [] for line in fileinput.input(samfile): if line[0] == "@": continue dat = line.split("\t") if dat[8] == "0": continue inslen = abs(int(dat[8])) alllens += [inslen] totlen += inslen lines += 1 if inslen < minlen: minlen = inslen if inslen > maxlen: maxlen = inslen try: avg_insert = float(totlen)/float(lines) f = open(statsfile, "w") f.write("Average\tMinimum\tMaximum\n") f.write(str(avg_insert) + "\t" + str(minlen) + "\t" + str(maxlen)) f.close() topoutlier = numpy.percentile(alllens, 75)+1.5*(numpy.percentile(alllens, 75)-numpy.percentile(alllens, 25)) cutlens = [x for x in alllens if x <= 5*topoutlier] f = open(insertsizefile, "w") f.write(",".join([str(x) for x in alllens])) f.close() plt.hist(cutlens, bins=100) plt.xlabel("Template length") plt.ylabel("Density of reads") plt.savefig(imagefile) except: f = open(statsfile, "w") f.write("Average\tMinimum\tMaximum\n") f.write("0\t0\t0") f.close() f = open(imagefile, "w") f.write(" ") f.close() f = open(insertsizefile, "w") f.write("0,0") f.close() QCumber-2.3.0/readme.md000077500000000000000000000305341330104711400146330ustar00rootroot00000000000000# QCumber Quality control, quality trimming, adapter removal and sequence content check of NGS data. >Version: 2.1.1
>Contact: BI-Support@rki.de
>Documentation updated: 24.07.2017 ## Installation: Install the latest stable version via Bioconda channel. It is assumed that the following channels are activated: * bioconda * r * ostrokach * conda-forge ```sh conda install qcumber ``` and update with ```sh conda update qcumber ``` Further prerequisite tools are pdflatex and texlive-latex-extra for PDF reports. ## Introduction QCumber is a pipeline for quality control, trimming and sequence content check of NGS data. It includes parameter optimization of trimming and visualization of the output as an interactive HTML report. Note that mapping and read classification are only preliminary results and also paired-end data are treated as single-end. The workflow used in the pipeline is visualized in the following chart: ![Workflow](workflow.png "Workflow image") QCumber needs miniconda3 to build the pipeline and pdflatex to write sample reports. The following tools are used: | Tool name | Version | Pubmed ID | |-----------|---------|-----------| | [snakemake](https://bitbucket.org/snakemake/snakemake/wiki/Home) | 3.12.0 || | [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) | 0.11.5 || | [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) | 0.36 | [24695404](https://www.ncbi.nlm.nih.gov/pubmed/24695404)| | [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) | 2.2.9 | [22388286](https://www.ncbi.nlm.nih.gov/pubmed/22388286) | | [Kraken](http://ccb.jhu.edu/software/kraken/) | 0.10.5 |[24580807](https://www.ncbi.nlm.nih.gov/pubmed/24580807) ## Tutorial Calling the pipeline with the option `--help` provides a help message listing all options: ```sh QCumber-2 --help ``` A small example dataset is provided in the data-folder in /pipelines/datasets/benchmarking/qc_map_var/ (not published yet). The following example uses this dataset to demonstrate a basic run of the pipeline. A basic pipeline run is as follows: ```sh QCumber-2 --read1 /pipelines/datasets/benchmarking/qc_map_var/data/Sample1_S1_L001_R1_001.fastq.gz --read2 /pipelines/datasets/benchmarking/qc_map_var/data/Sample1_S1_L001_R2_001.fastq.gz \ --reference /pipelines/datasets/benchmarking/qc_map_var/references/reference.fasta --output qcumber_output ``` Input data can be entered as `-1` or `-2` for single files or `--input` for a project folder. QCumber can automatically detect read pairs if Illumina sample pattern matches `___`. For the preliminary mapping step a reference in fasta-format must be given. Otherwise QCumber skips mapping process. If `--output` is not defined, the results folder **QCResults** is written to the working directory. The following usage is for batch analysis and parameter optimization adjusted to mapping as downstream analysis (`--trimBetter`, see chapter 'Functions' for more details): ``` QCumber-2 --input /pipelines/datasets/benchmarking/qc_map_var/ --adapter NexteraPE-PE --trimBetter mapping --output qcumber_batch_output ``` If you only need a subset of files in your folder, you can also use regular expression in `--input`. This example returns all files starting with Sample1 : ``` QCumber-2 --input /pipelines/datasets/benchmarking/qc_map_var/data/Sample1* ``` QCumber-2 outputs **/config.yaml** for each run, which can be used to rerun the analysis or to define default parameters. ``` QCumber-2 --config config.yaml ``` If you add additional parameters, it overrides the values in the config file. Here is an example how to use **config.yaml** as default parameter setting. The structure of **config.yaml** is very easy. All input parameters can be listed in the format ` : `. For instance: trimBetter: mapping
threads: 10
save_mapping: true
``` QCumber-2 --config config.yaml --trimBetter assembly --output results_folder ``` In this case trimBetter parameters will be optimized for assembly, i.e trimming is more aggressive than for mapping (see section 'Functions' for further details). ## Functions #### Get information from Illumina Sequence Analysis Viewer > short: `-w `
> long: `--sav ` This option requires that the provided folder contain: * CompletedJobInfo.xml * GenerateFASTQRunStatistics.xml * RunCompletionStatus.xml * RunInfo.xml * RunParameters.xml * InterOp/ControlMetricsOut.bin * InterOp/CorrectedIntMetricsOut.bin * InterOp/ErrorMetricsOut.bin * InterOp/ExtractionMetricsOut.bin * InterOp/IndexMetricsOut.bin * InterOp/QMetricsOut.bin * InterOp/TileMetricsOut.bin It takes the information from these files and converts it into a human readable table. Furthermore, plots were generated equivalent to SAV section "Data by Cycle" for FWHM, intensity and %base as well as for section "Data by Lane" for prephasing, phasing and cluster density. Both tables and plots can be found in **QCResults/batch_report.html** under the section "Sequencer Information". Additionally, a report **QCResults/SAV.pdf** for SAV will be generated. #### Input > Long option: `--input `
> Short option: `-i ` Input sample folder. Illumina filenames should be gzipped fastq files end with ___number, e.g. Sample_12_345_R1_001.fastq.gz, to find the right paired set. If this does not match, all files are treated as single end data. This is always the case of IonTorrent data, i.e. the input file is in bam-format. #### Read1 > Long option: `--read1 `
> Short option: `-1 ` Filename for forward reads or one single end file. This is expected to be a .fastq.gz file. #### Read2 > Long option: `--read2 R2`
> Short option: `-2 R2` Filename for reverse read file. This option does not check for file pattern. This is expected to be a .fastq.gz file. #### Sequence technology > Long option: `--technology `
> Short option: `-T ` > Options: {Illumina, IonTorrent} If not set, automatically determine technology and search for fastq and bam files. Set technology to IonTorrent if all files are bam-files, else set technology to Illumina. #### Optimize trimming parameter > Long option: `--trimBetter ` > Options : {assembly, mapping, default} Optimize trimming parameter using 'Per sequence base content' from fastqc. This option is not recommended for amplicons. This option will, after quality trimming, remove all positions at the beginning and end of the reads that show an uneven distribution of bases (as is characteristic for Nextera). The trimBetter_threshold in the values given below sets by how much the highest-abundant base in a position can be more abundant than the lowest-abundant base (i.e. if --trimBetter_threshold is set to 0.15, the abundancy of the highest-abundant base in a cycle may be at most 1.15 times that of the lowest-abundant base, otherwise the cycle will be trimmed). The option *assembly* trims more aggressively than *mapping*, i.e. it allows even lower fluctuations in 'Per sequence base content'. The parameters are written in config/parameter.txt and vary with trimBetter type and sequencing platform: * default: `--trimOption 'SLIDINGWINDOW:4:20' --trimBetter_threshold 0.15` * Illumina - Assembly: `--trimOption 'SLIDINGWINDOW:4:25' --trimBetter_threshold 0.1` * Illumina - Mapping: `--trimOption 'SLIDINGWINDOW:4:15' --trimBetter_threshold 0.15` * IonTorrent - Assembly: `--trimOption 'SLIDINGWINDOW:4:15' --trimBetter_threshold 0.2` * IonTorrent - Mapping: `--trimOption 'SLIDINGWINDOW:4:15' --trimBetter_threshold 0.25` #### Trimbetter threshold > Long option: `--trimBetter_threshold `
> Short option: `-b ` Set --trimBetter to use this option. This option overrides the threshold of how much the base content can max. fluctuate. *assembly*,*mapping* and *default* will be overwritten by this. #### Mininmal read length > Long option: `--minlen `
> Short option: `-m `
> Default: 50 Minlen parameter for Trimmomatic. Drops read short than minlen. #### Only Trim Adapters > Long option: `--only_trimm_adapters` > Short option: `-A` Only removes adapters and invaliates additional trimmomatic/trimBetter parameters #### Additional Trimmomatic parameters > Long option: `--trimOption `
> Short option: `-O ` #### Illuminaclip > Long option: `--illuminaclip `
> Short option: `-L `
> Default: 2:30:10 Illuminaclip option: `::`. #### Adapter removal > Long option: `--adapter `
> Short option: `-a `
> Options: {TruSeq2-PE, TruSeq2-SE, TruSeq3-PE, TruSeq3-SE, TruSeq3-PE-2, NexteraPE-PE} > Default: all Adapter sequence for Trimmomatic. Suggested adapter sequences are provided for TruSeq2 (as used in GAII machines) and TruSeq3 (as used by HiSeq and MiSeq machines), for both single-end and paired-end mode (check Trimmomatic manual). If not set, all adapters are used for trimming. #### Reference > Long option: `--reference `
> Short option: `-r ` Map reads against reference. Reference needs to be in fasta-format. #### Bowtie2 index > Long option: `--index `
> Short option: `-I ` Bowtie2 index if available. Otherwise, set --reference for mapping. #### Save mapping > Long option: `--save_mapping `
> Short option: `-S ` > Default: False Saves mapping file in sam-format. As default, only mapping statistics are saved. #### Kraken DB > Long option: `--kraken_db `
> Short option: `-d ` Define destination to Kraken database. The folder has to contain database.kdb. #### Kraken (un)classified read output > Long option: `--kraken_classified_out`
Kraken (un)classified-out option. If set, both the --classified-out and --unclassified-out option are set. Default: False. #### Nokraken > Long option: `--nokraken `
> Short option: `-K` Skip Kraken classifiation. #### Notrimming > Long option: `--notrimming `
> Short option: `-Q ` Skip trimming step. #### Config > Long option: `--config `
> Short option: `-c ` > Default: config/config.txt in the installation directory of QCumber-2 if it exists Configfile to (re-)run pipeline. Additional parameters in the commandline will override arguments in configfile. #### Threads > Long option: `--threads `
> Short option: `-t `
> Default: 4 Number of threads. #### Output > Long option: `--output `
> Short option: `-o ` #### Rename > Long option: `--rename RENAME`
> Short option: `-R RENAME` Tab-separated file with two columns: ` `. QCumber replaces the old filename with the new one. If it does not find unique replacements, it will skip renaming for this sample. #### Additional snakemake commands All parameters (excluding --cores) from snakemake can be given to QCumber. For example `--notemp` saves all temp files of the analysis or `--forceall` will force the pipeline the rerun all analysis steps, although the output already exists. ## Output By default, the pipeline generates the following files in the output folder: * **QCResults** * < PDF report per sample > * **batch_report.html** *(HTML report for entire project; it integrates kraken.html, so if you move this file, make sure to move kraken.html in the some folder)* * **kraken.html** * **FastQC** * **Raw** * < output folder(s) from FastQC > * **Trimmed** * < output folder(s) from FastQC > * **Trimmed** * < trimmed reads (.fastq.gz) > * **Mapping** * < sam files > * **Classification** * < Kraken plots > * < textfile of classified reads (.translated) > * **kraken_batch_result.csv** (table of classified species [%] ) * config.yaml # License This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License, version 3 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. QCumber-2.3.0/report.tex000077500000000000000000000124001330104711400151010ustar00rootroot00000000000000\documentclass[a4paper]{article} \usepackage[document]{ragged2e} \usepackage{python} \usepackage[english]{babel} \usepackage{graphicx} \usepackage[utf8]{inputenc} \usepackage{xcolor} \usepackage[T1]{fontenc} \renewcommand*\familydefault{\ttdefault} %% Only if the base font of the document is to be typewriter style \usepackage{subfigure} \usepackage{geometry} \usepackage{tcolorbox} \usepackage{adjustbox} \usepackage{setspace} \usepackage{float} \usepackage{url} \usepackage{grffile} \usepackage{fancyvrb} \usepackage{spverbatim} \geometry{a4paper,left=25mm,right=20mm, top=12mm, bottom=20mm} \setlength\parindent{0pt} \tolerance=1 \emergencystretch=\maxdimen \hyphenpenalty=1000 \hbadness=1000 %\definecolor{green}{RGB}{0,158,115} %\definecolor{red}{RGB}{213,94,0} %\definecolor{orange}{RGB}{230,159,0} \begin{document} {\bf {\LARGE{QCumber} Version {{~ general_information["QCumber"] ~}} } }\\ \line(1,0){\textwidth} \begin{tabular}{p{0.25\textwidth} p{0.75\textwidth}} Executor: & {{~ general_information["User"] ~}} \\ Dataset: & {\bfseries{\verb|{{~ sample["Name"] ~}}|} }\\ Path: & \path{{{~ sample["path"] ~}}} \\ Date: & {{~ sample["Date"] ~}}\\ & \\ Operating System & \\ {% for key, value in general_information["Operating system"].items() %} \verb|{{~ key ~}}| & \verb|{{~ value ~}}| \\ {% endfor %} & \\ Tool versions & \\ {% for key, value in general_information["Tool versions"].items() %} \verb|{{~ key ~}}| & {{~ value | safe ~}}\\ {% endfor %} \end{tabular}\\ %--------------------------------------------- Workflow ---------------------------------------------------------------% \line(1,0){\textwidth} \\ Processed reads: \\ {% for read in sample["Files"] %} \verb|{{~ read ~}}| \\ {% endfor %} %------------------------------------------------ Summary -------------------------------------------------------------% \begin{tcolorbox} {\large{Summary} } \\ \begin{tabular}{lrr} & Raw reads & {%- if "%Remaining Reads" in sample.keys() -%} Trimmed reads {% endif %} \\ Number of reads & {{~ sample["Total sequences"] ~}} & {%- if "%Remaining Reads" in sample.keys() -%} {{~ sample["#Remaining Reads"] ~}} ({{~ sample["%Remaining Reads"] ~}}\%) {% endif %} \\ {%- if sample["%Adapter content"] != None -%} Adapter content & {{~ sample["%Adapter content"] ~}}\% & {%- if sample["%Adapter content (trimmed)"] -%} {{~ sample["%Adapter content (trimmed)"] ~}}\% {% endif %} \\ {% endif %} {% if sample["%Overrepr sequences"] != None %} Overrepresented sequences & {{~ sample["%Overrepr sequences"] ~}}\% & {%- if sample["%Overrepr sequences (trimmed)"] -%} {{~ sample["%Overrepr sequences (trimmed)"] ~}}\%{% endif %} \\ {% endif %} \end{tabular} \vspace{5mm} {% if "%AlignedReads" in sample.keys() %}{{~ sample["#AlignedReads"] ~}} ({{~ sample["%AlignedReads"] ~}}\%) of reads aligned to \path{ {{~ sample["Reference"] ~}} }\\ {% endif %} {% if "%Classified" in sample.keys() %}{{~ sample[ "#Classified"] ~}} sequences classified ({{~ sample[ "%Classified"] ~}}\%) {% endif %} \end{tcolorbox} \line(1,0){\textwidth} \\ \vspace{5mm} %------------------------------------------------- FASTQC Results -----------------------------------------------------% {\Large{FastQC (Pre | Post Trimming) } } \\ \vspace{5mm} {%for name, value in sample["raw_fastqc_results"].items() %} Readname: \verb|{{~ name ~}}| {% if "Trim parameter" in sample.keys() %} Trim parameter: \\ {% for trimname, trimvalue in sample["Trim parameter"].items() %} {%- if name.startswith(trimname) -%} \begin{spverbatim}{{~ trimvalue ~}}\end{spverbatim}\\ {% endif %} {% endfor %} {% endif %} {% for key in value["img"].keys() %} \begin{figure}[H] \centering \begin{adjustbox}{minipage=0.4\textwidth-2\fboxrule, cframe = {{~ value["img"][key]["color"] ~}} 2}% \begin{subfigure}% {\includegraphics[width=\textwidth]{{{~ value["img"][key]["path"] ~}}}} \end{subfigure} \end{adjustbox}\qquad {% if "trimmed_fastqc_results" in sample.keys() %} {% if name in sample["trimmed_fastqc_results"].keys() %} \begin{adjustbox}{minipage=0.4\textwidth-2\fboxrule, cframe = {{~ sample["trimmed_fastqc_results"][name]["img"][key]["color"] ~}} 2}% \begin{subfigure}% {\includegraphics[width=\textwidth]{{{~ sample["trimmed_fastqc_results"][name]["img"][key]["path"] ~}}}} \end{subfigure} \end{adjustbox} \caption{ {{~ key ~}}} {% endif %} {% endif %} \end{figure} {% endfor %} {% endfor %}\\ %--------------------------------------------------- Bowtie Results ---------------------------------------------------% {% if "%AlignedReads" in sample.keys() %} \line(1,0){\textwidth} \vspace{5mm} {\Large{Bowtie2} } - Map against \path{{{~ sample["Reference"] ~}} } \\ \begin{verbatim} {{~ sample["Bowtie2_log"] ~}} \end{verbatim} {% endif %} %------------------------------------------------ Kraken Results ------------------------------------------------------% {% if "%Classified" in sample.keys() %} \line(1,0){\textwidth} \vspace{5mm} {\Large{Kraken} } \\ \begin{verbatim} {{~ sample["kraken_log"] ~}} \end{verbatim} {\includegraphics[width=\textwidth]{{{~ sample["kraken_img"] ~}}}} {{~ sample["kraken_results"] ~}} {% endif %} \end{document} QCumber-2.3.0/test/000077500000000000000000000000001330104711400140235ustar00rootroot00000000000000QCumber-2.3.0/test/.default_goldstandard.json000066400000000000000000000017451330104711400211550ustar00rootroot00000000000000{"-t4,-K": {"dwgsim,42,e_coli,-1120,-2120,-e0.0005-0.001,-E0.0005-0.001,-N10000": {"testData42dwgsim-e-colibbb40142940a1d5a6fccd555597c402a1cbc187c/QCResults/_logfiles/testData42dwgsim-e-colibbb40142940a1d5a6fccd555597c402a1cbc187c_S1_L001.trimmomatic.log": [2875, "85b6fa3c71dc9e277b6395c0f2b49579d8dda229"], "testData42dwgsim-e-colibbb40142940a1d5a6fccd555597c402a1cbc187c/QCResults/FastQC/trimmed/testData42dwgsim-e-colibbb40142940a1d5a6fccd555597c402a1cbc187c_S1_L001_R1_fastqc.zip": [211081, ""], "testData42dwgsim-e-colibbb40142940a1d5a6fccd555597c402a1cbc187c/QCResults/FastQC/trimmed/testData42dwgsim-e-colibbb40142940a1d5a6fccd555597c402a1cbc187c_S1_L001_R2_fastqc.zip": [209704, ""], "testData42dwgsim-e-colibbb40142940a1d5a6fccd555597c402a1cbc187c/QCResults/_data/summary_new.json": [612, "3abe14a04eb162cab2e1b2faf06f53f6953141d9"], "testData42dwgsim-e-colibbb40142940a1d5a6fccd555597c402a1cbc187c/QCResults/_data/summary.json": [1022131, "cdeec4f333fd39d5647caecba323d9378e07f968"]}}}QCumber-2.3.0/test/.default_low_spec_test_run.yaml000066400000000000000000000251251330104711400222340ustar00rootroot00000000000000# This config file controls the apptests for qcumber # Its toplevel consists of # references, seeds, default_output_files, download_minikraken and test_runs # verbosity # TODO: download_minikraken # short reminder: - denotes the start of a list element # still part of the element # again.. first entry of the list # - second entry of the list # also a list: [1, 2, 3, 4, 5, 6] # references: # Reference sequences used mainly for simulating reads # : # Name used to reference the sequence within the document # descriptive_name: # # As it states (Can be an empty string '') # url: (required and should be valid) # # Webaddress from which to download data if it doesn't exist # # yet. Multi-line strings that are supposed to be one big one # # can be enclosed with quotes "", but each line # # needs to end with a forward slash, to prevent the insertion # # of spaces. # remote_reads: # Remote fastq data # : # Name used to reference data # descriptive_name: # - desriptive_name: # url: # seeds: [] list of seeds required to run the simulators (non optional) # # verbosity: (optional, default: 1) # # verbosity level of qcumber run (does not affect log) # # level 0: no qcumber output # # level 1: only progress and errors # # level 2: all rule, job and progress information # default_output_files: # # List of output files that need monitoring # # OutputFile List Entry: # # Files are hashed and compared to a golden standard or alternatively # # checked by looking at the file size # # - file: (required) # # Filepath referenced to QCResults # # QCResults/_data/summary.json is _data/summary.json # # Unix- path pattern extension works (*[]?) and will # # be resolved as multiple files that are handled with the same # # options active. # # The test script generates pretty unwieldy file name prefixes. # # One does not need to anticipate thoose and can simply elect # # to use {{Prefix}} in the filepath, which will be replaced by the # # Prefix. # # expected_num_files: (optional, default: 1) # # Numbers of files to expect after resolving name and pattern # # extension. Causes WARNING, if more files than expected # # and ERROR if less than expected are observed # # regex: 'regex string' (optional, default: '') # # A perl regex that will be used to pre-process the file prior # # to hashing it. Removing timestamps, temporary directory names and # # machine local variables are its main uses. # # 's/"Date": "[-0-9]*",//g' removes a date entry from a .json # # 's/tmp\/[^ ]*\///g' removes tmp/randomized_string/ from a file # # Sadly, Using multiple regexes currently requires stringing them # # together. E.g.: 's/("Date": "[-0-9]*",)|(tmp\/[^ ]*\/)//g' # # use_file_size: (optional, default: False) # # circumvents hashing and triggers comparison by file size # # file_size_cutoff: (optional, default: .95) # # Bound for file being close enough to the size of the standard # # Per default checks if the smaller file is at least 95% of the # # bigger one in size. # # If set to 1, needs an exact matching. # # Caution: File systems are weird and this might not work out well # # test_runs: # list of test runs # # A test run # - threads: (optional, default: 4) # # Threads Qcumber should use # # trimbetter: (optional, default: False) # # Should the run use trimbetter # # special_output_files: (optional, default: []) # # Same as default output file list, but takes precedence # # over said list in case of a conflict (see: default_output_files) # # kraken: (optional, default: False) # # Use kraken database to check for possible contamination # # of read data # # # # If QCumber/config/config.txt is not setup to find a suitable # # kraken database one can be pointed out in the additional opts. # # TODO: # # If none is provided and the optional parameter download_minikraken # # is set, minikraken will be downloaded (4 GB). # # additional_opts: (optional, default: []) # # additional options for the qcumber run # # Note: Options are not allowed to contain white spaces # # ['--kraken_db Path/to/kraken/database.kdb', ...] is not valid # # ['--kraken_db', 'Path/to/kraken/database.kdb', ...] is valid # # ['-dPath/to/kraken/database.kdb', ...] is also valid # # simulators: (optional, if real data present, default: []) # # Simulator entry # - name: (required) # # currently only dwgsim implemented # # reference: (required) # # must match a name assigned in the references section of this # # config # # opt: (required) # # list of options for start of read simulator # # Note: Options are not allowed to have white space # # ['-1 140', ...] is not allowed # # it must be either ['-1140', ...] or ['-1', '140', ...] # type: (optional, default: 'read') # read | variants # # information used to manage multiple simulators, but currently # # unused. # # qcumber_input: (optional, default: True) # # flag signaling that the simulation is used for the input # # parameter of qcumber. Causes name cleanup. # # ## Here be Dragons ## # overwrite_standard: (optional, default: False) # # Overwrites the gold standard entry for the run. # # Usefull if you are sure that the previous entry was wrong. # # A tool for fixing rules and accidental activation should be # # avoided at all cost. This is why overwrites are only triggered # # if the ./test_qcumber is called with the --trigger_overwrite # # parameter. # # # # Use Case: You are working on a new file rule that requires # # cleanup before hashing. # # - You run test_qcumber and it generates new # # hashes for the newly tracked files. # # - The second run yields bad hashes for the new files. # # There might be a timestamp, temporary directory, username that # # needs to be removed prior to hashing # # - You adjust the regex replace and set the overwrite_standard # # option for the run to True. # # - run test_qcumber.py --trigger_overwrite # # - remove the flag and test again # --- references: e_coli: descriptive_name: Escherichia coli str. K-12 substr. MG1655 url: "ftp://ftp.ncbi.nlm.nih.gov/genomes/\ all/GCF/000/005/845/GCF_000005845.2_ASM584v2/\ GCF_000005845.2_ASM584v2_genomic.fna.gz" influenza_A: descriptive_name: Influenza A virus (A/New York/392/2004(H3N2)) url: "ftp://ftp.ncbi.nlm.nih.gov/genomes/\ all/GCF/000/865/085/\ GCF_000865085.1_ViralMultiSegProj15622/\ GCF_000865085.1_ViralMultiSegProj15622_genomic.fna.gz" remote_reads: # Remote fastq data SRR390728: # Name used to reference data descriptive_name: > Illumina Genome Analyzer IIx paired end sequencing; RNA-Seq (polyA+) analysis of DLBCL cell line HS0798 files: - descriptive_name: "Readends (1/2) of paired end read data" url: "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR390/SRR390728/\ SRR390728_1.fastq.gz" - descriptive_name: "Readends (2/2) of paired end read data" url: "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR390/SRR390728/\ SRR390728_2.fastq.gz" verbosity: 1 # (0/1/2) seeds: [42, 4242, 1337, 20019812, 799821123] default_output_files: - file: _data/summary_new.json use_file_size: False - file: _data/summary.json regex: 's/("Date": "[-0-9]*",)|(tmp\/[^ ]*\/)|(\/home[^ ]*.fa)//g' use_file_size: False test_runs: # List of runs - kraken: False trimbetter: False threads: 2 additional_opts: [] default_to_simref: False special_output_files: - file: _logfiles/*.trimmomatic.log regex: 's/(tmp\/[^ ]*\/)|(\/home[^ ]*.fa)//g' use_file_size: False - file: FastQC/trimmed/* use_file_size: True expected_num_files: 2 simulators: - name: dwgsim type: read qcumber_input: True reference: e_coli opt: ["-1120", "-2120", -e0.0005-0.001, -E0.0005-0.001, -N1000] - kraken: False trimbetter: True trimbetter_mode: mapping threads: 2 additional_opts: [] default_to_simref: False special_output_files: - file: _logfiles/*.trimmomatic.log regex: 's/(tmp\/[^ ]*\/)|(\/home[^ ]*.fa)//g' use_file_size: False - file: FastQC/trimmed/* use_file_size: True expected_num_files: 2 simulators: - name: dwgsim type: read qcumber_input: True reference: e_coli opt: ["-1120", "-2120", -e0.0005-0.001, -E0.0005-0.001, -N1000] # local_real_data: # - input_files: # - file: ./data/WierdSpec # cutoff: 200 # - project_name: Morpheus # input_files: # - file: the/chosen/path/* # - kraken: False # trimbetter: False # threads: 4 # default_to_simref: False # additional_opts: [] # special_output_files: # - file: _logfiles/{{Prefix}}*.trimmomatic.log # regex: 's/(tmp\/[^ ]*\/)|(\/home[^ ]*.fa)//g' # use_file_size: False # simulators: # - name: dwgsim # type: read # qcumber_input: True # reference: e_coli # opt: ["-1120", "-2110", -e0.0005-0.002, -E0.0005-0.001, -N100000] # overwrite_standard: False - kraken: False trimbetter: False threads: 4 default_to_simref: True additional_opts: ['-S'] special_output_files: - file: _logfiles/{{Prefix}}*.trimmomatic.log regex: 's/(tmp\/[^ ]*\/)|(\/home[^ ]*.fa)//g' use_file_size: False simulators: - name: dwgsim type: read qcumber_input: True reference: e_coli opt: ["-1120", "-2115", -e0.0005-0.002, -E0.0005-0.001, -N100000] overwrite_standard: False QCumber-2.3.0/test/__init__.py000077500000000000000000000000001330104711400161250ustar00rootroot00000000000000QCumber-2.3.0/test/completion/000077500000000000000000000000001330104711400161745ustar00rootroot00000000000000QCumber-2.3.0/test/completion/_test_qcumber.bash000066400000000000000000000047301330104711400216730ustar00rootroot00000000000000# bash completion for test fun _test_qcumber() { COMPREPLY=() local word="${COMP_WORDS[COMP_CWORD]}" local prev_word="${COMP_WORDS[COMP_CWORD-1]}" local word_before_previous="${COMP_WORDS[COMP_CWORD-2]}" local line=${COMP_LINE} if [ "$COMP_CWORD" -eq 1 ]; then top_opts="--help gen_test_yaml --run_file --gold_standard_file \ --trigger_overwrite update_default_yaml util" if [[ ${word} == -* ]] ; then top_opts="-h --help -i --runfile -g --gold_standard_file\ --trigger_overwrite" fi COMPREPLY=( $(compgen -W "${top_opts}" -- "$word") ) else local completions="" prelim_reply=() case "$line" in *update_default_yaml*) if [[ "$COMP_CWORD" -eq 2 ]] ; then completions="-h --help" fi prelim_reply=( $(compgen -W "${completions}" -- "$word") );; *gen_test_yaml*) if [[ "$COMP_CWORD" -eq 2 ]] ; then completions="-h --help" fi prelim_reply=( $(compgen -W "${completions}" -- "$word") );; *util*) if [[ "${prev_word}" = "util" ]] ; then if [[ "$COMP_CWORD" -eq 2 ]] ; then completions="--help --run_info --test_regex" if [[ ${word} == -* ]] ; then completions="-h --help -r --test_regex -i --run_info" fi prelim_reply=( $(compgen -W "${completions}" -- "$word" ) ) fi else if [[ "${prev_word}" = "-r" ]] || [[ "$prev_word" = "--test_regex" ]] ; then completions="\'"'s/(pattern)//g'"\'" prelim_reply=( $(compgen -W "${completions}" -- ) ) elif [[ "${prev_word}" = "-i" ]] || [[ "${prev_word}" = "--run_info" ]] ; then prelim_reply=( $(compgen -d -W "${completions}" -- "$word") ) fi fi;; *) if [[ "${word}" == -* ]] ; then completions="-h --help -i --runfile -g --gold_standard_file\ --trigger_overwrite" prelim_reply=( $(compgen -W "${completions}" -- "$word") ) fi;; esac #local words=("${COMP_WORDS[@]}") #unset words[0] #unset words[$COMP_CWORD] if [ ${#prelim_reply[@]} -eq 0 ] ; then prelim_reply=( $(compgen -f -W "${completions}" -- "$word")) fi COMPREPLY=( ${prelim_reply[@]} ) fi } complete -F _test_qcumber ./test_qcumber2.py QCumber-2.3.0/test/data/000077500000000000000000000000001330104711400147345ustar00rootroot00000000000000QCumber-2.3.0/test/data/__init__.py000077500000000000000000000000001330104711400170360ustar00rootroot00000000000000QCumber-2.3.0/test/readme.md000066400000000000000000000025511330104711400156050ustar00rootroot00000000000000# Test Suite The test suite is made up of two main parts test_qcumber2.py and a config file called test_run.yaml ## Development Build of Qcumber2 - cd to target install location and clone ``` $ cd target/location $ git clone https://gitlab.com/RKIBioinformaticsPipelines/QCumber.git $ cd QCumber ``` - create new conda environment and activate it ``` $ conda create -f environment/packages.yaml -n qcumber_env $ source activate qcumber_env ``` - you can generate adapters for adapter removal done by trimmomatic by executing the build srcipt ``` $ ./buld.sh ``` ## Side Note - If you are using bash you can do the following ``` $ source completion/_test_qcumber.bash ``` - It enables tab completion for the test script parameters ## Basic Testing Default tests and their corresponding gold standard should be present in the cloned repository. - From the main QCumber directory change to the test directory and execute the test script ``` $ ./test_qcumber2.py ``` - All tests should only yield OK or a Warnings, if there is an Error you should take a detailed look at the test run. The QCResults folder lies within the created testfolder in the test directory. ## Advanced Testing and Designing new Tests Here tab completion helps to remain sane QCumber-2.3.0/test/test_qcumber2.py000077500000000000000000001620651330104711400171710ustar00rootroot00000000000000#!/usr/bin/env python3 # -*- coding: utf-8 -*- from contextlib import contextmanager, ExitStack import argparse import copy import glob import hashlib import json import readline import shutil import subprocess import sys import time import tempfile import _thread import queue import os.path import yaml def main(): parser = argparse.ArgumentParser( prog='qcumber_test_suite', description=( 'Script that runs a suite of tests on QCumber\n' 'The script relies on a .yaml file that specifies the\n' 'kinds of tests to run and types of data used during the\n' 'process.\n Authors: René Kmiecinski, BI-Support\n' ' Email: r.w.kmiecinski@gmail.com\n'), formatter_class=argparse.RawTextHelpFormatter, add_help=False) parser.add_argument('-h', '--help', action=_HelpAction, help='show this help message and exit') parser.add_argument('-i', '--run_file', help='config file used to run tests.\n' 'defaults to: test_run.yaml') parser.add_argument('-g', '--gold_standard_file', help='gold standard file used. It is a json file\n' 'containing hashes, and other information used\n' 'and generated while testing') parser.add_argument('--trigger_overwrite', action="store_true", help='Trigger the set overwrite flags in the test run' ' yaml.\nThe overwrite flag is only useful when' 'writing new file rules.\nIt needs this option' 'and the set_overwrite yaml line in the\n' 'run config to activate') parser.add_argument('-s', '--spec', default='low', choices=['low', 'mid', 'high', 'cluster'], help='Specify which default set of runs to load\n' 'Low: low tier laptop from 2013 4GB RAM and' ' abysmal IO\nMid: ---\nHigh: ---\nCluster ---') parser.add_argument('-v', '--verbosity', choices=[0, 1, 2], help='Verbosity.\n 0 - quiet\n' ' 1 - only status and errors\n' ' 2 - output everything') # Hidden option parser.add_argument('--active_default_run_file', help=argparse.SUPPRESS) subparsers = parser.add_subparsers( help='See detailed description bellow', title='SUBCOMMANDS', dest='command') gen_test_yaml_parser = subparsers.add_parser( 'gen_test_yaml', description='generates default test_run.yaml') update_default_yaml_parser = subparsers.add_parser( 'update_default_yaml', description='replace default test_run.yaml file\n' 'with the current test.yaml\n' '(Warning: the default file is tracked by git)', formatter_class=argparse.RawTextHelpFormatter) update_default_goldstandard_parser = subparsers.add_parser( 'update_default_goldstandard', description='replace default goldstandard file\n' 'with the current goldstandard\n' '(Warning: the default file is tracked by git)', formatter_class=argparse.RawTextHelpFormatter) # ### Utility Mode ### # util_parser = subparsers.add_parser( 'util', description=('Maintenance of gold standard file.\n' '(e.g.: gold standard for run or set gold standard)\n' 'Access information about test runs\n' 'and the directories and files they created'), formatter_class=argparse.RawTextHelpFormatter) util_parser.add_argument( '-r', '--test_regex', nargs=2, help=' \n' 'Test a regular expression on a file\n' 'It will then output the differences') util_parser.add_argument( '-i', '--run_info', help='Info of run that produced the test folder') # argcomplete.autocomplete(parser) default_test_run_selector = { 'low': '.default_low_spec_test_run.yaml', 'mid': '.default_mid_spec_test_run.yaml', 'high': '.default_high_spec_test_run.yaml', 'cluster': '.default_cluster_spec_test_run.yaml', } data_path = 'data' default_gold_standard_file = '.default_gold_standard.json' gold_standard_file = data_path + '/gold_standard.json' test_run_c_file = 'test_run.yaml' args = parser.parse_args() default_test_run_c_file = default_test_run_selector[args.spec] args.active_default_run_file = default_test_run_c_file # default_test_run_c_file = '.default_test_run.yaml' if not args.run_file: args.run_file = test_run_c_file if not args.gold_standard_file: # If local standard not found if not os.path.isfile(gold_standard_file): # And default file found if os.path.isfile(default_gold_standard_file): shutil.copyfile(default_gold_standard_file, gold_standard_file) args.gold_standard_file = gold_standard_file else: gold_standard_file = args.gold_standard_file if args.command: if args.command in ['util', 'gen_test_yaml', 'update_default_yaml', 'update_default_goldstandard']: command = eval(args.command) command(args) else: eprint('\n[ERROR] - Command not recognized!') parser.print_help() util_parser.print_help() gen_test_yaml_parser.print_help() update_default_yaml_parser.print_help() update_default_goldstandard_parser.print_help() else: try: with open(test_run_c_file, 'r') as testrun_conf_file: user_config_dict = yaml.load(testrun_conf_file) except FileNotFoundError: try: with open(default_test_run_c_file, 'r') as testrun_conf_file: user_config_dict = yaml.load(testrun_conf_file) except FileNotFoundError: eprint('No config file found!') except yaml.YAMLError as exc: exit(exc) tests = QCumberTests(**user_config_dict) tests.run_tests(data_path, gold_standard_file) def gen_test_yaml(args): try: shutil.copyfile(args.active_default_run_file, outputFileDialogue('test_run.yaml', False)) except UserAbortError: exit('Goodbye') except FileNotFoundError: exit('default test_run.yaml not found!\n' 'There should be a hidden ".default_test_run.yaml" in the\n' 'current working directory.') def update_default_yaml(args): default_test_run_selector = { 'low': '.default_low_spec_test_run.yaml', 'mid': '.default_mid_spec_test_run.yaml', 'high': '.default_high_spec_test_run.yaml', 'cluster': '.default_cluster_spec_test_run.yaml', } try: shutil.copyfile(args.run_file, default_test_run_selector[args.spec]) except FileNotFoundError: exit('file does not exist') def update_default_goldstandard(args): try: shutil.copyfile(args.gold_standard_file, '.default_goldstandard.json') except FileNotFoundError: exit('file does not exist') def util(args): if args.run_info: if os.path.isdir(args.run_info): try: with open('%s/%s' % (args.run_info, '.testrun_info.yaml')) as run_conf_f: while True: line = run_conf_f.readline() if not line: break else: print(line) except FileNotFoundError: exit('No ".testrun_config.yaml" found in directory!') if args.test_regex: regex, testfile = args.test_regex with open('regex_test_result', 'wb') as regex_r_file: with subprocess.Popen(['perl', '-pe', '%s' % regex, '%s' % testfile], stdout=regex_r_file) as reg_proc: reg_proc.wait() if reg_proc.returncode: raise subprocess.CalledProcessError( reg_proc.returncode, "Error while running regular" " expression on file!\n") with subprocess.Popen(['diff', '-dU100000', 'regex_test_result', '%s' % testfile]) as diff_proc: diff_proc.wait() if diff_proc.returncode == 2: raise subprocess.CalledProcessError( diff_proc.returncode, "Error while running diff on file after applying\n" "regex on file") return # Classes ---> # # Exception Classes --> class UserAbortError(Exception): pass # # Exception Classes <-- class _HelpAction(argparse._HelpAction): ''' argparse monolithic help Action HelpAction that can be used to overwrite default help action helpfully provided by stackoverflow user "Adaephon" in conjunction with "grundic" https://stackoverflow.com/questions/20094215/\ argparse-subparser-monolithic-help-output ''' def __call__(self, parser, namespace, values, option_string=None): parser.print_help() # retrieve subparsers from parser subparsers_actions = [ action for action in parser._actions if isinstance(action, argparse._SubParsersAction)] # there will probably only be one subparser_action, # but better save than sorry for subparsers_action in subparsers_actions: # get all subparsers and print help for choice, subparser in subparsers_action.choices.items(): print("----------------------------------------------------\n" "> SUBCOMAND [{}]:\n".format(choice)) print(subparser.format_help()) parser.exit() class RemoteFile(): def __init__(self, url, descriptive_name): self.url = url self.descriptive_name = descriptive_name def download(self, download_location, resume=False): ref_filename = self.url.split('/')[-1] if not os.path.isfile(('%s/' % download_location) + ref_filename): if not os.path.isdir('../test/data'): eprint('Error: Wrong working directory') else: cmd = ['wget', '-P%s' % download_location, '-c' if resume else '', self.url] print(' '.join(cmd)) with subprocess.Popen(cmd) as dl_man: dl_man.wait() if dl_man.returncode not in [0, 1]: raise subprocess.CalledProcessError( dl_man.returncode, 'Error: wget had non-zero returncode') return ref_filename class MultiFileOutput(object): ''' Class that allows for writing of multiple files https://stackoverflow.com/questions/41283595/ how-to-redirect-python-subprocess-stderr-and-stdout-to-multiple-files https://stackoverflow.com/users/4014959/pm-2ring ''' def __init__(self, *file_handles): self.f_handles = file_handles def write(self, data): for f in self.f_handles: f.write(data) def flush(self): for f in self.f_handles: f.flush() def close(self): pass class Thread_Subprocess_State(object): def __init__(self): self.not_finished = True self.returncode = 1 # Yaml-Config Parser ---> class Structure(object): '''Helper struct that unroles a dictonary into a class If a Class inherits from this class it can be instantiated from an dictionary containing attributes listed in the self._fields attribute http://www.seanjohnsen.com/2016/11/23/pydeserialization.html Attributes: _fields (:obj:`list`): List of tuples containing attribute name and type ''' _fields = [] _default_field_values = [] def _init_arg(self, expected_type, value): try: if isinstance(value, expected_type): return value else: return expected_type(**value) except TypeError: return expected_type(value) def __init__(self, **kwargs): default_values = dict(self._default_field_values) field_names, field_types = zip(*self._fields) assert([isinstance(name, str) for name in field_names]) assert([isinstance(type_, type) for type_ in field_types]) for name, field_type in self._fields: try: setattr(self, name, self._init_arg(field_type, kwargs.pop(name))) except KeyError: setattr(self, name, default_values[name]) # Check for any remaining unknown arguments if kwargs: raise TypeError( 'Invalid arguments(s): {}'.format(','.join(kwargs))) class QCumberTests(Structure): '''Global Test Managing Object Attributes: references (:obj:`dict`): genomic reference sequences default_output_files (:obj:`list`): List[:obj:`OutputFile`] test_runs (:obj:`list`): List[:obj:`Run`] ''' def __deserialize_references(value): return dict((x, GenomicReference(**y)) for x, y in value.items()) def __deserialize_remote_reads(value): return dict((x, RemoteReadData(**y)) for x, y in value.items()) def __deserialize_output_files(value): return [OutputFile(**x) for x in list(value)] def __deserialize_runs(value): return [Run(**x) for x in list(value)] _default_field_values = [('verbosity', 1)] _fields = [('references', __deserialize_references), ('seeds', list), ('remote_reads', __deserialize_remote_reads), ('default_output_files', __deserialize_output_files), ('test_runs', __deserialize_runs), ('verbosity', int)] def run_tests(self, data_path, gold_standard_file, trigger_overwrite=False): # load gold standard gold_standard = {} try: with open(data_path + '/gold_standard.json', 'r') as gold_standard_file: gold_standard = json.load(gold_standard_file) except FileNotFoundError: pass with tmpdir() as temp_dir: badrun_simulation = {} for run, seed in ((r, s) for r in self.test_runs for s in self.seeds[:1] if r.gen_run_id(s, omit_seed=True) not in badrun_simulation): abort_run = False run_id, run_id_no_seed = (run.gen_run_id(seed), run.gen_run_id(seed, omit_seed=True)) hash_gen = hashlib.sha1() hash_gen.update(run_id.encode()) hash_run_id = hash_gen.hexdigest() q_run_id = ','.join(run.gen_easy_opts()) data_id = run.gen_sim_run_id(seed) map_reference_file = '' if run.map_reference: try: map_reference_file = self.references[ run.map_reference].download(data_path) map_reference_file = '%s/%s' % (data_path, map_reference_file) except KeyError: reason = ('map reference: %s\n was not found in ' 'test_run.yaml' % self.map_reference) eprint(reason) continue eprint("Running simulators...") for sim in run.simulators: try: ref_filename = self.references[ sim.reference].download(data_path) if (not run.map_reference and sim.qcumber_input and run.default_to_simref): map_reference_file = ref_filename map_reference_file = '%s/%s' % (data_path, map_reference_file) except KeyError: reason = ('%s not found in' ' test_run.yaml' % sim.reference) eprint(reason) badrun_simulation[run_id_no_seed] = reason abort_run = True break except subprocess.CalledProcessError as exc: eprint(exc) badrun_simulation[run_id_no_seed] = exc abort_run = True break # Simulate data try: subdir_prefix = sim.simulate_data( hash_run_id, seed, ref_filename, data_path, temp_dir) except subprocess.CalledProcessError as exc: eprint(exc) badrun_simulation[run_id_no_seed] = exc abort_run = True break if abort_run or not run.simulators: continue # Run QCumber analysis if not os.path.isdir(subdir_prefix): try: os.mkdir(subdir_prefix) except FileExistsError as e: if not os.path.isdir(subdir_prefix): eprint("mkdir failed: A file (not a directory)" " with the same name already exists!\n") continue with open('%s/%s' % (subdir_prefix, '.testrun_info.yaml'), 'w') as run_conf_f: yaml.dump(run, run_conf_f, default_flow_style=False) print('\nRun Qcumber: %s' % subdir_prefix) try: run.run_qcumber(subdir_prefix, '%s/%s' % (temp_dir, subdir_prefix), data_path, map_reference_file, verbosity=self.verbosity) except subprocess.CalledProcessError as exc: eprint(exc) continue except (BrokenPipeError, KeyboardInterrupt): eprint('User Aborted Run!!') continue run.compare_with_golden_standard(subdir_prefix, seed, gold_standard, q_run_id, data_id, self.default_output_files) # local data for run, data in ((r, d) for r in self.test_runs for d in r.local_real_data): q_run_id = ','.join(run.gen_easy_opts()) data_id = data.get_file_trace() hash_gen = hashlib.sha1() run_id = ('%s_%s' % (q_run_id, data_id)).encode() hash_gen.update(run_id) hash_run_id = hash_gen.hexdigest() subdir_prefix = 'locald_test%s' % hash_run_id if not os.path.isdir(subdir_prefix): try: os.mkdir(subdir_prefix) except FileExistsError as e: if not os.path.isdir(subdir_prefix): eprint("mkdir failed: A file (not a directory)" " with the same name already exists!\n") continue input_folder = data.prepare_data(temp_dir, subdir_prefix) try: run.run_qcumber(subdir_prefix, input_folder, data_path, map_reference_file, verbosity=self.verbosity) except subprocess.CalledProcessError as exc: eprint(exc) continue except (BrokenPipeError, KeyboardInterrupt): eprint('User Aborted Run!!') continue print(data_id) run.compare_with_golden_standard(subdir_prefix, seed, gold_standard, q_run_id, data_id, self.default_output_files) # remote data for run, rdata in ((r, d) for r in self.test_runs for d in r.remote_real_data): q_run_id = ','.join(run.gen_easy_opts()) key_files = dict( (data.remote_id, self.remote_reads[data.remote_id].download_all( data_path)) for data in rdata.remote_data) files = sorted('%s%s' % (f.remote_id, ':%i' % f.cutoff) for f in rdata.remote_data) data_id = ','.join(files) run_id = '%s_%s' % (q_run_id, data_id) hash_gen = hashlib.sha1() hash_gen.update(run_id.encode()) hash_run_id = hash_gen.hexdigest() subdir_prefix = 'real_test%s' % hash_run_id if not os.path.isdir(subdir_prefix): try: os.mkdir(subdir_prefix) except FileExistsError as e: if not os.path.isdir(subdir_prefix): eprint("mkdir failed: A file (not a directory)" " with the same name already exists!\n") continue input_folder = rdata.prepare_data(temp_dir, subdir_prefix, key_files, data_path) try: run.run_qcumber(subdir_prefix, input_folder, data_path, map_reference_file, verbosity=self.verbosity) except subprocess.CalledProcessError as exc: eprint(exc) continue except (BrokenPipeError, KeyboardInterrupt): eprint('User Aborted Run!!') continue run.compare_with_golden_standard(subdir_prefix, seed, gold_standard, q_run_id, data_id, self.default_output_files) with open(data_path + '/gold_standard.json', 'w') as gold_standard_file: json.dump(gold_standard, gold_standard_file) class RemoteReadData(Structure): def __deserialize_remote_data(value): return [RemoteFile(**x) for x in value] _default_field_values = [('descriptive_name', '')] _fields = [('descriptive_name', str), ('files', __deserialize_remote_data)] def download_all(self, data_path): return [f.download(data_path) for f in self.files] class OutputFile(Structure): _default_field_values = [('regex', ''), ('use_file_size', False), ('expected_num_files', 1), ('file_size_cutoff', .95)] _fields = [('file', str), ('regex', str), ('use_file_size', bool), ('expected_num_files', int), ('file_size_cutoff', float)] def resolve_name(self, prefix, substitute_str='{{Prefix}}', path=''): out = [] name = self.file.replace(substitute_str, prefix) if not any((s_char in name for s_char in '*?[]')): new_file = copy.deepcopy(self) new_file.file = '%s/%s' % (path, name) out = [new_file] else: for file_name in glob.iglob('%s/%s' % (path, name), recursive=True): new_file = copy.deepcopy(self) new_file.file = file_name new_file.expected_num_files = 1 out.append(new_file) return out class Run(Structure): def __deserialize_output_files(value): return [OutputFile(**x) for x in list(value)] def __deserialize_local_real_data(value): return [LocalRealData(**x) for x in list(value)] def __deserialize_remote_real_data(value): return [RemoteRealData(**x) for x in list(value)] def __deserialize_simulators(value): return [Simulator(**x) for x in list(value)] def __trimbetter_option(value): if value not in ['default', 'mapping', 'assembly']: raise TypeError else: return value _default_field_values = [('trimbetter', False), ('trimbetter_mode', 'default'), ('threads', 8), ('kraken', False), ('additional_opts', []), ('special_output_files', []), ('simulators', []), ('local_real_data', []), ('remote_real_data', []), ('overwrite_standard', False), ('map_reference', ''), ('index', ''), ('default_to_simref', True)] _fields = [('trimbetter', bool), ('trimbetter_mode', __trimbetter_option), ('threads', int), ('kraken', bool), ('additional_opts', list), ('special_output_files', __deserialize_output_files), ('simulators', __deserialize_simulators), ('local_real_data', __deserialize_local_real_data), ('remote_real_data', __deserialize_remote_real_data), ('overwrite_standard', bool), ('map_reference', str), ('index', str), ('default_to_simref', bool)] def __grep_progress(self, input_file, subprocess_state): # with open(input_file, 'rb') as i_file: with subprocess.Popen(['grep', '-iE', 'step|error', input_file]) as grep_proc: grep_proc.wait() subprocess_state.not_finished = False subprocess_state.returncode = grep_proc.returncode def gen_easy_opts(self): opts = ['-t%i' % self.threads] if not self.kraken: opts.append('-K') if self.trimbetter: opts.extend(['--trimBetter', self.trimbetter_mode]) opts.extend(self.additional_opts) return opts def gen_sim_run_id(self, seed): return '|'.join( [','.join([sim.name, '%i' % seed, sim.reference]+sim.opt) for sim in self.simulators]) def gen_run_id(self, seed, omit_seed=False): sims = '|'.join( [','.join([sim.name, '' if omit_seed else ('%i' % seed), sim.reference]+sim.opt) for sim in self.simulators]) qcall = ','.join(self.gen_easy_opts()) return '%s_%s' % (sims, qcall) def run_qcumber(self, prefix, input_path, data_path, map_reference, verbosity=1): ''' run Qcumber on data Args: prefix (:obj:`str`): prefix/name used by dwgsim to create output tmp_dir (:obj:`str`): path to temporary directory data_path (:obj:`str`): path to data_dir where in the files that need renaming and symlinking are located Kwargs: verbosity (int): Level of verbosity of stdout (does not affect log) level 0: no qcumber output level 1: only progress and errors level 2: everything, rule and job info ''' reference_opt = '' if map_reference or self.index: reference_opt = ('-r%s' % (map_reference) if not self.index else '-I%s' % self.index) options = self.gen_easy_opts()+['-o%s' % prefix] if reference_opt: options = options + [reference_opt] # ev_loop = asyncio.get_event_loop() r_code = 1 with ExitStack() as e_stack: qcumber_info_stream = e_stack.enter_context( (fifo() if (verbosity == 1) else pseudo_stderr())) qcumber_log = e_stack.enter_context( open(prefix+'/qcumber_run.log', 'wb')) # Object that allows error detection thread_state = Thread_Subprocess_State() if verbosity == 1: _thread.start_new_thread(self.__grep_progress, (qcumber_info_stream, thread_state)) time.sleep(0.2) qcumber_info_write_handle = e_stack.enter_context( open(qcumber_info_stream, 'wb')) processing_file = (qcumber_info_write_handle if verbosity == 1 else qcumber_info_stream) err_mf = MultiFileOutput(*([qcumber_log] + ([] if not verbosity else [processing_file]))) print(' '.join(['../QCumber-2', '-i%s' % (input_path)] + options)) with subprocess.Popen(['../QCumber-2', '-i%s' % (input_path)] + options, stderr=subprocess.PIPE) as qc2_proc: while True: line = qc2_proc.stderr.readline() if (not line): break err_mf.write(line) err_mf.flush() qc2_proc.wait() err_mf.flush() r_code = qc2_proc.returncode if r_code: raise subprocess.CalledProcessError( r_code, 'QCumber encountered an error') def compare_with_golden_standard(self, subdir_prefix, seed, gold_standard, run_id, data_id, default_output, trigger_overwrite=False): # Gen Run Id for golden standard print('\nGold standard checks:') force_overwrite = trigger_overwrite & self.overwrite_standard # data_id = self.__gen_run_id(seed) # run_id = ','.join(self.gen_easy_opts()) # output dir result_folder = '%s/QCResults' % subdir_prefix # Prepare gold standard if run is missing if run_id not in gold_standard: gold_standard[run_id] = {} gold_standard[run_id][data_id] = {} elif data_id not in gold_standard[run_id]: gold_standard[run_id][data_id] = {} # shortcut for active run id active_standard = gold_standard[run_id][data_id] # Replacement string used in File.resolve_name() to expand filenames rep_string = '{{Prefix}}' # Join default output files with run specific ones files = dict((f.file, f) for f in self.special_output_files) # Prioritize run specific output in case of conflict for d_file in default_output: if d_file.file not in files: files[d_file.file] = d_file # Iterate through unresolved files for prelim_file in files.values(): resolved_files = prelim_file.resolve_name( subdir_prefix, substitute_str=rep_string, path=result_folder) # Check of number of files resolved matches expected num num_res_files = len(resolved_files) if num_res_files != prelim_file.expected_num_files: less_files = num_res_files < prelim_file.expected_num_files print("%s: File Rule %s\nExpected %i file and found %i " % ( prelim_file.file, 'ERROR!' if less_files else 'WARNING', prelim_file.expected_num_files, num_res_files)) for f in resolved_files: new_hash = '' new_file_size = 0 standard_hash = '' standard_file_size = 0 try: standard_file_size, standard_hash = active_standard[ f.file] except KeyError: pass try: new_file_size = os.path.getsize(f.file) except FileNotFoundError: print("%s: %s" % (f.file, 'ERROR: FILE NOT FOUND')) continue if f.use_file_size: if standard_file_size and not force_overwrite: ratio = (min(float(new_file_size), standard_file_size) / max(new_file_size, float(standard_file_size))) # If user wishes to match file size exactly if 1 == f.file_size_cutoff: smalldiff = new_file_size == standard_file_size else: smalldiff = ratio > f.file_size_cutoff print('%s: %s' % (f.file, 'FILESIZE: OK' if smalldiff else 'FILESIZE: ERROR')) else: print('file:%s Checked Size: %s' % (f.file, new_hash)) active_standard[f.file] = (new_file_size, new_hash) continue # Hash file if f.regex: print('Sanitizing file before generating hash') result = queue.Queue() with fifo() as pipe: _thread.start_new_thread(hash_file, (pipe, result)) time.sleep(0.1) cleanup_file_for_hashing(f.regex, f.file, pipe) new_hash = result.get() # I case where no file hash was found in gold standard, # insert new one. else: result = queue.Queue() hash_file(f.file, result) new_hash = result.get() if not standard_hash or force_overwrite: print('file:%s generated hash: %s' % (f.file, new_hash)) active_standard[f.file] = (new_file_size, new_hash) else: print("%s: %s" % (f.file, 'HASH: OK' if new_hash == standard_hash else 'HASH: ERROR')) class Simulator(Structure): _default_field_values = [('type', 'read'), ('qcumber_input', True)] _fields = [('name', str), ('type', str), ('qcumber_input', bool), ('reference', str), ("opt", list)] def simulate_data(self, hash_run_id, seed, ref_filename, data_path, temp_dir): '''simulates read data Currently is only able to use dwgsim. If returns a cleaned up temporary directory, where in sanitized reads reside. Returns: subdir: ''' sim_id = ','.join([self.name, '%i' % seed, self.reference]+self.opt) sim_hash_gen = hashlib.sha1() sim_hash_gen.update(sim_id.encode()) sim_hash_id = sim_hash_gen.hexdigest() subdir_prefix = 'testData%i%s_%s%s' % (seed, self.name, self.reference, sim_hash_id) for suffix in ['.gz', '.fa.gz', '.fa', '.fna.gz', '.fna']: subdir_prefix = subdir_prefix.rstrip(suffix) subdir_prefix, _ = sanitize_name_for_illumina( subdir_prefix) subdir_prefix_path = '%s/%s' % (data_path, subdir_prefix) if self.name == 'dwgsim': self.dwgsim_generate_reads_unzip( ('%s/' % data_path) + ref_filename, subdir_prefix_path, seed, temp_dir) # Set up input folder structure if desired eprint("Cleanup simulated data...") self.dwgsim_out2qcumber_input(subdir_prefix, temp_dir, data_path, sim_hash_id, hash_run_id) return subdir_prefix.replace(sim_hash_id, hash_run_id) def dwgsim_generate_reads_unzip(self, input_ref, out_prefix, seed, tmp_dir, log_file='dwgsim_run', sequencer=0): ''' function that calls dwgsim as a subprocess and unzips reference if necessary Args: tmp_dir (:obj:`str`): temporary directory used for random data input_ref (:obj:`str`): filename of input reference out_prefix (:obj:`str`): prefix for files and folders created seed (int): seed used to generate pseudo random data Kwargs: sequencer (int): type of sequencer used 0 - Illumina (default) 1 - SOLiD 2 - Ion-Torrent log_file (:obj:`str`): logfile per default "dwgsim_run.log" Returns: (bool): True if success and False otherwise. In case of error check log ''' zipped = input_ref[-3:] == '.gz' filename = input_ref.split('/')[-1] seed_opt = '-z%i' % (seed) options = [out_prefix, seed_opt]+self.opt unzipped_path = (input_ref if not zipped else tmp_dir + '/' + filename[:-3]) # Checking if dwgsim files already exist dwgsim_status_filename = out_prefix + '.dwgsim_run_state' read_gen_string = ','.join([filename] + options[1:]) # Adding Simulator Object to QCTestManager if self.dwgsim_output_missing(out_prefix, dwgsim_status_filename, read_gen_string): if(zipped and not os.path.isfile(unzipped_path)): _unzip(unzipped_path, input_ref) with open(log_file+'.log', 'w') as dwg_log: _generate_reads(unzipped_path, dwg_log, options, log_file) # write dwgsim status file with open(dwgsim_status_filename, 'w') as genfile: genfile.write(read_gen_string) def dwgsim_output_missing(self, dwgsim_out_prefix, dwgsim_status_filename, read_gen_string): ''' function checking if output of dwgsim already exists Args: dwgsim_out_prefix (:obj:`str`): output prefix used by dwgsim unzipped_path ''' runs_read_gen = True # cmd_option_list = ['dwgsim', unzipped_path] + options if os.path.isfile(dwgsim_status_filename): with open(dwgsim_status_filename, 'r') as genfile: status_line = genfile.readline() runs_read_gen = read_gen_string not in status_line return runs_read_gen def dwgsim_out2qcumber_input(self, prefix, tmp_dir, data_path, sim_hash_id, hash_run_id): ''' creating a directory in the temporary directory and symlinks to within the directory to test data generated by dwgsim Args: prefix (:obj:`str`): prefix/name used by dwgsim to create output tmp_dir (:obj:`str`): path to temporary directory data_path (:obj:`str`): path to data_dir where in the files that need renaming and symlinking are located ''' real_prefix = prefix.replace(sim_hash_id, hash_run_id) tmp_qcumber_input_dir = '%s/%s' % (tmp_dir, real_prefix) if not os.path.isdir(tmp_qcumber_input_dir): try: os.mkdir(tmp_qcumber_input_dir) except FileExistsError as e: if not os.path.isdir(tmp_qcumber_input_dir): eprint("mkdir failed: A file (not a directory)" " with the same name already exists!\n") # ToDo: Need to check if file is already present in directory # and increment sample counter or sheet for i in range(1, 3): os.symlink(os.path.abspath( '%s/%s.bwa.read%i.fastq' % (data_path, prefix, i)), os.path.abspath( '%s/%s_S1_L001_R%i_001.fastq' % ( tmp_qcumber_input_dir, real_prefix, i))) return tmp_qcumber_input_dir class InputFile(Structure): _default_field_values = [('configfile', ''), ('cutoff', 0)] _fields = [('file', str), ('configfile', str), ('cutoff', int)] def resolve_name(self): out = [] name = self.file if not any((s_char in name for s_char in '*?[]')): new_file = copy.deepcopy(self) out = [new_file] else: for file_name in glob.iglob(name, recursive=True): new_file = copy.deepcopy(self) new_file.file = file_name out.append(new_file) return out class RemoteInputFile(Structure): _default_field_values = [('cutoff', 0)] _fields = [('remote_id', str), ('cutoff', int)] class LocalRealData(Structure): def __deserialize_input_files(value): return [InputFile(**x) for x in list(value)] _default_field_values = [('project_name', '')] _fields = [('project_name', str), ('input_files', __deserialize_input_files)] def get_file_trace(self): ''' Gets string that defines the files that make up the real data ''' return ','.join(('%s%s' % (f.file, ':%i' % f.cutoff if f.cutoff else '') for f in self.input_files)) def prepare_data(self, tmp_dir, prefix): dirs = {} files = {} # Deal with folders and Reads with same name for file_obj in (f for f_un_res in self.input_files for f in f_un_res.resolve_name()): cur_fd_basname = os.path.basename(file_obj.file) if os.path.isdir(file_obj.file): if file_obj.file not in dirs: dirs[cur_fd_basname] = file_obj else: raise NotImplementedError('Input dirs have same name') else: if file_obj.file not in files: files[cur_fd_basname] = file_obj else: raise NotImplementedError('input files have same name') if not files and len(dirs) == 1 and not list(dirs.values())[0].cutoff: return list(dirs.values())[0].file # tmp_dir = '.' temp_input_folder = '%s/%s' % (tmp_dir, prefix) if not os.path.isdir(temp_input_folder): try: os.mkdir(temp_input_folder) except FileExistsError as e: if not os.path.isdir(temp_input_folder): eprint("mkdir failed: A file (not a directory)" " with the same name already exists!\n") # Add directory contents for dir, file in ((d, f) for d in dirs for f in os.listdir(dirs[d].file)): newfile = copy.deepcopy(dirs[dir]) newfile.file = '%s/%s' % (dirs[dir].file, file) files[file] = newfile for f in files: if not files[f].cutoff or 'fastq' not in f: os.symlink(os.path.abspath(files[f].file), '%s/%s' % (temp_input_folder, f)) else: with ExitStack() as e_stack: if f[-2:] == 'gz': gzip_proc = e_stack.enter_context( subprocess.Popen(['gzip', '-dc', files[f].file], stdout=subprocess.PIPE)) fastq_stream = gzip_proc.stdout else: fastq_stream = e_stack.enter_context( open(files[f].file, 'rb')) write_file = e_stack.enter_context( open('%s/%s' % (temp_input_folder, f.strip('.gz')), 'wb')) cut_proc = e_stack.enter_context( subprocess.Popen(['head', '-n%i' % (4*files[f].cutoff)], stdin=fastq_stream, stdout=write_file)) cut_proc.wait() if cut_proc.returncode: raise subprocess.CalledProcessError( returncode=cut_proc.returncode, output='bash head error: shortening the' ' input file failed!') return temp_input_folder class RemoteRealData(Structure): def __deserialize_remote_data(value): return [RemoteInputFile(**x) for x in list(value)] _default_field_values = [('project_name', '')] _fields = [('project_name', str), ('remote_data', __deserialize_remote_data)] def get_file_trace(self): ''' Gets string that defines the files that make up the real data ''' return ','.join(('%s%s' % (f.remote_id, ':%i' % f.cutoff if f.cutoff else '') for f in self.input_files)) def prepare_data(self, tmp_dir, prefix, file_info, data_path): files = {} temp_input_folder = '%s/%s' % (tmp_dir, prefix) # Deal with remote data for r_data in self.remote_data: for f in file_info[r_data.remote_id]: cur_f_basename = os.path.basename(f) files[cur_f_basename] = InputFile( **{'file': "%s/%s" % (data_path, f), 'cutoff': r_data.cutoff}) if not os.path.isdir(temp_input_folder): try: os.mkdir(temp_input_folder) except FileExistsError as e: if not os.path.isdir(temp_input_folder): eprint("mkdir failed: A file (not a directory)" " with the same name already exists!\n") # Add directory contents for f in files: if not files[f].cutoff or 'fastq' not in f: os.symlink(os.path.abspath(files[f].file), '%s/%s' % (temp_input_folder, f)) else: with ExitStack() as e_stack: if f[-2:] == 'gz': gzip_proc = e_stack.enter_context( subprocess.Popen(['gzip', '-dc', files[f].file], stdout=subprocess.PIPE)) fastq_stream = gzip_proc.stdout else: fastq_stream = e_stack.enter_context( open(files[f].file, 'rb')) write_file = e_stack.enter_context( open('%s/%s' % (temp_input_folder, f.strip('.gz')), 'wb')) cut_proc = e_stack.enter_context( subprocess.Popen(['head', '-n%i' % (4*files[f].cutoff)], stdin=fastq_stream, stdout=write_file)) cut_proc.wait() if cut_proc.returncode: raise subprocess.CalledProcessError( returncode=cut_proc.returncode, output='bash head error: shortening the' ' input file failed!') return temp_input_folder class GenomicReference(RemoteFile): pass # Yaml-Config Parser <--- def sanitize_name_for_illumina(sequence_name): ''' Cleanup string for compatability with illumina naming conventions Args: sequence_name (:obj:`str`): name which will be sanitized (problematic chars will be replaced) ''' separators_used = [] bad_separators = ['_', '$', '.', '"', "'", '%'] allowed_separators = ['-', '#', '&', '+', ';', ':', '!', ',', '~'] separators_tried = 0 for i in range(len(sequence_name)): try: bad_index = bad_separators.index(sequence_name[i]) found_replacement = False while not found_replacement: if allowed_separators[separators_tried] not in sequence_name: sequence_name = sequence_name.replace( bad_separators[bad_index], allowed_separators[separators_tried]) found_replacement = True separators_used += [allowed_separators[separators_tried]] else: separators_tried = separators_tried + 1 except ValueError: pass return sequence_name, separators_used def eprint(*args, **kwargs): ''' print function that prints to stderr :return: returns nothing ''' print(*args, file=sys.stderr, **kwargs) def outputFileDialogue(outputFile, overwrite, interactive=True): ''' Utility function for output File: Checking if it exists and prompting user if it exists, asking if the user wants to overwrite the file. Args: outputFile (str): output file name overwrite (bool): force overwrite of file if True interactive (:obj:`Bool`,optional): is user interaction wanted. default is True Returns: str: filename Raises: UserAbortError: if User types exit() into prompt ''' if outputFile is sys.stdout: outputFile = outputFile.fileno() else: if os.path.isfile(outputFile): if interactive: if not overwrite: eprint(outputFile, 'does already exist \n overwrite Y/n') running = True while running: if input().upper() not in ['Y', 'YES']: while running: eprint( 'Enter new filename or exit() to exit: ') filename = input() if filename == 'exit()': raise UserAbortError elif not os.path.isfile(filename): outputFile = filename running = False else: eprint(filename, 'already exists') else: running = False return outputFile @contextmanager def tmpdir(): ''' Generator/contextmanager that creats aTemporary directory and removes the same directory after context exit ''' dirname = tempfile.mkdtemp() try: yield dirname finally: shutil.rmtree(dirname) @contextmanager def fifo(): ''' Generator/contextmanager that manages creation of a fifo ''' dirname = tempfile.mkdtemp() try: path = os.path.join(dirname, 'tmp_fifo') os.mkfifo(path) yield path finally: shutil.rmtree(dirname) @contextmanager def pseudo_stderr(): ''' pseudo stdout context ''' try: yield sys.stderr.buffer finally: pass """ @asyncio.coroutine def copy_to_files(stream, outfile): ''' reads input stream and writes multiple output files asyncio.coroutine and yields from syntax allows the function to be used in the context of the async package Args: stream (:obj:`_io.TextIOWrapper`): input file handle outfile (:obj:`MutiFileOutput`): Multi file object that is used for writing ''' while True: line = yield from stream.readline() if not line: break outfile.write(line) @asyncio.coroutine def run_cmd_and_redirect_output(cmd, out_files, err_files): ''' runs a cmd via bash and writes the output into files Args: cmd (:obj:`str`): command to run out_files (:obj:`MultiFileOutput`): Multi file object, for stdout of the program, whoose file handles have already been opened. err_files (:obj:`MultiFileOutput`): Multi file object, for stderr of the program, whoose file handles have already been opened. ''' print('Does something') proc = yield from asyncio.create_subprocess_shell( cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE, executable='/bin/bash') try: yield from asyncio.gather( copy_to_files(proc.stdout, out_files), copy_to_files(proc.stderr, err_files)) except Exception as e: proc.kill() print(e) raise finally: r_code = yield from proc.wait() return r_code """ def _unzip(output_name, zipped_input): ''' unzipping file and redirecting it by running zcat as a subprocess Args: zipped_input (:obj:`str`): zipfile input output_name (:obj:`str`): file/pipe output Raises: subprocess.CalledProcessError, if gzip has non zero returnvalue ''' with open(output_name, 'wb') as redirect_file: with subprocess.Popen(['zcat', zipped_input], stdout=redirect_file) as gz_proc: gz_proc.communicate() if gz_proc.returncode: raise subprocess.CalledProcessError( 'Gzip failed with error code: %i\n' % ( gz_proc.returncode)) def hash_file(input_file, result_queue): hash = hashlib.sha1() with open(input_file, 'rb') as to_hash_file: hashing = True while hashing: data = to_hash_file.read(65536) hash.update(data) if not data: hashing = False result_queue.put(hash.hexdigest()) def cleanup_file_for_hashing(regex, file, outfile_name): with open(outfile_name, 'wb') as outfile: with subprocess.Popen(['perl', '-pe', regex, file], stdout=outfile) as clean_proc: clean_proc.wait() if clean_proc.returncode: raise subprocess.CalledProcessError( clean_proc.returncode, "Perl Regex failed") def _generate_reads(_redirect, _dwg_log, _options, _log_file): ''' function actually calling dwgsim and running it as a subprocess Args: _redirect (:obj:`str`): file/pipe reference genome _dwg_log (:obj:`_io.TextIOWrapper`): log file as file object _options (:obj:`list`): options of cmd as string list _log_file (:obj:`str`): log file as string Raises: subprocess.CalledProcessError, if dwgsim has non zero returnvalue ''' _dwg_log.write('Simulating Reads by calling:\n%s\n' % " ".join(['dwgsim', _redirect]+_options)) _dwg_log.flush() cmd_option_list = ['dwgsim', _redirect] + _options with subprocess.Popen(cmd_option_list, stdout=_dwg_log) as d_proc: d_proc.wait() if d_proc.returncode: raise subprocess.CalledProcessError( 'The read simulation failed with error code: %i\n' 'Please check logfile: "%s" ' % (d_proc.returncode, _log_file)) def dont_fret_because_readline_is_never_used(): version = readline._READLINE_VERSION print('Nothing of the readline module version: %s' 'is ever directly called.\n' 'Including it, allows the user to input things' 'without losing his or hers mind!\n' '(Backspace works after the import)' % version) # read-gen <--- if __name__ == '__main__': main() QCumber-2.3.0/workflow.png000077500000000000000000001540051330104711400154340ustar00rootroot00000000000000PNG  IHDRNKsRGBgAMA a pHYs(JךIDATx^ |\uO+ZŊEPi5" ,j]MFX4+ ,"غ"ں%E\)6FJ U T[ wg9sd|`9 v-r};iiC8a @2/lI/1'/U#@|kG9ar=$:`Tioz éryg HO? 7NP0%o;HX뿘@&'t?}ѵR^ze30^y_ԲD7y xVi竵#h u[!g䠃k:(^:{Ƃ oxC};ߕ[N:yuv)Q_1z?/?+cY;rک/ǿ.CP:oٔuR?A~z@{nXgq&L0'53jda٥񦯙άoҤƒ]~g){͙]'X x'㚯ȑGN] @TB6V:,53GE>/|]Z}nXysw:'3jjLMd4*C#55娣N0vzo_=-kΜBwk#҉Gxv;ԤI(N@zo.  {qvI޽ri:ءrxB)S!m;[q:~r<9Im@ x'>*gi^_@:&w-,аHKȊ ?zL뿾+{gsۿ!T mOk_5VAK5R) ٟSJ+.oon7}M^޵K.fSdr!ιI̮{'NՋ/GŜkO3^erW7cշ=\3U-?n/6JOc~"wt2}UdR:Cdjiko~e7K/{R|7ЫG-Ћim WOpaoIrdKk8FY0L Uwk@f4]>_ȚV'?n~9h.4?D=^Eի@!%ß.-OZ_k:7u'(Vŗɋ/$Mϔ/^}gl[@ 2w>k2hO݅}ӟʧ/KrMG׿ޓ_@:U?ț|zr؛d{nx|3ʸqY)|h)0td:M ՞5iOo-Z*2u{NyV8`XtԩC]v 'ޓm>4A[B/]ӪMug=vۥg~zD (=?蠃̸zh,[$"|Ԥf4i_B/8(t]w˿^yn=K/{3sw޵k @Lo#\}_.2=};.-/wsKSj@4_ttu-_w]nzIK4ܨ oxi|Gk&wػNzF~M*ַ!W_uEVm69ڥ(cJ|׽q!ԟ ; 7z :ꂳWO}c4 sq;6i ZYG\/ɿ,㨣oPj_@|Fj_@|Fj_@|F(goSD7K$#lrȧhttӱ @yVnN _;.R}M6]#r(%}vHcRf9 B -սSbՑO1ű% Wf3V *`)%@|,^?$s6/?0t)o$_SƬbV1KRd-y{]Fn۸z_囶*ꛝv|m52G.{~|ʔljh;OfM<%y۷nSJkrsyNޕ9։Wnu{;ϳO#Rh0UX!)zEK1><[W.2tgYºE{mY j[JqԕraV͑Zߤم499ƞ8:s5rE~>u8 tKjy؀̊ЉZA!BRPT S' <ɳZkѓ`jIL坲[;|HBd&[$?%X?'Sô ie|rv'jt*qo Ll4yHnX,K\)E;,BTMiO]yy%p5L|_yyfЯW,/Ns,Z]oZn ccrnގt~s;T7OI%F;\K~'?-_:xU.&C5MkxK:mbe7=)v#f IlMi>5ΛbdJ-Ui0/7%+S{J=>&X\Oֲ=6 <iGtΗf<{߱vȰNsf]0 WoI:q󕮓~ʠl~)]7eh4:SvWYy!~Lɜc!кm* k<:'zk&OMOw} u k{~gG>)x;cy[էitpE߫4ӎ\/:ֱjv\أ]2vq섶 ـzbp5>Q2eT4PƆI1JnʹOsP,J)C(:.|z;cPX"hE/qPYJ|?M{zƫV.Cd2IC4e{J{!~hg~ cCu/,(:JǥS@?.*uneJ)Ct5;YKXsY7Y$ AJ@VoCxCwkR;]f+]'q x\Ug-A5W]v=˺QE%zRR2mq7ǹmkܘ߮cMGt¢CNꎽ!u/G!m|b@ĥ;.bd5B$y2!O]O岮W,lHw!~g%:|kc9^|S%֖0{/zƷ׬\|RKSx?h/kqd#޹i_sy<ٮrY90ZiGWDLɎ.I=URt1͞vШދ,AR}et7媵^_JUor]vm#h9Cs::kad3^ZS c:~7%c|llк;+FVdET0%q\zS=GӳuOyiB$WQNcP.ܪlyo6zzJtgJ% H99u 1<|KO= ܇c"WTzL _` Wuܶ;B#r0]9G{- gұZc#"+VuN>+Ye] ;_kU (G_]#J|gplΖs%]cp|ϑ;ϋV=|%]*̹SγUǹN8+z 'Gs,l\$/y_}NK+7ub3oMwM3@.6>y?s^ X]ǟk"G9Ev)QB * *ބ7 *%tQ H *>Cw+K(CN岶 ?/\!Om{ @0/7Kf;;Yuv&=f9~";vY~K{>؄ࡡgZ" #xއ囷CzX+vտJCC !{\%ܼ܄O"%A4v)q_@|[G6o^{.xɑ7K/EmK/ɋ/dztڵ+vB?Ig94q%ַΉLf3#zm|Mviy{ƛ.Ql_Wp:g]f\, dloMop {{y?ZH3+( /c3ِ/yo-I&9:ێ<1+!Z)w8?1طo9!W_ݻwy&OoXjnrbs}R%B=W^8_s|pE_@#`{h:s ˾}7fqoyWo;ry饗-/ka?0e̲tw1~9Q3% |DOjw> >ѐk.8|6 q*ox!s~^f͚)o|B~QyGɾ}5||LKr~U}#L#<ҌIsH'f2zv:!%yaNy̿}y?aRB۟g\腠l𹟉gnE߇J/Cͼ_Pcx/|汵փϏqş\X ˟@wo֖Z^ޞ4ib36h:MEn&[/ _IVz_۞č?ޜZ2|v|Ѯ7oc϶~ϴj{Oe\)oii45~9 NͅۄFe[aؙ3ŠGUbMP&jՎ/p@O !w۔gB|i7ΔX=}6;awSP JVG?3W*=i<؏MUIQS튴m2Ѱes`>az{\6>D,4$#"OO֧6i){AIZАaw4.abW__鱾'4z9ᄹֽPDNܴ1lS$KCdtM5 6Ȓ%Kdx8F{ĄO|Usץ%z 2gvݨtR.Rpœ[SU ~N RW1Z)-hOL ?z1Dɴc5ߝ&3~!Z/iO=%x҄aMK' f^  Uɚi)Fr5>50MvSn܇yƜ{i\ '}=,oeӀu^zU$S/^?z'ZSJTz{.s1Q?_z?i}E0#fGKH}1}%xi=+R]X^ M?s>?Iz߱z߸,#J)|?>7w$b]I}9p>sK'w~myf<މߏ:!SOɪMXZm/&k4Bhm"-V{{]#\ٌϫ ||i kJFvW&OQl&]^|uwk3m^G{f}wΜ|I9Aי=nH0!>I'7aWR8=xo{R vUkjjRNsE48jZE>k<3fWŤ]lg_]1.cqII~B~ 0ÜšN8^>p\3Ygp~sL;^hՋIipi/~ЋB^׏|F^z%SIM}N)SGxs[:K1 wJ&]2v&E/(|j+*w6hҌ53jdaѱ""C,]Dɔy0M^~i؝ @i| @XBJ @i@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_Yٳg* @V&NhBj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|F9s !H}v袋gKܹS>?lָqdɒ%vZ$UVUW]eB|^zIw /|S:Jq93r9sϕXf;imm*ц Ua-[Ȝ9s:Hܧ%R?Z/)zu3^W^yE;z nON./3M2%rYgUZ W]u<Áce]fo|X`A`9B-|s8=3fZvU>ܹsM[נ[MS[C?䓼7400`B2mrEzzzLo|D-7d/6֭Us=Wo./+W{D?[:<ƍp۳g HK͛g/zNoŊrE6mdJy |D_ 艸ê=Mk{GO|^d~#t馛/`1n/wٲe&<|Ftכiʔ)v)//`imll4s4E 5"޽.@!B 5/`8p@on&_ڻwL6L;vKT:/PU%U԰\]1mryAam<14-3jʦ 7y]v9]ҩ'J]:v)V.@!ĉ\:%-=d'],$=-]Zhl HIH&^k``@,Y"]t]rCrѻZ; [DVGK6K9ok_`mm'NjCy6C/kϺiS}m⏗|D=rvC66qû!Ygqb| W)v6=VbwD%pn!u}ҲЯ>^y:i@71}+JO}f)Thܹ2<<,O>]vBX7H/c}05` Ҷ=Þ06ѰOڜesrEg\X+>>ǹmk$&u6>ݬ=^0ʶMoc 8#H0t~Vg6PN3'I~?^eԩv Fv8\:VpGSڈP1+975#*u 6;0z{w;qۤcDOiI&yԋfM"=הyeg؟VΛ7t{nQ X۷I2aPgӮxZ_ }|m}۱5yhj~Kxؾ#y'@6מP0@%2mR`m:*Oӱ85w`V@ 9Xy/Ӟ=GRe1u0oϣfAazWKsGD;X=P/`1+mqFXh9g(MZ tޔP58q+e:˞ :1-놆dfsE_S\mٲtsg%(%S a]-ґZG85 ЛXA5tyhk'Lӡ4d~XZP&rcJ}m d 2t증* ]Uq=Vcckb'"/9'lJ 7#>(ӷ/:v!Yޮ,:~tjNng;c;6sHhuO5uzӿ1Fq|Ej|fK/33]q|11/X(,=joo7SICV& cs/:3TEا]/p,Ps]8v?{t=P<(s4EFU m1:m3CN0ldvKSC}nXtSwOSUDd9aL .n n5K; /փioM+>@r@Ӷn汣ri61LGٳe.@!#j8Sbx1^}:4o粑>tP;+f蕶DZ|cc̝;.@!#NoVJB"=}f$)NY ?# $1SrV46ަC'wRweӍIn_(-}mr}v$iC۳/N)1 X+[xvii[Y3aLYS zF4'W6#z_HKp1I3?M'U>'cƶ:縲mǛ9H& 8`pqF#"+22%`2sշQi8#~׀;ct$wBB0/X(,a-2SꩽǭN?tr}&dN_?HW[uD;-¥3eS>b厧5-Ջ3ŀkW}Ν; &ȢER${";*q|B/P(CL1MI,ML3-I. {T JIi׋WN:6[5ٝ7/O>Tz,Glm+q$<>YW뼲Uu!u򮗇)Sŋ ( up_ Riq$&LN[:Z mkhfnjgLI[w%v}gW&uj-ղMuhLO_{FfzJ:2Gy;ylX?Tԋ!dp?Vg꩓>uKzZ|/hyZ~}4鬗H6'H}c3R/Nsi>ݿw^nۧH~z"-αyaEdnp@&YܹeH3(P:w~ixx.Y}L8. ?֎;dڴifhK)M*9qM:M.M3kXu&7WeuH.cyKO*wYWv_`zs;q&]=U[Ƿs3YEsQS_ϗZm>R ^aE,hf:p]6QZEΪxV;4mYuf!>P$A4͌9X&h{Ӛag^I-=I8HwՏn8Oڮf;].ږXi^iK*f8x?9^q3֔8'b[ϻ=EamvA|<_C]W4kJ՞R}N-~ؔ}nHNO.!LkAҌ=-<5ug3fimkeo;JwIz¹w;3iKX=c:V<[/O7K%,ZssN6~z.p>w~1m|SrKz3݋PEΏ#Dpɯ Mj6SR{Rm+p_|H}BhZwNO[W#`6*vcI{_Lv昣O>79DO[b#X>mSm|s{>P{;=g|Ow~y p</$Q"tTt5ngu/~%_s?ww~M,& 8ҮS'|BUg IU掳"DٶF@ʐeZMfxzԦK$U#K'`Lm} ×%ЪffHuIouE,jfҎ5atwάƼh?3Xcĉv,{Wwv䦓w])iٶѝvE,`k;bgsu22 ɓeʕfk=^MG:kh(o5/J<4 NKu` onYu%ϱSc y\;4fحY\PDP25KK?|u18IB-_(@mm%pw%O`L3 Z"[CZl,*$λĭ-. 57IgS5: l@(鉯RX H,)Sv-K>+PMW:+S_^KKy{ZKë_g>ijAi֪3dFSmZtwl&O[x.*u 6Ȓ%Kdxx.ʇ i @M񒠑4\ Xm'Md;̗QsT;7oF~Y#3,DlȲ=;V`l"M:oV0HUC/H( 4txzvmgn[} j2U#2* TC0V|2Պ$V:y[FvDKufɽloh϶-N4:1:F-Vvhy{&NCh/7YOu&wwZgZ-u 4v[-nҦ![*vCBGK{|glҖ&o1"Nv YbAw˼c# ٹ]K{Ʀ2m ;T;2Q "&F97Dmm;ea(1/`iVfҎܗ\uNu(=uM"= Q9ֆd'S+#Ͷγ\%ê8(OUh۾%K] t]9?Z {@{nkߙfHX(0"zo|iarٲen|\M)qXh{e%(7_@|F;wCuy@,WI_@|ZCJ&Md|F+f7n] dgĉv XSNa3<={9/ P)Sŋ.@!@jkkeʕr 7%(7_@|F]vɒ%K̤ X:vṲc/cvHJx͐-2qD||KO^uV_uQru%𣟯f2/~aΞ= __}G_ 'xBv!1c{5k @)b~"xLE0k˖-f PL_c(L>.RmذA."3|FΝk|8p@>~hƍ|EUqyHO|uaagϞTt#GS[Qv.N~qc vsN{+є)SdV\m&O,&L4hu`?Aۨ۷۹DLCO6uH600 s'|29 wq}sr'R35m~Ȋ+m>__>([ /^Rtmt l2uk%Zr:mauF>:mI_ ᷍? ڦѮJ;묳"/]l{5gbr}c>0Tu믗{ ^Yn]`)2iuK9s|K16i Ed͚5@a\:=cE9裥.3JAS{p嗛饗^j=>[/L̓/TzjҤI~. 7JP#B 5/ P#B 5/ P#B 5/ P#B 5/ P#B 5/ P#B 5/ P8ޒ/㎓n.M>#m)}Juۏ#OK;[ZciIgss۷Rߋ'|nE[q鶹K.M6%J>gYKO6u𣯛_)zmsM76:WjJW\X =J|fԜ@5kVJɃw *] u ڦ\K^S뷾NcDJ_ ߯rR_}/j 0ĉ{Dkh 0P 0$dcժU-XGo|C͛'_|q`S~K_2'|RJ3WccSkkyoכZZ;P\:#̨R;gSVOSNYfV3Vzbw^AΌxo3M67P<~.~&;7PI "(=aG>bڲ{rI'{x|ghp>ؖ~%.(WTu-\fl޼Ћ@guV/팧]mmm/zQTpiSO=T  jT/(KZrhh|@YBOnoشWu=X@1y‹ (9mCa7@)|%U)Mfm1d:Ę[S4_1lϞ=vHO/bmÁJC'Ν; }g/J| *]t]@~cT[[k@x|ƠF3?."JnܸqrJ3M<.CUy@0a,^L:рP]]-HL@!B ܹsexxX|I|@YNjԩSӆ ̴~‹ 0͛7L:^2Pr۷IFPr{iӦiǎv)rpQȖa&|0Bl23 (9N}hK7nKPrTD>lb:_:3C֯_oZ‹ 0566IK#F! ՞+3B|%LрP]]-HL@!B (C˥JRV鵫OV5!{34 ^g6/}߷KPfZVs.ijX.ˆb.VΗΝkIWʼp}v38p."+:olkʦ oo&LNfdk/XL6L[nK  h6op UҺS2-ۮ<[jF>͂j׳q 8c\VsiFI4%<>ViuSs:,_[ԙyN@"JN{uVUNkYhK46DK;SAwueJzI.;݆h`,.Y˛[:7wSq= w"+vOskX^#+yMFp}#v}~y=vJFs~ {a#ۋ%~zWKWK}s I>.4uds"OlݻWVZe&|eF{gJdhDZdYCZe],8t,h&6 ~v5^lZR͗Ns9oKCRM#vVKVgLZ/3c!-Ycw^qL疕y$S.8҆}]r} |\~Y1\k.YdJC\n'֞6&&4ohrnahϴYiHR?!^w[bYwhOMm9N?鎵NR^Rm}lذA6nh"]byzާv-]I.XNJ}}b$s4B_@]R;!i&oUP><%{ʄd[ v9 ",ܛ gAwg(VΏ uk1>iVLi4 +hU]ONښڢ՜uA.ɞ}zVkCjwm}\(cy B.NV\iӧ%ѶݱF =U`|eofNjd&S-3DXiN~L鈥wO}fs-L| 9^vm2$Ѱap{.5y^bzY6! x!t"/6ӄ <$U2MbUrCv'<:n3UƖV|^v>gG}H` oׯTWW[@8'Ce˖[+PYvmuyc0Y/{s Y=U6ZKK$%N~]6E8ymDo𢡊tǕr[lEOKD;g'\'qou5iqȞSY|沮1M{|뭫:{E0<J^2NzRBNFJjS@iʔ)E%(۷vI>jdi9ԉ "w%άz%/*:WΝ\Gi=|ƠiӦi֭vI毐HtXOUM"=>J^K"}Ǝ)6֞Ȳ[)%>c[dh %U߁lذA,YB vFT -M: اX)-ڱ*3d[ت2=tM2{lillKRi8R7o6qC6ʯ %۱cGtIdfl } 2^Cy󤡡\OC|7q߿߼2rD܄ Ԟ@6SOgrg TTx6عsPi >mg%Zj* z* %Z]JZ2{r :n+%wwa׉syiBpۺq趱{bۤ^:Fi6:Kau1bCtǒpT-=rs^ o_bmƍM0"ڵkcuޥ˓._j]F˻}[lKg7cuc&siZb߾}EyX (C{"mpH۶mЀ;1 u;dI͂f0wu4w4Kt@Y`nhtH_!qeY'Jq(eh豤ɽo,|k6t,U{uFMz-baDtAr["Ͳ?X-[f:Ћ oׯ8?./8'DVኽ?TݻwfKHDz#Npûw:kDZ>9t}#E_褏.[-u>m9C>R)ĉb͑q^ &5k5P4>0oTJ|%Z-^LZr̔m*C-eR>ٶntL1e`jjY*"mkU궽+TwRiٗmz}%Z"li&o[^znmaCꪫ^C=TO.ws]/&O,+W4ΧV315R'I~_"&AjNNh@cs{Mz!'ƚms䬙Kù57Kn:ay@S};ݱε:Jj+v9[Ev>(G}u]244TWۏ;;\ܧ;oy,\;YIᷮ*_~JC..WH8,h'QM4yM}5uz3MIX(]%w\Y/].pi$մ֓}3{(^ֻڤ.~7XK6霬;O<ڮ7ݱ8¦}W;F3ۡɚ5k*4z9umQ.vc2Ȯüܢ'[!]o Ǜ1u*MazȃdSm}ok'|{@yӓ9[i6\:@/!`P (j. ־zBufjFusy?M:>%Gmu3be[V4U%?\t,mS36\(]GnrY@|(:m:B՜9Oli]|mou&r}[ѹt w'rIufx:˸.F՜7ot(x*_.Ϸ"W>\T3&Ϧ3dt"+]GnɝΥYgO 4tnU:BѹIK&Md;S @o뮻N>`ٽ{_i h S7oٳg2 !۷˴i<[/Œbcxϔm2)w=m p+"+ykzO#/* n:.?n#7n!"[_T2o3/it#Z&5ބk tgʦ\zG2ݔy\RY!+beܸqn b"| @E|QL|C[ VhWOU [Dr{LYpWJ'TPczy1Q{RyʈG}T8 @a:tÇ|c B>7;, Nn{e]ÄβۚP]:qX5Cގ=/Uz&{/<mٲeYu~Bp SC`m'Ճ{WKWu=sYfwj'IoktŞ]Oy_(-}ݲ.=Ix=_vev !B_R:jzz;|RZ"Ŕe]ӞNf*ʦhڜ3 6a9^>1VyHyx{_-}nPOcKӎ;/-"O^˯7\/7ڼy̞=@XQ (;wʼy̤#FjgJ{}Q;f0/uپ}JCݫdtVux.w[dEm8o(Vw +RƷEz5K7&B ʒv&MK` ZxNj@x|Ơ+Wiv Pr{5Y4n8 EV1uZ%rE@ynM|d4a„Xp|@Y2e9nmmKZSnK` 7ovi@x|Ơ-[Ȇ /v%K,1$/tUVIT۷KUUJCPr{s@6n(ӦMO>. ?_@M8ۿZuV4{lill4%@|Ơכ `, JN;6vtB|%7uT68P"ԲeT/4 4_P$[;_:3CeI{ΗC͛g;w%^_1h˖-a/1 (9핷L wh@&7PAC}}qFI&Iss]T/%J.a֬Y>j{EK|Ǐo (ɓ'֭[_ nIoKPLMMM}'?h{UV>* 0"thO@&Z'>]b1/^lN:.Riu7/ZH93T#G%dKvuu%)Pyne„ OD۝|ɦjC=$zŔ]d޽Vϯj?uXZ^X~zʔ))6Z+EoǎVm\'ɏ>_S} },?A~Tm7Gyy_(k|G/~{ ŋG-[foCyW#gώ466F^~eHOGO#-K1{[ =﷾NvD+W]_mo}>?c﷾NA(uJ;~Cm{6ޗ6z Y6k֬1w ̙3Gy>=F~^Q#-*_UU2M6ͮ*hX~T=Q;t[_m9:}~mv*Lo}Q~뤯OFm3oAh vp',︟+w(o3/0r|PZTuj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_FЁ䢋. 6%֮]+(7z"j*sgΝr嗛7nx2o<;wnBߙZ9s϶K@9hllK.DM~r _|^zILb /ӟTN?t[nE?|38CfϞm"`ĉ~V^xԧ>eNPv%Gu~& _kOK}0a+/ߓrAɵ^k@͑˾}IMg=Q <}t&Wlՠ)9ROro#@ִ+_K\T-P,<ۿ|YgQ ȉMo2Z{iQ @ |AW{o>C{ 7`ȖW(PUGczԒ;O{pXd.)LpTL:dzQZ4P/mҲc{+ɓMdыnsg㎓n.M>KKO6F&M_d:G)Ym0kv.GBiE~/F:А]ŋ#˖- WuwwŬYl4<c??f\9Wu֫_Z-HUU'uJʹi|R A̛7ϮJFL{xq}/gK6:IMUIoo}ٰa:}Ax+॥ ,fuV{صh"Z7oW:~R[Ms=ל@%˩+LhٲeU5@IHxbYr(hFY~Hl]uu [q."s,h!Vt% 8>뗬OdܹރJEUgdϙ3ǜe|?OnH۵Ho_UF>ޯYC0*GdƍrI'd~g>9:,-Z<"-Zz~5TTVoǫ\ L ͛7dT/JIO.@3U+|7Q4`?hd0a,^L:"|$] {P~ܞy@\TWWK$1Pi7.\C$YF9D{R8U ۿHO:.ڵKke~?)^L+k/JB3ٳgs5B|*ּ|gp>ر,@T354e3ix.ӧOZw>|"?-B>pnjo[?FPոb/^@!X>ZΝ+2~xFq|"8`.Q {<O ,wב_PZZ2j*SJuVKC?xUG4; >}Ck?B3P< 6k(=tJWHVBN(~Vjjj _B,j痾%y_6\y& VzUھXнeݺu{nټy%Jð6ёE ')zwwCO͛'gy]bГKm_B n&aV]ְ;44 Zypþk1Z#91r|477 u]{gȷ-S-ՠ' -?:(vGeӦM&jJ\Gq|3rJG??0"N$Bx+mҡ5%فo{9{d\kտs,}-w|>ls~. ڣN|/ W *zWfЉiכZk'c'*z'+=1;+ [:I Ň~kU>2ei4U!4$NjZ. UVx_*0]35,w^ҎL*uCh{8m綍UKT؋c۶m&*9ꨣ_5MFEIp uܹv Y"ҵ'.Ֆʦ qJtG }s0"MKWʟ8@Mos"w}C;.ŵ V;Q1HNN-9~R'^ct^ޜo[/ Mob]%WDKI%*I-\Қ|;y[*>^i&}NlojZ:孲|yCl ;}Gsiħv<v u-ޜ_{5PrѝonhCgN6/)B"Hlq6Yd M]Ϊv[ݰ=:}͗R/-$KSB4ŶPn+Ͼ=yGU?ȖiwrɃ 6ȼy/6zO qk^}sov ܌f5>VTsn^S KM U.$6髟O̬me܌9ѓڽΝ;c':dAqPF1|DƏopCC|/|l߾}+*Mpm8X[|رc`N`4Y,Z ]-ґZG8UfIOwF4kOZ58ғ] mc?-bhH̬UV3<.A1 [j4P^:,/yͼ,\s |s\uU-psڮ]LNf̘!MMM&nڴ|iO^7ԩS.3mpz5wqr衇0/7aRkwC Z][{a֒\=F {LO~j*>??v׬Y#^x*4ik1eˤ.-~gt8?uIB~:(<m34̷A8KK{AÉʣWܪw{,Zo6^z)=Y*l1<wɏ|t|}環cslwڅoدo~c=YjU_b#'pB#=hOo{"'|rh#wygY/뙱W^iPԮvAH:Ou&" ؛.-ս5$drG#L_մ~h;­b%ﳞuZ<ͺ}U} -:\>K(/Z2*:ib9牵&;`gs.8u}3 κ1M}sLqg%opiϔ)SdfBqBi J'TO~qBݡեݰf/}K vZڪ%ɓ& uSO5ᅮZXMg]ZU Ti֞pnJwpn^BcdŮZ F:=W^Jꥋ7ʘn>q]+[W|#e(-6kDŽ9W;gLyf^7\?eEՖh{"IS{KPy{s'&$v)zsP" jvHK_اq=pYۼCOD􅔮N}n=" ^ Z%-KP><[JOI'N74o2eݠNjks>oMte)b A|j5jjDyp[/*u2d{ѓ=YѓbzE%l]X\.KY->,g:ߧ#A'3Ìe3\O?m$έwn醅M7On2_s9l\~'c1J|}.}}K\t~–-0?v;a0xn缮K҉hnfc7 ܯٷl;Ƴw|e̺1~'w0qf!9Ucĉ\&j|$a%ҡkcuue.HKWT]`obuFencgԑp#1i1;YM]i{|?WpN )7w]b187l`&(}?h3^qV,Z96vxmU#1N=7#e?~UƢ&$DE۞].Bc'B{ܶm弽gnߒ3Egd;T$z!ryLiKM/g+/ |{˦MCQR ,`(T?˥6,~3qx3G|ϋ +۷ogW_hZ?[NcǞfX9YT13O==.|ut^@%W8 ٶm+LZ(M֌iJGߌ,=%*}'sFTYӂ d„ f|o[eQ D6[υzo;k5g3cL*dos{RP2;z-] [l?, s9#maJ7i ɼeZj.O%ߌouu466;N'+(&}?iua ysbrJ,m} {̼n0,l 5YֿII;Xb7(ׯ7W+JOҒ`*^)ꢞvșQ rT=hg4Hm}hi5/WtK~>~ [ ('lD 6t7n]ZY$?V~Hy ђJ6-+ xulX|衇ڵD{)x:u}ߵ mbM;I. IZbвz|XetN^2]vǻ6C>].ýݔWl_f]JSIQ/axP; v{KGv?t8C`o9*ɳ?]7,Q¶}sN,W_}|VG0y_ooGzùg2m|%0 ffĘ*f^:4K}߶h)t_K 92{l0V߷CkdHZh"yŪ\JvSZt$F.X Mc%MTˬ[W`j͐`Z; βAov;kvbZs@?ڙ᷾-y["r( ΗT{WK|(*0F\ߖ8ǪA';/xuhai?U}IՔY^tSba62T1+c]vj16ߙV'= Hڄz'`VD%5wƫ>/pi{- 1:Cl5L8^:rjdiGK }w}]e <s p WJiN2#]Dz|؇X:e]` (*mO5B,Mnka Œq5\J_t]4v{q/q޶s fj__ ٴYtJ5L!E]{l{8zmb@XS=40G@9 ah6:uȶ Ѯm qg(X\{3=`F9W -v KcW2\UsZzW DvH84Ԋx.؎t.p ;KokBaG h- zy[Σx8kI;;3&)02~DeժUvI Ոvnip JP ,Β!luq=W^tECo`RsU@6l0RGrL>_ n`0,vfݞkZ:$wLR3j[8Ɂ;fe1zN@|'*:ceVYGE]$}!0kjoZ-]'w7Nc 9誠2m7P^wgJn!ބK'`C۾6w?!"űk.YdtxƎ6 >K6jR&h;^+%FCvXe>BZ_$u 5t,&S0q;[T_;viӦ 4* пthWQ]iՈfn+/k`u=*l6ЫBpUCapNpMh͞Th;Z3"n3ch/MW׳HS<6߉ުz]ŤIԚ|:阾EkY5~kSmۮXڙhV>i CxT|I#yy! ++3'/4W.ކO/yu@k'-@/6o n(î쫂z%Pڵk\C%)D'zL{^1%f=HKh@!8-Q/zo /*]ߕU=+tF;5=0yO. MtYUŠzNn( NnGbU<Zn:=+EuGG2[Oߝ"j離H3=C) x%ЋF;jll4vtUI;D6닷MVOr$޿CB6&kji0Io[t,Q_ZṳdF'NsC *(EV02_=A;.)v%+lW9YL"ׯ7W`ix&{])';lKM [*:='1?={|exxX.2Tؕ@]d@eK 1ڱU9t*Q6)x5ġKy^Hm#&J 5tҎlM%5K;ɖINo:]NyO_u7h/u*u< z+,|Q7pCPHdsXWu?y7pͮ.s-*rI'ɉ'h үT=&juoo_'O6%eviv||F?%Sn>}˕Y-IKuz)x3vîswt/__˹ͼV[1 4'0:Ӏ`_]APRb[ ;v0ǥ۷o7oLuoMzLZ[KT[K ^U矗~bI&y)@ p[n5ag? z[ gKbz_r]<7T%6֓MX'Dʇ~T]z饡iӒiӦy*7J&|Pk?ޔ޺>Rh]g|"'K_dKK0ɒBE .I_5 {~;{˟Vn [N2_w$/r'*Zo̹5/mC?ZR'rUs(DfS ȗvVxb3_Psccqv ȗԻrJ3d"QnŠis`!$$EOVh#B (9LLtFZj*sa-[L_02}{DGḰ\<ѡZ(b2m4Kʑ1߿cf~֭/M7$ r%(or%Ȓ%KKq~8Y_NT]vɜ9s䪫 'x2f}}}%(:H>h3on]2J1-[_ }WǏo>r)1ٸq<fO7Q:mI_ ᷍? f޼yvDfR{$%%tI\(uϋr&[-כ'P z&=Mztw}vD;v]_m:i '|6_~]#U6Z-$ޗ6z ~ F _Z]>إFkh  'NKvav89|R1nؾoW3I9msg P";tEI 566[q~ .?A(zG&noo]6FZEO6u𣯁Ut:+Z2<Fk3 Aۨ7=:Y#d6h}:ϋ^?/k,6|_A=>k/t]~JQAc(H' D;l?Ǐ6J;G2} ~Lmm-*\6„//y} 0_;V{rgӋ /`,"U_*j_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|FZUaƼydܹr%@ehllsYdTWWK{{]Rboڵr7릛nG}T֬YcoeƍP8 ;v|?_XD‰ Ԟ={dҤI{n8q]0UV?oolÆ jʕv 0 !EӦ Z>}t`'x[€ 7sϕ>[/^l`,Wg@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fj_@|Fyg-[l1S֭[`~)9ey޽V|A{ߞ={v_ $y)RoO?]jkkҸqIgg,X7nذAN%k|dڵvI޽@GQgy"h$ HD KPHpqVQ0;8u'QP`TDPF@QQAAWTTTFaپ7UM%tBB:IwsNt:I{ r~ҤIl^|\|ŧ|ӧKy q'cQnr=H:u69p@Yr?~\K.%iҥ/2diРhBn*|uYo>i֬ _kȑtI׿>ׯbEEEҮ];߷M6oʑ#G.u9|P׺uknݺYЫҨQ#nfy饗gAK o&3gδQt7n,G?,)))vZ%7g}֞OGxGiذB//3?رsՅ J7nlAE7.\ E/@#eоK ''N}n믿{gCʡDG6ol/t'U}9f|rhԾL}q&4>3NOn/j84+yV`O7|3 &1j ,ڗyA 16mre۷o}X}v6lX,/PڏW3OSïmлdԩv9WP@i__~M 5@Y͛,q;x`;@#n+V1G#}dff&=_4p ַw@ȑ#z}ꪫ~iذs Tijܹo嬀!''G5kfIRRs tW5 v{ !2eqjݰashzw)#F`"BP lTꫯ~N ߉ؤwҥ` .p|*ٳ͛[_[2gΜ`ށ1"qƕ]l#6|?A@GFi~wa ֭}ZWs A6m,v<)y !?/-[{饗:WI@ 0Dɳ"kCcǎcD/F:;uT;vXo;":uXkSÑ_I̺ufr |0ѣ;ʝTcD./P ƌSWA{= G7@d#DGGm}/СC!Ɔ _~z✫d_hߋ/A)ynz_|E9-с Tk¯ jpZQ8[RAG-S򜧠l/~pWG=@5-j_'{ℜl92[ gT; ?b@t!5@G : YoM.;jʹhAjk֭^0$3dtNIٳ3OIgzmKR&i=))ε,ɗI>'=AEݺu->_5*X&aJFk[ޣ 1JR푆42 JAKV^빒&)^a9h;(Y М#2ӹ}$[:\Po}@qq SNu |*==E5jhёVwTx}.RϐQi3%#vBr7+Y.n@uNRGȶ%K},Lկ_F}OuV׶e$n+,KMU([oK/?пȑ#ы Ԃ)SXQ OZR%7#_9}$=9G͹!ʙɣY]%Q<_(Q^otdټyޯJƍg~_h4hlڴ–ִ\ϒ9^IsJKvXFIpʠ fmT>K ܯK+Y.u l'ĉwRH/aҳgO4oܹ hwBԊW_}Uٻw̘1CO\4:sly Oʝw) 4c?0 ԢÇg ֒g/E{뭷}:K/P4h?^K _gso۶J"_5mo^TCXu h?˜u}ѢEҶm[ ;wv/@ݻƍSO=,E?~wnc 'O4i)ƍZ|oȹp?+q~w\C@ر:X)%CCg+-߷@IKK j@>T˃(YKo_ iĉoVPP\AUٳ>LOeС,]#@:묳,]qևWRQ5zWڷoo_--|lsjhÙ_,qЫ% _ M0!﫡7ްcTN~~}p?X 9% D8 iÆ T ڧ-~r ZJA"\v,vS)y瞓&MXMIIq V|(0x`ys1ʗ@Cﭷj-_ Jh̴>U ڷ}gzw%_=}1 D Xxꪫ_UC"4p //JvE9Wk@ٳf͚Y*=%J bM_ ʌ7.Xj⤵kZ駟$##g|h#GU ~ׯ+BoÆ U_ ]ptb#<;o<9s\@,#Qj%J7r71@zի\N">q[⋭%z,X -Z{饗:Wo\\CW^y%5;֎Ph0}OnjHnFy|0xuY_ +Wnݺٺuk !o(?E?88/S2` VZeĉrt* t/лek >Kobb[Q[_;B.T>}X_~hp``;zh;* Ān¯j~hi& _3^/#444H6olaRf#ٱc};ңG{ZrGbD||meH#<ԩSB5\\* ĐÇXHg#ъ+\ԩS8_ -"p}d߾}o.]w}2/c6mja.Y Ǐw>^dng׮]+!1w*7nl} /ߋQF1P_ FM<9aSjkG}dモtJ6_ i:tjܿsf9rJ}ILLռys*P5_ uBfJ5H\}@KKK k>ۚ)S1._;qDaJڽ{ܹS 9‰ @>lm˭VïV7 K.mZ荏wC`-6im5V9sK5>܎p#0aB~e˖q}-@u (AG_ f}~7>lwݺureYmڴs?/." :upp~Ȋ^Ԧ1lϻvQ`QFb&M9sy] G|Tښ5kr1iݺEEEpBٸq=';ÎBpFtDtwr;gCp4>oa%:=: D/Ы#FpYW_}%_~_a7g}<"k_|Fk_|Fk_|Fk_|Fk_|Fk_|Fk_|Fkq'c׎;&{u!:w,>L<9ԪU+iҤ |3v%;v:8gN-ZHÆ 3-EEESOĉ3q!i޼|ҬY3,?#GÇ|| 5/_#| 5/_#| 5/_#| 5/_#| 5/_#| ڸq̙3G9)I?vZ ?"·z)gϖ6mX=z׽>>'/|w:tH~ҥӽ>z|k#F}ԫWXu ;vr~/ V|{<4n؎5$(}hL4CO?Ԯϛ79{N8ǀoW={XՠЮ];ٹssnV˹+oerWKAA,X@6m*F9_[ K,qO?d/"e𕔔Ϸ[?֭k~XM< |3v*|n۶9 >\{L1/1_ ^Q~C~vssLy'-Kzg֬Y#  7J߾}2uTyꩧ+g޽֗ˎ;@IڟޥKІJb7ip֭-]}ם+@,X vxb1b4g^rӢE 8qzr饗:g|cВ%KdȑҪU+ٺu큊4i̙3Bߚ !X3PH3O#Złncر7=#/se3ft݂D+wL>ݎuYG dKC}]9- #R+/iӦ3 4ib뻺vHi駟e]&?|ᇒ\^+VCZxywGe˖YM>oڨy3I{np}^:v(v풕+WJ~Ku(hsRRup=c=\~D)%ht"!ӥgtO7k߾LĚO>$XUhu]t[S\iH9xϞ=6JTUvvL2Ŗq%|}?DV:1~ym[z{衇t]pMB%W GZnmm Ch"B/NQNYu]^VHRXXhxpWؾH~G;D|;xƌ@EҙmuMg-ْ8?~F@8h СC# 1/h;w|.g,Yb,ժU`~V>#t:f>v+ 1bO:sϫ$--&ڹssT5=P,ep O(|rwlڳ[QI /t O>dp-Si5zuMV]y۶m6w^2@=X־կMJJc/b@2tYgoVy5. <̛7kCE!*{z4ib%+W$:4;ҹsg?~< &|Hvv<3ҠA+]V<}\eŊC9WA"姟~ɓ'NGgiokǏ<9(ݻn=PQ7t9rXB5jo^b 7Jl޼|ׯo@et o8zsU4K_#| 5/_#| 5/£pI)[.,9)':q<*0ʐ'wMO=9P2z\m3D/W}2+9_xsm !Y$T /j_(ɐVf8.dJfCy,=;3X>筟.Qf]D;=[BKy b|Q Unrt/1* V^o~L-ΖT i"gHZYWdee+9iNu ,mu%ڞm,Iru2-޹ b|F!kZڮ2B,Zi3%#uO7+Y5qn SF=@T!"BMJUQBْ_`(]K|7-kH6),"˔.= eMT/Q^zKyUke}rg2n_ԾEp`Y: 2mW.[<%U iν%dINΑEɁk0K0aDE-˓̴Iu_%ˡ?Lҽ!$(RP[##&/jX>≨օmx.WߤKz-|Y۝S%{,)pd Pͥ]<y('CF sE'g /ugfrF,ß=V(FExXIi&zVUuB>?byTv{%ޟNW[!&eU ax&QӊQːS]+Y N0-@e pXMNjpprpR?Q)qfOQ ƅBWeKoq2𪐌Ӝ Re[/`'@^3%=Cd^>20{:^_`S|2]E|wO,-Q.\6y[*3EE ).Q<B'HQϼmRP"܅G|IP ֻP9k0{:_ -GΗ,aa awzfG [$32YzS:YDk#a,@!@j#$roYIԉJ\Jp@vg OOfB:ӻ;§8^jrX*E1{:.q'F o̙2c 1b,^9jc[dfefp6n(IIIv\? ;v]vU7h3>'ȖV\mذAzik_T5ٙ={h˓Lw-%Nɔ̔RqDy眳MzO v}ۦu%nkbovqϞ=sγ @_TUom V?>k׮ ֖Gâ4Ȯjٲep$K_믿c Rz݀}:͚5d W\qE0@_T. ~w-݇r]v ^euEoMGm۶9wj+/\ @l F /ʽ*YzupD+҃/ޓ߶={8WN֭\|Ŷu=x6o|ʦvɕW^i_~sdݰaeO({Hik;PY[63[q:#˗/KXLW^wh~z9dl.e6Kb#F /ʽԩ#s8iHKK~aÆH Zc>ԈuЧnz˥{tOCouΖ)ܲnF{C!o ĘT;wr-z5-X@Ə J'\۷wy.s NVC< \rN2fֿ!COsnǏw!ҦMMNM(FߚV'rínmi vCuVP\{HC|FB Vmܵ^#mtO{Lw-]  N.N"F oRG 4lgYktTG֭Xynݺ`Al߾}cviVxˢX7kh@{H?p ͛7х^xwG꧟~{@7;:-?]tݥcJ*%Sí\ d]|ǝ:u >vmi8sGu{uD4V7 _Y±z]"24#ek%] =C0{ȴ>@|Zܪr߶mնG* 711Q.R֥Kjd8z=x;(vXWIFu۳gOػQG~VJoyʖ%s'iR/uGtʔ GL ՑңxrM7INNPQGߧwDWh|(-[ W_Z15%k8;#/lv^Z2pر|HiuW_}q-;] r_oprCdG z .Y}K=>|5G?pCwԛ XCΐe:AXiɩ.4-] ZN]Pғ;" |/AOG7 `=5pMeigɩtR{[]: Z~~O%(K;r~zg)v[h<#2}g%f.}Qn-۷opI,_Q<-vGuFtY]U7]Gˤ54FpԒe >uۼyPjיuZ}4 v|}J^q˃@a3w4 on9ec/hʝ\{4ر#r5z)Mg>Tw! m L|c^ :ӯH;9Tmp/˂n,T 1N'rGbuꚯJGou2tV7nWGݑe@u#¼%`V˥uc ! :uqXIENDB`