rtax-0.984/0000755000076500000240000000000012236215350011523 5ustar loraxstaffrtax-0.984/changelog0000644000076500000240000000271412236215032013376 0ustar loraxstaffVersion 0.984 (2013-08-07) =========================== * Fix tempdir calls for wider compatibility. * Add NOMATEPAIR classification when --single_ok is not selected, so that every input line gets an output line. Version 0.983 (2012-09-05) =========================== * add options -x and -y to reverse-complement reads 1 and 2, respectively * fixed completely broken fallback to single-ended classification, and miscounting of NOHIT cases * fix idList command-line parsing bug Version 0.982 (2012-07-02) =========================== * getopts bug fix * try to fix issues with TOOMANYHITS cases Version 0.981 (2012-05-08) =========================== * change all references to "/usr/bin/perl" to "/usr/bin/env perl" for wider compatibility Version 0.98 (2012-05-07) ========================== * insured global alignment with iddef 2, i.e. ignoring terminal gaps in computing percent identity, which was the intention all along * Fix intermittent hanging due to squirrely interaction of usearch with named pipes Version 0.96 (2012-03-01) ========================== * Make fallback optional. Store FASTA indexes as DBM. Add fastaidx tool * Combined pair + single script with fallback Version 0.94 (2012-01-12) ========================== * add tempdir option * add idList option * add id regex option Version 0.93 (2012-01-09) ========================= * Ditch Bio::Index::Fasta Version 0.92 (2011-12-28) ========================= * Prep Qiime integration rtax-0.984/greengenes/0000755000076500000240000000000011710062223013640 5ustar loraxstaffrtax-0.984/greengenes/greengenes.README0000644000076500000240000000757011710062223016652 0ustar loraxstaffCaution re. using newer versions of GreenGenes ---------------------------------------------- As of September 2011, we do not recommend using a version of GreenGenes newer than March 11, 2011. The reason is that GreenGenes has expanded dramatically in this time--from about half a million to about one million sequences--but the annotation process has not kept up. Thus, newer versions contain much higher proportions of sequences that either carry no taxonomic annotation at all or that list "unclassified" at many or even all ranks. At present, RTAX assumes that a sequence that is either not annotated or "unclassified" at some rank does not belong to any known taxonomic group at that rank. That is, we assume that the lack of an annotation constitutes an assertion that the sequence is novel. This assumption makes our taxonomic assignment process more conservative. Our interpretation is very different from the alternative, that the sequence could perhaps be fully annotated with known names, but that this simply has not been done yet. This question manifests itself in the voting procedure. We require that half of _all_ hits to a query sequence agree on a taxonomic group (at whatever rank) in order to make an assignment. We include hits with no annotation in the denominator, corresponding to the idea that those sequences do not belong to the group being considered in the numerator. We reasoned that the alternative procedure--considering the proportion of concordant hits out of only those carrying any annotation at all (at a given rank)--risks overly biasing taxonomic assignments towards previously named groups. On the other hand, our inclusion of unannotated sequences in the denominator means that, when those sequences are highly prevalent, the annotated sequences in the numerator rarely reach the 50% threshold required to make an assignment--even if they are in perfect agreement. We can hope that GreenGenes taxonomic assignments will become more comprehensive in the future. Alternatively, we should carefully consider whether it is appropriate to switch to the more permissive approach of ignoring unannotated sequences when voting at each rank, or if some alternative voting procedure better balances the risks of misannotation versus low classification rate. Until this is resolved, it seems prudent to simply use the March 2011 database since it is known to work well. Preparing the Greengenes reference database ------------------------------------------- To regenerate the GreenGenes reference database files for use with RTAX (e.g. with a newer GreenGenes version): Download the complete GreenGenes database to a disk with at least 20GB free space (as of Sep 2011, but GreenGenes is growing rapidly): wget http://greengenes.lbl.gov/Download/Sequence_Data/Greengenes_format/greengenes16SrRNAgenes.txt.gz Then, while in the directory containing the downloaded .txt.gz file, run prepare-greengenes. (If usearch is not in your path, use "prepare-greengenes /path/to/usearch"). This takes a couple of hours and about 4GB of memory. Go out for coffee and whatnot. When you get back, you'll find two files have been produced: gg.nr.fasta and gg.nr.taxonomy. Note that the free (32-bit) version of USEARCH can use at most 2GB memory on some systems and at most 4GB memory on others. 2GB memory is not sufficient to cluster the current version of GreenGenes, and 4GB barely suffices. Thus, GreenGenes will soon be too large to cluster at 99% id using USEARCH without a paid 64-bit license. (Note that USEARCH reports "2.0 GB available" even if in fact 4GB are available, so don't be discouraged until it actually crashes.) When 4GB is also insufficient, and lacking a paid USEARCH license, a possible strategy will be to cluster at lower %id thresholds. This requires less memory and produces a smaller, less fine-grained reference database. We have not explored the impact of doing this on classification performance. rtax-0.984/greengenes/prepare-greengenes0000755000076500000240000000455011710062223017350 0ustar loraxstaff#!/bin/sh cur=`pwd` # we want to call subsidiary scripts in the "scripts" subdir, but we don't know where we're installed or what the working directory is. # solution from http://hintsforums.macworld.com/archive/index.php/t-73839.html IFS=$' \t\n' declare -x PATH=/bin:/usr/bin arg=$0; [[ -L $0 ]] && arg=$(stat -f '%Y' "$0") pth=$(2>/dev/null cd "${arg%/*}" >&2; echo "`pwd -P`/${arg##*/}") par=$(dirname "$pth") scripts="$par/../scripts" if [[ ( $# -ne 1 ) && ($# -ne 0) ]] then echo "Usage: `basename $0` (in a directory containing greengenes16SrRNAgenes.txt.gz)" exit 1 fi # Grab the command-line arguments #gg=$1 usearch=`which usearch` if [[ ($# -ne 0) ]] then usearch=$1 fi echo using usearch: $usearch # create a temporary working directory #tempdir=/tmp/rtax.$RANDOM tempdir=./rtax.$RANDOM mkdir $tempdir cd $tempdir # Extract the sequence and taxonomy information: #echo gunzip -c -S .gz $cur/greengenes16SrRNAgenes.txt.gz \| $scripts/greengenesExtract.pl echo Extracting raw fasta and taxonomy files gunzip -c -S .gz $cur/greengenes16SrRNAgenes.txt.gz | $scripts/greengenesExtract.pl # That will write greengenes.fasta and greengenes.taxonomy into the current directory. # In our experiments, we cluster the reference sequences into 99% id OTUs, and find a consensus taxonomy string for each resulting cluster. # sort by length echo sorting fasta file by length $usearch --sort greengenes.fasta --output greengenes.fasta.sorted # cluster echo clustering at 99% identity $usearch --cluster greengenes.fasta.sorted --usersort --iddef 2 --uc gg.nr.uc --seedsout $cur/gg.nr.fasta --id 0.99 --maxrejects 128 --nowordcountreject # parse clusters echo parsing clusters $scripts/parseUclustClusters.pl gg.nr.uc > gg.nr.cl # find consensus taxonomy per cluster echo finding consensus taxonomy per cluster $scripts/rtaxVote.pl greengenes.taxonomy gg.nr.cl > gg.nr.taxonomy.raw echo cleaning up taxonomy output $scripts/classificationQualityFilter.pl 1 0.25 0.1 < gg.nr.taxonomy.raw > gg.nr.taxonomy.filtered $scripts/classificationQualityStripper.pl < gg.nr.taxonomy.filtered > $cur/gg.nr.taxonomy # Note gg.nr.taxonomy resulting from this process includes the pcid-width column and a tab-delimited tax string echo wrote $cur/gg.nr.fasta and $cur/gg.nr.taxonomy # remove the temp directory # (comment out to debug or grab intermediate results) rm -rf $tempdirrtax-0.984/README0000644000076500000240000001335112236215032012403 0ustar loraxstaffRTAX: Rapid and accurate taxonomic classification of short paired-end sequence reads from the 16S ribosomal RNA gene. David A. W. Soergel (1), Rob Knight (2), and Steven E. Brenner (1) 1 Department of Plant and Microbial Biology, University of California, Berkeley 2 Howard Hughes Medical Institute and Department of Chemistry and Biochemistry, University of Colorado at Boulder * Corresponding author (current address): soergel@cs.umass.edu Version 0.984 (August 7, 2013) http://dev.davidsoergel.com/rtax Requirements ------------ * Hardware: memory requirements vary with the size of the reference database; for the version of GreenGenes we provide, ~4GB is needed. * USEARCH (http://www.drive5.com/usearch/). RTAX has been tested with v4.1.93 and v5.0.144. Make sure that the program is in your search path as "usearch", or provide its path explicitly in the "rtax" script. Note that v5.x uses quite a bit more memory, so 4.x may be preferred. * A reference database consisting of sequences for which taxonomy assignments are known. We provide prepared versions of the GreenGenes database on the RTAX web site; see the GreenGenes section below for details. Two files are needed: 1. a FASTA file containing the sequences, and 2. a file listing taxonomy strings for each sequence ID found in the FASTA file. The format of this file is sequenceid pcidwidth kingdom; phylum; class; order; ... one entry per line, where sequenceid, pcidwidth, and the taxonomy string are separated by a single tab, and the taxonomy string itself is delimited either by semicolons or by tabs. The pcidwidth column is optional, and (if present) is ignored in the current version of RTAX anyway. (In our usage, we cluster the reference database into 99% id OTUs, representing each by a single "seed" sequence. This column then lists the largest pairwise %id difference between the cluster seed and any cluster member.) Installation ------------ RTAX consists of a set of perl scripts and a shell script ("rtax") to wire them together. No compilation is needed; just extract the package to a convenient location. The perl scripts must remain in the "scripts" directory below the "rtax" shell script, but the latter can be symlinked anywhere for convenience. A common pattern might be to place the whole package at /usr/local/rtax, and then symlink the script into your path (e.g., "ln -s /usr/local/rtax/rtax /usr/local/bin"). Running RTAX ------------ Sequence classification is done as follows: rtax -r gg.nr.fasta -t gg.nr.taxonomy -a queryA.fasta [-b queryB.fasta] -o result.out Substitute a different reference database for the GreenGenes files if desired, of course. If two query files are provided, then they must contain mate-paired reads with exactly matching identifiers (though they need not be in the same order). Any ids present in one file but not the other are silently dropped. Alternate naming schemes for the paired reads (e.g., "SOMEID.a" paired with "SOMEID.b", and so forth) are handled via the -i option. RTAX considers query sequences in the forward sense with respect to the reference database. Thus, in a paired-end experiment, the reverse reads will likely need be reverse-complemented with the -y option. RTAX may be run for single-ended reads as well; simply exclude the queryB parameter in this case. Run "rtax" with no arguments for a help message describing additional options. Various progress messages are provided on stderr, and the predicted classification for each query is provided in the file given by the -o option. The output format is essentially the same as the taxonomy file input format above. The second column indicates the best %id observed between the query sequence and any reference sequence, and the taxonomy string is tab-delimited. Temporary files are created in /tmp/rtax.SOMENUMBER (or under the temp directory given by -m). These can be quite large; please use -m to select a location with sufficient free space, as needed. The temporary directory is deleted upon successful termination, but may need to be manually cleaned up in the event of an error of some sort. Using the GreenGenes reference database --------------------------------------- An RTAX-formatted reference database based on GreenGenes (version of March 11, 2011) is available at http://dev.davidsoergel.com/rtax/greengenes.reference.20110311.tgz. That contains gg.nr.fasta and gg.nr.taxonomy, which are the result of clustering the GreenGenes input file at 99% identity, finding consensus taxonomy strings for each cluster, and formatting the result as needed for use with RTAX. Please see the "greengenes" subdirectory in the RTAX distribution for information on using newer versions of GreenGenes. Preparing a non-GreenGenes reference database --------------------------------------------- You can certainly use different databases, or GreenGenes clustered by different means or with different parameters (or not at all, though this approach will have poor performance). If you have any trouble producing the reference fasta and taxonomy files in the required format, examine the prepare-greengenes script for hints, and feel free to contact us for help. Plumbing -------- Taxonomy assignment proceeds in two phases, which we implement separately: 1. For each query sequence (or mate pair), find an appropriate set of matching hits in a reference database. This is implemented as rtaxSearchSingle and rtaxSearchPair for single and paired-end reads, respectively. 2. Find consensus taxonomy among each set of reference hits. This is implemented as rtaxVote. 3. Finally the detailed rtaxVote results are filtered and cleaned up for output. rtax-0.984/rtax0000755000076500000240000002356612236214711012443 0ustar loraxstaff#!/bin/bash -eu # RTAX: Rapid and accurate taxonomic classification of short paired-end # sequence reads from the 16S ribosomal RNA gene. # # David A. W. Soergel 1*, Rob Knight 2, and Steven E. Brenner 1 # # 1 Department of Plant and Microbial Biology, # University of California, Berkeley # 2 Howard Hughes Medical Institute and Department of Chemistry # and Biochemistry, University of Colorado at Boulder # * Corresponding author: soergel@cs.umass.edu # # http://www.davidsoergel.com/rtax # # Version 0.984 (August 7, 2013) # # For usage instructions: just run the script with no arguments # # # Copyright (c) 2011 Regents of the University of California # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright notice, # this list of conditions and the following disclaimer. # * Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # * Neither the name of the University of California, Berkeley nor the # names of its contributors may be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR # A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF # LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. function echoerr() { echo "$@" 1>&2; } echoerr echoerr 'RTAX: Rapid and accurate taxonomic classification of short paired-end' echoerr ' sequence reads from the 16S ribosomal RNA gene.' echoerr '' echoerr 'David A. W. Soergel (1), Rob Knight (2), and Steven E. Brenner (1)' echoerr '' echoerr '1 Department of Plant and Microbial Biology,' echoerr ' University of California, Berkeley ' echoerr '2 Howard Hughes Medical Institute and Department of Chemistry ' echoerr ' and Biochemistry, University of Colorado at Boulder' echoerr '* Corresponding author (current address): soergel@cs.umass.edu' echoerr '' echoerr 'Version 0.984 (August 7, 2013)' echoerr '' echoerr 'http://dev.davidsoergel.com/rtax' echoerr '' echoerr 'Please cite:' echoerr 'Soergel D.A.W., Dey N., Knight R., and Brenner S.E. 2012.' echoerr 'Selection of primers for optimal taxonomic classification of' echoerr 'environmental 16S rRNA gene sequences. ISME J (6), 1440–1444' echoerr '' # we want to call subsidiary scripts in the "scripts" subdir, but we don't know where we're installed or what the working directory is. # solution from http://hintsforums.macworld.com/archive/index.php/t-73839.html IFS=$' \t\n' #declare -x PATH=/bin:/usr/bin:/usr/local/bin arg=$0; [[ -L $0 ]] && arg=$(stat -f '%Y' "$0") pth=$(2>/dev/null cd "${arg%/*}" >&2; echo "`pwd -P`/${arg##*/}") par=$(dirname "$pth") scripts="$par/scripts" # The above solution works with the widespread GNU Coreutils 5.97. Thanks to Bernd Brandt for this alternate solution for Coreutils 8.5. #IFS=$' \t\n' #arg=$0; [[ -L $0 ]] && arg=$(stat -c '%N' "$0") #pth=${arg##*-> \`}; #par=$(dirname "$pth") #scripts="$par/scripts" function abspath { if [[ -d "$1" ]] then pushd "$1" >/dev/null pwd popd >/dev/null #elif [[ -e $1 ]] #then else pushd $(dirname $1) >/dev/null echo $(pwd)/$(basename $1) popd >/dev/null #else # echo $1 does not exist! >&2 # return 127 fi } # Grab the command-line arguments usageHelp="Usage: ${0##*/} -r -t -a -b -d -i -o " refdbHelp="-r reference database in FASTA format" taxonomyHelp="-t taxonomy file with sequence IDs matching the reference database" queryAHelp="-a FASTA file containing query sequences (single-ended or read 1)" queryBHelp="-b FASTA file containing query sequences (read b, with matching IDs)" revcompAHelp="-x Reverse-complement query A sequences (required if they are provided in the reverse sense)" revcompBHelp="-y Reverse-complement query B sequences (required if they are provided in the reverse sense)" idListHelp="-l text file containing sequence IDs to process, one per line" delimiterHelp="-d delimiter separating the two reads when provided in a single file" idRegexHelp="-i regular expression used to select part of the fasta header to use as the sequence id. Default: \"(\\\\S+)\"" outputHelp="-o output path" tmpdirHelp="-m temporary directory. Will be removed on successful completion, but likely not if there is an error." fallbackHelp="-f for sequences where only one read is available, fall back to single-ended classification. Default: drop these sequences." noGenericFallbackHelp="-g for sequences where one read is overly generic, do not fall back to single-ended classification. Default: classify these sequences based on only the more specific read." badOptionHelp="Option not recognised" printHelpAndExit() { echoerr "$usageHelp" echoerr "$refdbHelp" echoerr "$taxonomyHelp" echoerr "$queryAHelp" echoerr "$queryBHelp" echoerr "$revcompAHelp" echoerr "$revcompBHelp" echoerr "$idRegexHelp" echoerr "$idListHelp" echoerr "$delimiterHelp" echoerr "$tmpdirHelp" echoerr "$fallbackHelp" echoerr "$noGenericFallbackHelp" echoerr "$outputHelp" exit $1 } printErrorHelpAndExit() { echoerr "-------------------------------------------------------------" echoerr "ERROR : $@" echoerr "-------------------------------------------------------------" echoerr printHelpAndExit 1 } refdb="" taxonomy="" queryA="" queryB="" revcompA="" revcompB="" delimiter="" idList="" idRegex="" verbose="" fallback="" noGenericFallback="" outpath="" tempdir="" while getopts "hr:t:a:b:i:l:d:m:o:vfgxy" optionName; do case "$optionName" in h) printHelpAndExit 0;; r) refdb=`abspath "$OPTARG"`;; t) taxonomy=`abspath "$OPTARG"`;; a) queryA=`abspath "$OPTARG"`;; b) queryB=`abspath "$OPTARG"`;; x) revcompA="--revcompA";; y) revcompB="--revcompB";; d) delimiter="$OPTARG";; i) idRegex="$OPTARG";; l) idList=`abspath "$OPTARG"`;; o) outpath=`abspath "$OPTARG"`;; m) tempdir=`abspath "$OPTARG"`;; v) verbose="0";; f) fallback="--singleOK";; g) noGenericFallback="--nosingleOKgeneric";; [?]) printErrorHelpAndExit "$badOptionHelp";; esac done if [[ -n "$verbose" ]] then echoerr echoerr Reference database : $refdb echoerr Taxonomy file : $taxonomy echoerr ID List : $idList echoerr Query read A : $queryA echoerr RevComp read A : $revcompA echoerr Query read B : $queryB echoerr RevComp read B : $revcompB # echoerr Header Regex : $idRegex echoerr Read delimiter : $delimiter echoerr Output path : $outpath echoerr Temporary dir : $tempdir echoerr Fallback : $fallback echoerr No Generic Fallback: $noGenericFallback echoerr fi if [[ -z "$refdb" || -z "$taxonomy" || -z "$queryA" || -z "$outpath" ]]; then printErrorHelpAndExit "-r, -t, -a, and -o options are required" fi if [[ -n "$queryB" && -n "$delimiter" ]]; then printErrorHelpAndExit "Paired ends can be input using -b or -d, but not both" fi # find USEARCH, or hardcode if needed (could use an environment variable...) # usearch=/path/to/usearch set +e usearch=`which usearch` if [ $? -eq 0 ] then echoerr using usearch: $usearch echoerr else echoerr "usearch not found in path. Please correct, or just hardcode the path in the rtax script." exit 1 fi set -e # create a temporary working directory if [ ! $tempdir ]; then tempdir=/tmp/rtax.$RANDOM #tempdir=./rtax.$RANDOM fi mkdir -p $tempdir cd $tempdir # perform the search if [ $idList ]; then idList="--idList $idList" fi if [ $delimiter ]; then $scripts/splitDelimitedFasta.pl $delimiter $queryA queryAa=`basename $queryA`.a queryAb=`basename $queryA`.b $scripts/rtaxSearch.pl --database $refdb --queryA $queryAa --queryB $queryAb $idList $revcompA $revcompB --usearch $usearch --idRegex "$idRegex" $fallback $noGenericFallback > rawhits else if [ $queryB ]; then $scripts/rtaxSearch.pl --database $refdb --queryA $queryA --queryB $queryB $idList $revcompA $revcompB --usearch $usearch --idRegex "$idRegex" $fallback $noGenericFallback > rawhits else fallback="--singleOK" $scripts/rtaxSearch.pl --database $refdb --queryA $queryA $idList $revcompA --usearch $usearch --idRegex "$idRegex" $fallback > rawhits fi fi # choose the "best" taxonomy assignment by walking down the tree, including info on how many ref sequences agree at each level # and pipe directly into the cleanup scripts instead of using temp files $scripts/rtaxVote.pl $taxonomy rawhits | $scripts/classificationQualityFilter.pl 1 0.0 0.5 | $scripts/classificationQualityStripper.pl > $outpath # The output of this process includes the pcid-width column and a tab-delimited tax string # filter the taxonomy strings to the finest rank where half of the hits agree (out of all the hits, i.e. including in the denominator those hits with no annotation) #$scripts/classificationQualityFilter.pl 1 0.0 0.5 < rawtaxonomy > filteredtaxonomy # clean up the resulting output and provide it to STDOUT #$scripts/classificationQualityStripper.pl < filteredtaxonomy # remove the temp directory # (comment out to debug or grab intermediate results) rm -rf $tempdirrtax-0.984/scripts/0000755000076500000240000000000012200345425013210 5ustar loraxstaffrtax-0.984/scripts/classificationQualityFilter.pl0000755000076500000240000000321712200346774021275 0ustar loraxstaff#!/usr/bin/env perl my $countThreshold = $ARGV[0]; my $localMinProportionThreshold = $ARGV[1]; my $globalMinProportionThreshold = $ARGV[2]; while () { chomp; my ( $id, $bestPcid, @taxonomyEntries ) = split /\t/; if ( $taxonomyEntries[0] eq "NOPRIMER" ) { print "$id\t\tNOPRIMER\n"; } elsif ( $taxonomyEntries[0] eq "NOHIT" ) { print "$id\t\tNOHIT\n"; } elsif ( $taxonomyEntries[0] eq "NOMATEPAIR" ) { print "$id\t\tNOMATEPAIR\n"; } elsif ( $taxonomyEntries[0] =~ /TOOMANYHITS/ ) { print "$id\t\tTOOMANYHITS\n"; } else { my @newTaxonomyEntries = (); for my $entry (@taxonomyEntries) { my ( $label, $numChildren, $count, $localMinProportion, $localMaxProportion, $globalMinProportion, $globalMaxProportion, $alternateLocalProportion, $alternateGlobalProportion ) = split /;/, $entry; if ( $count >= $countThreshold && $localMinProportion >= $localMinProportionThreshold && $globalMinProportion >= $globalMinProportionThreshold ) { push @newTaxonomyEntries, $entry; } else { last; } } #if ( @newTaxonomyEntries == 0 ) { # push @newTaxonomyEntries, "Root"; #} # if the trailing entries of the filtered taxonomy string are "SKIP", strip them while(substr(@newTaxonomyEntries[$#newTaxonomyEntries],0,4) eq "SKIP") { pop @newTaxonomyEntries; } print "$id\t$bestPcid\t" . ( join "\t", @newTaxonomyEntries ) . "\n"; } } rtax-0.984/scripts/classificationQualityStripper.pl0000755000076500000240000000014112122237445021646 0ustar loraxstaff#!/usr/bin/env perl while () { my $line = $_; $line =~ s/;[^\t\n]*//g; print $line; } rtax-0.984/scripts/fastaidx.pl0000755000076500000240000000065612122237445015367 0ustar loraxstaff#!/usr/bin/env perl use strict; use warnings; use Getopt::Long; use File::Temp; use FindBin; use lib "$FindBin::Bin"; use FastaIndex; my ($fastaFileName, $idRegex) = @ARGV; my $indexA = FastaIndex->new(); # '-filename' => "A.idx", '-write_flag' => 1 ); print STDERR "Indexing $fastaFileName...\n"; $indexA->make_index( $fastaFileName, $idRegex, $fastaFileName ); while() { chomp; print $indexA->fetch($_); }rtax-0.984/scripts/FastaIndex.pm0000644000076500000240000000631312122237445015604 0ustar loraxstaffpackage FastaIndex; use strict; use warnings; use IO::File; BEGIN { for my $field (qw( fastaFileName startpos lines fh )) { my $slot = __PACKAGE__ . "::$field"; no strict "refs"; # So symbolic ref to typeglob works. *$field = sub { my $self = shift; $self->{$slot} = shift if @_; return $self->{$slot}; }; } } sub new { my ($this) = @_; my $class = ref($this) || $this; my $self = {}; bless $self, $class; return $self; } sub make_index { my ($this, $fastaFileName, $idregex, $dbmFileName) = @_; if(!defined $idregex) { $idregex = "(\\S+)"; } $this->fastaFileName($fastaFileName); #open(IN, "$fastaFileName") or die "could not open $fastaFileName"; my $in = IO::File->new("$fastaFileName") or die "could not open $fastaFileName"; $this->fh($in); if($dbmFileName) { my %openpos = (); dbmopen( %openpos, $dbmFileName.".pos", 0666) || die ("Could not open DBM file $dbmFileName.pos"); $this->startpos(\%openpos); my %openlines = (); dbmopen( %openlines, $dbmFileName.".lines", 0666) || die ("Could not open DBM file $dbmFileName.lines"); $this->lines(\%openlines); my $sizepos = scalar keys %openpos; my $sizelines = scalar keys %openlines; if($sizepos != $sizelines) { die "DBM files broken $dbmFileName\n"; } if($sizepos > 0) { print STDERR ("Index file $dbmFileName.pos already exists; assuming valid\n"); return; }; } else { # just store a hash in memory. $this->startpos({}); $this->lines({}); } my $lastpos = 0; my $pos = 0; #my $seq = ""; my $id = ""; my $numlines = 0; while(<$in>) { my $line = $_; if($line =~ /^>$idregex/) { if(defined $id && $id ne "") { # write the previous record #print STDERR "$id -> $lastpos\n"; $this->startpos()->{$id} = $lastpos; $this->lines()->{$id} = $numlines; } # start a new record $id = $1; #$seq = $line; $lastpos = $pos; $numlines = 0; } # else { $seq .= $line; } $pos += length($line); $numlines++; } # write the final record $this->startpos()->{$id} = $lastpos; $this->lines()->{$id} = $numlines; #delete($this->startpos()->{""}); # spurious entries from the start of the loop #delete($this->lines()->{""}); # spurious entries from the start of the loop } sub count_records() { my ($this) = @_; return scalar(keys %{$this->startpos()}); } sub fetch { my ($this, $id) = @_; my $pos = $this->startpos()->{$id}; if(! defined $pos) { die "no sequence with id: $id in " . $this->fastaFileName(); } my $numlines = $this->lines()->{$id}; my $result = ""; my $in = $this->fh(); #print $this->fastaFileName() . ": " . $in . "\n"; seek($in,$pos,0); for (1..$numlines) { $result .= <$in> } return $result; } sub close { my ($this) = @_; my $in = $this->fh(); close($in); $this->startpos(undef); $this->lines(undef); } 1;rtax-0.984/scripts/greengenesExtract.pl0000755000076500000240000000230612122237445017233 0ustar loraxstaff#!/usr/bin/env perl use strict; use warnings; open( FASTA, ">greengenes.fasta" ) || die("Can't write to greengenes.fasta"); open( TAX, ">greengenes.taxonomy" ) || die("Can't write to greengenes.taxonomy"); my $fieldname = "gg_norm_tax_string"; while () { # read in by blocks if ( $_ =~ /^BEGIN$/ ) { my $prokMSAid = "NONE"; my $tax = "NONE"; my $seq = "NONE"; until ( $_ =~ /^END$/ ) { if (/^prokMSA_id=(.+)/) { $prokMSAid = $1; } elsif (/^$fieldname=(.+)/) { $tax = $1; } elsif (/aligned_seq=(.+)/) { $seq = $1; } if ( $seq ne "NONE" && $seq ne "unaligned" ) { print FASTA ">$prokMSAid\n$seq\n"; # don't include taxonomy info if there is no sequence anyway # and certainly not if there is no taxonomy data if ( $tax ne "" && $tax ne "NONE" ) { print TAX "$prokMSAid\t$tax\n"; } } $_ = ; } } # else ignore anything outside a BEGIN/END block } close FASTA; close TAX; rtax-0.984/scripts/joinDelimitedFasta.pl0000755000076500000240000000122612122237445017315 0ustar loraxstaff#!/usr/bin/env perl use strict; use warnings; # make sure delimiter is properly escaped, e.g. for "+" we need "\+". my $delimiter = $ARGV[0]; my $infileA = $ARGV[1]; my $infileB = $ARGV[2]; my $rc2 = $ARGV[3]; open(INA, $infileA) or die "Can't read $infileA\n"; open(INB, $infileB) or die "Can't read $infileB\n"; my $fastaHeader = ""; #sub revcompl { # return ((my $rev = reverse $_) =~ tr/ACGTacgt/TGCAtgca/); #} while() { my $a = $_; my $b = ; chomp $a; chomp $b; if($rc2) { # $b = revcompl($b); $b = reverse $b; $b =~ tr/ACGTacgt/TGCAtgca/; } print $a . $delimiter . $b . "\n"; } close INA; close INB;rtax-0.984/scripts/LevelPrediction.pm0000644000076500000240000000325411710062223016637 0ustar loraxstaffpackage LevelPrediction; use strict; use warnings; sub new { my ($this) = @_; my $class = ref($this) || $this; my $self = {}; bless $self, $class; return $self; } for my $field ( qw(label numChildren count localMinProportion localMaxProportion globalMinProportion globalMaxProportion alternateLocalProportion alternateGlobalProportion) ) { my $slot = __PACKAGE__ . "::$field"; no strict "refs"; # So symbolic ref to typeglob works. *$field = sub { my $self = shift; $self->{$slot} = shift if @_; return $self->{$slot}; }; } sub print_tab_delimited { my $self = shift; for my $field ( qw(label numChildren count localMinProportion localMaxProportion globalMinProportion globalMaxProportion alternateLocalProportion alternateGlobalProportion) ) { my $slot = __PACKAGE__ . "::$field"; print "\t" . ( $self->{$slot} ? $self->{$slot} : "undef" ); } } sub toString { my $self = shift; my $result = ""; for my $field ( qw(label) ) { my $slot = __PACKAGE__ . "::$field"; $result .= ( $self->{$slot} ? $self->{$slot} : "" ); } for my $field ( qw(numChildren count) ) { my $slot = __PACKAGE__ . "::$field"; $result .= ";" . ( $self->{$slot} ? $self->{$slot} : "" ); } for my $fieldb ( qw(localMinProportion localMaxProportion globalMinProportion globalMaxProportion alternateLocalProportion alternateGlobalProportion) ) { my $slot = __PACKAGE__ . "::$fieldb"; $result .= $self->{$slot} ? (sprintf ";%.2f", $self->{$slot}) : ";" ; } return $result; } 1; rtax-0.984/scripts/parseUclustClusters.pl0000755000076500000240000000243512122237445017620 0ustar loraxstaff#!/usr/bin/env perl # read a uclust result file; for each query sequence, collect the prokMSAids of the hits. use strict; use warnings; my %clusters = (); my %worstPcids = (); parseUclust( $ARGV[0] ); for my $key ( sort ( keys %clusters ) ) { my $worstPcid = $worstPcids{$key}; #print STDERR "$key -> $clusters{$key}\n"; print "$key\t$worstPcid\t" . ( join "\t", @{ $clusters{$key} } ) . "\n"; } sub parseUclust { my ($ucFileName) = @_; open( UC, $ucFileName ) || die("Could not open $ucFileName"); while () { if (/^\s*#/) { next; } my ( $type, $cluster, $size, $percentid, $strand, $querystart, $targetstart, $alignment, $querylabel, $targetlabel ) = split /\t/; chomp $querylabel; chomp $targetlabel; if ( $type eq "S" ) { #print STDERR "S $querylabel\n"; $clusters{$querylabel} = [$querylabel]; $worstPcids{$querylabel} = 100.0; } elsif ( $type eq "H" ) { #print STDERR "H $targetlabel $querylabel\n"; push @{ $clusters{$targetlabel} }, $querylabel; if ( $percentid < $worstPcids{$targetlabel} ) { $worstPcids{$targetlabel} = $percentid; } } # ignore other types } close UC; } rtax-0.984/scripts/rtaxSearch.pl0000755000076500000240000011671712236215032015670 0ustar loraxstaff#!/usr/bin/env perl # RTAX: Rapid and accurate taxonomic classification of short paired-end # sequence reads from the 16S ribosomal RNA gene. # # David A. W. Soergel 1*, Rob Knight 2, and Steven E. Brenner 1 # # 1 Department of Plant and Microbial Biology, # University of California, Berkeley # 2 Howard Hughes Medical Institute and Department of Chemistry # and Biochemistry, University of Colorado at Boulder # * Corresponding author: soergel@cs.umass.edu # # http://www.davidsoergel.com/rtax # # Version 0.984 (August 7, 2013) # # For usage instructions: just run the script with no arguments # # # Copyright (c) 2011 Regents of the University of California # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright notice, # this list of conditions and the following disclaimer. # * Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # * Neither the name of the University of California, Berkeley nor the # names of its contributors may be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR # A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF # LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. use strict; use warnings; use Getopt::Long; use File::Temp qw/ tempdir /; use FindBin; use lib "$FindBin::Bin"; use FastaIndex; ### command-line arguments get stored in globals use vars qw ( $usearch $slop $minMaxAccepts $maxMaxAccepts $maxPercentDifferenceThreshold $databaseFile $readAFileAll $readBFileAll $revCompReadA $revCompReadB $idRegex $idList $singleOK $singleOKgeneric ); # just store these as globals for now-- they're basically like command-line args my $indexA; my $indexB; sub init { $singleOKgeneric = 1; # default value; GetOptions may override Getopt::Long::Configure("bundling"); GetOptions( "usearch=s" => \$usearch, "slop=s" => \$slop, "minMaxAccepts=s" => \$minMaxAccepts, "maxMaxAccepts=s" => \$maxMaxAccepts, "maxPercentDifferenceThreshold=s" => \$maxPercentDifferenceThreshold, "databaseFile=s" => \$databaseFile, "idRegex=s" => \$idRegex, "queryA=s" => \$readAFileAll, "queryB=s" => \$readBFileAll, "revcompA" => \$revCompReadA, "revcompB" => \$revCompReadB, "idList=s" => \$idList, "singleOK" => \$singleOK, "singleOKgeneric!" => \$singleOKgeneric ); if ( !defined $databaseFile || !defined $readAFileAll ) { print STDERR "Missing required argument!\n\n"; usage(); } if ( !defined $usearch ) { $usearch = `whereis usearch`; chomp $usearch; if ( !defined $usearch || $usearch eq "" ) { print STDERR "Could not find usearch.\n\n"; usage(); } } if ( !defined $slop ) { $slop = 0.005; } if ( !defined $minMaxAccepts ) { $minMaxAccepts = 1000; } if ( !defined $maxMaxAccepts ) { $maxMaxAccepts = 16000; } if ( !defined $maxPercentDifferenceThreshold ) { $maxPercentDifferenceThreshold = 0.2; } if ( !defined $idRegex || $idRegex eq "" ) { $idRegex = "(\\S+)"; } print STDERR "Header Regex: $idRegex\n"; # these are redundant between multiple runs, oh well # but the DBM indexes should persist, in the same directory as the original files $indexA = FastaIndex->new(); # '-filename' => "A.idx", '-write_flag' => 1 ); if ($revCompReadA) { $readAFileAll = revcompFile($readAFileAll); } $indexA->make_index( $readAFileAll, $idRegex, $readAFileAll ); $indexB = FastaIndex->new(); if ( defined $readBFileAll ) { # '-filename' => "B.idx", '-write_flag' => 1 ); if ($revCompReadB) { $readBFileAll = revcompFile($readBFileAll); } $indexB->make_index( $readBFileAll, $idRegex, $readBFileAll ); } } sub usage { print STDERR << "EOF"; RTAX: Rapid and accurate taxonomic classification of short paired-end sequence reads from the 16S ribosomal RNA gene. David A. W. Soergel (1*), Rob Knight (2), and Steven E. Brenner (1) 1 Department of Plant and Microbial Biology, University of California, Berkeley 2 Howard Hughes Medical Institute and Department of Chemistry and Biochemistry, University of Colorado at Boulder * Corresponding author: soergel\@berkeley.edu http://www.davidsoergel.com/rtax Version 0.984 (August 7, 2013) OPTIONS: --usearch location of usearch program (defaults to result of "whereis usearch") --slop %id difference from maximum to accept (default 0.005, i.e. accept hits to a 200nt query that has 1 more mismatch than the best match, or 2 extra mismatches out of 400nt.) --minMaxAccepts The initial maxaccepts value to pass to USEARCH, controlling the maximum number of hits that will be returned in the first iteration. If this number of hits is reached, the search is repeated with a doubled maxaccepts value. Thus, a smaller value causes the first iteration to run faster, but increases the probability of needing more iterations. (Default 1000) --maxMaxAccepts The highest escalated maxaccepts value to allow. The maxaccepts value is doubled on every iteration, so this value is effectively rounded down to a power of two factor of minMaxAccepts. USEARCH runs with large maxaccepts values are very slow, so this controls when RTAX gives up on a query because there are too many hits (i.e., a query that is too short and/or too highly conserved to be informative). (Default 16000) --maxPercentDifferenceThreshold The largest percent difference to allow between a query (considering the read pair jointly) and a database hit. (Default 0.2, i.e. reference sequences of less than 80% identity with the query sequence are never matched) --databaseFile A FASTA file containing a reference database. --queryA A FASTA file containing one set of query reads. --queryB A FASTA file containing the other set of query reads (if any). Must be provided in the forward sense unless --revcompB is used. --revcompB Reverse-complement read B. Required if the queryB file is provided in the reverse sense, as is typical with paired-end experiments. --singleOK Classify a sequence based on only one read when the other read is missing. Required when queryB is absent. Default: false, so sequences present in only one of the two input files are dropped. When enabled, paired-end sequences are classified using both reads as usual, but any remaining single-ended reads are also classified. --singleOKgeneric Classify a sequence based on only one read when the other read is present but uninformative. This occurs when one read is so generic (i.e., short, or highly conserved) that more than maxMaxAccepts hits are found. Default: true; use --nosingleOKgeneric to disable. --idRegex A regular expression for extracting IDs from fasta headers. The first capture group (aka \$1) will be used as the sequence ID; these should be unique. This is useful when the ID needed to match mate pairs is embedded in the header in some way. "^>" is automatically prepended to the regex provided here. Take care that double escaping may be required. Default: "(\\S+)" (the first field) --idList A file containing a list of IDs to process (one per line). By default all IDs found in the query files are used. Note that the two query files must provide mate-paired reads with exactly matching identifiers (though they need not be in the same order). Any ids present in one file but not the other are classified in single-ended mode. Alternate naming schemes for the paired reads (e.g., "SOMEID.a" paired with "SOMEID.b", and so forth) are handled via the --idRegex option. Various indexes and derived FASTA files will be created in a temporary directory and are cleaned up afterwards. The output is tab-delimited text, one line per query pair, in the form <%id> where the second column gives the %id between the query and the best match, and the reference IDs provided are all those matches within "slop" of this best value. This output is suitable for piping directly into rtaxVote for taxonomy assignment. EOF exit; } sub main { init(); my ( $pairedReadAFile, $pairedReadBFile, $pairedBothCount, $singleReadAFile, $singleReadACount, $singleReadBFile, $singleReadBCount, $allSingleIDs ) = partitionReadFiles(); processPairs( $pairedReadAFile, $pairedReadBFile, $pairedBothCount ); if ($singleOK) { processSingle( $singleReadAFile, $indexA, $singleReadACount ); processSingle( $singleReadBFile, $indexB, $singleReadBCount ); } else { for my $queryLabel (@$allSingleIDs) { print "$queryLabel\t\tNOMATEPAIR\n"; } } } sub partitionReadFiles { my @idList = (); if ($idList) { open( IDLIST, "$idList" ) or die "Could not open $idList"; while () { chomp; push @idList, $_; } close IDLIST; } my @idsA = keys %{ $indexA->startpos() }; my @idsB = defined $indexB->startpos() ? keys %{ $indexB->startpos() } : (); # in the single-read case, there should still be an empty index my @bothAandB = (); my @aOnly = (); my @bOnly = (); # encode which of the input files contain which IDs in three bits my %count = (); foreach my $element (@idsA) { if ( ( $element =~ /[\t ]/ ) > 0 ) # $indexA->db() should return parsed IDs with no whitespace { die "Invalid FASTA id: $element\n"; } $count{$element} += 1; } foreach my $element (@idsB) { if ( ( $element =~ /[\t ]/ ) > 0 ) # $indexA->db() should return parsed IDs with no whitespace { die "Invalid FASTA id: $element\n"; } $count{$element} += 2; } foreach my $element (@idList) { if ( ( $element =~ /[\t ]/ ) > 0 ) { die "Invalid FASTA id: $element\n"; } $count{$element} += 4; } if ($idList) { foreach my $element ( keys %count ) { $count{$element} -= 4; } } foreach my $element ( keys %count ) { if ( !( $element =~ /^__/ ) ) { if ( $count{$element} == 1 ) { push @aOnly, $element } elsif ( $count{$element} == 2 ) { push @bOnly, $element } elsif ( $count{$element} == 3 ) { push @bothAandB, $element } else { # no problem; these are sequences not in the idList } } } print STDERR "file a only = " . scalar(@aOnly) . " sequences\n"; print STDERR "file b only = " . scalar(@bOnly) . " sequences\n"; print STDERR "both = " . scalar(@bothAandB) . " sequences\n"; my $pairedReadAFile = extractFasta( $indexA, \@bothAandB ); my $pairedReadBFile = extractFasta( $indexB, \@bothAandB ); my $singleReadAFile = extractFasta( $indexA, \@aOnly ); my $singleReadBFile = extractFasta( $indexB, \@bOnly ); my @allSingleIDs = (@aOnly, @bOnly); return ( $pairedReadAFile, $pairedReadBFile, scalar(@bothAandB), $singleReadAFile, scalar(@aOnly), $singleReadBFile, scalar(@bOnly), \@allSingleIDs ); } sub processPairs { my ( $pairedReadAFile, $pairedReadBFile, $pairedBothCount ) = @_; if ( $pairedBothCount == 0 ) { return; } my $nohitQueryIds = []; push @$nohitQueryIds, "ALL"; my $tooManyHitQueryIds = []; my $percentDifferenceThreshold = 0.005; # this gets doubled to 1% before being used the first time while ( @$nohitQueryIds > 0 ) { # double the allowed %diff on every round $percentDifferenceThreshold *= 2; if ( $percentDifferenceThreshold > $maxPercentDifferenceThreshold ) { last; } if ( $nohitQueryIds->[0] ne "ALL" ) { # prepare input files with the remaining sequences $pairedReadAFile = extractFasta( $indexA, $nohitQueryIds ); $pairedReadBFile = extractFasta( $indexB, $nohitQueryIds ); } # within doPairSearch we escalate maxAccepts, so if a queryLabel is still marked tooManyHits at this point, # that means that there are more than maxMaxAccepts hits for this threshold-- # so there's really no point in testing this query again at an even lower threshold my $tooManyHitQueryIdsThisRound; my $numRemaining = ( $nohitQueryIds->[0] eq "ALL" ) ? $pairedBothCount : scalar(@$nohitQueryIds); print STDERR "doPairSearch $pairedReadAFile, $pairedReadBFile: $numRemaining query sequences remaining\n searching with pair %id " . $percentDifferenceThreshold . " and maxAccepts " . $minMaxAccepts . "\n"; ( $nohitQueryIds, $tooManyHitQueryIdsThisRound ) = doPairSearch( $pairedReadAFile, $pairedReadBFile, $percentDifferenceThreshold, $minMaxAccepts ); print STDERR "MAIN: Finished round at threshold $percentDifferenceThreshold; " . scalar(@$nohitQueryIds) . " NOHIT, " . scalar(@$tooManyHitQueryIdsThisRound) . " TOOMANYHIT.\n"; push @$tooManyHitQueryIds, @$tooManyHitQueryIdsThisRound; } print STDERR "MAIN: " . scalar(@$nohitQueryIds) . " query sequences remaining with NOHIT\n"; for my $queryLabel (@$nohitQueryIds) { print "$queryLabel\t\tNOHIT\n"; } print STDERR "MAIN: " . scalar(@$tooManyHitQueryIds) . " query sequences remaining with TOOMANYHITS\n"; for my $queryLabel (@$tooManyHitQueryIds) { print "$queryLabel\t\tTOOMANYHITS\n"; } } sub processSingle { my ( $singleReadFile, $singleIndex, $singleReadCount ) = @_; if ( !defined $singleReadFile || $singleReadFile eq "" || $singleReadCount == 0 ) { return; } my $nohitQueryIds = []; push @$nohitQueryIds, "ALL"; my $tooManyHitQueryIds = []; my $singlePercentDifferenceThreshold = 0.005; # this gets doubled to 1% before being used the first time while ( @$nohitQueryIds > 0 ) { # double the allowed %diff on every round $singlePercentDifferenceThreshold *= 2; if ( $singlePercentDifferenceThreshold > $maxPercentDifferenceThreshold ) { last; } if ( $nohitQueryIds->[0] ne "ALL" ) { # prepare input files with the remaining sequences $singleReadFile = extractFasta( $singleIndex, $nohitQueryIds ); } my $numRemaining = ( $nohitQueryIds->[0] eq "ALL" ) ? $singleReadCount : scalar(@$nohitQueryIds); # within doSearch we escalate maxAccepts, so if a queryLabel is still marked tooManyHits at this point, # that means that there are more than maxMaxAccepts hits for this threshold-- # so there's really no point in testing this query again at an even lower threshold my $tooManyHitQueryIdsThisRound; print STDERR "doSingleSearch $singleReadFile: $numRemaining query sequences remaining\n searching with %id " . $singlePercentDifferenceThreshold . " and maxAccepts " . $minMaxAccepts . "\n"; ( $nohitQueryIds, $tooManyHitQueryIdsThisRound ) = doSingleSearch( $singleReadFile, $singleIndex, $singlePercentDifferenceThreshold, $minMaxAccepts ); print STDERR "Finished round at threshold $singlePercentDifferenceThreshold; " . scalar(@$nohitQueryIds) . " NOHIT, " . scalar(@$tooManyHitQueryIdsThisRound) . " TOOMANYHIT.\n"; push @$tooManyHitQueryIds, @$tooManyHitQueryIdsThisRound; } print STDERR scalar(@$nohitQueryIds) . " query sequences remaining with NOHIT\n"; for my $queryLabel (@$nohitQueryIds) { print "$queryLabel\t\tNOHIT\n"; } print STDERR scalar(@$tooManyHitQueryIds) . " query sequences remaining with TOOMANYHITS\n"; for my $queryLabel (@$tooManyHitQueryIds) { print "$queryLabel\t\tTOOMANYHITS\n"; } } sub doSingleSearch { my $singleReadFile = shift; my $singleIndex = shift; my $singlePercentDifferenceThreshold = shift; my $maxAccepts = shift; my $singlePercentIdThreshold = 1. - $singlePercentDifferenceThreshold; my $nohitQueryIds = []; my $tooManyHitQueryIds = []; # open the USEARCH streams # print STDERR "$usearch --quiet --global --iddef 2 --query $singleReadFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject\n"; # open( UCA, # "$usearch --quiet --global --iddef 2 --query $singleReadFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject |" # ) || die "can't fork usearch: $!"; my $dir = tempdir(CLEANUP => 1); if ( system( 'mknod', "$dir/a", 'p' ) && system( 'mkfifo', "$dir/a" ) ) { die "mk{nod,fifo} $dir/a failed"; } if ( !fork() ) { # see paired end case for explanation open( UCAW, ">$dir/a" ) || die("Couldn't write named pipe: $dir/a"); print UCAW "# pipe open!\n"; close UCAW; my $cmd = "$usearch --quiet --global --iddef 2 --query $singleReadFile --db $databaseFile --uc $dir/a --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject"; print STDERR $cmd . "\n"; exec $cmd || die "can't fork usearch: $!"; } open( UCA, "$dir/a" ) || die("Couldn't read named pipe from usearch: $dir/a"); #print STDERR "Reading named pipe from usearch: $dir/a\n"; # Load the first non-comment line from each stream my $nextLineA; my $pipeARetryCount = 0; while ( !defined $nextLineA ) { # keep trying to read from the pipe even if the writer disconnects while () { if (/^\s*#/) { next; } if (/^\s*$/) { next; } $nextLineA = $_; last; } #print STDERR "Waiting for named pipe $dir/a\n"; sleep 1; $pipeARetryCount++; if ( $pipeARetryCount > 10 ) { die "Named pipe communication with usearch failed: $dir/a\n"; } } # read synchronized blocks from each stream while (1) { my ( $queryLabelA, $idsA ); #print STDERR "reading next block...\n"; # idsA is a reference to a hash from targetIds to %ids ( $queryLabelA, $idsA, $nextLineA ) = collectIds( *UCA, $nextLineA ); if ( ( scalar keys %$idsA ) >= $maxAccepts ) { push @$tooManyHitQueryIds, $queryLabelA; } elsif ( !reconcileSingleHitsAndPrint( $queryLabelA, $idsA, $singlePercentIdThreshold ) ) { push @$nohitQueryIds, $queryLabelA; } if ( !$nextLineA ) { #print STDERR "End of stream, close\n"; last; } } close(UCA) || die "can't close usearch: $!"; #print STDERR "Closed usearch stream.\n"; # for the TOOMANYHITS cases, escalate maxAccepts and try again # Note this recurses, so no need to iterate here if ( scalar(@$tooManyHitQueryIds) && ( $maxAccepts * 2 <= $maxMaxAccepts ) ) { my $nohitQueryIdsB; print STDERR "Escalating maxAccepts to " . ( $maxAccepts * 2 ) . " for " . scalar(@$tooManyHitQueryIds) . " sequences.\n"; # prepare input files with the remaining sequences $singleReadFile = extractFasta( $singleIndex, $tooManyHitQueryIds ); ( $nohitQueryIdsB, $tooManyHitQueryIds ) = doSingleSearch( $singleReadFile, $singleIndex, $singlePercentDifferenceThreshold, $maxAccepts * 2 ); if (@$nohitQueryIdsB) { die "A TOOMANYHITS case can't turn into a NOHITS case"; } } print STDERR "doSingleSearch $singleReadFile: Finished at pair threshold $singlePercentDifferenceThreshold and maxAccepts $maxAccepts\n"; print STDERR " NOHITS: " . scalar(@$nohitQueryIds) . "\n"; print STDERR " TOOMANYHITS: " . scalar(@$tooManyHitQueryIds) . "\n"; # print STDERR " NOHITS: " . ( join ", ", @$nohitQueryIds ) . "\n"; # print STDERR " TOOMANYHITS: " . ( join ", ", @$tooManyHitQueryIds ) . "\n"; # any tooManyHitQueryIds that remain had more than maxMaxAccepts hits return ( $nohitQueryIds, $tooManyHitQueryIds ); } sub reconcileSingleHitsAndPrint { my ( $queryLabel, $idsA, $singlePercentIdThreshold ) = @_; my $idsByIdentity = {}; for my $targetLabel ( keys %$idsA ) { my $percentIdA = $idsA->{$targetLabel}; if ( !defined $idsByIdentity->{$percentIdA} ) { $idsByIdentity->{$percentIdA} = []; } push @{ $idsByIdentity->{$percentIdA} }, $targetLabel; } if ( !%$idsByIdentity ) { #print STDERR "$queryLabel -> no reconciled hits at $singlePercentIdThreshold%\n"; # this is the NOHIT case, but we'll back off and try again return 0; } else { #print STDERR "$queryLabel -> printing reconciled hits at $singlePercentIdThreshold%\n"; printLine( $queryLabel, $idsByIdentity ); return 1; } } sub doPairSearch { my $readAFile = shift; my $readBFile = shift; my $pairPercentDifferenceThreshold = shift; my $maxAccepts = shift; my $pairPercentIdThreshold = 1. - $pairPercentDifferenceThreshold; # because we're going to average the two %ids, we have to search each read with twice the %diff for the pair my $singlePercentDifferenceThreshold = $pairPercentDifferenceThreshold * 2; my $singlePercentIdThreshold = 1. - $singlePercentDifferenceThreshold; my $nohitQueryIds = []; my $tooManyHitQueryIds = []; # open the USEARCH streams # print STDERR #"$usearch --quiet --global --iddef 2 --query $readAFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject\n"; # open( UCA, #"$usearch --quiet --global --iddef 2 --query $readAFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject |" # ) || die "can't fork usearch: $!"; # print STDERR #"$usearch --quiet --global --iddef 2 --query $readBFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject\n"; # open( UCB, #"$usearch --quiet --global --iddef 2 --query $readBFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject |" # ) || die "can't fork usearch: $!"; my $dir = tempdir(CLEANUP => 1); if ( system( 'mknod', "$dir/a", 'p' ) && system( 'mkfifo', "$dir/a" ) ) { die "mk{nod,fifo} $dir/a failed"; } if ( system( 'mknod', "$dir/b", 'p' ) && system( 'mkfifo', "$dir/b" ) ) { die "mk{nod,fifo} $dir/b failed"; } # try to avoid mysterious intermittent condition where opening a named pipe blocks forever #while ( !-p "$dir/a" ) { # print STDERR "Waiting for named pipe $dir/a\n"; # sleep 1; #} #while ( !-p "$dir/b" ) { # print STDERR "Waiting for named pipe $dir/b\n"; # sleep 1; #} if ( !fork() ) { # there is a mysterious intermittent condition where opening a named pipe blocks forever. # I think what is happening may be: # if the reader side of the named pipe is not already connected when usearch tries to write to it, usearch gets confused and never writes anything # thus when the reader side does connect, it blocks. # one hack is just to sleep here for a while in hopes that the reader thread gets all hooked up # sleep 5; # let's try writing to it, so we block here until we know it works, and then continue on to usearch open( UCAW, ">$dir/a" ) || die("Couldn't write named pipe: $dir/a"); print UCAW "# pipe open!\n"; close UCAW; my $cmd = "$usearch --quiet --global --iddef 2 --query $readAFile --db $databaseFile --uc $dir/a --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject"; print STDERR $cmd . "\n"; exec $cmd || die("can't fork usearch: $!"); } if ( !fork() ) { open( UCBW, ">$dir/b" ) || die("Couldn't write named pipe: $dir/b"); print UCBW "# pipe open!\n"; close UCBW; my $cmd = "$usearch --quiet --global --iddef 2 --query $readBFile --db $databaseFile --uc $dir/b --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject"; print STDERR $cmd . "\n"; exec $cmd || die("can't fork usearch: $!"); } open( UCA, "$dir/a" ) || die("Couldn't read named pipe from usearch: $dir/a"); #print STDERR "Reading named pipe from usearch: $dir/a\n"; open( UCB, "$dir/b" ) || die("Couldn't read named pipe from usearch: $dir/b"); #print STDERR "Reading named pipe from usearch: $dir/b\n"; # Load the first non-comment line from each stream my $nextLineA; my $pipeARetryCount = 0; while ( !defined $nextLineA ) { # keep trying to read from the pipe even if the writer disconnects while () { if (/^\s*#/) { next; } if (/^\s*$/) { next; } $nextLineA = $_; last; } #print STDERR "Waiting for named pipe $dir/a\n"; sleep 1; $pipeARetryCount++; if ( $pipeARetryCount > 10 ) { die("Named pipe communication with usearch failed: $dir/a\n"); } } my $nextLineB; my $pipeBRetryCount = 0; while ( !defined $nextLineB ) { # keep trying to read from the pipe even if the writer disconnects while () { if (/^\s*#/) { next; } if (/^\s*$/) { next; } $nextLineB = $_; last; } #print STDERR "Waiting for named pipe $dir/b\n"; sleep 1; $pipeBRetryCount++; if ( $pipeBRetryCount > 10 ) { die("Named pipe communication with usearch failed: $dir/b\n"); } } # read synchronized blocks from each stream while (1) { my ( $queryLabelA, $idsA, $queryLabelB, $idsB ); # idsA is a reference to a hash from targetIds to %ids ( $queryLabelA, $idsA, $nextLineA ) = collectIds( *UCA, $nextLineA ); ( $queryLabelB, $idsB, $nextLineB ) = collectIds( *UCB, $nextLineB ); if ( !( $queryLabelA eq $queryLabelB ) ) { die("Usearch results desynchronized: $queryLabelA neq $queryLabelB"); } my $numHitsA = ( scalar keys %$idsA ); my $numHitsB = ( scalar keys %$idsB ); # if either read got NOHITS, then it's definitely NOHITS for the pair. This trumps TOOMANYHITS for the other read. # don't bother trying to reconcile hits in this case if ( ( $numHitsA == 0 ) || ( $numHitsB == 0 ) ) { push @$nohitQueryIds, $queryLabelA; } # if both reads got TOOMANYHITS, then it's definitely TOOMANYHITS for the pair. elsif ( ( $numHitsA >= $maxAccepts ) && ( $numHitsB >= $maxAccepts ) ) { push @$tooManyHitQueryIds, $queryLabelA; } # if neither read got TOOMANYHITS, then we're good to go elsif ( ( $numHitsA < $maxAccepts ) && ( $numHitsB < $maxAccepts ) ) { if ( !reconcilePairedHitsAndPrint( $queryLabelA, $idsA, $idsB, $pairPercentIdThreshold ) ) { push @$nohitQueryIds, $queryLabelA; } } # if only one read got TOOMANYHITS... else { # escalate if possible if ( $maxAccepts < $maxMaxAccepts ) { push @$tooManyHitQueryIds, $queryLabelA; } # if we're already at maxMaxAccepts and we're allowed, fall back to the info provided by the other read. # This is a little tricky: which percent ID threshold do we use? # For consistency, I think we have to assume that the overly generic read is a 100% match. # the cleanest way to say this is to say that the "overly generic" read just matches everything, so # reconcilePairHitsAndPrint( $queryLabelA, $idsA, $allIds, $pairPercentIdThreshold ) # but that would require a hash %allIds mapping every ID to 100, just because reconcilePairHitsAndPrint says $idsB->{$targetLabel} # it's equivalent to just use reconcileSingleHitsAndPrint with singlePercentIdThreshold. elsif ($singleOKgeneric) { if ( $numHitsA < $maxAccepts ) { if ( !reconcileSingleHitsAndPrint( $queryLabelA, $idsA, $singlePercentIdThreshold ) ) { push @$nohitQueryIds, $queryLabelA; } } elsif ( $numHitsB < $maxAccepts ) { if ( !reconcileSingleHitsAndPrint( $queryLabelA, $idsB, $singlePercentIdThreshold ) ) { push @$nohitQueryIds, $queryLabelA; } } else { die("impossible"); } } # if we're already at maxMaxAccepts, but not allowed to rely on the other read, just report TOOMANYHITS for the pair else { push @$tooManyHitQueryIds, $queryLabelA; } } if ( !$nextLineA || !$nextLineB ) { if ( !( !$nextLineA && !$nextLineB ) ) { die("Usearch results desynchronized at end:\nA: $nextLineA\nB: $nextLineB"); } last; } } close(UCA) || die "can't close usearch: $!"; close(UCB) || die "can't close usearch: $!"; # for the TOOMANYHITS cases, escalate maxAccepts and try again # Note this recurses, so no need to iterate here if ( scalar(@$tooManyHitQueryIds) && ( $maxAccepts * 2 <= $maxMaxAccepts ) ) { my $nohitQueryIdsB; print STDERR "doPairSearch $readAFile, $readBFile: Escalating maxAccepts to " . ( $maxAccepts * 2 ) . " for " . scalar(@$tooManyHitQueryIds) . " sequences.\n"; # prepare input files with the remaining sequences my $readAFileEsc = extractFasta( $indexA, $tooManyHitQueryIds ); my $readBFileEsc = extractFasta( $indexB, $tooManyHitQueryIds ); ( $nohitQueryIdsB, $tooManyHitQueryIds ) = doPairSearch( $readAFileEsc, $readBFileEsc, $pairPercentDifferenceThreshold, $maxAccepts * 2 ); # A TOOMANYHITS case can certainly turn into a NOHITS case: # once one read is no longer TOOMANYHITS, it may turn out that nothing can be reconciled with the other read. #if (@$nohitQueryIdsB) { # die( "A TOOMANYHITS case can't turn into a NOHITS case: " # . ( join ", ", @$nohitQueryIdsB ) # . " at pair threshold $pairPercentDifferenceThreshold and maxAccepts " # . ( $maxAccepts * 2 ) ); # } push @$nohitQueryIds, @$nohitQueryIdsB; } print STDERR "doPairSearch $readAFile, $readBFile: Finished at pair threshold $pairPercentDifferenceThreshold and maxAccepts $maxAccepts\n"; print STDERR " NOHITS: " . scalar(@$nohitQueryIds) . "\n"; print STDERR " TOOMANYHITS: " . scalar(@$tooManyHitQueryIds) . "\n"; # print STDERR " NOHITS: " . ( join ", ", @$nohitQueryIds ) . "\n"; # print STDERR " TOOMANYHITS: " . ( join ", ", @$tooManyHitQueryIds ) . "\n"; # any tooManyHitQueryIds that remain had more than maxMaxAccepts hits return ( $nohitQueryIds, $tooManyHitQueryIds ); } sub reconcilePairedHitsAndPrint { my ( $queryLabel, $idsA, $idsB, $pairPercentIdThreshold ) = @_; my $idsByIdentity = {}; for my $targetLabel ( keys %$idsA ) { my $percentIdA = $idsA->{$targetLabel}; my $percentIdB = $idsB->{$targetLabel}; if ($percentIdB) { # both reads hit the same target # compute the average percent id my $pairPercentId = ( $percentIdA + $percentIdB ) / 2.0; if ( $pairPercentId >= $pairPercentIdThreshold ) { if ( !defined $idsByIdentity->{$pairPercentId} ) { $idsByIdentity->{$pairPercentId} = []; } push @{ $idsByIdentity->{$pairPercentId} }, $targetLabel; } } } if ( !%$idsByIdentity ) { #print STDERR "$queryLabel -> no reconciled hits at $pairPercentIdThreshold%\n"; # this is the NOHIT case, but we'll back off and try again return 0; } else { #print STDERR "$queryLabel -> printing reconciled hits at $pairPercentIdThreshold%\n"; printLine( $queryLabel, $idsByIdentity ); return 1; } } ##### Read a block of target IDs for one query from a USEARCH stream sub collectIds { # assume that the records for a given label are all grouped together, so we can collect the hits for each label and process them in turn # need to jump through some hoops to save the first line of the next block (since we can't rewind a pipe with seek()). my $fh = shift; my $firstLine = shift; my %hits = (); # process the first line (cached from the previous call) #print STDERR "Processing firstLine: $firstLine\n"; my ( $typeF, $clusterF, $sizeF, $percentIdF, $strandF, $queryStartF, $targetStartF, $alignmentF, $queryLabelF, $targetLabelF ) = split /\t/, $firstLine; chomp $targetLabelF; $queryLabelF =~ /$idRegex/; $queryLabelF = $1; #$queryLabelF =~ s/\s.*//; # primary ID is only the portion before whitespace if ( $typeF eq "N" ) { #print STDERR "$queryLabelF -> N\n"; # NOHIT represented as empty hash # note we still have to cache the next line, after filtering comments while (<$fh>) { if (/^\s*#/) { next; } if (/^\s*$/) { next; } my $line = $_; chomp $line; #print STDERR "$line\n"; return ( $queryLabelF, \%hits, $line ); } } elsif ( $typeF eq "H" ) { $hits{$targetLabelF} = $percentIdF; } # process the remaining lines while (<$fh>) { if (/^\s*#/) { next; } if (/^\s*$/) { next; } my $line = $_; chomp $line; #print STDERR "$line\n"; my ( $type, $cluster, $size, $percentId, $strand, $queryStart, $targetStart, $alignment, $queryLabel, $targetLabel ) = split /\t/; chomp $targetLabel; $queryLabel =~ /$idRegex/; $queryLabel = $1; #$queryLabel =~ s/\s.*//; # primary ID is only the portion before whitespace if ( $queryLabel ne $queryLabelF ) { # advanced past the end of the current block #print STDERR "End of block $queryLabelF -> block of " . scalar( keys %hits ) . " hits\n"; #print STDERR "End of block, return\n"; return ( $queryLabelF, \%hits, $line ); } else { if ( $type eq "N" ) { die "impossible: $queryLabel reports no hit after it already had a hit"; } elsif ( $type eq "H" ) { $hits{$targetLabel} = $percentId; } } } # end of the stream (the last block) #print STDERR "End of stream $queryLabelF -> block of " . scalar( keys %hits ) . " hits\n"; #print STDERR "End of stream, return\n"; return ( $queryLabelF, \%hits, "" ); } my $fastaNum; sub extractFasta { my $index = shift; my $ids = shift; my $name = $fastaNum++; # print STDERR "Extracting " . scalar(@$ids) . " to file $name\n"; open( OUT, ">$name" ); # my $out = Bio::SeqIO->new( '-format' => 'Fasta', '-fh' => \*OUT ); for my $id (@$ids) { my $seqobj = $index->fetch($id); if ( !defined $seqobj ) { print STDERR "Extracting from " . $index->fastaFileName() . ": Undefined: $id\n"; } # elsif ($seqobj->primary_id() ne $id) # { # print STDERR "Extracting from " . $index->filename() . ": ID problem: " . ($seqobj->primary_id()) . " ne $id\n"; # } else { #$out->write_seq($seqobj); print OUT $seqobj; } } close OUT; return $name; } sub printLine { my ( $label, $idsByIdentity ) = @_; my @ids = (); my @pcids = sort { $b <=> $a } keys %$idsByIdentity; my $bestPcid = $pcids[0]; my $acceptPcid = $bestPcid - $slop; for my $pcid (@pcids) { if ( $pcid < $acceptPcid ) { last; } push @ids, @{ $idsByIdentity->{$pcid} }; } my $idString = ( join "\t", @ids ); print "$label\t$bestPcid\t" . $idString . "\n"; # print STDERR "$label\t$bestPcid\t" . $idString . "\n"; } sub revcompFile { my $infile = shift; my $outfile = $infile . ".rc"; open( IN, $infile ) or die "Can't read $infile\n"; open( OUT, ">$outfile" ) or die "Can't write $outfile\n"; while () { my $a = $_; chomp $a; if ( !( $a =~ /^>/ ) ) { $a = reverse $a; $a =~ tr/ACGTacgt/TGCAtgca/; } print OUT $a . "\n"; } close IN; close OUT; return $outfile; } main(); rtax-0.984/scripts/rtaxVote.pl0000755000076500000240000002361712200346776015407 0ustar loraxstaff#!/usr/bin/env perl # read the prokMSAids of the hits, as produced by exactMatchIds. # Then grab the taxonomy info for those prokMSAids and make some classification info from the set. use strict; use warnings; #use lib '.'; use FindBin; use lib "$FindBin::Bin"; use LevelPrediction; loadTaxonomy( $ARGV[0] ); processHits( $ARGV[1] ); my %taxonomies = (); # my %taxonomyPrior = (); sub loadTaxonomy { my ($taxFileName) = @_; print STDERR "loading taxonomy...\n"; open( TAX, $taxFileName ) or die "Cannot open input taxonomy file $taxFileName: $!"; my $lines = 0; while ( my $line = ) { chomp $line; # try to accept several formats: # semicolon or tab-delimited taxonomy strings # and including the "pcid-width" column between the sequence id and # the taxonomy string, or not my ( $prokMSAid, @taxonomy ) = split( /[;\t]/, $line ); #my ( @taxonomy ) = split( /; /, $taxString ); if ( @taxonomy && ($taxonomy[0] eq "" || $taxonomy[0] =~ /^(\d+\.?\d*|\.\d+)$/ )) { # value is numeric or empty, must be the pcid width of the target cluster my $taxPcid = shift @taxonomy; #print STDERR "Ignoring target cluster width: $taxPcid\n"; } $taxonomies{$prokMSAid} = \@taxonomy; # my $taxString = ""; # for my $taxElem (@taxonomy) { # $taxString .= "$taxElem; "; # # # $taxonomyPrior{$taxString}++; # } $lines++; } close TAX; print STDERR "...done loading $lines taxonomy lines\n"; # print STDERR "normalizing prior...\n"; # for my $taxKey (keys %taxonomyPrior) # { # $taxonomyPrior{$taxKey} /= $lines; # } # $taxonomyPrior{""} = 1; # print STDERR "...done normalizing prior\n"; } sub printLine { my ( $label, $bestPcid, @ids ) = @_; my $levelPredictions = makeTaxonomyPrediction(@ids); print "$label\t$bestPcid"; for my $levelPrediction (@$levelPredictions) { print "\t" . $levelPrediction->toString(); } #if(scalar(@$levelPredictions) == 0) {$unclassified++} #if(scalar(@$levelPredictions) == 0) { print "\tUNCLASSIFIED"; } print "\n"; return ( scalar(@$levelPredictions) == 0 ); } sub processHits { my ($hitsFileName) = @_; open( HITS, $hitsFileName ) || die("Could not open $hitsFileName"); my $hit = 0; my $noprimer = 0; my $nohit = 0; my $toomanyhits = 0; my $nomatepair = 0; my $unclassified = 0; while () { chomp; my ( $label, $bestPcid, @ids ) = split /\t/; if ( ( !@ids ) || ( $ids[0] eq "" ) ) # an empty id list { print "$label\t\tNOHIT\n"; $nohit++; } elsif($ids[0] eq "NOHIT") # at one point we represented this only as an empty id list, but now apparently it is used { print "$label\t\tNOHIT\n"; $nohit++; } elsif ( ( $ids[0] eq "NOPRIMER" ) ) { print "$label\t\tNOPRIMER\n"; $noprimer++; } elsif ( ( $ids[0] eq "TOOMANYHITS" ) ) { print "$label\t\tTOOMANYHITS\n"; $toomanyhits++; } elsif ( ( $ids[0] eq "NOMATEPAIR" ) ) { print "$label\t\tNOMATEPAIR\n"; $nomatepair++; } else { $unclassified += printLine( $label, $bestPcid, @ids ); $hit++; } } my $samples = $hit + $nohit + $toomanyhits + $nomatepair + $noprimer; print STDERR "$samples items\n"; if ( $samples != 0 ) { print STDERR "$noprimer (" . sprintf( "%.1f", ( 100 * $noprimer / $samples ) ) . "%) had no primer match\n"; print STDERR "$nohit (" . sprintf( "%.1f", ( 100 * $nohit / $samples ) ) . "%) had no hits\n"; print STDERR "$toomanyhits (" . sprintf( "%.1f", ( 100 * $toomanyhits / $samples ) ) . "%) had too many hits\n"; print STDERR "$nomatepair (" . sprintf( "%.1f", ( 100 * $nomatepair / $samples ) ) . "%) had no mate pair\n"; print STDERR "$unclassified (" . sprintf( "%.1f", ( 100 * $unclassified / $samples ) ) . "%) had hits but no classification\n"; print STDERR "" . ( $hit - $unclassified ) . " (" . sprintf( "%.1f", ( 100 * ( $hit - $unclassified ) / $samples ) ) . "%) were classified\n"; } # classifications per level? That depends on the stringency filter, which is downstream close HITS; } #sub normalizeByPrior { # my ($taxString, $taxonCounts) = @_; # # print STDERR "Normalizing: \n"; # while( my ($k, $v) = each %$taxonCounts) { # print STDERR "$k = $v ; "; # } # print STDERR "\n"; # # my $normalizer = $taxonomyPrior{$taxString}; # if(!defined $normalizer) # { # print STDERR ("No prior: $taxString\n"); # } # # for my $taxon (keys %$taxonCounts) # { # $taxonCounts->{$taxon} = ($taxonCounts->{$taxon} / $taxonomyPrior{$taxString . $taxon . "; "}) * $normalizer; # } # print STDERR " Done: \n"; # while( my ($k, $v) = each %$taxonCounts) { # print STDERR "$k = $v ; "; # } # print STDERR "\n"; #} sub makeTaxonomyPrediction { my (@ids) = @_; my @levelPredictions = (); # TOOMANYHITS is now added in ucFilterBest.pl, so it should be caught at line 99 above # if(scalar(@ids) == 1000) { # # when we did the uclust at exactMatchOneReadSet:50, we used " --allhits --maxaccepts 1000 ". # # after that we filtered for the best pcid set (using ucFilterBest.pl). # # if 1000 hits remain, then the real set of best-pcid hits is larger, and we missed some. # # In that case we should bail because the set of 1000 hits we do have may not be representative. # # I think this is the reason why matching the E517F primer only (17nt reads) produced predictions, and at different levels to boot. # # That also depends on the classification thresholds. # # my $levelPrediction = LevelPrediction->new(); # $levelPrediction->label("TOOMANYHITS"); # push @levelPredictions, $levelPrediction; # return \@levelPredictions; # } my @taxonomyVectors = map { $taxonomies{$_} } @ids; # my @taxonomyClusterSizes = map { $taxonomyWorstPcids{$_} } @ids; my $globalTotal = @taxonomyVectors; my $predictedTaxa = ""; my $globalUnknowns = 0; # at all levels, incrementing as we go for my $level ( 0 .. 10 ) { my $levelPrediction = LevelPrediction->new(); # assert the remaining taxonomyVectors are equal at higher levels my %taxonCounts = (); my $localUnknowns = 0; my $localTotal = @taxonomyVectors; # count up how often each label is seen descending from this node for my $vec (@taxonomyVectors) { my $taxon = $vec->[$level]; # "Unknown" should occur only towards the leaves; an unspecified intermediate level followed by a populated level is a "skip". # Here, "skip" is counted as just another taxon. if ( !defined $taxon || $taxon eq "" || $taxon =~ /unknown/i ) { $localUnknowns++; } else { $taxonCounts{$taxon}++; } } if ( $localUnknowns == $localTotal ) { last; } # this normalization makes no sense, because we don't want a uniform prior either # e.g., one Archaeal hit among dozens of Bacterial hits will win, because there are so few Archaea in GreenGenes to begin with # normalizeByPrior( $predictedTaxa, \%taxonCounts ); # get the best label and check for ties $levelPrediction->numChildren( scalar( keys %taxonCounts ) ); my @taxaInOrder = sort { $taxonCounts{$b} <=> $taxonCounts{$a} } keys %taxonCounts; my $bestTaxon = $taxaInOrder[0]; # print STDERR "$bestLabel\n"; $levelPrediction->label($bestTaxon); $predictedTaxa .= "$bestTaxon; "; my $bestCount = $taxonCounts{$bestTaxon}; $levelPrediction->count($bestCount); my $secondBestTaxon = $taxaInOrder[1]; if ( defined $secondBestTaxon ) { my $secondBestCount = $taxonCounts{$secondBestTaxon}; if ( $levelPrediction->count() < 2 * $secondBestCount ) { # Declare a tie if the winning taxon doesn't have at least twice as many votes as the runner-up. # just consider this level unknown and bail last; } } # compute various ratios of the prediction vs. alternatives $levelPrediction->localMinProportion( $bestCount / $localTotal ); $levelPrediction->localMaxProportion( ( $bestCount + $localUnknowns ) / $localTotal ); my $globalUnknowns += $localUnknowns; $levelPrediction->globalMinProportion( $bestCount / $globalTotal ); $levelPrediction->globalMaxProportion( ( $bestCount + $globalUnknowns ) / $globalTotal ); # what if all of the "unknown" matches should have been the same? Then an "alternate" classification might have won $levelPrediction->alternateLocalProportion( $localUnknowns / $localTotal ); $levelPrediction->alternateGlobalProportion( $globalUnknowns / $globalTotal ) ; # note we already added the local unknowns to the global unknowns # it's possible that a completely different path has a higher global probability than this one, # but we'd never know because we pick the max _per level_ and never explore the other paths. # decide whether to bother continuing # if ( $bestLocalMinProportion < 0.5 ) { last; } # for now, print all the best labels until everything is unknown or there is a tie; sort out the confidence later push @levelPredictions, $levelPrediction; # remove any non-matching paths from consideration at the next level my @newTaxonomyVectors = (); for my $vec (@taxonomyVectors) { my $taxon = $vec->[$level]; if ( defined $taxon && $taxon eq $bestTaxon ) { push @newTaxonomyVectors, $vec; } } @taxonomyVectors = @newTaxonomyVectors; } return \@levelPredictions; } rtax-0.984/scripts/splitDelimitedFasta.pl0000755000076500000240000000144612122237445017515 0ustar loraxstaff#!/usr/bin/env perl use strict; use warnings; # make sure delimiter is properly escaped, e.g. for "+" we need "\+". my $delimiter = $ARGV[0]; my $infile = $ARGV[1]; my $infilename = `basename $infile`; chomp $infilename; open(IN, $infile) or die "Can't read $infile\n"; open(A,">$infilename.a") or die "Can't write $infilename.a\n"; open(B,">$infilename.b") or die "Can't write $infilename.b\n"; my $fastaHeader = ""; my $outfile = *A; while() { chomp; if(/^>/) { $outfile = *A; $fastaHeader = $_; print A "$fastaHeader\n"; print B "$fastaHeader\n"; } else { if ( /(.*)$delimiter(.*)/) { print A "$1\n"; print B "$2\n"; $outfile = *B; } else { print $outfile "$_\n"; } } } close IN; close A; close B;