Bio-DB-NCBIHelper-1.7.8000755000765000024 014536752166 14462 5ustar00cjfieldsstaff000000000000Changes100644000765000024 233714536752166 16043 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8Summary of important user-visible changes for Bio-DB-NCBIHelper --------------------------------------------------------------- 1.7.8 2023-12-14 22:22:13-06:00 America/Chicago * Fix a series of failing tests caused by changes to the data on the NCBI databases. 1.7.7 2021-06-30 08:52:57-05:00 America/Chicago * Merge pull request to address taxonomic name change, which led to failing network tests (thx @MrCurtis!). 1.7.6 2019-12-07 16:11:22-06:00 America/Chicago * Requires Bio::DB::WebDBSeqI v1.7.7 1.7.5 2019-11-29 16:55:33-06:00 America/Chicago * Explicitly add Bio::SeqIO::entrezgene and LWP::Protocol::https as dependencies * Actually get '-email' to work which requires overriding get_seq_stream * Fix remaining delay settings in tests which are causing issues during peak NCBI times 1.7.4 2019-03-10 21:29:52-05:00 America/Chicago * Switch away from Bio::Root::Test * Add '-email' parameter, which allows lower delay interval at NCBI (currently requires setting the '-delay' parameter separately). 1.7.3 2019-03-02 23:05:20-06:00 America/Chicago * First release after split from bioperl-live * Remove Bio::DB::RefSeq dependency, which isn't tested outside of Bio::DB::RefSeq. LICENSE100644000765000024 4723514536752166 15603 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8This software is copyright (c) 2023 by Aaron Mackey , Brian Osborne , Jason Stajich , Lincoln Stein . This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself. Terms of the Perl programming language system itself a) the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version, or b) the "Artistic License" --- The GNU General Public License, Version 1, February 1989 --- This software is Copyright (c) 2023 by Aaron Mackey , Brian Osborne , Jason Stajich , Lincoln Stein . This is free software, licensed under: The GNU General Public License, Version 1, February 1989 GNU GENERAL PUBLIC LICENSE Version 1, February 1989 Copyright (C) 1989 Free Software Foundation, Inc. 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The license agreements of most software companies try to keep users at the mercy of those companies. By contrast, our General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. The General Public License applies to the Free Software Foundation's software and to any other program whose authors commit to using it. You can use it for your programs, too. When we speak of free software, we are referring to freedom, not price. Specifically, the General Public License is designed to make sure that you have the freedom to give away or sell copies of free software, that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of a such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must tell them their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License Agreement applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any work containing the Program or a portion of it, either verbatim or with modifications. Each licensee is addressed as "you". 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this General Public License and to the absence of any warranty; and give any other recipients of the Program a copy of this General Public License along with the Program. You may charge a fee for the physical act of transferring a copy. 2. You may modify your copy or copies of the Program or any portion of it, and copy and distribute such modifications under the terms of Paragraph 1 above, provided that you also do the following: a) cause the modified files to carry prominent notices stating that you changed the files and the date of any change; and b) cause the whole of any work that you distribute or publish, that in whole or in part contains the Program or any part thereof, either with or without modifications, to be licensed at no charge to all third parties under the terms of this General Public License (except that you may choose to grant warranty protection to some or all third parties, at your option). c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the simplest and most usual way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this General Public License. d) You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. Mere aggregation of another independent work with the Program (or its derivative) on a volume of a storage or distribution medium does not bring the other work under the scope of these terms. 3. You may copy and distribute the Program (or a portion or derivative of it, under Paragraph 2) in object code or executable form under the terms of Paragraphs 1 and 2 above provided that you also do one of the following: a) accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Paragraphs 1 and 2 above; or, b) accompany it with a written offer, valid for at least three years, to give any third party free (except for a nominal charge for the cost of distribution) a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Paragraphs 1 and 2 above; or, c) accompany it with the information you received as to where the corresponding source code may be obtained. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form alone.) Source code for a work means the preferred form of the work for making modifications to it. For an executable file, complete source code means all the source code for all modules it contains; but, as a special exception, it need not include source code for modules which are standard libraries that accompany the operating system on which the executable file runs, or for standard header files or definitions files that accompany that operating system. 4. You may not copy, modify, sublicense, distribute or transfer the Program except as expressly provided under this General Public License. Any attempt otherwise to copy, modify, sublicense, distribute or transfer the Program is void, and will automatically terminate your rights to use the Program under this License. However, parties who have received copies, or rights to use copies, from you under this General Public License will not have their licenses terminated so long as such parties remain in full compliance. 5. By copying, distributing or modifying the Program (or any work based on the Program) you indicate your acceptance of this license to do so, and all its terms and conditions. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. 7. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of the license which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the license, you may choose any version ever published by the Free Software Foundation. 8. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 9. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 10. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS Appendix: How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to humanity, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) 19yy This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston MA 02110-1301 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) 19xx name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (a program to direct compilers to make passes at assemblers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice That's all there is to it! --- The Perl Artistic License 1.0 --- This software is Copyright (c) 2023 by Aaron Mackey , Brian Osborne , Jason Stajich , Lincoln Stein . This is free software, licensed under: The Perl Artistic License 1.0 The "Artistic License" Preamble The intent of this document is to state the conditions under which a Package may be copied, such that the Copyright Holder maintains some semblance of artistic control over the development of the package, while giving the users of the package the right to use and distribute the Package in a more-or-less customary fashion, plus the right to make reasonable modifications. Definitions: "Package" refers to the collection of files distributed by the Copyright Holder, and derivatives of that collection of files created through textual modification. "Standard Version" refers to such a Package if it has not been modified, or has been modified in accordance with the wishes of the Copyright Holder as specified below. "Copyright Holder" is whoever is named in the copyright or copyrights for the package. "You" is you, if you're thinking about copying or distributing this Package. "Reasonable copying fee" is whatever you can justify on the basis of media cost, duplication charges, time of people involved, and so on. (You will not be required to justify it to the Copyright Holder, but only to the computing community at large as a market that must bear the fee.) "Freely Available" means that no fee is charged for the item itself, though there may be fees involved in handling the item. It also means that recipients of the item may redistribute it under the same conditions they received it. 1. You may make and give away verbatim copies of the source form of the Standard Version of this Package without restriction, provided that you duplicate all of the original copyright notices and associated disclaimers. 2. You may apply bug fixes, portability fixes and other modifications derived from the Public Domain or from the Copyright Holder. A Package modified in such a way shall still be considered the Standard Version. 3. You may otherwise modify your copy of this Package in any way, provided that you insert a prominent notice in each changed file stating how and when you changed that file, and provided that you do at least ONE of the following: a) place your modifications in the Public Domain or otherwise make them Freely Available, such as by posting said modifications to Usenet or an equivalent medium, or placing the modifications on a major archive site such as uunet.uu.net, or by allowing the Copyright Holder to include your modifications in the Standard Version of the Package. b) use the modified Package only within your corporation or organization. c) rename any non-standard executables so the names do not conflict with standard executables, which must also be provided, and provide a separate manual page for each non-standard executable that clearly documents how it differs from the Standard Version. d) make other distribution arrangements with the Copyright Holder. 4. You may distribute the programs of this Package in object code or executable form, provided that you do at least ONE of the following: a) distribute a Standard Version of the executables and library files, together with instructions (in the manual page or equivalent) on where to get the Standard Version. b) accompany the distribution with the machine-readable source of the Package with your modifications. c) give non-standard executables non-standard names, and clearly document the differences in manual pages (or equivalent), together with instructions on where to get the Standard Version. d) make other distribution arrangements with the Copyright Holder. 5. You may charge a reasonable copying fee for any distribution of this Package. You may charge any fee you choose for support of this Package. You may not charge a fee for this Package itself. However, you may distribute this Package in aggregate with other (possibly commercial) programs as part of a larger (possibly commercial) software distribution provided that you do not advertise this Package as a product of your own. You may embed this Package's interpreter within an executable of yours (by linking); this shall be construed as a mere form of aggregation, provided that the complete Standard Version of the interpreter is so embedded. 6. The scripts and library files supplied as input to or produced as output from the programs of this Package do not automatically fall under the copyright of this Package, but belong to whoever generated them, and may be sold commercially, and may be aggregated with this Package. If such scripts or library files are aggregated with this Package via the so-called "undump" or "unexec" methods of producing a binary executable image, then distribution of such an image shall neither be construed as a distribution of this Package nor shall it fall under the restrictions of Paragraphs 3 and 4, provided that you do not represent such an executable image as a Standard Version of this Package. 7. C subroutines (or comparably compiled subroutines in other languages) supplied by you and linked into this Package in order to emulate subroutines and variables of the language defined by this Package shall not be considered part of this Package, but are the equivalent of input as in Paragraph 6, provided these subroutines do not change the language in any way that would cause it to fail the regression tests for the language. 8. Aggregation of this Package with a commercial distribution is always permitted provided that the use of this Package is embedded; that is, when no overt attempt is made to make this Package's interfaces visible to the end user of the commercial distribution. Such use shall not be construed as a distribution of this Package. 9. The name of the Copyright Holder may not be used to endorse or promote products derived from this software without specific prior written permission. 10. THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The End dist.ini100644000765000024 122514536752166 16207 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8name = Bio-DB-NCBIHelper version = 1.7.8 author = Aaron Mackey author = Brian Osborne author = Jason Stajich author = Lincoln Stein copyright_holder = Aaron Mackey , Brian Osborne , Jason Stajich , Lincoln Stein license = Perl_5 ;; Modules should be fixed so that these don't have to be removed. [@BioPerl] -remove = PodCoverageTests -remove = PodWeaver -remove = Test::EOL -remove = Test::NoTabs [Prereqs] Bio::DB::Query::WebQuery = 1.7.7 Bio::SeqIO::entrezgene = 0 LWP::Protocol::https = 0 META.yml100644000765000024 1734114536752166 16042 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8--- abstract: 'A collection of routines useful for queries to NCBI databases.' author: - 'Aaron Mackey ' - 'Brian Osborne ' - 'Jason Stajich ' - 'Lincoln Stein ' build_requires: File::Spec: '0' IO::Handle: '0' IPC::Open3: '0' Test::Exception: '0' Test::More: '0' Test::Most: '0' Test::RequiresInternet: '0' perl: '5.006' configure_requires: ExtUtils::MakeMaker: '0' dynamic_config: 0 generated_by: 'Dist::Zilla version 6.031, CPAN::Meta::Converter version 2.150010' license: perl meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: '1.4' name: Bio-DB-NCBIHelper requires: Bio::DB::Query::WebQuery: v1.7.7 Bio::DB::Taxonomy: '0' Bio::DB::WebDBSeqI: '0' Bio::Root::IO: '0' Bio::Root::Root: '0' Bio::SeqIO: '0' Bio::SeqIO::entrezgene: '0' Bio::Taxon: '0' Bio::WebAgent: '0' CGI: '0' Cache::FileCache: '0' Getopt::Long: '0' HTTP::Request::Common: '0' LWP::Protocol::https: '0' LWP::UserAgent: '0' URI: '0' URI::Escape: '0' XML::Twig: '0' base: '0' constant: '0' strict: '0' vars: '0' warnings: '0' resources: bugtracker: https://github.com/bioperl/bio-db-ncbihelper/issues homepage: https://metacpan.org/release/Bio-DB-NCBIHelper repository: git://github.com/bioperl/bio-db-ncbihelper.git version: 1.7.8 x_Dist_Zilla: perl: version: '5.034000' plugins: - class: Dist::Zilla::Plugin::GatherDir config: Dist::Zilla::Plugin::GatherDir: exclude_filename: [] exclude_match: [] follow_symlinks: 0 include_dotfiles: 0 prefix: '' prune_directory: [] root: . name: '@BioPerl/@Filter/GatherDir' version: '6.031' - class: Dist::Zilla::Plugin::PruneCruft name: '@BioPerl/@Filter/PruneCruft' version: '6.031' - class: Dist::Zilla::Plugin::ManifestSkip name: '@BioPerl/@Filter/ManifestSkip' version: '6.031' - class: Dist::Zilla::Plugin::MetaYAML name: '@BioPerl/@Filter/MetaYAML' version: '6.031' - class: Dist::Zilla::Plugin::License name: '@BioPerl/@Filter/License' version: '6.031' - class: Dist::Zilla::Plugin::ExtraTests name: '@BioPerl/@Filter/ExtraTests' version: '6.031' - class: Dist::Zilla::Plugin::ExecDir name: '@BioPerl/@Filter/ExecDir' version: '6.031' - class: Dist::Zilla::Plugin::ShareDir name: '@BioPerl/@Filter/ShareDir' version: '6.031' - class: Dist::Zilla::Plugin::MakeMaker config: Dist::Zilla::Role::TestRunner: default_jobs: 1 name: '@BioPerl/@Filter/MakeMaker' version: '6.031' - class: Dist::Zilla::Plugin::Manifest name: '@BioPerl/@Filter/Manifest' version: '6.031' - class: Dist::Zilla::Plugin::TestRelease name: '@BioPerl/@Filter/TestRelease' version: '6.031' - class: Dist::Zilla::Plugin::ConfirmRelease name: '@BioPerl/@Filter/ConfirmRelease' version: '6.031' - class: Dist::Zilla::Plugin::UploadToCPAN name: '@BioPerl/@Filter/UploadToCPAN' version: '6.031' - class: Dist::Zilla::Plugin::MetaConfig name: '@BioPerl/MetaConfig' version: '6.031' - class: Dist::Zilla::Plugin::MetaJSON name: '@BioPerl/MetaJSON' version: '6.031' - class: Dist::Zilla::Plugin::PkgVersion name: '@BioPerl/PkgVersion' version: '6.031' - class: Dist::Zilla::Plugin::PodSyntaxTests name: '@BioPerl/PodSyntaxTests' version: '6.031' - class: Dist::Zilla::Plugin::Test::Compile config: Dist::Zilla::Plugin::Test::Compile: bail_out_on_fail: '0' fail_on_warning: author fake_home: 0 filename: t/00-compile.t module_finder: - ':InstallModules' needs_display: 0 phase: test script_finder: - ':PerlExecFiles' skips: [] switch: [] name: '@BioPerl/Test::Compile' version: '2.058' - class: Dist::Zilla::Plugin::MojibakeTests name: '@BioPerl/MojibakeTests' version: '0.8' - class: Dist::Zilla::Plugin::AutoPrereqs name: '@BioPerl/AutoPrereqs' version: '6.031' - class: Dist::Zilla::Plugin::AutoMetaResources name: '@BioPerl/AutoMetaResources' version: '1.21' - class: Dist::Zilla::Plugin::MetaResources name: '@BioPerl/MetaResources' version: '6.031' - class: Dist::Zilla::Plugin::Encoding name: '@BioPerl/Encoding' version: '6.031' - class: Dist::Zilla::Plugin::NextRelease name: '@BioPerl/NextRelease' version: '6.031' - class: Dist::Zilla::Plugin::Git::Check config: Dist::Zilla::Plugin::Git::Check: untracked_files: die Dist::Zilla::Role::Git::DirtyFiles: allow_dirty: - Changes - dist.ini allow_dirty_match: [] changelog: Changes Dist::Zilla::Role::Git::Repo: git_version: '2.39.2 (Apple Git-143)' repo_root: . name: '@BioPerl/Git::Check' version: '2.048' - class: Dist::Zilla::Plugin::Git::Commit config: Dist::Zilla::Plugin::Git::Commit: add_files_in: [] commit_msg: v%V%n%n%c signoff: '0' Dist::Zilla::Role::Git::DirtyFiles: allow_dirty: - Changes - dist.ini allow_dirty_match: [] changelog: Changes Dist::Zilla::Role::Git::Repo: git_version: '2.39.2 (Apple Git-143)' repo_root: . Dist::Zilla::Role::Git::StringFormatter: time_zone: local name: '@BioPerl/Git::Commit' version: '2.048' - class: Dist::Zilla::Plugin::Git::Tag config: Dist::Zilla::Plugin::Git::Tag: branch: ~ changelog: Changes signed: 0 tag: Bio-DB-NCBIHelper-v1.7.8 tag_format: '%N-v%v' tag_message: '%N-v%v' Dist::Zilla::Role::Git::Repo: git_version: '2.39.2 (Apple Git-143)' repo_root: . Dist::Zilla::Role::Git::StringFormatter: time_zone: local name: '@BioPerl/Git::Tag' version: '2.048' - class: Dist::Zilla::Plugin::Prereqs config: Dist::Zilla::Plugin::Prereqs: phase: runtime type: requires name: Prereqs version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':InstallModules' version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':IncModules' version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':TestFiles' version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':ExtraTestFiles' version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':ExecFiles' version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':PerlExecFiles' version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':ShareFiles' version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':MainModule' version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':AllFiles' version: '6.031' - class: Dist::Zilla::Plugin::FinderCode name: ':NoFiles' version: '6.031' zilla: class: Dist::Zilla::Dist::Builder config: is_trial: '0' version: '6.031' x_generated_by_perl: v5.34.0 x_serialization_backend: 'YAML::Tiny version 1.74' x_spdx_expression: 'Artistic-1.0-Perl OR GPL-1.0-or-later' MANIFEST100644000765000024 107714536752166 15701 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8# This file was automatically generated by Dist::Zilla::Plugin::Manifest v6.031. Changes LICENSE MANIFEST META.json META.yml Makefile.PL bin/bp_biofetch_genbank_proxy bin/bp_download_query_genbank bin/bp_query_entrez_taxa dist.ini eg/gb2features.pl eg/getGenBank.pl eg/subsequence.cgi lib/Bio/DB/EntrezGene.pm lib/Bio/DB/GenBank.pm lib/Bio/DB/GenPept.pm lib/Bio/DB/NCBIHelper.pm lib/Bio/DB/Query/GenBank.pm lib/Bio/DB/Taxonomy/entrez.pm t/00-compile.t t/EntrezGene.t t/GenBank.t t/GenPept.t t/Query-Genbank.t t/RefSeq.t t/Taxonomy.t t/author-mojibake.t t/author-pod-syntax.t META.json100644000765000024 2777214536752166 16223 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8{ "abstract" : "A collection of routines useful for queries to NCBI databases.", "author" : [ "Aaron Mackey ", "Brian Osborne ", "Jason Stajich ", "Lincoln Stein " ], "dynamic_config" : 0, "generated_by" : "Dist::Zilla version 6.031, CPAN::Meta::Converter version 2.150010", "license" : [ "perl_5" ], "meta-spec" : { "url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec", "version" : 2 }, "name" : "Bio-DB-NCBIHelper", "prereqs" : { "configure" : { "requires" : { "ExtUtils::MakeMaker" : "0" } }, "develop" : { "requires" : { "Test::Mojibake" : "0", "Test::Pod" : "1.41" } }, "runtime" : { "requires" : { "Bio::DB::Query::WebQuery" : "v1.7.7", "Bio::DB::Taxonomy" : "0", "Bio::DB::WebDBSeqI" : "0", "Bio::Root::IO" : "0", "Bio::Root::Root" : "0", "Bio::SeqIO" : "0", "Bio::SeqIO::entrezgene" : "0", "Bio::Taxon" : "0", "Bio::WebAgent" : "0", "CGI" : "0", "Cache::FileCache" : "0", "Getopt::Long" : "0", "HTTP::Request::Common" : "0", "LWP::Protocol::https" : "0", "LWP::UserAgent" : "0", "URI" : "0", "URI::Escape" : "0", "XML::Twig" : "0", "base" : "0", "constant" : "0", "strict" : "0", "vars" : "0", "warnings" : "0" } }, "test" : { "requires" : { "File::Spec" : "0", "IO::Handle" : "0", "IPC::Open3" : "0", "Test::Exception" : "0", "Test::More" : "0", "Test::Most" : "0", "Test::RequiresInternet" : "0", "perl" : "5.006" } } }, "release_status" : "stable", "resources" : { "bugtracker" : { "mailto" : "bioperl-l@bioperl.org", "web" : "https://github.com/bioperl/bio-db-ncbihelper/issues" }, "homepage" : "https://metacpan.org/release/Bio-DB-NCBIHelper", "repository" : { "type" : "git", "url" : "git://github.com/bioperl/bio-db-ncbihelper.git", "web" : "https://github.com/bioperl/bio-db-ncbihelper" } }, "version" : "1.7.8", "x_Dist_Zilla" : { "perl" : { "version" : "5.034000" }, "plugins" : [ { "class" : "Dist::Zilla::Plugin::GatherDir", "config" : { "Dist::Zilla::Plugin::GatherDir" : { "exclude_filename" : [], "exclude_match" : [], "follow_symlinks" : 0, "include_dotfiles" : 0, "prefix" : "", "prune_directory" : [], "root" : "." } }, "name" : "@BioPerl/@Filter/GatherDir", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::PruneCruft", "name" : "@BioPerl/@Filter/PruneCruft", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::ManifestSkip", "name" : "@BioPerl/@Filter/ManifestSkip", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::MetaYAML", "name" : "@BioPerl/@Filter/MetaYAML", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::License", "name" : "@BioPerl/@Filter/License", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::ExtraTests", "name" : "@BioPerl/@Filter/ExtraTests", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::ExecDir", "name" : "@BioPerl/@Filter/ExecDir", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::ShareDir", "name" : "@BioPerl/@Filter/ShareDir", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::MakeMaker", "config" : { "Dist::Zilla::Role::TestRunner" : { "default_jobs" : 1 } }, "name" : "@BioPerl/@Filter/MakeMaker", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::Manifest", "name" : "@BioPerl/@Filter/Manifest", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::TestRelease", "name" : "@BioPerl/@Filter/TestRelease", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::ConfirmRelease", "name" : "@BioPerl/@Filter/ConfirmRelease", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::UploadToCPAN", "name" : "@BioPerl/@Filter/UploadToCPAN", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::MetaConfig", "name" : "@BioPerl/MetaConfig", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::MetaJSON", "name" : "@BioPerl/MetaJSON", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::PkgVersion", "name" : "@BioPerl/PkgVersion", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::PodSyntaxTests", "name" : "@BioPerl/PodSyntaxTests", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::Test::Compile", "config" : { "Dist::Zilla::Plugin::Test::Compile" : { "bail_out_on_fail" : 0, "fail_on_warning" : "author", "fake_home" : 0, "filename" : "t/00-compile.t", "module_finder" : [ ":InstallModules" ], "needs_display" : 0, "phase" : "test", "script_finder" : [ ":PerlExecFiles" ], "skips" : [], "switch" : [] } }, "name" : "@BioPerl/Test::Compile", "version" : "2.058" }, { "class" : "Dist::Zilla::Plugin::MojibakeTests", "name" : "@BioPerl/MojibakeTests", "version" : "0.8" }, { "class" : "Dist::Zilla::Plugin::AutoPrereqs", "name" : "@BioPerl/AutoPrereqs", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::AutoMetaResources", "name" : "@BioPerl/AutoMetaResources", "version" : "1.21" }, { "class" : "Dist::Zilla::Plugin::MetaResources", "name" : "@BioPerl/MetaResources", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::Encoding", "name" : "@BioPerl/Encoding", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::NextRelease", "name" : "@BioPerl/NextRelease", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::Git::Check", "config" : { "Dist::Zilla::Plugin::Git::Check" : { "untracked_files" : "die" }, "Dist::Zilla::Role::Git::DirtyFiles" : { "allow_dirty" : [ "Changes", "dist.ini" ], "allow_dirty_match" : [], "changelog" : "Changes" }, "Dist::Zilla::Role::Git::Repo" : { "git_version" : "2.39.2 (Apple Git-143)", "repo_root" : "." } }, "name" : "@BioPerl/Git::Check", "version" : "2.048" }, { "class" : "Dist::Zilla::Plugin::Git::Commit", "config" : { "Dist::Zilla::Plugin::Git::Commit" : { "add_files_in" : [], "commit_msg" : "v%V%n%n%c", "signoff" : 0 }, "Dist::Zilla::Role::Git::DirtyFiles" : { "allow_dirty" : [ "Changes", "dist.ini" ], "allow_dirty_match" : [], "changelog" : "Changes" }, "Dist::Zilla::Role::Git::Repo" : { "git_version" : "2.39.2 (Apple Git-143)", "repo_root" : "." }, "Dist::Zilla::Role::Git::StringFormatter" : { "time_zone" : "local" } }, "name" : "@BioPerl/Git::Commit", "version" : "2.048" }, { "class" : "Dist::Zilla::Plugin::Git::Tag", "config" : { "Dist::Zilla::Plugin::Git::Tag" : { "branch" : null, "changelog" : "Changes", "signed" : 0, "tag" : "Bio-DB-NCBIHelper-v1.7.8", "tag_format" : "%N-v%v", "tag_message" : "%N-v%v" }, "Dist::Zilla::Role::Git::Repo" : { "git_version" : "2.39.2 (Apple Git-143)", "repo_root" : "." }, "Dist::Zilla::Role::Git::StringFormatter" : { "time_zone" : "local" } }, "name" : "@BioPerl/Git::Tag", "version" : "2.048" }, { "class" : "Dist::Zilla::Plugin::Prereqs", "config" : { "Dist::Zilla::Plugin::Prereqs" : { "phase" : "runtime", "type" : "requires" } }, "name" : "Prereqs", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":InstallModules", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":IncModules", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":TestFiles", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":ExtraTestFiles", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":ExecFiles", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":PerlExecFiles", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":ShareFiles", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":MainModule", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":AllFiles", "version" : "6.031" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":NoFiles", "version" : "6.031" } ], "zilla" : { "class" : "Dist::Zilla::Dist::Builder", "config" : { "is_trial" : 0 }, "version" : "6.031" } }, "x_generated_by_perl" : "v5.34.0", "x_serialization_backend" : "Cpanel::JSON::XS version 4.37", "x_spdx_expression" : "Artistic-1.0-Perl OR GPL-1.0-or-later" } t000755000765000024 014536752166 14646 5ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8RefSeq.t100644000765000024 103014536752166 16352 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/t# -*-Perl-*- Test Harness script for Bioperl # $Id$ use strict; BEGIN { use Test::Most tests => 5; use Test::RequiresInternet; use_ok('Bio::DB::GenBank'); } my $verbose = $ENV{'BIOPERLDEBUG'} || -1; my ($db,$seq); # get a single seq $seq = undef; #test redirection from GenBank ok $db = Bio::DB::GenBank->new('-verbose'=> $verbose); throws_ok {$seq = $db->get_Seq_by_acc('NT_006732')} qr/NT_ contigs are whole chromosome files/; SKIP: { ok($seq = $db->get_Seq_by_acc('NM_006732')); is($seq->length, 3775); } GenBank.t100644000765000024 1650114536752166 16523 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/t# -*-Perl-*- Test Harness script for Bioperl # $Id$ use strict; BEGIN { use Test::Most tests => 44; use Test::RequiresInternet; use_ok('Bio::DB::GenBank'); } my %expected_lengths = ( 'MUSIGHBA1' => 408, 'J00522.1' => 408, 'AF303112' => 1611, 'AF303112.1' => 1611, 'AF041456' => 1156, 'CELRABGDI' => 1743, 'JH374761' => 38055, ); my ($gb, $seq, $seqio, $seqin); my %params; if (defined $ENV{BIOPERLEMAIL}) { $params{'-email'} = $ENV{BIOPERLEMAIL}; $params{'-delay'} = 2; } $params{'-verbose'} = $ENV{BIOPERLDEBUG}; # # Bio::DB::GenBank # ok $gb = Bio::DB::GenBank->new(%params), 'Bio::DB::GenBank'; # get a single seq SKIP: { eval {$seq = $gb->get_Seq_by_id('J00522');1}; skip "Couldn't connect to Genbank with Bio::DB::GenBank.pm. Do you have network access? Skipping GenBank tests: $@", 4 if $@; is $seq->length, $expected_lengths{$seq->display_id}, $seq->display_id; eval {$seq = $gb->get_Seq_by_acc('AF303112');}; skip "Couldn't connect to Genbank with Bio::DB::GenBank.pm. Transient network problems? Skipping GenBank tests: $@", 3 if $@; is $seq->length, $expected_lengths{$seq->display_id}, $seq->display_id; eval {$seq = $gb->get_Seq_by_version('AF303112.1');}; skip "Couldn't connect to Genbank with Bio::DB::GenBank.pm. Transient network problems? Skipping GenBank tests: $@", 2 if $@; is $seq->length, $expected_lengths{$seq->display_id}, $seq->display_id; eval {$seq = $gb->get_Seq_by_gi('405830');}; skip "Couldn't connect to Genbank with Bio::DB::GenBank.pm. Transient network problems? Skipping GenBank tests: $@", 1 if $@; is $seq->length, $expected_lengths{$seq->display_id}, $seq->display_id; } $seq = $seqio = undef; # batch mode SKIP: { eval {$seqio = $gb->get_Stream_by_id([qw(J00522 AF303112 2981014)]);}; skip "Batch access test failed for Genbank. Skipping those tests: $@", 4 if $@; my $done = 0; while (my $s = $seqio->next_seq) { is $s->length, $expected_lengths{$s->display_id}, $s->display_id; $done++; } skip('No seqs returned', 4) if !$done; is $done, 3; } $seq = $seqio = undef; # test the temporary file creation and fasta ok $gb = Bio::DB::GenBank->new('-format' => 'fasta', '-retrievaltype' => 'tempfile', %params), "Tempfile tests"; SKIP: { eval {$seq = $gb->get_Seq_by_id('J00522');}; skip "Couldn't connect to complete GenBank tests with a tempfile with Bio::DB::GenBank.pm. Skipping those tests: $@", 6 if $@; # last part of id holds the key is $seq->length, $expected_lengths{(split(/\|/,$seq->display_id))[-1]}, "Check tmpfile: get_Seq_by_id:".$seq->display_id; eval {$seq = $gb->get_Seq_by_acc('AF303112');}; skip "Couldn't connect to complete GenBank tests with a tempfile with Bio::DB::GenBank.pm. Skipping those tests: $@", 5 if $@; # last part of id holds the key is $seq->length, $expected_lengths{(split(/\|/,$seq->display_id))[-1]}, "Check tmpfile: get_Seq_by_acc:".$seq->display_id; # batch mode requires genbank format $gb->request_format("gb"); eval {$seqio = $gb->get_Stream_by_id([qw(J00522 AF303112 2981014)]);}; skip "Couldn't connect to complete GenBank batch tests with a tempfile with Bio::DB::GenBank.pm. Skipping those tests: $@", 4 if $@; my $done = 0; while (my $s = $seqio->next_seq) { is $s->length, $expected_lengths{$s->display_id}, "Check tmpfile: get_Stream_by_id:".$s->display_id; undef $gb; # test the case where the db is gone, # but a temp file should remain until seqio goes away. $done++; } skip('No seqs returned', 4) if !$done; is $done, 3; } $seq = $seqio = undef; # test pipeline creation ok $gb = Bio::DB::GenBank->new('-retrievaltype' => 'pipeline', %params), "Pipeline tests"; SKIP: { eval {$seq = $gb->get_Seq_by_id('J00522');}; skip "Couldn't connect to complete GenBank tests with a pipeline with Bio::DB::GenBank.pm. Skipping those tests: $@", 6 if $@; is $seq->length, $expected_lengths{$seq->display_id}, "Check pipeline: get_Seq_by_id:".$seq->display_id; eval {$seq = $gb->get_Seq_by_acc('AF303112');}; skip "Couldn't connect to complete GenBank tests with a pipeline with Bio::DB::GenBank.pm. Skipping those tests: $@", 5 if $@; is $seq->length, $expected_lengths{$seq->display_id}, "Check pipeline: get_Seq_by_acc:".$seq->display_id; eval {$seqio = $gb->get_Stream_by_id([qw(J00522 AF303112 2981014)]);}; skip "Couldn't connect to complete GenBank tests with a pipeline with Bio::DB::GenBank.pm. Skipping those tests: $@", 4 if $@; my $done = 0; while (my $s = $seqio->next_seq) { is $s->length, $expected_lengths{$s->display_id}, "Check pipeline: get_Stream_by_id:".$s->display_id; undef $gb; # test the case where the db is gone, # but the pipeline should remain until seqio goes away $done++; } skip('No seqs returned', 4) if !$done; is $done, 3; } $seq = $seqio = undef; # test contig retrieval ok $gb = Bio::DB::GenBank->new('-format' => 'gbwithparts', %params); SKIP: { eval {$seq = $gb->get_Seq_by_id('JH374761');}; skip "Couldn't connect to GenBank with Bio::DB::GenBank.pm. Skipping those tests: $@", 3 if $@; is $seq->length, $expected_lengths{$seq->display_id}, "Check contig: get_Seq_by_id:".$seq->display_id; # now to check that postprocess_data in NCBIHelper catches CONTIG... ok $gb = Bio::DB::GenBank->new('-format' => 'gb',%params); eval {$seq = $gb->get_Seq_by_id('JH374761');}; skip "Couldn't connect to GenBank with Bio::DB::GenBank.pm. Skipping those tests: $@", 1 if $@; is $seq->length, $expected_lengths{$seq->display_id}, "Check contig: get_Seq_by_acc".$seq->display_id; } $seq = $seqio = undef; # bug 1405 my @result; ok $gb = Bio::DB::GenBank->new(-format => 'Fasta', -seq_start => 2, -seq_stop => 7, %params); SKIP: { eval {$seq = $gb->get_Seq_by_acc("A11111");}; skip "Couldn't connect to complete GenBank tests. Skipping those tests: $@", 15 if $@; is $seq->length, 6; # complexity tests ok $gb = Bio::DB::GenBank->new(-format => 'fasta', -complexity => 0, %params); eval {$seqin = $gb->get_Stream_by_acc("21614549");}; skip "Couldn't connect to complete GenBank tests. Skipping those tests: $@", 13 if $@; my @result = (4366, 'dna', 620, 'protein'); # Test number is labile (dependent on remote results) while ($seq = $seqin->next_seq) { is $seq->length, shift(@result); is $seq->alphabet, shift(@result); } is(@result, 0, @result ? "Missing results:".join(",", @result) : "All results checked"); # Real batch retrieval using epost/efetch # these tests may change if integrated further into Bio::DB::Gen* # Currently only useful for retrieving GI's via get_seq_stream $gb = Bio::DB::GenBank->new(%params); eval {$seqin = $gb->get_seq_stream(-uids => [4887706 ,431229, 147460], -mode => 'batch');}; skip "Couldn't connect to complete GenBank batchmode epost/efetch tests. Skipping those tests: $@", 8 if $@; my %result = ('M59757' => 12611 ,'X76083'=> 3140, 'J01670'=> 1593); my $ct = 0; # Test number is labile (dependent on remote results) while ($seq = $seqin->next_seq) { $ct++; my $acc = $seq->accession; ok exists $result{ $acc }; is $seq->length, $result{ $acc }; delete $result{$acc}; } skip('No seqs returned', 8) if !$ct; is $ct, 3; is %result, 0; } GenPept.t100644000765000024 505614536752166 16543 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/t# -*-Perl-*- Test Harness script for Bioperl # $Id$ use strict; BEGIN { use Test::Most tests => 21; use Test::RequiresInternet; use_ok('Bio::DB::GenPept'); } my %expected_lengths = ( 'AAC06201' => 353, 'AAD15290' => 136, 'P31383' => 635, '2AAA_YEAST' => 635 ); my %params; if (defined $ENV{BIOPERLEMAIL}) { $params{'-email'} = $ENV{BIOPERLEMAIL}; $params{'-delay'} = 2; } $params{'-verbose'} = $ENV{BIOPERLDEBUG}; my ($gb, $seq, $seqio); # # Bio::DB::GenPept # ok $gb = Bio::DB::GenPept->new(%params); SKIP: { eval {$seqio = $gb->get_seq_stream(-uids => [2981015, 1621261, 195055], -mode => 'batch');}; skip "Couldn't connect to complete GenPept tests. Skipping those tests", 8 if $@; my %result = ('AAC06201' => 353, 'CAB02640' => 193, 'AAD15290' => 136); my $ct = 0; while ($seq = $seqio->next_seq) { $ct++; my $acc = $seq->accession; ok exists $result{ $acc }; is $seq->length, $result{ $acc }; delete $result{$acc}; } skip('No seqs returned', 8) if !$ct; is $ct, 3; is %result, 0; } $seq = $seqio = undef; ok $gb = Bio::DB::GenPept->new(%params); SKIP: { eval {$seq = $gb->get_Seq_by_id('195055');}; skip "Couldn't connect to Genbank with Bio::DB::GenPept.pm. Skipping those tests: $@", 10 if $@; is $seq->length, $expected_lengths{$seq->display_id}, $seq->display_id; eval {$seq = $gb->get_Seq_by_acc('AAC06201');}; skip "Couldn't connect to Genbank with Bio::DB::GenPept.pm. Skipping those tests $@", 9 if $@; is $seq->length, $expected_lengths{$seq->display_id}, $seq->display_id; eval {$seqio = $gb->get_Stream_by_id([qw(AAC06201 195055)]);}; skip "Couldn't connect to Genbank with Bio::DB::GenPept.pm. Skipping those tests: $@", 8 if $@; my $done = 0; while( my $s = $seqio->next_seq ) { is $s->length, $expected_lengths{$s->display_id}, $s->display_id; $done++; } skip('No seqs returned', 8) if !$done; is $done, 2; # swissprot genpept parsing eval {$seq = $gb->get_Seq_by_acc('P31383');}; skip "Couldn't connect to Genbank with Bio::DB::GenPept.pm. Skipping those tests: $@", 5 if $@; is $seq->length, $expected_lengths{$seq->display_id}, $seq->display_id; # test dbsource stuff # small chance this might change but hopefully not my @annot = $seq->annotation->get_Annotations('dblink'); cmp_ok(scalar(@annot), '>', 31); is $annot[0]->database, 'UniProtKB'; is $annot[0]->primary_id, '2AAA_YEAST'; is (($seq->annotation->get_Annotations('swissprot_dates'))[0]->value, 'Jul 1, 1993'); } Makefile.PL100644000765000024 521114536752166 16514 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8# This file was automatically generated by Dist::Zilla::Plugin::MakeMaker v6.031. use strict; use warnings; use 5.006; use ExtUtils::MakeMaker; my %WriteMakefileArgs = ( "ABSTRACT" => "A collection of routines useful for queries to NCBI databases.", "AUTHOR" => "Aaron Mackey , Brian Osborne , Jason Stajich , Lincoln Stein ", "CONFIGURE_REQUIRES" => { "ExtUtils::MakeMaker" => 0 }, "DISTNAME" => "Bio-DB-NCBIHelper", "EXE_FILES" => [ "bin/bp_biofetch_genbank_proxy", "bin/bp_download_query_genbank", "bin/bp_query_entrez_taxa" ], "LICENSE" => "perl", "MIN_PERL_VERSION" => "5.006", "NAME" => "Bio::DB::NCBIHelper", "PREREQ_PM" => { "Bio::DB::Query::WebQuery" => "1.7.7", "Bio::DB::Taxonomy" => 0, "Bio::DB::WebDBSeqI" => 0, "Bio::Root::IO" => 0, "Bio::Root::Root" => 0, "Bio::SeqIO" => 0, "Bio::SeqIO::entrezgene" => 0, "Bio::Taxon" => 0, "Bio::WebAgent" => 0, "CGI" => 0, "Cache::FileCache" => 0, "Getopt::Long" => 0, "HTTP::Request::Common" => 0, "LWP::Protocol::https" => 0, "LWP::UserAgent" => 0, "URI" => 0, "URI::Escape" => 0, "XML::Twig" => 0, "base" => 0, "constant" => 0, "strict" => 0, "vars" => 0, "warnings" => 0 }, "TEST_REQUIRES" => { "File::Spec" => 0, "IO::Handle" => 0, "IPC::Open3" => 0, "Test::Exception" => 0, "Test::More" => 0, "Test::Most" => 0, "Test::RequiresInternet" => 0 }, "VERSION" => "1.7.8", "test" => { "TESTS" => "t/*.t" } ); my %FallbackPrereqs = ( "Bio::DB::Query::WebQuery" => "1.7.7", "Bio::DB::Taxonomy" => 0, "Bio::DB::WebDBSeqI" => 0, "Bio::Root::IO" => 0, "Bio::Root::Root" => 0, "Bio::SeqIO" => 0, "Bio::SeqIO::entrezgene" => 0, "Bio::Taxon" => 0, "Bio::WebAgent" => 0, "CGI" => 0, "Cache::FileCache" => 0, "File::Spec" => 0, "Getopt::Long" => 0, "HTTP::Request::Common" => 0, "IO::Handle" => 0, "IPC::Open3" => 0, "LWP::Protocol::https" => 0, "LWP::UserAgent" => 0, "Test::Exception" => 0, "Test::More" => 0, "Test::Most" => 0, "Test::RequiresInternet" => 0, "URI" => 0, "URI::Escape" => 0, "XML::Twig" => 0, "base" => 0, "constant" => 0, "strict" => 0, "vars" => 0, "warnings" => 0 ); unless ( eval { ExtUtils::MakeMaker->VERSION(6.63_03) } ) { delete $WriteMakefileArgs{TEST_REQUIRES}; delete $WriteMakefileArgs{BUILD_REQUIRES}; $WriteMakefileArgs{PREREQ_PM} = \%FallbackPrereqs; } delete $WriteMakefileArgs{CONFIGURE_REQUIRES} unless eval { ExtUtils::MakeMaker->VERSION(6.52) }; WriteMakefile(%WriteMakefileArgs); Taxonomy.t100644000765000024 1334514536752166 17037 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/tuse strict; use Test::Exception; use Test::More; use Test::RequiresInternet; use_ok('Bio::DB::Taxonomy'); my %params; if (defined $ENV{BIOPERLEMAIL}) { $params{'-email'} = $ENV{BIOPERLEMAIL}; $params{'-delay'} = 1; } { ok my $db = Bio::DB::Taxonomy->new(-source => 'entrez', %params); isa_ok $db, 'Bio::DB::Taxonomy::entrez'; isa_ok $db, 'Bio::DB::Taxonomy'; } { my $db = Bio::DB::Taxonomy->new(-source => 'entrez', %params); my $id; my $n; cmp_ok $db->get_num_taxa, '>', 880_000; # 886,907 as of 08-May-2012 $id = $db->get_taxonid('Homo sapiens'); is $id, 9606; # easy test on human, try out the main Taxon methods ok $n = $db->get_taxon(9606); is $n->id, 9606; is $n->object_id, $n->id; is $n->ncbi_taxid, $n->id; is $n->parent_id, 9605; is $n->rank, 'species'; is $n->node_name, 'Homo sapiens'; is $n->scientific_name, $n->node_name; is ${$n->name('scientific')}[0], $n->node_name; my %common_names = map { $_ => 1 } $n->common_names; cmp_ok keys %common_names, '>=', 3, ref($db).": common names"; ok exists $common_names{human}; is $n->division, 'Primates'; is $n->genetic_code, 1; is $n->mitochondrial_genetic_code, 2; ok defined $n->pub_date; ok defined $n->create_date; ok defined $n->update_date; # briefly test some Bio::Tree::NodeI methods ok my $ancestor = $n->ancestor; is $ancestor->scientific_name, 'Homo'; # unless set explicitly, Bio::Taxon doesn't return anything for # each_Descendent; must ask the database directly ok my @children = $ancestor->db_handle->each_Descendent($ancestor); cmp_ok @children, '>', 0; sleep(3); # do some trickier things... ok my $n2 = $db->get_Taxonomy_Node('89593'); is $n2->scientific_name, 'Craniata'; # briefly check we can use some Tree methods my $tree = Bio::Tree::Tree->new(); is $tree->get_lca($n, $n2)->scientific_name, 'Craniata'; # get lineage_nodes my @nodes = $tree->get_nodes; is scalar(@nodes), 0; my @lineage_nodes; @lineage_nodes = $tree->get_lineage_nodes($n->id); # read ID, only works if nodes have been added to tree is scalar @lineage_nodes, 0; @lineage_nodes = $tree->get_lineage_nodes($n); # node object always works cmp_ok(scalar @lineage_nodes, '>', 20); # get lineage string like($tree->get_lineage_string($n), qr/cellular organisms;Eukaryota/); like($tree->get_lineage_string($n,'-'), qr/cellular organisms-Eukaryota/); like($tree->get_lineage_string($n2), qr/cellular organisms;Eukaryota/); # can we actually form a Tree and use other Tree methods? ok $tree = Bio::Tree::Tree->new(-node => $n); cmp_ok($tree->number_nodes, '>', 20); cmp_ok(scalar($tree->get_nodes), '>', 20); is $tree->find_node(-rank => 'genus')->scientific_name, 'Homo'; # check that getting the ancestor still works now we have explitly set the # ancestor by making a Tree is $n->ancestor->scientific_name, 'Homo'; sleep(3); ok $n = $db->get_Taxonomy_Node('1760'); is $n->scientific_name, 'Actinomycetes'; sleep(3); # entrez isn't as good at searching as flatfile, so we have to special-case my @ids = sort $db->get_taxonids('Chloroflexi'); is scalar @ids, 2; is_deeply \@ids, [200795, 32061]; $id = $db->get_taxonids('Chloroflexi (class)'); is($id, 'No hit'); @ids = $db->get_taxonids('Rhodotorula'); cmp_ok @ids, '>=' , 1; # From NCBI: Taxid 592558 was merged into taxid 5533 on June 16, 2017 is( (grep { $_ == 592558 } @ids), 0, 'Value no longer found'); ok grep { $_ == 5533 } @ids; } # we can recursively fetch all descendents of a taxon { my $db = Bio::DB::Taxonomy->new(-source=>"entrez", %params); $db->get_taxon(10090); my $lca = $db->get_taxon(314146); my @descs = $db->get_all_Descendents($lca); cmp_ok @descs, '>=', 17; } # tests for #182 { my $db = Bio::DB::Taxonomy->new(-source=>"entrez", %params); my @taxa = qw(viruses Deltavirus unclassified plasmid); for my $taxon (@taxa) { test_taxid($db, $taxon); } sub test_taxid { my ($db, $taxa) = @_; my @taxonids = $db->get_taxonids($taxa); cmp_ok(scalar(@taxonids), '>', 0, "Got IDs returned for $taxa:".join(',', @taxonids)); my $taxon; lives_ok { $taxon = $db->get_taxon(-taxonid => pop @taxonids) } "IDs generates a Bio::Taxonomy::Node"; if (defined $taxon) { like( $taxon->scientific_name, qr/$taxa/i, "Name returned matches $taxa"); } else { ok(0, "No taxon object returned for $taxa"); } } } # tests for #212 { my $db = Bio::DB::Taxonomy->new( -source => "entrez", %params); # String | What I expect | What I get # ---------------------- | ------------- | ---------- # 'Lissotriton vulgaris' | 8324 | 8324 # 'Chlorella vulgaris' | 3077 | 3077 # 'Phygadeuon solidus' | 1763951 | 1763951 # 'Ovatus' | 666060 | 666060 # 'Phygadeuon ovatus' | 2890685 | 2890685 # 'Zaphod Beeblebrox' | "No hit" | "No hit" my @ids; @ids = $db->get_taxonids('Lissotriton vulgaris'); is $ids[0], 8324, 'Correct: Lissotriton vulgaris'; @ids = $db->get_taxonids('Chlorella vulgaris'); is $ids[0], 3077, 'Correct: Chlorella vulgaris'; @ids = $db->get_taxonids('Phygadeuon solidus'); is $ids[0], 1763951, 'Correct: Phygadeuon solidus'; @ids = $db->get_taxonids('Ovatus'); is $ids[0], 666060, 'Correct: Ovatus'; @ids = $db->get_taxonids('Phygadeuon ovatus'); is $ids[0], '2890685', 'Correct: 2890685'; @ids = $db->get_taxonids('Zaphod Beeblebrox'); is $ids[0], 'No hit', 'Correct: No hit'; } done_testing(); EntrezGene.t100644000765000024 203514536752166 17241 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/t# -*-Perl-*- Test Harness script for Bioperl # $Id$ use strict; BEGIN { use Test::Most tests => 9; use Test::RequiresInternet; use_ok('Bio::SeqIO::entrezgene'); use_ok('Bio::DB::EntrezGene'); } my %params; if (defined $ENV{BIOPERLEMAIL}) { $params{'-email'} = $ENV{BIOPERLEMAIL}; $params{'-delay'} = 2; } $params{'-verbose'} = $ENV{BIOPERLDEBUG}; my ($gb, $seq, $seqio); ok $gb = Bio::DB::EntrezGene->new(-retrievaltype => 'tempfile', %params); # # Bio::DB::EntrezGene # SKIP: { eval {$seqio = $gb->get_Stream_by_id([2,3064]);}; skip "Couldn't connect to Entrez with Bio::DB::EntrezGene. Skipping those tests", 6 if $@; $seq = $seqio->next_seq; is $seq->display_id, "A2M"; is $seq->accession_number, 2; $seq = $seqio->next_seq; is $seq->display_id, "HTT"; is $seq->accession_number, 3064; eval {$seq = $gb->get_Seq_by_id(6099);}; skip "Couldn't connect to Entrez with Bio::DB::EntrezGene. Skipping those tests", 2 if $@; is $seq->display_id, "RP"; is $seq->accession_number, 6099; } 00-compile.t100644000765000024 565614536752166 17054 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/tuse 5.006; use strict; use warnings; # this test was generated with Dist::Zilla::Plugin::Test::Compile 2.058 use Test::More; plan tests => 9 + ($ENV{AUTHOR_TESTING} ? 1 : 0); my @module_files = ( 'Bio/DB/EntrezGene.pm', 'Bio/DB/GenBank.pm', 'Bio/DB/GenPept.pm', 'Bio/DB/NCBIHelper.pm', 'Bio/DB/Query/GenBank.pm', 'Bio/DB/Taxonomy/entrez.pm' ); my @scripts = ( 'bin/bp_biofetch_genbank_proxy', 'bin/bp_download_query_genbank', 'bin/bp_query_entrez_taxa' ); # no fake home requested my @switches = ( -d 'blib' ? '-Mblib' : '-Ilib', ); use File::Spec; use IPC::Open3; use IO::Handle; open my $stdin, '<', File::Spec->devnull or die "can't open devnull: $!"; my @warnings; for my $lib (@module_files) { # see L my $stderr = IO::Handle->new; diag('Running: ', join(', ', map { my $str = $_; $str =~ s/'/\\'/g; q{'} . $str . q{'} } $^X, @switches, '-e', "require q[$lib]")) if $ENV{PERL_COMPILE_TEST_DEBUG}; my $pid = open3($stdin, '>&STDERR', $stderr, $^X, @switches, '-e', "require q[$lib]"); binmode $stderr, ':crlf' if $^O eq 'MSWin32'; my @_warnings = <$stderr>; waitpid($pid, 0); is($?, 0, "$lib loaded ok"); shift @_warnings if @_warnings and $_warnings[0] =~ /^Using .*\bblib/ and not eval { +require blib; blib->VERSION('1.01') }; if (@_warnings) { warn @_warnings; push @warnings, @_warnings; } } foreach my $file (@scripts) { SKIP: { open my $fh, '<', $file or warn("Unable to open $file: $!"), next; my $line = <$fh>; close $fh and skip("$file isn't perl", 1) unless $line =~ /^#!\s*(?:\S*perl\S*)((?:\s+-\w*)*)(?:\s*#.*)?$/; @switches = (@switches, split(' ', $1)) if $1; close $fh and skip("$file uses -T; not testable with PERL5LIB", 1) if grep { $_ eq '-T' } @switches and $ENV{PERL5LIB}; my $stderr = IO::Handle->new; diag('Running: ', join(', ', map { my $str = $_; $str =~ s/'/\\'/g; q{'} . $str . q{'} } $^X, @switches, '-c', $file)) if $ENV{PERL_COMPILE_TEST_DEBUG}; my $pid = open3($stdin, '>&STDERR', $stderr, $^X, @switches, '-c', $file); binmode $stderr, ':crlf' if $^O eq 'MSWin32'; my @_warnings = <$stderr>; waitpid($pid, 0); is($?, 0, "$file compiled ok"); shift @_warnings if @_warnings and $_warnings[0] =~ /^Using .*\bblib/ and not eval { +require blib; blib->VERSION('1.01') }; # in older perls, -c output is simply the file portion of the path being tested if (@_warnings = grep { !/\bsyntax OK$/ } grep { chomp; $_ ne (File::Spec->splitpath($file))[2] } @_warnings) { warn @_warnings; push @warnings, @_warnings; } } } is(scalar(@warnings), 0, 'no warnings found') or diag 'got warnings: ', ( Test::More->can('explain') ? Test::More::explain(\@warnings) : join("\n", '', @warnings) ) if $ENV{AUTHOR_TESTING}; eg000755000765000024 014536752166 14776 5ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8getGenBank.pl100644000765000024 142114536752166 17476 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/eg#!/usr/bin/perl # # How to retrieve GenBank entries over the Web # # by Jason Stajich # use Bio::DB::GenBank; use Bio::SeqIO; my $gb = new Bio::DB::GenBank; # the output stream for your seqs, this can be a file # instead or STDOUT, see the Bio::SeqIO module for info my $seqout = new Bio::SeqIO(-fh => \*STDOUT, -format => 'fasta'); # if you want a single seq my $seq = $gb->get_Seq_by_id('J00522'); $seqout->write_seq($seq); # or by accession $seq = $gb->get_Seq_by_acc('AF303112'); $seqout->write_seq($seq); # feel free to pull multiple sequences... # if you want to get a bunch of sequences use the get_Stream_by_id/acc methods my $seqio = $gb->get_Stream_by_id([ qw(J00522 AF303112 2981014)]); while( defined ($seq = $seqio->next_seq )) { $seqout->write_seq($seq); } Query-Genbank.t100644000765000024 560414536752166 17650 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/t# -*-Perl-*- Test Harness script for Bioperl # $Id$ use strict; BEGIN { use Test::Most tests => 18; use Test::RequiresInternet; use_ok('Bio::DB::Query::GenBank'); use_ok('Bio::DB::GenBank'); } my %expected_lengths = ( 'MUSIGHBA1' => 408, 'AF303112' => 1611, 'AF041456' => 1156, 'AY080910' => 798, 'AY080909' => 1042, 'AF155220' => 1172, 'AF442768' => 2547, ); my ($gb, $seq, $seqio, $seqin, $query); my %params; if (defined $ENV{BIOPERLEMAIL}) { $params{'-email'} = $ENV{BIOPERLEMAIL}; $params{'-delay'} = 2; } $params{'-verbose'} = $ENV{BIOPERLDEBUG}; # test query facility ok $query = Bio::DB::Query::GenBank->new('-db' => 'nucleotide', '-query' => 'Onchocerca volvulus[Organism]', '-mindate' => '2002/1/1', '-maxdate' => '2002/12/31', %params), 'Bio::DB::Query::GenBank'; SKIP: { cmp_ok $query->count, '>', 0; my @ids = $query->ids; cmp_ok @ids, '>', 0; is @ids, $query->count; ok $gb = Bio::DB::GenBank->new(%params); eval {$seqio = $gb->get_Stream_by_query($query);}; skip "Couldn't connect to complete GenBank query tests. Skipping those tests", 5 if $@; my $done = 0; while (my $s = $seqio->next_seq) { is $s->length, $expected_lengths{$s->display_id}, $s->display_id; undef $gb; # test the case where the db is gone, # but the pipeline should remain until seqio goes away $done++; } skip('No seqs returned', 5) if !$done; is $done, 1; } $seq = $seqio = undef; # test query facility (again) ok $query = Bio::DB::Query::GenBank->new('-db' => 'nucleotide', '-ids' => [qw(J00522 AF303112 2981014)], %params); SKIP: { cmp_ok $query->count, '>', 0; my @ids = $query->ids; cmp_ok @ids, '>', 0; is @ids, $query->count; $gb = Bio::DB::GenBank->new(%params); eval {$seqio = $gb->get_Stream_by_query($query);}; skip "Couldn't connect to complete GenBank query tests. Skipping those tests: $@", 4 if $@; my $done = 0; while (my $s = $seqio->next_seq) { is $s->length, $expected_lengths{$s->display_id}, $s->display_id; $done++; } skip('No seqs returned', 4) if !$done; is $done, 3; $seqio->close(); # the key to preventing errors during make test, no idea why } $seq = $seqio = undef; # and yet again, for bug 2133 $query = Bio::DB::Query::GenBank->new('-query' => 'AF303112', '-ids' => [qw(J00522 AF303112 2981014)], %params); is $query->query, 'J00522[PACC]|AF303112[PACC]|2981014[UID]'; gb2features.pl100644000765000024 1231614536752166 17727 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/eg#!/usr/bin/perl # Author: Damien Mattei C.N.R.S / U.N.S.A - UMR 6549 # example: ./idfetch.pl AP001266 use Bio::DB::GenBank; $gb = new Bio::DB::GenBank(); # this returns a Seq object : $seq1 = $gb->get_Seq_by_acc($ARGV[0]); print $seq1->display_id() . "\n" ; foreach $feat ($seq1->all_SeqFeatures()) { #print $feat->primary_tag . " " . $feat->source_tag() . "\n" ; print "Feature from ", $feat->start, " to ", $feat->end, " Primary tag ", $feat->primary_tag, ", produced by ", $feat->source_tag(), "\n"; if( $feat->strand == 0 ) { print "Feature applicable to either strand\n"; } else { print "Feature on strand ", $feat->strand,"\n"; # -1,1 } foreach $tag ( $feat->all_tags() ) { print "Feature has tag ", $tag, " with values, ", join(' ',$feat->each_tag_value($tag)), "\n"; } print "new feature\n" if $feat->has_tag('new'); } exit; __END__ It will display something like that: [dmattei@pclgmch2 gmap]$ ./idfetch.pl AP001266 AP001266 Feature from 1 to 168978 Primary tag source, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag chromosome with values, 11 Feature has tag map with values, 11q13 Feature has tag clone with values, RP11-770G2 Feature has tag organism with values, Homo sapiens Feature has tag db_xref with values, taxon:9606 Feature from 1 to 31550 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 31651 to 48510 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 48611 to 64044 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 64145 to 78208 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 78309 to 89008 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 89109 to 99704 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 99805 to 107965 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 108066 to 116032 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 116133 to 124010 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 124111 to 130494 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 130595 to 136072 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 136173 to 139649 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 139750 to 144590 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 144691 to 148482 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 148583 to 152279 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 152380 to 153632 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment clone_end:T7 vector_side:left Feature from 153733 to 155746 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 155847 to 156405 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment clone_end:SP6 vector_side:right Feature from 156506 to 158398 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 158499 to 161333 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 161434 to 163304 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 163405 to 164604 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 164705 to 166693 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment Feature from 166794 to 168978 Primary tag misc_feature, produced by EMBL/GenBank/SwissProt Feature on strand 1 Feature has tag note with values, assembly_fragment subsequence.cgi100644000765000024 1002614536752166 20163 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/eg#!/usr/bin/perl # see http://zfish.nichd.nih.gov/tools/subsequence.cgi # uncomment and modify the next two lines # if your perl is in a nonstandard directory #use lib '/disk3/local/lib/perl5/site_perl'; #use lib '/disk3/local/lib/perl5/'; use CGI qw/:standard :html3/; use Bio::DB::GenBank; use File::Temp; use FileHandle; print header, start_html(-title => 'find subsequence of large GenBank entries',-author => 'Jonathan_Epstein\@nih.gov'); print_form() unless param; print_results() if param; sub print_results { $gb = new Bio::DB::GenBank; $accession = param('accession'); eval { $seq = $gb->get_Seq_by_acc($accession); # Accession Number }; if ($@) { print "***ERROR: accession $accession not found***\n"; return; } $segment_start = param('start'); $segment_end = param('length_or_end_value'); $segment_end = $segment_start+$segment_end-1 if param('length_or_end_choice') eq 'Length'; if ($segment_end<$segment_start || $segment_start<0) { print "***ERROR: invalid segment start and end values:$segment_start,$segment_end***\n"; return; } $len = $seq->length(); if ($segment_end>$len) { print "***ERROR: maximum length $len exceeded***\n"; return; } $subseq = $seq->subseq ($segment_start,$segment_end); $name = "subsequence of $accession"; $strand = "+"; $strand = "-" if (param('reverse')); # For some reason, there seems to be a problem if you use the file # handle provided by File::Temp. Similarly, there's a problem if you # pass a filename to BioPerl below rather than a file handle. However, # constructing our own file handle and then passing it to BioPerl works # fine. (undef, $filename) = File::Temp::tempfile(); $fh = new FileHandle "> $filename"; $seqoutlong = Bio::SeqIO->new( '-format' => 'Fasta',-fh => $fh); $seqobj = Bio::PrimarySeq->new ( -seq => $subseq, -id => $name . "[length:$len]:" . $segment_start . "-" . $segment_end . "(" . $strand . "strand)", -moltype => 'dna' ); $seqobj = $seqobj->revcom if ($strand ne "+"); $seqoutlong->write_seq($seqobj); $fh->close; undef $fh; # Now we parse the FASTA file which was just generated, and perform # some simple conversions to HTML. open my $TEMPORARY, '<', $filename or die "Could not read temporary file '$filename': $!\n"; print "\n"; while (<$TEMPORARY>) { print $_; print "
\n"; } close $TEMPORARY; print "
\n"; unlink $filename; } sub print_form { print p("This web page permits you to extract a short subsequence of DNA from a large GenBank entry. This is especially useful in an era of huge \"contigs\" of genomic DNA, where you only want to extract a few hundred base pairs for subsequent analysis.\n"); print p,"This program also illustrates the power of ",a({-href => 'http://www.BioPerl.org/'}, "BioPerl"), ", a powerful set of tools for molecular biology analysis. The ", a({-href => 'subsequence.pl.txt'}, "source code"), " for this program is less than 90 lines long.\n"; print p,"You must specify the GenBank accession number along with a start position. You may specify either the length of the subsequence you wish to extract or, equivalently, the endpoint.\n"; print "The sequence may be reverse-complemented if you wish, e.g., the reverse complement of ATCGC is GCGAT.\n"; print p,"To test this web page, try accession NT_004002, start 50000, length 400.\n"; print start_form,table( Tr(td("Enter your GenBank accession"),td(textfield(-name => 'accession',-size => 20))), Tr(td("Start position"),td(textfield(-name => 'start',-size => 10))), Tr(td("Specify length or end position"), td(radio_group (-name => 'length_or_end_choice',-values => [Length, End], default => Length))), Tr(td("Length or end position"), td(textfield (-name => length_or_end_value,-size => 20))), Tr(td("Reverse complement?"), td(checkbox (-name => 'reverse')))), submit ("Find my subsequence"); print hr(),"Credits: Jonathan Epstein (Jonathan_Epstein\@nih.gov)"; } author-mojibake.t100644000765000024 35314536752166 20235 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/t#!perl BEGIN { unless ($ENV{AUTHOR_TESTING}) { print qq{1..0 # SKIP these tests are for testing by the author\n}; exit } } use strict; use warnings qw(all); use Test::More; use Test::Mojibake; all_files_encoding_ok(); DB000755000765000024 014536752166 16147 5ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/lib/BioGenBank.pm100644000765000024 2761014536752166 20200 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/lib/Bio/DB# # BioPerl module for Bio::DB::GenBank # # Please direct questions and support issues to # # Cared for by Aaron Mackey # # Copyright Aaron Mackey # # You may distribute this module under the same terms as perl itself # # POD documentation - main docs before the code # # Added LWP support - Jason Stajich 2000-11-6 # completely reworked by Jason Stajich 2000-12-8 # to use WebDBSeqI # Added batch entrez back when determined that new entrez cgi will # essentially work (there is a limit to the number of characters in a # GET request so I am not sure how we can get around this). The NCBI # Batch Entrez form has changed some and it does not support retrieval # of text only data. Still should investigate POST-ing (tried and # failed) a message to the entrez cgi to get around the GET # limitations. =head1 NAME Bio::DB::GenBank - Database object interface to GenBank =head1 SYNOPSIS use Bio::DB::GenBank; $gb = Bio::DB::GenBank->new(); $seq = $gb->get_Seq_by_id('J00522'); # Unique ID, *not always the LOCUS ID* # or ... $seq = $gb->get_Seq_by_acc('J00522'); # Accession Number $seq = $gb->get_Seq_by_version('J00522.1'); # Accession.version $seq = $gb->get_Seq_by_gi('405830'); # GI Number # get a stream via a query string my $query = Bio::DB::Query::GenBank->new (-query =>'Oryza sativa[Organism] AND EST', -reldate => '30', -db => 'nucleotide'); my $seqio = $gb->get_Stream_by_query($query); while( my $seq = $seqio->next_seq ) { print "seq length is ", $seq->length,"\n"; } # or ... best when downloading very large files, prevents # keeping all of the file in memory # also don't want features, just sequence so let's save bandwidth # and request Fasta sequence $gb = Bio::DB::GenBank->new(-retrievaltype => 'tempfile' , -format => 'Fasta'); my $seqio = $gb->get_Stream_by_acc(['AC013798', 'AC021953'] ); while( my $clone = $seqio->next_seq ) { print "cloneid is ", $clone->display_id, " ", $clone->accession_number, "\n"; } # note that get_Stream_by_version is not implemented # don't want the entire sequence or more options my $gb = Bio::DB::GenBank->new(-format => 'Fasta', -seq_start => 100, -seq_stop => 200, -strand => 1, -complexity => 4); my $seqi = $gb->get_Stream_by_query($query); =head1 DESCRIPTION Allows the dynamic retrieval of L sequence objects from the GenBank database at NCBI, via an Entrez query. WARNING: Please do B spam the Entrez web server with multiple requests. NCBI offers Batch Entrez for this purpose. Note that when querying for GenBank accessions starting with 'NT_' you will need to call $gb-Erequest_format('fasta') beforehand, because in GenBank format (the default) the sequence part will be left out (the reason is that NT contigs are rather annotation with references to clones). Some work has been done to automatically detect and retrieve whole NT_ clones when the data is in that format (NCBI RefSeq clones). The former behavior prior to bioperl 1.6 was to retrieve these from EBI, but now these are retrieved directly from NCBI. The older behavior can be regained by setting the 'redirect_refseq' flag to a value evaluating to TRUE. =head2 Running Alternate methods are described at L NOTE: strand should be 1 for plus or 2 for minus. Complexity: gi is often a part of a biological blob, containing other gis complexity regulates the display: 0 - get the whole blob 1 - get the bioseq for gi of interest (default in Entrez) 2 - get the minimal bioseq-set containing the gi of interest 3 - get the minimal nuc-prot containing the gi of interest 4 - get the minimal pub-set containing the gi of interest 'seq_start' and 'seq_stop' will not work when setting complexity to any value other than 1. 'strand' works for any setting other than a complexity of 0 (whole glob); when you try this with a GenBank return format nothing happens, whereas using FASTA works but causes display problems with the other sequences in the glob. As Tao Tao says from NCBI, "Better left it out or set it to 1." =head1 FEEDBACK =head2 Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web: https://github.com/bioperl/bioperl-live/issues =head1 AUTHOR - Aaron Mackey, Jason Stajich Email amackey@virginia.edu Email jason@bioperl.org =head1 APPENDIX The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ =cut # Let the code begin... package Bio::DB::GenBank; $Bio::DB::GenBank::VERSION = '1.7.8'; use strict; use vars qw(%PARAMSTRING $DEFAULTFORMAT $DEFAULTMODE); use base qw(Bio::DB::NCBIHelper); BEGIN { $DEFAULTMODE = 'single'; $DEFAULTFORMAT = 'gbwithparts'; %PARAMSTRING = ( 'batch' => { 'db' => 'nucleotide', 'usehistory' => 'n', 'tool' => 'bioperl'}, 'query' => { 'usehistory' => 'y', 'tool' => 'bioperl', 'retmode' => 'text'}, 'gi' => { 'db' => 'nucleotide', 'usehistory' => 'n', 'tool' => 'bioperl', 'retmode' => 'text'}, 'version' => { 'db' => 'nucleotide', 'usehistory' => 'n', 'tool' => 'bioperl', 'retmode' => 'text'}, 'single' => { 'db' => 'nucleotide', 'usehistory' => 'n', 'tool' => 'bioperl', 'retmode' => 'text'}, 'webenv' => { 'query_key' => 'querykey', 'WebEnv' => 'cookie', 'db' => 'nucleotide', 'usehistory' => 'n', 'tool' => 'bioperl', 'retmode' => 'text'}, ); } # new is in NCBIHelper # helper method to get db specific options =head2 new Title : new Usage : $gb = Bio::DB::GenBank->new(@options) Function: Creates a new genbank handle Returns : a new Bio::DB::Genbank object Args : -delay number of seconds to delay between fetches (3s) NOTE: There are other options that are used internally. By NCBI policy, this module introduces a 3s delay between fetches. If you are fetching multiple genbank ids, it is a good idea to use get =cut =head2 get_params Title : get_params Usage : my %params = $self->get_params($mode) Function: Returns key,value pairs to be passed to NCBI database for either 'batch' or 'single' sequence retrieval method Returns : a key,value pair hash Args : 'single' or 'batch' mode for retrieval =cut sub get_params { my ($self, $mode) = @_; return defined $PARAMSTRING{$mode} ? %{$PARAMSTRING{$mode}} : %{$PARAMSTRING{$DEFAULTMODE}}; } # from Bio::DB::WebDBSeqI from Bio::DB::RandomAccessI =head1 Routines Bio::DB::WebDBSeqI from Bio::DB::RandomAccessI =head2 get_Seq_by_id Title : get_Seq_by_id Usage : $seq = $db->get_Seq_by_id('ROA1_HUMAN') Function: Gets a Bio::Seq object by its name Returns : a Bio::Seq object Args : the id (as a string) of a sequence Throws : "id does not exist" exception =head2 get_Seq_by_acc Title : get_Seq_by_acc Usage : $seq = $db->get_Seq_by_acc($acc); Function: Gets a Seq object by accession numbers Returns : a Bio::Seq object Args : the accession number as a string Note : For GenBank, this just calls the same code for get_Seq_by_id(). Caveat: this normally works, but in rare cases simply passing the accession can lead to odd results, possibly due to unsynchronized NCBI ID servers. Using get_Seq_by_version() is slightly better, but using the unique identifier (GI) and get_Seq_by_id is the most consistent Throws : "id does not exist" exception =head2 get_Seq_by_gi Title : get_Seq_by_gi Usage : $seq = $db->get_Seq_by_gi('405830'); Function: Gets a Bio::Seq object by gi number Returns : A Bio::Seq object Args : gi number (as a string) Throws : "gi does not exist" exception =head2 get_Seq_by_version Title : get_Seq_by_version Usage : $seq = $db->get_Seq_by_version('X77802.1'); Function: Gets a Bio::Seq object by sequence version Returns : A Bio::Seq object Args : accession.version (as a string) Note : Caveat: this normally works, but using the unique identifier (GI) and get_Seq_by_id is the most consistent Throws : "acc.version does not exist" exception =head1 Routines implemented by Bio::DB::NCBIHelper =head2 get_Stream_by_query Title : get_Stream_by_query Usage : $seq = $db->get_Stream_by_query($query); Function: Retrieves Seq objects from Entrez 'en masse', rather than one at a time. For large numbers of sequences, this is far superior than get_Stream_by_[id/acc](). Example : Returns : a Bio::SeqIO stream object Args : $query : An Entrez query string or a Bio::DB::Query::GenBank object. It is suggested that you create a Bio::DB::Query::GenBank object and get the entry count before you fetch a potentially large stream. =cut =head2 get_Stream_by_id Title : get_Stream_by_id Usage : $stream = $db->get_Stream_by_id( [$uid1, $uid2] ); Function: Gets a series of Seq objects by unique identifiers Returns : a Bio::SeqIO stream object Args : $ref : a reference to an array of unique identifiers for the desired sequence entries =head2 get_Stream_by_acc Title : get_Stream_by_acc Usage : $seq = $db->get_Stream_by_acc([$acc1, $acc2]); Function: Gets a series of Seq objects by accession numbers Returns : a Bio::SeqIO stream object Args : $ref : a reference to an array of accession numbers for the desired sequence entries Note : For GenBank, this just calls the same code for get_Stream_by_id() =cut =head2 get_Stream_by_gi Title : get_Stream_by_gi Usage : $seq = $db->get_Seq_by_gi([$gi1, $gi2]); Function: Gets a series of Seq objects by gi numbers Returns : a Bio::SeqIO stream object Args : $ref : a reference to an array of gi numbers for the desired sequence entries Note : For GenBank, this just calls the same code for get_Stream_by_id() =head2 get_Stream_by_batch Title : get_Stream_by_batch Usage : $seq = $db->get_Stream_by_batch($ref); Function: Retrieves Seq objects from Entrez 'en masse', rather than one at a time. Example : Returns : a Bio::SeqIO stream object Args : $ref : either an array reference, a filename, or a filehandle from which to get the list of unique ids/accession numbers. NOTE: This method is redundant and deprecated. Use get_Stream_by_id() instead. =head2 get_request Title : get_request Usage : my $url = $self->get_request Function: HTTP::Request Returns : Args : %qualifiers = a hash of qualifiers (ids, format, etc) =cut =head2 default_format Title : default_format Usage : my $format = $self->default_format Function: Returns default sequence format for this module Returns : string Args : none =cut sub default_format { return $DEFAULTFORMAT; } 1; __END__ GenPept.pm100644000765000024 1514414536752166 20234 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/lib/Bio/DB# # BioPerl module for Bio::DB::GenPept # # Please direct questions and support issues to # # Cared for by Aaron Mackey # # Copyright Aaron Mackey # # You may distribute this module under the same terms as perl itself # POD documentation - main docs before the code # completely reworked by Jason Stajich to use Bio::DB::WebDBSeqI 2000-12-12 =head1 NAME Bio::DB::GenPept - Database object interface to GenPept =head1 SYNOPSIS $gb = Bio::DB::GenPept->new(); $seq = $gb->get_Seq_by_id('195055'); # Unique ID # or ... $seq = $gb->get_Seq_by_acc('P09651.5'); # Accession Number my $seqio = $gb->get_Stream_by_id(['195055', 'P09651.5']); while( my $seq = $seqio->next_seq ) { print "seq is is ", $seq->display_id, "\n"; } =head1 DESCRIPTION Allows the dynamic retrieval of Sequence objects (Bio::Seq) from the GenPept database at NCBI, via an Entrez query. WARNING: Please do NOT spam the Entrez web server with multiple requests. NCBI offers Batch Entrez for this purpose. Batch Entrez support will likely be supported in a future version of DB::GenPept. Currently the only return format supported by NCBI Entrez for GenPept database is GenPept format, so any format specification passed to GenPept will be ignored still be forced to GenPept format (which is just GenBank format). =head1 FEEDBACK =head2 Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web: https://github.com/bioperl/bioperl-live/issues =head1 AUTHOR - Aaron Mackey, Jason Stajich Email amackey@virginia.edu Email jason@bioperl.org =head1 APPENDIX The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ =cut # Let the code begin... package Bio::DB::GenPept; $Bio::DB::GenPept::VERSION = '1.7.8'; use strict; use vars qw($DEFAULTFORMAT $DEFAULTMODE %PARAMSTRING); use base qw(Bio::DB::NCBIHelper); BEGIN { $DEFAULTMODE = 'single'; $DEFAULTFORMAT = 'gp'; %PARAMSTRING = ( 'batch' => { 'db' => 'protein', 'usehistory' => 'n', 'tool' => 'bioperl'}, # no query? 'gi' => { 'db' => 'protein', 'usehistory' => 'n', 'tool' => 'bioperl', 'retmode' => 'text'}, 'version' => { 'db' => 'protein', 'usehistory' => 'n', 'tool' => 'bioperl', 'retmode' => 'text'}, 'single' => { 'db' => 'protein', 'usehistory' => 'n', 'tool' => 'bioperl', 'retmode' => 'text'}, 'webenv' => { 'query_key' => 'querykey', 'WebEnv' => 'cookie', 'db' => 'protein', 'usehistory' => 'n', 'tool' => 'bioperl', 'retmode' => 'text'}, ); } # the new way to make modules a little more lightweight sub new { my($class, @args) = @_; my $self = $class->SUPER::new(@args); $self->request_format($self->default_format); return $self; } =head2 get_params Title : get_params Usage : my %params = $self->get_params($mode) Function: Returns key,value pairs to be passed to NCBI database for either 'batch' or 'single' sequence retrieval method Returns : a key,value pair hash Args : 'single' or 'batch' mode for retrieval =cut sub get_params { my ($self, $mode) = @_; return defined $PARAMSTRING{$mode} ? %{$PARAMSTRING{$mode}} : %{$PARAMSTRING{$DEFAULTMODE}}; } =head2 default_format Title : default_format Usage : my $format = $self->default_format Function: Returns default sequence format for this module Returns : string Args : none =cut sub default_format { return $DEFAULTFORMAT; } # from Bio::DB::WebDBSeqI from Bio::DB::RandomAccessI =head1 Routines from Bio::DB::WebDBSeqI and Bio::DB::RandomAccessI =head2 get_Seq_by_id Title : get_Seq_by_id Usage : $seq = $db->get_Seq_by_id('P09651.5') Function: Gets a Bio::Seq object by its name Returns : a Bio::Seq object Args : the id (as a string) of a sequence Throws : "id does not exist" exception =head2 get_Seq_by_acc Title : get_Seq_by_acc Usage : $seq = $db->get_Seq_by_acc('AAC73346'); Function: Gets a Seq objects by accession number Returns : Bio::Seq object Args : accession number to retrieve by =head1 Routines implemented by Bio::DB::NCBIHelper =head2 get_request Title : get_request Usage : my $url = $self->get_request Function: HTTP::Request Returns : Args : %qualifiers = a hash of qualifiers (ids, format, etc) =head2 get_Stream_by_id Title : get_Stream_by_id Usage : $stream = $db->get_Stream_by_id( [$uid1, $uid2] ); Function: Gets a series of Seq objects by unique identifiers Returns : a Bio::SeqIO stream object Args : $ref : a reference to an array of unique identifiers for the desired sequence entries =head2 get_Stream_by_acc (2) Title : get_Stream_by_acc Usage : $seq = $db->get_Stream_by_acc($acc); Function: Gets a series of Seq objects by accession numbers Returns : a Bio::SeqIO stream object Args : $ref : a reference to an array of accession numbers for the desired sequence entries Note : For GenBank, this just calls the same code for get_Stream_by_id() =head2 get_Stream_by_query Title : get_Stream_by_query Usage : $seq = $db->get_Stream_by_query('ROA1_HUMAN'); Function: Gets a series of Seq objects by unique display_id Returns : a Bio::SeqIO stream object Args : $ref : display_id to query by =head2 request_format Title : request_format Usage : my $format = $self->request_format; $self->request_format($format); Function: Get/Set sequence format retrieval Returns : string representing format Args : $format = sequence format =cut # override to force format to be GenPept regardless sub request_format { my ($self) = @_; return $self->SUPER::request_format($self->default_format()); } 1; __END__ author-pod-syntax.t100644000765000024 45414536752166 20564 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/t#!perl BEGIN { unless ($ENV{AUTHOR_TESTING}) { print qq{1..0 # SKIP these tests are for testing by the author\n}; exit } } # This file was automatically generated by Dist::Zilla::Plugin::PodSyntaxTests. use strict; use warnings; use Test::More; use Test::Pod 1.41; all_pod_files_ok(); bin000755000765000024 014536752166 15153 5ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8bp_query_entrez_taxa100644000765000024 521514536752166 21473 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/bin#!/usr/bin/perl # This is a -*-Perl-* file (make my emacs happy) =head1 NAME bp_query_entrez_taxa - query Entrez taxonomy database and print out information =head1 USAGE bp_query_entrez_taxa "Homo sapiens" "Saccharomyces cerevisiae" Rhizopus Metazoa bp_query_entrez_taxa -gi 28800981 -gi 54301680 -db nucleotide bp_query_entrez_taxa -gi 71836523 -db protein Provide the genus and species name in quotes, you can also query for a non-species node like Family or Order Command-line options: -v or --verbose : print verbose debugging info -gi : one or many GI numbers to lookup taxon id for -db : the sequence db (nucleotide or protein) the GI is for other arguments are assumed to be species names to lookup in taxonomy db =head1 AUTHOR Jason Stajich jason-at-bioperl-dot-org =cut use strict; use warnings; use Bio::DB::Taxonomy; use Getopt::Long; my $verbose = 0; my (@gi, $dbname); GetOptions('v|verbose' => \$verbose, 'gi:i' => \@gi, 'db:s' => \$dbname); my $db = new Bio::DB::Taxonomy(-source => 'entrez', -verbose => $verbose); if( @gi ) { my @nodes= $db->get_Taxonomy_Node(-gi => \@gi, -db => $dbname); for my $node ( @nodes ) { my $gi = shift @gi; print " for gi $gi:\n"; print " taxonid is ",$node->ncbi_taxid,"\n"; print " node is ", join(", ",$node->classification), "\n"; print " species is ", $node->species,"\n"; print " parent is ", $node->parent_id, "\n"; print " rank is ", $node->rank, "\n"; print " genetic_code ", $node->genetic_code, "\n"; print " mito_genetic_code ", $node->mitochondrial_genetic_code, "\n"; print " scientfic name is ", $node->binomial, "\n"; } } print "\n\n"; for my $name ( @ARGV ) { my $taxonid = $db->get_taxonid($name); my $node = $db->get_Taxonomy_Node(-taxonid => $taxonid); print "taxonid is $taxonid\n"; print " node is ", join(", ",$node->classification), "\n"; print " species is ", $node->species,"\n"; print " parent is ", $node->parent_id, "\n"; print " rank is ", $node->rank, "\n"; print " genetic_code ", $node->genetic_code, "\n"; print " mito_genetic_code ", $node->mitochondrial_genetic_code, "\n"; print " scientfic name is ", $node->binomial, "\n"; print " common name is ", $node->common_name, "\n"; print " create date is ", $node->create_date, "\n"; print " update date is ", $node->update_date, "\n"; print " pub date is ", ($node->pub_date || ''), "\n"; print " variant is ", $node->variant, "\n"; print " sub_species is ", $node->sub_species, "\n"; print " organelle is ", $node->organelle, "\n"; print " division is ", $node->division, "\n"; } EntrezGene.pm100644000765000024 1271614536752166 20742 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/lib/Bio/DB# # BioPerl module for Bio::DB::EntrezGene # # Please direct questions and support issues to # # Cared for by Brian Osborne bosborne at alum.mit.edu # # Copyright Brian Osborne # # You may distribute this module under the same terms as perl itself # POD documentation - main docs before the code =head1 NAME Bio::DB::EntrezGene - Database object interface to Entrez Gene =head1 SYNOPSIS use Bio::DB::EntrezGene; my $db = Bio::DB::EntrezGene->new; my $seq = $db->get_Seq_by_id(2); # Gene id # or ... my $seqio = $db->get_Stream_by_id([2, 4693, 3064]); # Gene ids while ( my $seq = $seqio->next_seq ) { print "id is ", $seq->display_id, "\n"; } =head1 DESCRIPTION Allows the dynamic retrieval of Sequence objects from the Entrez Gene database at NCBI, via an Entrez query using Gene ids. This module requires the CPAN Bio::ASN1 module. WARNING: Please do NOT spam the Entrez web server with multiple requests. NCBI offers Batch Entrez for this purpose. =head1 NOTES The Entrez eutils API does not allow Entrez Gene queries by name as of this writing, therefore there are only get_Seq_by_id and get_Stream_by_id methods in this module, and these expect Gene ids. There are no get_Seq_by_acc or get_Stream_by_acc methods. =head1 FEEDBACK =head2 Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web: https://github.com/bioperl/bioperl-live/issues =head1 AUTHOR - Brian Osborne Email bosborne at alum.mit.edu =head1 APPENDIX The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ =cut # Let the code begin... package Bio::DB::EntrezGene; $Bio::DB::EntrezGene::VERSION = '1.7.8'; use strict; use vars qw($DEFAULTFORMAT $DEFAULTMODE %PARAMSTRING); use base qw(Bio::DB::NCBIHelper); BEGIN { $DEFAULTMODE = 'single'; $DEFAULTFORMAT = 'asn.1'; %PARAMSTRING = ('batch' => {'db' => 'gene', 'usehistory' => 'y', 'tool' => 'bioperl', 'retmode' => 'asn.1'}, 'gi' => {'db' => 'gene', 'usehistory' => 'y', 'tool' => 'bioperl', 'retmode' => 'asn.1'}, 'version' => {'db' => 'gene', 'usehistory' => 'y', 'tool' => 'bioperl', 'retmode' => 'asn.1'}, 'single' => {'db' => 'gene', 'usehistory' => 'y', 'tool' => 'bioperl', 'retmode' => 'asn.1'} ); } # the new way to make modules a little more lightweight sub new { my($class, @args) = @_; my $self = $class->SUPER::new(@args); # Seems that Bio::SeqIO::entrezgene requires this: $self->{_retrieval_type} = "tempfile"; $self->request_format($self->default_format); return $self; } =head2 get_params Title : get_params Usage : my %params = $self->get_params($mode) Function: Returns key,value pairs to be passed to NCBI database for either 'batch' or 'single' sequence retrieval method Returns : A key,value pair hash Args : 'single' or 'batch' mode for retrieval =cut sub get_params { my ($self, $mode) = @_; return defined $PARAMSTRING{$mode} ? %{$PARAMSTRING{$mode}} : %{$PARAMSTRING{$DEFAULTMODE}}; } =head2 default_format Title : default_format Usage : my $format = $self->default_format Function: Returns default sequence format for this module Returns : string Args : none =cut sub default_format { return $DEFAULTFORMAT; } # from Bio::DB::WebDBSeqI from Bio::DB::RandomAccessI =head1 Routines from Bio::DB::WebDBSeqI and Bio::DB::RandomAccessI =head2 get_Seq_by_id Title : get_Seq_by_id Usage : $seq = $db->get_Seq_by_id(2) Function: Gets a Bio::Seq object by its name Returns : A Bio::Seq object Args : Gene id Throws : "id does not exist" exception =head1 Routines implemented by Bio::DB::NCBIHelper =head2 get_request Title : get_request Usage : my $url = $self->get_request Function: HTTP::Request Returns : Args : %qualifiers = a hash of qualifiers (ids, format, etc) =head2 get_Stream_by_id Title : get_Stream_by_id Usage : $stream = $db->get_Stream_by_id( [$gid1, $gid2] ); Function: Gets a series of Seq objects using Gene ids Returns : A Bio::SeqIO stream object Args : A reference to an array of Gene ids =head2 request_format Title : request_format Usage : my $format = $self->request_format; $self->request_format($format); Function: Get or set sequence format retrieval Returns : String representing format Args : $format = sequence format =cut # override to force format sub request_format { my ($self) = @_; return $self->SUPER::request_format($self->default_format()); } 1; __END__ NCBIHelper.pm100644000765000024 5410514536752166 20545 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/lib/Bio/DB# # BioPerl module for Bio::DB::NCBIHelper # # Please direct questions and support issues to # # Cared for by Jason Stajich # # Copyright Jason Stajich # # You may distribute this module under the same terms as perl itself # # POD documentation - main docs before the code # # Interfaces with new WebDBSeqI interface =head1 NAME Bio::DB::NCBIHelper - A collection of routines useful for queries to NCBI databases. =head1 SYNOPSIS # Do not use this module directly. # get a Bio::DB::NCBIHelper object somehow my $seqio = $db->get_Stream_by_acc(['J00522']); foreach my $seq ( $seqio->next_seq ) { # process seq } =head1 DESCRIPTION Provides a single place to setup some common methods for querying NCBI web databases. This module just centralizes the methods for constructing a URL for querying NCBI GenBank and NCBI GenPept and the common HTML stripping done in L(). The base NCBI query URL used is: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi =head1 FEEDBACK =head2 Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web. https://github.com/bioperl/bioperl-live/issues =head1 AUTHOR - Jason Stajich Email jason@bioperl.org =head1 APPENDIX The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ =cut # Let the code begin... package Bio::DB::NCBIHelper; $Bio::DB::NCBIHelper::VERSION = '1.7.8'; use strict; use Bio::DB::Query::GenBank; use HTTP::Request::Common; use URI; use Bio::Root::IO; # use Bio::DB::RefSeq; use URI::Escape qw(uri_unescape); use base qw(Bio::DB::WebDBSeqI Bio::Root::Root); our $HOSTBASE = 'https://eutils.ncbi.nlm.nih.gov'; our $MAX_ENTRIES = 19000; our $REQUEST_DELAY = 4; our %CGILOCATION = ( 'batch' => [ 'post' => '/entrez/eutils/epost.fcgi' ], 'query' => [ 'get' => '/entrez/eutils/efetch.fcgi' ], 'single' => [ 'get' => '/entrez/eutils/efetch.fcgi' ], 'version' => [ 'get' => '/entrez/eutils/efetch.fcgi' ], 'gi' => [ 'get' => '/entrez/eutils/efetch.fcgi' ], 'webenv' => [ 'get' => '/entrez/eutils/efetch.fcgi' ] ); our %FORMATMAP = ( 'gb' => 'genbank', 'gp' => 'genbank', 'fasta' => 'fasta', 'asn.1' => 'entrezgene', 'gbwithparts' => 'genbank', ); our $DEFAULTFORMAT = 'gb'; =head2 new Title : new Usage : Function: the new way to make modules a little more lightweight Returns : Args : =cut sub new { my ( $class, @args ) = @_; my $self = $class->SUPER::new(@args); my ($seq_start, $seq_stop, $no_redirect, $redirect, $complexity, $strand, $email ) = $self->_rearrange( [ qw(SEQ_START SEQ_STOP NO_REDIRECT REDIRECT_REFSEQ COMPLEXITY STRAND EMAIL) ], @args ); $seq_start && $self->seq_start($seq_start); $seq_stop && $self->seq_stop($seq_stop); $no_redirect && $self->no_redirect($no_redirect); $redirect && $self->redirect_refseq($redirect); $strand && $self->strand($strand); $email && $self->email($email); # adjust statement to accept zero value if (defined $complexity && $complexity >= 0 && $complexity <= 4 ) { $self->complexity($complexity) } return $self; } =head2 get_params Title : get_params Usage : my %params = $self->get_params($mode) Function: returns key,value pairs to be passed to NCBI database for either 'batch' or 'single' sequence retrieval method Returns : a key,value pair hash Args : 'single' or 'batch' mode for retrieval =cut sub get_params { my ($self, $mode) = @_; $self->throw("subclass did not implement get_params"); } =head2 default_format Title : default_format Usage : my $format = $self->default_format Function: returns default sequence format for this module Returns : string Args : none =cut sub default_format { return $DEFAULTFORMAT; } =head2 get_request Title : get_request Usage : my $url = $self->get_request Function: HTTP::Request Returns : Args : %qualifiers = a hash of qualifiers (ids, format, etc) =cut sub get_request { my ( $self, @qualifiers ) = @_; my ( $mode, $uids, $format, $query, $seq_start, $seq_stop, $strand, $complexity, $email) = $self->_rearrange( [qw(MODE UIDS FORMAT QUERY SEQ_START SEQ_STOP STRAND COMPLEXITY EMAIL)], @qualifiers ); $mode = lc $mode; ($format) = $self->request_format() unless ( defined $format ); if ( !defined $mode || $mode eq '' ) { $mode = 'single'; } my %params = $self->get_params($mode); if ( !%params ) { $self->throw( "must specify a valid retrieval mode 'single' or 'batch' not '$mode'" ); } my $url = URI->new( $HOSTBASE . $CGILOCATION{$mode}[1] ); unless ( $mode eq 'webenv' || defined $uids || defined $query ) { $self->throw("Must specify a query or list of uids to fetch"); } if ( $query && $query->can('cookie') ) { @params{ 'WebEnv', 'query_key' } = $query->cookie; $params{'db'} = $query->db; } elsif ($query) { $params{'id'} = join ',', $query->ids; } # for batch retrieval, non-query style elsif ( $mode eq 'webenv' && $self->can('cookie') ) { @params{ 'WebEnv', 'query_key' } = $self->cookie; } elsif ($uids) { if ( ref($uids) =~ /array/i ) { $uids = join( ",", @$uids ); } $params{'id'} = $uids; } $seq_start && ( $params{'seq_start'} = $seq_start ); $seq_stop && ( $params{'seq_stop'} = $seq_stop ); $strand && ( $params{'strand'} = $strand ); $email && ( $params{'email'} = $email ); if ( defined $complexity && ( $seq_start || $seq_stop || $strand ) ) { $self->warn( "Complexity set to $complexity; seq_start and seq_stop may not work!" ) if ( $complexity != 1 && ( $seq_start || $seq_stop ) ); $self->warn( "Complexity set to 0; expect strange results with strand set to 2" ) if ( $complexity == 0 && $strand == 2 && $format eq 'fasta' ); } defined $complexity && ( $params{'complexity'} = $complexity ); $params{'rettype'} = $format unless $mode eq 'batch'; # for now, 'post' is batch retrieval if ( $CGILOCATION{$mode}[0] eq 'post' ) { my $response = $self->ua->request( POST $url, [%params] ); $response->proxy_authorization_basic( $self->authentication ) if ( $self->authentication ); $self->_parse_response( $response->content ); my ( $cookie, $querykey ) = $self->cookie; my %qualifiers = ( '-mode' => 'webenv', '-seq_start' => $seq_start, '-seq_stop' => $seq_stop, '-strand' => $strand, '-complexity' => $complexity, '-format' => $format, '-email' => $email ); $self->_sleep(); return $self->get_request(%qualifiers); } else { $url->query_form(%params); return GET $url; } } =head2 get_seq_stream Title : get_seq_stream Usage : my $seqio = $self->get_seq_stream(%qualifiers) Function: builds a url and queries a web db Returns : a Bio::SeqIO stream capable of producing sequence Args : %qualifiers = a hash qualifiers that the implementing class will process to make a url suitable for web querying =cut sub get_seq_stream { my ($self, %qualifiers) = @_; my ($rformat, $ioformat) = $self->request_format(); my $seen = 0; foreach my $key ( keys %qualifiers ) { if( $key =~ /format/i ) { $rformat = $qualifiers{$key}; $seen = 1; } } $qualifiers{'-format'} = $rformat if( !$seen); ($rformat, $ioformat) = $self->request_format($rformat); # These parameters are implemented for Bio::DB::GenBank objects only if($self->isa('Bio::DB::GenBank')) { $self->seq_start() && ($qualifiers{'-seq_start'} = $self->seq_start()); $self->seq_stop() && ($qualifiers{'-seq_stop'} = $self->seq_stop()); $self->strand() && ($qualifiers{'-strand'} = $self->strand()); $self->email() && ($qualifiers{'-email'} = $self->email()); defined $self->complexity() && ($qualifiers{'-complexity'} = $self->complexity()); } my $request = $self->get_request(%qualifiers); $request->proxy_authorization_basic($self->authentication) if ( $self->authentication); $self->debug("request is ". $request->as_string(). "\n"); # workaround for MSWin systems $self->retrieval_type('io_string') if $self->retrieval_type =~ /pipeline/ && $^O =~ /^MSWin/; if ($self->retrieval_type =~ /pipeline/) { # Try to create a stream using POSIX fork-and-pipe facility. # this is a *big* win when fetching thousands of sequences from # a web database because we can return the first entry while # transmission is still in progress. # Also, no need to keep sequence in memory or in a temporary file. # If this fails (Windows, MacOS 9), we fall back to non-pipelined access. # fork and pipe: _stream_request()=> my ($result,$stream) = $self->_open_pipe(); if (defined $result) { $DB::fork_TTY = File::Spec->devnull; # prevents complaints from debugger if (!$result) { # in child process $self->_stream_request($request,$stream); POSIX::_exit(0); #prevent END blocks from executing in this forked child } else { return Bio::SeqIO->new('-verbose' => $self->verbose, '-format' => $ioformat, '-fh' => $stream); } } else { $self->retrieval_type('io_string'); } } if ($self->retrieval_type =~ /temp/i) { my $dir = $self->io->tempdir( CLEANUP => 1); my ( $fh, $tmpfile) = $self->io()->tempfile( DIR => $dir ); close $fh; my $resp = $self->_request($request, $tmpfile); if( ! -e $tmpfile || -z $tmpfile || ! $resp->is_success() ) { $self->throw("WebDBSeqI Error - check query sequences!\n"); } $self->postprocess_data('type' => 'file', 'location' => $tmpfile); # this may get reset when requesting batch mode ($rformat,$ioformat) = $self->request_format(); if( $self->verbose > 0 ) { open my $ERR, '<', $tmpfile or $self->throw("Could not read file '$tmpfile': $!"); while(<$ERR>) { $self->debug($_);} close $ERR; } return Bio::SeqIO->new('-verbose' => $self->verbose, '-format' => $ioformat, '-file' => $tmpfile); } if ($self->retrieval_type =~ /io_string/i ) { my $resp = $self->_request($request); my $content = $resp->content_ref; $self->debug( "content is $$content\n"); if (!$resp->is_success() || length($$content) == 0) { $self->throw("WebDBSeqI Error - check query sequences!\n"); } ($rformat,$ioformat) = $self->request_format(); $self->postprocess_data('type'=> 'string', 'location' => $content); $self->debug( "str is $$content\n"); return Bio::SeqIO->new('-verbose' => $self->verbose, '-format' => $ioformat, '-fh' => new IO::String($$content)); } # if we got here, we don't know how to handle the retrieval type $self->throw("retrieval type " . $self->retrieval_type . " unsupported\n"); } =head2 get_Stream_by_batch Title : get_Stream_by_batch Usage : $seq = $db->get_Stream_by_batch($ref); Function: Retrieves Seq objects from Entrez 'en masse', rather than one at a time. For large numbers of sequences, this is far superior than get_Stream_by_id or get_Stream_by_acc. Example : Returns : a Bio::SeqIO stream object Args : $ref : either an array reference, a filename, or a filehandle from which to get the list of unique ids/accession numbers. NOTE: deprecated API. Use get_Stream_by_id() instead. =cut *get_Stream_by_batch = sub { my $self = shift; $self->deprecated('get_Stream_by_batch() is deprecated; use get_Stream_by_id() instead'); $self->get_Stream_by_id(@_) }; =head2 get_Stream_by_query Title : get_Stream_by_query Usage : $seq = $db->get_Stream_by_query($query); Function: Retrieves Seq objects from Entrez 'en masse', rather than one at a time. For large numbers of sequences, this is far superior to get_Stream_by_id and get_Stream_by_acc. Example : Returns : a Bio::SeqIO stream object Args : An Entrez query string or a Bio::DB::Query::GenBank object. It is suggested that you create a Bio::DB::Query::GenBank object and get the entry count before you fetch a potentially large stream. =cut sub get_Stream_by_query { my ($self, $query) = @_; unless (ref $query && $query->can('query')) { $query = Bio::DB::Query::GenBank->new($query); } return $self->get_seq_stream('-query' => $query, '-mode'=>'query'); } =head2 postprocess_data Title : postprocess_data Usage : $self->postprocess_data ( 'type' => 'string', 'location' => \$datastr ); Function: Process downloaded data before loading into a Bio::SeqIO. This works for Genbank and Genpept, other classes should override it with their own method. Returns : void Args : hash with two keys: 'type' can be 'string' or 'file' 'location' either file location or string reference containing data =cut sub postprocess_data { # retain this in case postprocessing is needed at a future date } =head2 request_format Title : request_format Usage : my ($req_format, $ioformat) = $self->request_format; $self->request_format("genbank"); $self->request_format("fasta"); Function: Get/Set sequence format retrieval. The get-form will normally not be used outside of this and derived modules. Returns : Array of two strings, the first representing the format for retrieval, and the second specifying the corresponding SeqIO format. Args : $format = sequence format =cut sub request_format { my ( $self, $value ) = @_; if ( defined $value ) { $value = lc $value; if ( defined $FORMATMAP{$value} ) { $self->{'_format'} = [ $value, $FORMATMAP{$value} ]; } else { # Try to fall back to a default. Alternatively, we could throw # an exception $self->{'_format'} = [ $value, $value ]; } } return @{ $self->{'_format'} }; } =head2 redirect_refseq Title : redirect_refseq Usage : $db->redirect_refseq(1) Function: simple getter/setter which redirects RefSeqs to use Bio::DB::RefSeq Returns : Boolean value Args : Boolean value (optional) Throws : 'unparseable output exception' Note : This replaces 'no_redirect' as a more straightforward flag to redirect possible RefSeqs to use Bio::DB::RefSeq (EBI interface) instead of retrieving the NCBI records =cut sub redirect_refseq { shift->throw( "Use of redirect_refseq() is deprecated. Bio::DB::GenBank default is to\n". "always retrieve from NCBI, including RefSeq sequences. Please use\n". "Bio::DB::EMBL to retrieve records from EBI"); } =head2 complexity Title : complexity Usage : $db->complexity(3) Function: get/set complexity value Returns : value from 0-4 indicating level of complexity Args : value from 0-4 (optional); if unset server assumes 1 Throws : if arg is not an integer or falls outside of noted range above Note : From efetch docs, the complexity regulates the display: 0 - get the whole blob 1 - get the bioseq for gi of interest (default in Entrez) 2 - get the minimal bioseq-set containing the gi of interest 3 - get the minimal nuc-prot containing the gi of interest 4 - get the minimal pub-set containing the gi of interest =cut sub complexity { my ( $self, $comp ) = @_; if ( defined $comp ) { $self->throw("Complexity value must be integer between 0 and 4") if $comp !~ /^\d+$/ || $comp < 0 || $comp > 4; $self->{'_complexity'} = $comp; } return $self->{'_complexity'}; } =head2 strand Title : strand Usage : $db->strand(1) Function: get/set strand value Returns : strand value if set Args : value of 1 (plus) or 2 (minus); if unset server assumes 1 Throws : if arg is not an integer or is not 1 or 2 Note : This differs from BioPerl's use of strand: 1 = plus, -1 = minus 0 = not relevant. We should probably add in some functionality to convert over in the future. =cut sub strand { my ($self, $str) = @_; if ($str) { $self->throw("strand() must be integer value of 1 (plus strand) or 2 (minus strand) if set") if $str !~ /^\d+$/ || $str < 1 || $str > 2; $self->{'_strand'} = $str; } return $self->{'_strand'}; } =head2 seq_start Title : seq_start Usage : $db->seq_start(123) Function: get/set sequence start location Returns : sequence start value if set Args : integer; if unset server assumes 1 Throws : if arg is not an integer =cut sub seq_start { my ($self, $start) = @_; if ($start) { $self->throw("seq_start() must be integer value if set") if $start !~ /^\d+$/; $self->{'_seq_start'} = $start; } return $self->{'_seq_start'}; } =head2 seq_stop Title : seq_stop Usage : $db->seq_stop(456) Function: get/set sequence stop (end) location Returns : sequence stop (end) value if set Args : integer; if unset server assumes 1 Throws : if arg is not an integer =cut sub seq_stop { my ($self, $stop) = @_; if ($stop) { $self->throw("seq_stop() must be integer if set") if $stop !~ /^\d+$/; $self->{'_seq_stop'} = $stop; } return $self->{'_seq_stop'}; } =head2 email Title : email Usage : $db->email('foo@bar.edu') Function: get/set email value Returns : email (string) or undef Args : string with a valid email address; note we do not vallidate this currently! Throws : if arg is not an integer or falls outside of noted range above Note : This is required if you wish to speed up mulltiple requests faster than 4s per request. =cut sub email { my ( $self, $email ) = @_; if ( defined $email ) { # TODO: validate email? $self->{'_email'} = $email; } return $self->{'_email'}; } =head2 Bio::DB::WebDBSeqI methods Overriding WebDBSeqI method to help newbies to retrieve sequences =head2 get_Stream_by_acc Title : get_Stream_by_acc Usage : $seq = $db->get_Stream_by_acc([$acc1, $acc2]); Function: gets a series of Seq objects by accession numbers Returns : a Bio::SeqIO stream object Args : $ref : a reference to an array of accession numbers for the desired sequence entries Note : For GenBank, this just calls the same code for get_Stream_by_id() =cut sub get_Stream_by_acc { my ( $self, $ids ) = @_; $self->throw("NT_ contigs are whole chromosome files which are not part of regular" . "database distributions. Go to ftp://ftp.ncbi.nih.gov/genomes/.") if $ids =~ /NT_/; return $self->get_seq_stream( '-uids' => $ids, '-mode' => 'single' ); } =head2 delay_policy Title : delay_policy Usage : $secs = $self->delay_policy Function: NCBI requests a delay of 4 seconds between requests unless email is provided. This method implements a 4 second delay; use 'delay()' to override, though understand if no email is provided we are not responsible for users being IP-blocked by NCBI Returns : number of seconds to delay Args : none =cut sub delay_policy { my $self = shift; return $REQUEST_DELAY; } =head2 cookie Title : cookie Usage : ($cookie,$querynum) = $db->cookie Function: return the NCBI query cookie, this information is used by Bio::DB::GenBank in conjunction with efetch, ripped from Bio::DB::Query::GenBank Returns : list of (cookie,querynum) Args : none =cut sub cookie { my $self = shift; if (@_) { $self->{'_cookie'} = shift; $self->{'_querynum'} = shift; } else { return @{$self}{qw(_cookie _querynum)}; } } =head2 _parse_response Title : _parse_response Usage : $db->_parse_response($content) Function: parse out response for cookie, this is a trimmed-down version of _parse_response from Bio::DB::Query::GenBank Returns : empty Args : none Throws : 'unparseable output exception' =cut sub _parse_response { my $self = shift; my $content = shift; if ( my ($warning) = $content =~ m!(.+)!s ) { $self->warn("Warning(s) from GenBank: $warning\n"); } if ( my ($error) = $content =~ /([^<]+)/ ) { $self->throw("Error from Genbank: $error"); } my ($cookie) = $content =~ m!(\S+)!; my ($querykey) = $content =~ m!(\d+)!; $self->cookie( uri_unescape($cookie), $querykey ); } =head2 no_redirect Title : no_redirect Usage : $db->no_redirect($content) Function: DEPRECATED - Used to indicate that Bio::DB::GenBank instance retrieves possible RefSeqs from EBI instead; default behavior is now to retrieve directly from NCBI Returns : None Args : None Throws : Method is deprecated in favor of positive flag method 'redirect_refseq' =cut sub no_redirect { shift->throw( "Use of no_redirect() is deprecated. Bio::DB::GenBank new default is to always\n". "retrieve from NCBI. Please use Bio::DB::EMBL to retrieve records\n"). "from EBI"; } 1; __END__ Query000755000765000024 014536752166 17254 5ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/lib/Bio/DBGenBank.pm100644000765000024 2564614536752166 21314 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/lib/Bio/DB/Query# # BioPerl module for Bio::DB::Query::GenBank.pm # # Please direct questions and support issues to # # Cared for by Lincoln Stein # # Copyright Lincoln Stein # # You may distribute this module under the same terms as perl itself # # POD documentation - main docs before the code # =head1 NAME Bio::DB::Query::GenBank - Build a GenBank Entrez Query =head1 SYNOPSIS use Bio::DB::Query::GenBank; use Bio::DB::GenBank; my $query_string = 'Oryza[Organism] AND EST[Keyword]'; my $query = Bio::DB::Query::GenBank->new(-db => 'nucleotide', -query => $query_string, -mindate => '2001', -maxdate => '2002'); print $query->count,"\n"; # get a Genbank database handle my $gb = Bio::DB::GenBank->new(); my $stream = $gb->get_Stream_by_query($query); while (my $seq = $stream->next_seq) { # do something with the sequence object } # initialize the list yourself my $query = Bio::DB::Query::GenBank->new(-ids=>[195052,2981014,11127914]); =head1 DESCRIPTION This class encapsulates NCBI Entrez queries. It can be used to store a list of GI numbers, to translate an Entrez query expression into a list of GI numbers, or to count the number of terms that would be returned by a query. Once created, the query object can be passed to a Bio::DB::GenBank object in order to retrieve the entries corresponding to the query. =head1 FEEDBACK =head2 Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web: https://github.com/bioperl/bioperl-live/issues =head1 AUTHOR - Lincoln Stein Email lstein@cshl.org =head1 APPENDIX The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ =cut # Let the code begin... package Bio::DB::Query::GenBank; $Bio::DB::Query::GenBank::VERSION = '1.7.8'; use strict; use URI::Escape 'uri_unescape'; use Bio::DB::NCBIHelper; #use constant EPOST => $Bio::DB::NCBIHelper::HOSTBASE . '/entrez/eutils/epost.fcgi'; #use constant ESEARCH => $Bio::DB::NCBIHelper::HOSTBASE . '/entrez/eutils/esearch.fcgi'; # the reference to the our variable of the $Bio::DB::NCBIHelper::HOSTBASE doesn't seem to work in # the constant definition in perl 5.10.1 or 5.16.3 use constant EPOST => '/entrez/eutils/epost.fcgi'; use constant ESEARCH => '/entrez/eutils/esearch.fcgi'; use constant DEFAULT_DB => 'protein'; use constant MAXENTRY => 100; use vars qw(@ATTRIBUTES); our $REQUEST_DELAY = 4; use base qw(Bio::DB::Query::WebQuery); BEGIN { @ATTRIBUTES = qw(db reldate mindate maxdate datetype maxids email); for my $method (@ATTRIBUTES) { eval <{'_$method'}; \$self->{'_$method'} = shift if \@_; \$d; } END } } =head2 new Title : new Usage : $db = Bio::DB::Query::GenBank->new(@args) Function: create new query object Returns : new query object Args : -db database (see below for allowable values) -query query string -mindate minimum date to retrieve from (YYYY/MM/DD) -maxdate maximum date to retrieve from (YYYY/MM/DD) -reldate relative date to retrieve from (days) -datetype date field to use ('edat' or 'mdat') -ids array ref of gids (overrides query) -maxids the maximum number of IDs you wish to collect (defaults to 100) -email Email address; required if you want to decrease delay time between queries -delay Delay time (in seconds). Note NCBI policy requires 4 seconds between requests unless an email is provided This method creates a new query object. Typically you will specify a -db and a -query argument, possibly modified by -mindate, -maxdate, or -reldate. -mindate and -maxdate specify minimum and maximum dates for entries you are interested in retrieving, expressed in the form YYYY/MM/DD. -reldate is used to fetch entries that are more recent than the indicated number of days. If you provide an array reference of IDs in -ids, the query will be ignored and the list of IDs will be used when the query is passed to a Bio::DB::GenBank object's get_Stream_by_query() method. A variety of IDs are automatically recognized, including GI numbers, Accession numbers, Accession.version numbers and locus names. By default, the query will collect only the first 100 IDs and will generate an exception if you call the ids() method and the query returned more than that number. To increase this maximum, set -maxids to a number larger than the number of IDs you expect to obtain. This only affects the list of IDs you obtain when you call the ids() method, and does not affect in any way the number of entries you receive when you generate a SeqIO stream from the query. -db option values: The most commonly used databases are: protein nucleotide nuccore nucgss nucest unigene An up to date list of database names supported by NCBI eUtils is always available at: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi? However, note that not all of these databases return datatypes that are parsable by Bio::DB::GenBank =cut sub new { my $class = shift; my $self = $class->SUPER::new(@_); my ($query,$db,$reldate,$mindate,$maxdate,$datetype,$ids,$maxids,$email,$delay) = $self->_rearrange([qw(QUERY DB RELDATE MINDATE MAXDATE DATETYPE IDS MAXIDS EMAIL DELAY)],@_); $self->db($db || DEFAULT_DB); $reldate && $self->reldate($reldate); $mindate && $self->mindate($mindate); $maxdate && $self->maxdate($maxdate); $maxids && $self->maxids($maxids); $email && $self->email($email); $datetype ||= 'mdat'; $datetype && $self->datetype($datetype); $self; } =head2 cookie Title : cookie Usage : ($cookie,$querynum) = $db->cookie Function: return the NCBI query cookie Returns : list of (cookie,querynum) Args : none NOTE: this information is used by Bio::DB::GenBank in conjunction with efetch. =cut sub cookie { my $self = shift; if (@_) { $self->{'_cookie'} = shift; $self->{'_querynum'} = shift; } else { $self->_sleep(); $self->_run_query; @{$self}{qw(_cookie _querynum)}; } } =head2 _request_parameters Title : _request_parameters Usage : ($method,$base,@params = $db->_request_parameters Function: return information needed to construct the request Returns : list of method, url base and key=>value pairs Args : none =cut sub _request_parameters { my $self = shift; my ($method,$base); my @params = map {eval("\$self->$_") ? ($_ => eval("\$self->$_")) : () } @ATTRIBUTES; push @params,('usehistory'=>'y','tool'=>'bioperl'); $method = 'get'; $base = $Bio::DB::NCBIHelper::HOSTBASE.ESEARCH; # this seems to need to be dynamic push @params,('term' => $self->query); # Providing 'retmax' limits queries to 500 sequences ?? I don't think so LS push @params,('retmax' => $self->maxids || MAXENTRY); if ($self->email) { push @params,('email' => $self->email); } # And actually, it seems that we need 'retstart' equal to 0 ?? I don't think so LS # push @params, ('retstart' => 0); ($method,$base,@params); } =head2 count Title : count Usage : $count = $db->count; Function: return count of number of entries retrieved by query Returns : integer Args : none Returns the number of entries that are matched by the query. =cut sub count { my $self = shift; if (@_) { my $d = $self->{'_count'}; $self->{'_count'} = shift; return $d; } else { $self->_sleep(); $self->_run_query; return $self->{'_count'}; } } =head2 ids Title : ids Usage : @ids = $db->ids([@ids]) Function: get/set matching ids Returns : array of sequence ids Args : (optional) array ref with new set of ids =cut =head2 query Title : query Usage : $query = $db->query([$query]) Function: get/set query string Returns : string Args : (optional) new query string =cut =head2 _parse_response Title : _parse_response Usage : $db->_parse_response($content) Function: parse out response Returns : empty Args : none Throws : 'unparseable output exception' =cut sub _parse_response { my $self = shift; my $content = shift; if (my ($warning) = $content =~ m!(.+)!s) { $self->warn("Warning(s) from GenBank: $warning\n"); } if (my ($error) = $content =~ /([^<]+)/) { $self->throw("Error from Genbank: $error"); } my ($count) = $content =~ /(\d+)/; my ($max) = $content =~ /(\d+)/; my $truncated = $count > $max; $self->count($count); if (!$truncated) { my @ids = $content =~ /(\d+)/g; $self->ids(\@ids); } else { $self->debug("ids truncated at $max\n"); } $self->_truncated($truncated); my ($cookie) = $content =~ m!(\S+)!; my ($querykey) = $content =~ m!(\d+)!; $self->cookie(uri_unescape($cookie),$querykey); } =head2 _generate_id_string Title : _generate_id_string Usage : $string = $db->_generate_id_string Function: joins IDs together in string (possibly implementation-dependent) Returns : string of concatenated IDs Args : array ref of ids (normally passed into the constructor) =cut sub _generate_id_string { my ($self, $ids) = @_; # this attempts to separate out accs (alphanumeric) from UIDs (numeric only) # recent changes to esearch has wrought this upon us.. cjf 4/19/07 return sprintf('%s',join('|',map { ($_ =~ m{^\d+$}) ? $_.'[UID]' : $_.'[PACC]' } @$ids)); } =head2 delay_policy Title : delay_policy Usage : $secs = $self->delay_policy Function: NCBI requests a delay of 4 seconds between requests unless email is provided. This method implements a 4 second delay; use 'delay()' to override, though understand if no email is provided we are not responsible for users being IP-blocked by NCBI Returns : number of seconds to delay Args : none =cut sub delay_policy { my $self = shift; return $REQUEST_DELAY; } 1; bp_download_query_genbank100644000765000024 636414536752166 22451 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/bin#!/usr/bin/perl =head1 NAME bp_download_query_genbank - script to query Genbank and retrieve records =head1 USAGE bp_download_query_genbank --query "Neurospora[ORGN]" --db nucest -o Ncrassa_ESTs.fa --format fasta bp_download_query_genbank --queryfile 'filewithquery' --db nucest -o Ncrassa_ESTs.fa --format fasta =head2 Other options Provide ONE of: -q --query query string OR --queryfile profile file with query OR --gi --gis --gifile file with list of GIs to download Database type: -d --db database (nucleotide [default], nucest, protein, ) -o --out --outfile output file (results are displayed on screen otherwise) -f --format sequence file output format (fasta by default) -v --verbose debugging output =head2 Query options --maxids maximum number of IDs to retrieve in a set (100 at a time by default) --reldate --maxdate maxdate for a record --mindate minimum date for record --datetype edat or mdat (entered or modified) =head1 AUTHOR Jason Stajich Jason Stajich, jason-AT-bioperl.org =cut use strict; use warnings; use Bio::DB::GenBank; use Bio::DB::GenPept; use Bio::DB::Query::GenBank; use Bio::SeqIO; use Getopt::Long; my ($queryfile,$outfile,$format,$debug,%options); $format = 'fasta'; $options{'-maxids'} = '100'; $options{'-db'} = 'nucleotide'; # can be nucleotide, nucest, protein my $gifile; GetOptions( 'h|help' => sub { exec('perldoc', $0); exit(0); }, 'v|verbose' => \$debug, 'f|format:s' => \$format, 'queryfile:s' => \$queryfile, 'o|out|outfile:s' => \$outfile, 'gi|gifile|gis:s' => \$gifile, # DB::Query options 'd|db:s' => \$options{'-db'}, 'mindate:s' => \$options{'-mindate'}, 'maxdate:s' => \$options{'-maxdate'}, 'reldate:s' => \$options{'-reldate'}, 'datetype:s' => \$options{'-datetype'}, # edat or mdat 'maxids:i' => \$options{'-maxids'}, 'q|query:s' => \$options{'-query'}, ); my $out; if( $outfile ) { $out = Bio::SeqIO->new(-format => $format, -file => ">$outfile"); } else { $out = Bio::SeqIO->new(-format => $format); # write to STDOUT } my $dbh; if( $options{'-db'} eq 'protein' ) { $dbh = Bio::DB::GenPept->new(-verbose => $debug); } else { $dbh = Bio::DB::GenBank->new(-verbose => $debug); } my $query; if( $gifile ) { my @ids; open my $fh, '<', $gifile or die "Could not read file '$gifile': $!\n"; while(<$fh>) { push @ids, split; } close $fh; while( @ids ) { my @mini_ids = splice(@ids, 0, $options{'-maxids'}); $query = Bio::DB::Query::GenBank->new(%options, -verbose =>$debug, -ids => \@mini_ids, ); my $stream = $dbh->get_Stream_by_query($query); while( my $seq = $stream->next_seq ) { $out->write_seq($seq); } } exit; } elsif( $options{'-query'}) { $query = Bio::DB::Query::GenBank->new(%options,-verbose => $debug); } elsif( $queryfile ) { open my $fh, '<', $queryfile or die "Could not read file '$queryfile': $!\n"; while(<$fh>) { chomp; $options{'-query'} .= $_; } $query = Bio::DB::Query::GenBank->new(%options,-verbose => $debug); close $fh; } else { die("no query string or gifile\n"); } my $stream = $dbh->get_Stream_by_query($query); while( my $seq = $stream->next_seq ) { $out->write_seq($seq); } bp_biofetch_genbank_proxy100644000765000024 2006614536752166 22454 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/bin#!/usr/bin/perl # dbfetch style caching proxy for GenBank use strict; use warnings; use CGI qw(:standard); use HTTP::Request::Common; use LWP::UserAgent; use Cache::FileCache; use vars qw(%GOT $BUFFER %MAPPING $CACHE); use constant CACHE_LOCATION => '/usr/tmp/dbfetch_cache'; use constant MAX_SIZE => 100_000_000; # 100 megs, roughly use constant CACHE_DEPTH => 4; use constant EXPIRATION => "1 week"; use constant PURGE => "1 hour"; %MAPPING = (genbank => {db=>'nucleotide', rettype => 'gb'}, genpep => {db=>'protein', rettype => 'gp'}); # we're doing everything in callbacks, so initialize globals. $BUFFER = ''; %GOT = (); print header('text/plain'); param() or print_usage(); my $db = param('db'); my $style = param('style'); my $format = param('format'); my $id = param('id'); my @ids = split /\s+/,$id; $format = 'genbank' if $format eq 'default'; #h'mmmph $MAPPING{$db} or error(1=>"Unknown database [$db]"); $style eq 'raw' or error(2=>"Unknown style [$style]"); $format eq 'genbank' or error(3=>"Format [$format] not known for database [$db]"); $CACHE = Cache::FileCache->new({cache_root => CACHE_LOCATION, default_expires_in => EXPIRATION, cache_DEPTH => CACHE_DEPTH, namespace => 'dbfetch', auto_purge_interval => PURGE}); # handle cached entries foreach (@ids) { if (my $obj = $CACHE->get($_)) { $GOT{$_}++; print $obj,"//\n"; } } # handle the remainder @ids = grep {!$GOT{$_}} @ids; if (@ids) { my $request = POST('https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi', [rettype => $MAPPING{$db}{rettype}, db => $MAPPING{$db}{db}, tool => 'bioperl', retmode => 'text', usehistory => 'n', id => join(',',@ids), ] ); my $ua = LWP::UserAgent->new; my $response = $ua->request($request,\&callback); if ($response->is_error) { my $status = $response->status_line; error(6 => "HTTP error from GenBank [$status]"); } } my @missing_ids = grep {!$GOT{$_}} @ids; foreach (@missing_ids) { error(4=>"ID [$_] not found in database [$db]",1); } # my $response = $response->content; sub process_record { my $record = shift; print "$record//\n"; my ($locus) = $record =~ /^LOCUS\s+(\S+)/m; my ($accession) = $record =~ /^ACCESSION\s+(\S+)/m; my ($version,$gi) = $record =~ /^VERSION\s+(\S+)\s+GI:(\d+)/m; foreach ($locus,$accession,$version,$gi) { $GOT{$_}++; $CACHE->set($_,$record); } } sub callback { my $data = shift; $BUFFER .= $data; my $index = 0; while (($index = index($BUFFER,"//\n\n",$index))>=0) { my $record = substr($BUFFER,0,$index); $index += length("//\n\n"); substr($BUFFER,0,$index) = ''; process_record($record); } } sub print_usage { print <<'END'; This script is intended to be used non-interactively. Brief summary of arguments: URL This interface does not specify what happens when biofetch is called in interactive context. The implementations can return the entries decorated with HTML tags and hypertext links. A URL for biofetch consists of four sections: e.g. 1. protocol http:// 2. host www.ebi.ac.uk 3. path to program /Tools/dbfetch/dbfetch 4. query string ?style=raw;format=embl;db=embl;id=J00231 QUERY STRING The query string options are separated from the base URL (protocol + host + path) by a question mark (?) and from each other by a semicolon ';' (or by ampersand '&'). See CGI GET documents at http://www.w3.org/CGI/). The order of options is not critical. It is recommended to leave the ID to be the last item. Input for options should be case insensitive. option: db Option : db Descr : database name Type : required Usage : db=genpep | db=genbank Arg : string Currently this server accepts "genbank" and "genpep" option: style Option : style Descr : +/- HTML tags Type : required Usage : style=raw | db=html Arg : enum (raw|html) In non-interactive context, always give "style=raw". This uses "Content-Type: text/plain". If other content types are needed (XML), this part of the spesifications can be extended to accommodate them. This server only accepts "raw". option: format Option : format Descr : format of the database entries returned Type : optional Usage : format=genbank Arg : enum Format defaults to the distribution format of the database (embl for EMBL database). If some other supported format is needed this option is needed (E.g. formats for EMBL: fasta, bsml, agave). This server only accepts "genbank" format. option: id Option : id Descr : unique database identifier(s) Type : required Usage : db=J00231 | id=J00231+HSFOS Arg : string The ID option should be able to process all UIDS in a database. It should not be necessary to know if the UID is an ID, accession number or accession.version. The number of entry UIDs allowed is implementation specific. If the limit is exceeded, the the program reports an error. The UIDs should be separated by spaces (use '+' in a GET method string). ERROR MESSAGES The following standardized one line messages should be printed out in case of an error. ERROR 1 Unknown database [$db]. ERROR 2 Unknown style [$style]. ERROR 3 Format [$format] not known for database [$db]. ERROR 4 ID [$id] not found in database [$db]. ERROR 5 Too many IDs [$count]. Max [$MAXIDS] allowed. END ; exit 0; } sub error { my ($code,$message,$noexit) = @_; print "ERROR $code $message\n"; exit 0 unless $noexit; } __END__ =head1 NAME bp_biofetch_genbank_proxy.pl - Caching BioFetch-compatible web proxy for GenBank =head1 SYNOPSIS Install in cgi-bin directory of a Web server. Stand back. =head1 DESCRIPTION This CGI script acts as the server side of the BioFetch protocol as described in http://obda.open-bio.org/Specs/. It provides two database access services, one for data source "genbank" (nucleotide entries) and the other for data source "genpep" (protein entries). This script works by forwarding its requests to NCBI's eutils script, which lives at https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi. It then reformats the output according to the BioFetch format so the sequences can be processed and returned by the Bio::DB::BioFetch module. Returned entries are temporarily cached on the Web server's file system, allowing frequently-accessed entries to be retrieved without another round trip to NCBI. =head2 INSTALLATION You must have the following installed in order to run this script: 1) perl 2) the perl modules LWP and Cache::FileCache 3) a web server (Apache recommended) To install this script, copy it into the web server's cgi-bin directory. You might want to shorten its name; "dbfetch" is recommended. There are several constants located at the top of the script that you may want to adjust. These are: CACHE_LOCATION This is the location on the filesystem where the cached files will be located. The default is /usr/tmp/dbfetch_cache. MAX_SIZE This is the maximum size that the cache can grow to. When the cache exceeds this size older entries will be deleted automatically. The default setting is 100,000,000 bytes (100 MB). EXPIRATION Entries that haven't been accessed in this length of time will be removed from the cache. The default is 1 week. PURGE This constant specifies how often the cache will be purged for older entries. The default is 1 hour. =head1 TESTING To see if this script is performing as expected, you may test it with this script: use Bio::DB::BioFetch; my $db = Bio::DB::BioFetch->new(-baseaddress=>'http://localhost/cgi-bin/dbfetch', -format =>'genbank', -db =>'genbank'); my $seq = $db->get_Seq_by_id('DDU63596'); print $seq->seq,"\n"; This should print out a DNA sequence. =head1 SEE ALSO L, L =head1 AUTHOR Lincoln Stein, Elstein-at-cshl.orgE Copyright (c) 2003 Cold Spring Harbor Laboratory This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See DISCLAIMER.txt for disclaimers of warranty. =cut Taxonomy000755000765000024 014536752166 17765 5ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/lib/Bio/DBentrez.pm100644000765000024 5275014536752166 22023 0ustar00cjfieldsstaff000000000000Bio-DB-NCBIHelper-1.7.8/lib/Bio/DB/Taxonomy# # BioPerl module for Bio::DB::Taxonomy::entrez # # Please direct questions and support issues to # # Cared for by Jason Stajich # # Copyright Jason Stajich # # You may distribute this module under the same terms as perl itself # POD documentation - main docs before the code =head1 NAME Bio::DB::Taxonomy::entrez - Taxonomy Entrez driver =head1 SYNOPSIS # Do not use this object directly, rather through the Bio::DB::Taxonomy # interface use Bio::DB::Taxonomy; my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); my $taxonid = $db->get_taxonid('Homo sapiens'); my $node = $db->get_Taxonomy_Node(-taxonid => $taxonid); my $gi = 71836523; my $node = $db->get_Taxonomy_Node(-gi => $gi, -db => 'protein'); print $node->binomial, "\n"; my ($species,$genus,$family) = $node->classification; print "family is $family\n"; # Can also go up 4 levels my $p = $node; for ( 1..4 ) { $p = $db->get_Taxonomy_Node(-taxonid => $p->parent_id); } print $p->rank, " ", ($p->classification)[0], "\n"; # could then classify a set of BLAST hits based on their GI numbers # into taxonomic categories. It is not currently possibly to query a node for its children so we cannot completely replace the advantage of the flatfile Bio::DB::Taxonomy::flatfile module. =head1 DESCRIPTION A driver for querying NCBI Entrez Taxonomy database. =head1 FEEDBACK =head2 Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to the Bioperl mailing list. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track of the bugs and their resolution. Bug reports can be submitted via the web: https://github.com/bioperl/bioperl-live/issues =head1 AUTHOR - Jason Stajich Email jason-at-bioperl.org =head1 CONTRIBUTORS Sendu Bala: bix@sendu.me.uk =head1 APPENDIX The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ =cut # Let the code begin... package Bio::DB::Taxonomy::entrez; $Bio::DB::Taxonomy::entrez::VERSION = '1.7.8'; use vars qw($EntrezLocation $UrlParamSeparatorValue %EntrezParams $EntrezGet $EntrezSummary $EntrezFetch %SequenceParams $XMLTWIG $DATA_CACHE $RELATIONS); use strict; use Bio::Taxon; use Bio::DB::NCBIHelper; eval { require XML::Twig; $XMLTWIG = 1; }; if( $@ ) { $XMLTWIG = 0; } use base qw(Bio::WebAgent Bio::DB::Taxonomy); $EntrezLocation = $Bio::DB::NCBIHelper::HOSTBASE . '/entrez/eutils/'; $EntrezGet = 'esearch.fcgi'; $EntrezFetch = 'efetch.fcgi'; $EntrezSummary = 'esummary.fcgi'; $DATA_CACHE = {}; $RELATIONS = {}; %EntrezParams = ( 'db' => 'taxonomy', 'report' => 'xml', 'retmode'=> 'xml', 'tool' => 'Bioperl'); %SequenceParams = ( 'db' => 'nucleotide', # or protein 'retmode' => 'xml', 'tool' => 'Bioperl'); $UrlParamSeparatorValue = '&'; =head2 new Title : new Usage : my $obj = Bio::DB::Taxonomy::entrez->new(); Function: Builds a new Bio::DB::Taxonomy::entrez object Returns : an instance of Bio::DB::Taxonomy::entrez Args : -location => URL to Entrez (if you want to override the default) -params => Hashref of URL params if you want to override the default =cut sub new { my ($class, @args) = @_; # need to initialise Bio::WebAgent... my ($self) = $class->SUPER::new(@args); # ... as well as our normal Bio::DB::Taxonomy selves: $self->_initialize(@args); return $self; } sub _initialize { my($self) = shift; $self->SUPER::_initialize(@_); my ($location,$params,$email) = $self->_rearrange([qw(LOCATION PARAMS EMAIL)],@_); if( $params ) { if( ref($params) !~ /HASH/i ) { $self->warn("Must have provided a valid HASHref for -params"); $params = \%EntrezParams; } } else { $params = \%EntrezParams; } if ($email) { $params->{email} = $email; } $self->entrez_params($params); $self->entrez_url($location || $EntrezLocation ); } =head2 get_num_taxa Title : get_num_taxa Usage : my $num = $db->get_num_taxa(); Function: Get the number of taxa stored in the database. Returns : A number Args : None =cut sub get_num_taxa { my ($self) = @_; # Use this URL query to get the ID of all the taxa in the NCBI Taxonomy database: # https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&term=all[Filter] # Only the first 20 taxa IDs are returned (good because the list is long), # and the total number is reported as well (which is what we are interested # in). my %p = $self->entrez_params; $p{'term'} = 'all[Filter]'; my $twig = $self->_run_query($self->_build_url($EntrezGet, \%p)); my $count = $twig->root->first_child('Count')->first_child->text; return $count; } =head2 get_taxon Title : get_taxon Usage : my $taxon = $db->get_taxon(-taxonid => $taxonid) Function: Get a Bio::Taxon object from the database. Returns : Bio::Taxon object Args : just a single value which is the database id, OR named args: -taxonid => taxonomy id (to query by taxonid) OR -name => string (to query by a taxonomy name: common name, scientific name, etc) OR To retrieve a taxonomy node for a GI number provide the -gi option with the gi number and -db with either 'nucleotide' or 'protein' to define the db. AND optionally, -full => 1 (to force retrieval of full information - sometimes minimal information about your taxon may have been cached, which is normally used to save database accesses) =cut sub get_taxon { my $self = shift; if (! $XMLTWIG) { eval { require XML::Twig }; $self->throw("Could not load XML::Twig for get_taxon(): $@") if $@; } my %p = $self->entrez_params; # convert input request to one or more ids my (@taxonids, $taxonid, $want_full); if (@_ > 1) { my %params = @_; if ($params{'-taxonid'}) { $taxonid = $params{'-taxonid'}; } elsif ($params{'-gi'}) { my $db = $params{'-db'}; # we're going to do all the work here and then redirect # the call based on the TaxId my %p = %SequenceParams; my %items; if( ref($params{'-gi'}) =~ /ARRAY/i ) { $p{'id'} = join(',', @{$params{'-gi'}}); } else { $p{'id'} = $params{'-gi'}; } $p{'db'} = $db if defined $db; my $url = $self->_build_url($EntrezSummary, \%p); my @ids; if (exists $DATA_CACHE->{gi_to_ids}->{$url}) { @ids = @{$DATA_CACHE->{gi_to_ids}->{$url}}; } else { my $twig = $self->_run_query($url); my $root = $twig->root; for my $topnode ( $root->children('DocSum') ) { for my $child ( $topnode->children('Item') ) { if( uc($child->{att}->{'Name'}) eq 'TAXID' ) { push @ids, $child->text; } } } $DATA_CACHE->{gi_to_ids}->{$url} = \@ids; } return $self->get_taxon(-taxonid => \@ids); } elsif ($params{'-name'}) { @taxonids = $self->get_taxonid($params{'-name'}); } else { $self->warn("Need to have provided either a -taxonid or -name value to get_taxon"); } if ($params{'-full'}) { $want_full = 1; } } else { $taxonid = shift; } if (ref($taxonid) =~ /ARRAY/i ) { @taxonids = @{$taxonid}; } else { push(@taxonids, $taxonid) if $taxonid; } # return answer(s) from the cache if possible my @results; my @uncached; foreach my $taxonid (@taxonids) { $taxonid || $self->throw("In taxonids list one was undef! '@taxonids'\n"); if (defined $DATA_CACHE->{full_info}->{$taxonid}) { push(@results, $self->_make_taxon($DATA_CACHE->{full_info}->{$taxonid})); } elsif (! $want_full && defined $DATA_CACHE->{minimal_info}->{$taxonid}) { push(@results, $self->_make_taxon($DATA_CACHE->{minimal_info}->{$taxonid})); } else { push(@uncached, $taxonid); } } if (@uncached > 0) { $taxonid = join(',', @uncached); $p{'id'} = $taxonid; my $twig = $self->_run_query($self->_build_url($EntrezFetch, \%p)); my $root = $twig->root; for my $taxon ( $root->children('Taxon') ) { my $taxid = $taxon->first_child_text('TaxId'); $self->throw("Got a result with no TaxId!") unless $taxid; my $data = {}; if (exists $DATA_CACHE->{minimal_info}->{$taxid}) { $data = $DATA_CACHE->{minimal_info}->{$taxid}; } $data->{id} = $taxid; $data->{rank} = $taxon->first_child_text('Rank'); my $other_names = $taxon->first_child('OtherNames'); my @other_names = $other_names->children_text() if $other_names; my $sci_name = $taxon->first_child_text('ScientificName'); my $orig_sci_name = $sci_name; $sci_name =~ s/ \(class\)$//; push(@other_names, $orig_sci_name) if $orig_sci_name ne $sci_name; $data->{scientific_name} = $sci_name; $data->{common_names} = \@other_names; $data->{division} = $taxon->first_child_text('Division'); $data->{genetic_code} = $taxon->first_child('GeneticCode')->first_child_text('GCId'); $data->{mitochondrial_genetic_code} = $taxon->first_child('MitoGeneticCode')->first_child_text('MGCId'); $data->{create_date} = $taxon->first_child_text('CreateDate'); $data->{update_date} = $taxon->first_child_text('UpdateDate'); $data->{pub_date} = $taxon->first_child_text('PubDate'); # since we have some information about all the ancestors of our # requested node, we may as well cache data for the ancestors to # reduce the number of accesses to website in future my $lineage_ex = $taxon->first_child('LineageEx'); if (defined $lineage_ex) { my ($ancestor, $lineage_data, @taxa); foreach my $lineage_taxon ($lineage_ex->children) { my $lineage_taxid = $lineage_taxon->first_child_text('TaxId'); if (exists $DATA_CACHE->{minimal_info}->{$lineage_taxid} || exists $DATA_CACHE->{full_info}->{$lineage_taxid}) { $lineage_data = $DATA_CACHE->{minimal_info}->{$lineage_taxid} || $DATA_CACHE->{full_info}->{$lineage_taxid}; next; } else { $lineage_data = {}; } $lineage_data->{id} = $lineage_taxid; $lineage_data->{scientific_name} = $lineage_taxon->first_child_text('ScientificName'); $lineage_data->{rank} = $lineage_taxon->first_child_text('Rank'); $RELATIONS->{ancestors}->{$lineage_taxid} = $ancestor->{id} if $ancestor; $DATA_CACHE->{minimal_info}->{$lineage_taxid} = $lineage_data; } continue { $ancestor = $lineage_data; unshift(@taxa, $lineage_data); } $RELATIONS->{ancestors}->{$taxid} = $ancestor->{id} if $ancestor; # go through the lineage in reverse so we can remember the children my $child = $data; foreach my $lineage_data (@taxa) { $RELATIONS->{children}->{$lineage_data->{id}}->{$child->{id}} = 1; } continue { $child = $lineage_data; } } delete $DATA_CACHE->{minimal_info}->{$taxid}; $DATA_CACHE->{full_info}->{$taxid} = $data; push(@results, $self->_make_taxon($data)); } } wantarray() ? @results : shift @results; } *get_Taxonomy_Node = \&get_taxon; =head2 get_taxonids Title : get_taxonids Usage : my $taxonid = $db->get_taxonids('Homo sapiens'); Function: Searches for a taxonid (typically ncbi_taxon_id) based on a query string. Note that multiple taxonids can match to the same supplied name. Returns : array of integer ids in list context, one of these in scalar context Args : string representing taxon's name =cut sub get_taxonids { my ($self,$query) = @_; my %p = $self->entrez_params; # queries don't work correctly with special characters, so get rid of them. if ($query =~ /<.+>/) { # queries with will fail, so workaround by removing, doing # the query, getting multiple taxonids, then picking the one id that # has a parent node with a scientific_name() or common_names() # case-insensitive matching to the word(s) within <> $query =~ s/ <(.+?)>//; my $desired_parent_name = lc($1); ID: for my $start_id ($self->get_taxonids($query)) { my $node = $self->get_taxon($start_id) || next ID; # walk up the parents until we hit a node with a named rank while (1) { my $parent_node = $self->ancestor($node) || next ID; my $parent_sci_name = $parent_node->scientific_name || next ID; my @parent_common_names = $parent_node->common_names; unless (@parent_common_names) { # ensure we're not using a minimal-info cached version $parent_node = $self->get_taxon(-taxonid => $parent_node->id, -full => 1); @parent_common_names = $parent_node->common_names; } for my $name ($parent_sci_name, @parent_common_names) { if (lc($name) eq $desired_parent_name) { return wantarray() ? ($start_id) : $start_id; } } my $parent_rank = $parent_node->rank || 'no rank'; if ($parent_rank ne 'no rank') { last; } else { $node = $parent_node; } } } return; } $query =~ s/[\"\(\)]//g; # not an exhaustive list; these are just the ones we know cause problems $query =~ s/\s/+/g; my @data; if (defined $DATA_CACHE->{name_to_id}->{$query}) { @data = @{$DATA_CACHE->{name_to_id}->{$query}}; } else { $p{'term'} = $query; my $twig = $self->_run_query($self->_build_url($EntrezGet, \%p)); my $root = $twig->root; # Check that all words in the query are found, because we do not want to # match just 1 word if there are multiple words, e.g. if we query with # "Homo sapiens" both "homo" and "sapiens" must be found my $errorlist = $root->first_child('ErrorList'); if ( $errorlist ) { my @notfound = map { $_->text } $errorlist->children('PhraseNotFound'); if ( @notfound ) { for my $term ( @notfound ) { return "No hit" if $query =~ /$term/; } } } my $list = $root->first_child('IdList'); @data = map { $_->text } $list->children('Id'); $DATA_CACHE->{name_to_id}->{$query} = [@data]; } return wantarray() ? @data : shift @data; } *get_taxonid = \&get_taxonids; =head2 ancestor Title : ancestor Usage : my $ancestor_taxon = $db->ancestor($taxon) Function: Retrieve the ancestor taxon of a supplied Taxon from the database. Note that unless the ancestor has previously been directly requested with get_taxon(), the returned Taxon object will only have a minimal amount of information. Returns : Bio::Taxon Args : Bio::Taxon (that was retrieved from this database) =cut sub ancestor { my ($self, $taxon) = @_; $self->throw("Must supply a Bio::Taxon") unless ref($taxon) && $taxon->isa('Bio::Taxon'); $self->throw("The supplied Taxon must belong to this database") unless $taxon->db_handle && $taxon->db_handle eq $self; my $id = $taxon->id || $self->throw("The supplied Taxon is missing its id!"); my $ancestor_id = $RELATIONS->{ancestors}->{$id} || return; return $self->_make_taxon($DATA_CACHE->{full_info}->{$ancestor_id} || $DATA_CACHE->{minimal_info}->{$ancestor_id}); } =head2 each_Descendent Title : each_Descendent Usage : my @taxa = $db->each_Descendent($taxon); Function: Get all the descendents of the supplied Taxon (but not their descendents, ie. not a recursive fetchall). Note that this implementation is unable to return a taxon that hasn't previously been directly fetched with get_taxon(), or wasn't an ancestor of such a fetch. Returns : Array of Bio::Taxon objects Args : Bio::Taxon (that was retrieved from this database) =cut sub each_Descendent { my ($self, $taxon) = @_; $self->throw("Must supply a Bio::Taxon") unless ref($taxon) && $taxon->isa('Bio::Taxon'); $self->throw("The supplied Taxon must belong to this database") unless $taxon->db_handle && $taxon->db_handle eq $self; my $id = $taxon->id || $self->throw("The supplied Taxon is missing its id!"); my @children_ids = keys %{$RELATIONS->{children}->{$id} || {}}; my @children; foreach my $child_id (@children_ids) { push(@children, $self->_make_taxon($DATA_CACHE->{full_info}->{$child_id} || $DATA_CACHE->{minimal_info}->{$child_id})); } return @children; } =head2 Some Get/Setter methods =head2 entrez_url Title : entrez_url Usage : $obj->entrez_url($newval) Function: Get/set entrez URL Returns : value of entrez url (a scalar) Args : on set, new value (a scalar or undef, optional) =cut sub entrez_url{ my $self = shift; return $self->{'_entrez_url'} = shift if @_; return $self->{'_entrez_url'}; } =head2 entrez_params Title : entrez_params Usage : $obj->entrez_params($newval) Function: Get/set entrez params Returns : value of entrez_params (a hashref) Args : on set, new value Hashref =cut sub entrez_params{ my $self = shift; my $f; if( @_ ) { $f = $self->{'_entrez_params'} = shift; } else { $f = $self->{'_entrez_params'}; } return %$f; } =head2 Bio::DB::WebBase methods =head2 proxy_string Title : proxy_string Usage : my $proxy_string = $self->proxy_string($protocol) Function: Get the proxy string (plus user/pass ) Returns : string Args : protocol ('http' or 'ftp'), default 'http' =head2 proxy Title : proxy Usage : $httpproxy = $db->proxy('http') or $db->proxy(['http','ftp'], 'http://myproxy' ) Function: Get/Set a proxy for use of proxy Returns : a string indicating the proxy Args : $protocol : an array ref of the protocol(s) to set/get $proxyurl : url of the proxy to use for the specified protocol $username : username (if proxy requires authentication) $password : password (if proxy requires authentication) =head2 authentication Title : authentication Usage : $db->authentication($user,$pass) Function: Get/Set authentication credentials Returns : Array of user/pass Args : Array or user/pass =cut # make a Taxon object from data hash ref sub _make_taxon { my ($self, $data) = @_; my $taxon = Bio::Taxon->new(); my $taxid; while (my ($method, $value) = each %{$data}) { if ($method eq 'id') { $method = 'ncbi_taxid'; # since this is a real ncbi taxid, explicitly set it as one $taxid = $value; } $taxon->$method(ref($value) eq 'ARRAY' ? @{$value} : $value); } # we can't use -dbh or the db_handle() method ourselves or we'll go # infinite on the merge attempt $taxon->{'db_handle'} = $self; $self->_handle_internal_id($taxon); return $taxon; } sub _build_url { # Given a eutility (esearch.fcgi, efetch.fcgi or esummary.fcgi) and a # hashref or parameters, build a url suitable for eutil query my ($self, $eutility, $p) = @_; my $params = join($UrlParamSeparatorValue, map { $_.'='.$p->{$_} } keys %$p); my $url = $self->entrez_url.$eutility.'?'.$params; $self->debug("url is $url\n"); return $url; } sub _run_query { # Given an eutil url, run the eutil query and parse the response into an # XML Twig object my ($self, $url) = @_; $self->sleep(); my $response = $self->get($url); if ($response->is_success) { $response = $response->content; }else { $self->throw("Can't query website: ".$response->status_line); } $self->debug("response is $response\n"); my $twig = XML::Twig->new; $twig->parse($response); return $twig; } 1;