Bio-ASN1-EntrezGene-1.70000755000765000024 012215135616 14767 5ustar00cjfieldsstaff000000000000Changes100644000765000024 712512215135616 16350 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70Revision history for Bio-ASN1-Entrezgene 1.70 2013-09-14 14:39:54 America/Chicago * Bio::ASN1::EntrezGene is now able to parse EntrezGene-set in which case next_seq() will return the next set of sequences with each sequence as an element in the array ref instead of an array ref with a single element. version 1.10: Important update if you see segmentation fault when running the parser - so far I only saw it happen on Perl 5.8 (Perl 5.10 is fine) due to an exceedingly long (and invalid) URL in one Arabidopsis entry. It's due to Perl regex engine core dumps when matching the long string exhausted the stack. I changed the particular regex in EntrezGene.pm and Sequence.pm to solve the issue. The overall parsing runs 2-3% faster after the change. version 1.09: Added parser/indexer for NCBI's ASN.1-formatted sequence files (like Genbank records). Updated test, example scripts and documentation Minor fix on parse_entrez_gene_example.pl Added code to deal with CCDS xref and Hugo symbol (under gene properties! unlike before) in parse_entrez_gene_example.pl Updated parser & indexer file handle code to work with perl version 5.005_03 (previous code since 1.07 only works with 5.6 or higher). Commented out count_records call in testindex.t to allow successful test on 5.005_03-compatible bioperl versions. version 1.08: Split test script into two for better testing Minor change in documentation and test scripts NO change in parser/indexer code! version 1.07: Added indexing capability through a new module Added testing script for make test Added example script for indexing, reorganized examples scripts Fixed a bug in next_seq Reset line number after input_file() or fh() calls Added rawdata(), fh() functions and -file, -fh, fh to new() Updated documentation to reflect all changes version 1.06: integrated code from Util.pm into EntrezGene.pm. changed packaging to Perl standard changed next_seq() default option to 2, so now the call $parser->next_seq() is equivalent to the call $parser->next_seq(2) in version 1.05 updated documentation to reflect all changes version 1.05: added support to parse the NCBI 4/5/2005 download, which inexplicably added a useless space before ',' on all lines, broke some lines into two yet condensed others (brackets) to one line. This unfortunately slows down my parser because I have to use lookahead regexes to fix the parser for this weird new format. I also fixed a minor mistake in error reporting function version 1.04: added attempt at opening large file (2 GB) on Perl that does not support it; added 'file' option to new(); added file name in error reporting message; updated documentation version 1.03: added validating capability such that anything that does not conform to the current NCBI Entrez Gene ASN.1 format would raise error and stops program. Position of the offending data item would be reported. version 1.02: added input_file function that accepts filename input, and next_seq function that returns the next record version 1.01: unescaped double quote escapes in double quoted strings version 1.0: released LICENSE100644000765000024 4405712215135616 16107 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70This software is copyright (c) 2013 by Mingyi Liu, GPC Biotech AG and Altana Research Institute. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself. Terms of the Perl programming language system itself a) the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version, or b) the "Artistic License" --- The GNU General Public License, Version 1, February 1989 --- This software is Copyright (c) 2013 by Mingyi Liu, GPC Biotech AG and Altana Research Institute. This is free software, licensed under: The GNU General Public License, Version 1, February 1989 GNU GENERAL PUBLIC LICENSE Version 1, February 1989 Copyright (C) 1989 Free Software Foundation, Inc. 51 Franklin St, Suite 500, Boston, MA 02110-1335 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The license agreements of most software companies try to keep users at the mercy of those companies. By contrast, our General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. The General Public License applies to the Free Software Foundation's software and to any other program whose authors commit to using it. You can use it for your programs, too. When we speak of free software, we are referring to freedom, not price. Specifically, the General Public License is designed to make sure that you have the freedom to give away or sell copies of free software, that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of a such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must tell them their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License Agreement applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any work containing the Program or a portion of it, either verbatim or with modifications. Each licensee is addressed as "you". 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this General Public License and to the absence of any warranty; and give any other recipients of the Program a copy of this General Public License along with the Program. You may charge a fee for the physical act of transferring a copy. 2. You may modify your copy or copies of the Program or any portion of it, and copy and distribute such modifications under the terms of Paragraph 1 above, provided that you also do the following: a) cause the modified files to carry prominent notices stating that you changed the files and the date of any change; and b) cause the whole of any work that you distribute or publish, that in whole or in part contains the Program or any part thereof, either with or without modifications, to be licensed at no charge to all third parties under the terms of this General Public License (except that you may choose to grant warranty protection to some or all third parties, at your option). c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the simplest and most usual way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this General Public License. d) You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. Mere aggregation of another independent work with the Program (or its derivative) on a volume of a storage or distribution medium does not bring the other work under the scope of these terms. 3. You may copy and distribute the Program (or a portion or derivative of it, under Paragraph 2) in object code or executable form under the terms of Paragraphs 1 and 2 above provided that you also do one of the following: a) accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Paragraphs 1 and 2 above; or, b) accompany it with a written offer, valid for at least three years, to give any third party free (except for a nominal charge for the cost of distribution) a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Paragraphs 1 and 2 above; or, c) accompany it with the information you received as to where the corresponding source code may be obtained. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form alone.) Source code for a work means the preferred form of the work for making modifications to it. For an executable file, complete source code means all the source code for all modules it contains; but, as a special exception, it need not include source code for modules which are standard libraries that accompany the operating system on which the executable file runs, or for standard header files or definitions files that accompany that operating system. 4. You may not copy, modify, sublicense, distribute or transfer the Program except as expressly provided under this General Public License. Any attempt otherwise to copy, modify, sublicense, distribute or transfer the Program is void, and will automatically terminate your rights to use the Program under this License. However, parties who have received copies, or rights to use copies, from you under this General Public License will not have their licenses terminated so long as such parties remain in full compliance. 5. By copying, distributing or modifying the Program (or any work based on the Program) you indicate your acceptance of this license to do so, and all its terms and conditions. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. 7. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of the license which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the license, you may choose any version ever published by the Free Software Foundation. 8. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 9. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 10. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS Appendix: How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to humanity, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) 19yy This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston MA 02110-1301 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) 19xx name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (a program to direct compilers to make passes at assemblers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice That's all there is to it! --- The Artistic License 1.0 --- This software is Copyright (c) 2013 by Mingyi Liu, GPC Biotech AG and Altana Research Institute. This is free software, licensed under: The Artistic License 1.0 The Artistic License Preamble The intent of this document is to state the conditions under which a Package may be copied, such that the Copyright Holder maintains some semblance of artistic control over the development of the package, while giving the users of the package the right to use and distribute the Package in a more-or-less customary fashion, plus the right to make reasonable modifications. Definitions: - "Package" refers to the collection of files distributed by the Copyright Holder, and derivatives of that collection of files created through textual modification. - "Standard Version" refers to such a Package if it has not been modified, or has been modified in accordance with the wishes of the Copyright Holder. - "Copyright Holder" is whoever is named in the copyright or copyrights for the package. - "You" is you, if you're thinking about copying or distributing this Package. - "Reasonable copying fee" is whatever you can justify on the basis of media cost, duplication charges, time of people involved, and so on. (You will not be required to justify it to the Copyright Holder, but only to the computing community at large as a market that must bear the fee.) - "Freely Available" means that no fee is charged for the item itself, though there may be fees involved in handling the item. It also means that recipients of the item may redistribute it under the same conditions they received it. 1. You may make and give away verbatim copies of the source form of the Standard Version of this Package without restriction, provided that you duplicate all of the original copyright notices and associated disclaimers. 2. You may apply bug fixes, portability fixes and other modifications derived from the Public Domain or from the Copyright Holder. A Package modified in such a way shall still be considered the Standard Version. 3. You may otherwise modify your copy of this Package in any way, provided that you insert a prominent notice in each changed file stating how and when you changed that file, and provided that you do at least ONE of the following: a) place your modifications in the Public Domain or otherwise make them Freely Available, such as by posting said modifications to Usenet or an equivalent medium, or placing the modifications on a major archive site such as ftp.uu.net, or by allowing the Copyright Holder to include your modifications in the Standard Version of the Package. b) use the modified Package only within your corporation or organization. c) rename any non-standard executables so the names do not conflict with standard executables, which must also be provided, and provide a separate manual page for each non-standard executable that clearly documents how it differs from the Standard Version. d) make other distribution arrangements with the Copyright Holder. 4. You may distribute the programs of this Package in object code or executable form, provided that you do at least ONE of the following: a) distribute a Standard Version of the executables and library files, together with instructions (in the manual page or equivalent) on where to get the Standard Version. b) accompany the distribution with the machine-readable source of the Package with your modifications. c) accompany any non-standard executables with their corresponding Standard Version executables, giving the non-standard executables non-standard names, and clearly documenting the differences in manual pages (or equivalent), together with instructions on where to get the Standard Version. d) make other distribution arrangements with the Copyright Holder. 5. You may charge a reasonable copying fee for any distribution of this Package. You may charge any fee you choose for support of this Package. You may not charge a fee for this Package itself. However, you may distribute this Package in aggregate with other (possibly commercial) programs as part of a larger (possibly commercial) software distribution provided that you do not advertise this Package as a product of your own. 6. The scripts and library files supplied as input to or produced as output from the programs of this Package do not automatically fall under the copyright of this Package, but belong to whomever generated them, and may be sold commercially, and may be aggregated with this Package. 7. C or perl subroutines supplied by you and linked into this Package shall not be considered part of this Package. 8. The name of the Copyright Holder may not be used to endorse or promote products derived from this software without specific prior written permission. 9. THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. The End dist.ini100644000765000024 35112215135616 16473 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70name = Bio-ASN1-EntrezGene version = 1.70 author = Mingyi Liu license = Perl_5 copyright_holder = Mingyi Liu, GPC Biotech AG and Altana Research Institute copyright_year = 2013 [@BioPerl] META.yml100644000765000024 1215612215135616 16346 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70--- abstract: 'Regular expression-based Perl Parser for NCBI Entrez Gene.' author: - 'Mingyi Liu ' build_requires: File::Spec: 0 IO::Handle: 0 IPC::Open3: 0 Test::More: 0 configure_requires: ExtUtils::MakeMaker: 6.30 dynamic_config: 0 generated_by: 'Dist::Zilla version 4.300037, CPAN::Meta::Converter version 2.120921' license: perl meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: 1.4 name: Bio-ASN1-EntrezGene requires: Bio::Index::AbstractSeq: 0 Carp: 0 parent: 0 strict: 0 utf8: 0 warnings: 0 resources: bugtracker: https://redmine.open-bio.org/projects/bioperl/ homepage: http://search.cpan.org/dist/Bio-ASN1-EntrezGene repository: git://github.com/bioperl/bio-asn1-entrezgene.git version: 1.70 x_Dist_Zilla: perl: version: 5.018001 plugins: - class: Dist::Zilla::Plugin::GatherDir name: '@BioPerl/@Filter/GatherDir' version: 4.300037 - class: Dist::Zilla::Plugin::PruneCruft name: '@BioPerl/@Filter/PruneCruft' version: 4.300037 - class: Dist::Zilla::Plugin::ManifestSkip name: '@BioPerl/@Filter/ManifestSkip' version: 4.300037 - class: Dist::Zilla::Plugin::MetaYAML name: '@BioPerl/@Filter/MetaYAML' version: 4.300037 - class: Dist::Zilla::Plugin::License name: '@BioPerl/@Filter/License' version: 4.300037 - class: Dist::Zilla::Plugin::ExtraTests name: '@BioPerl/@Filter/ExtraTests' version: 4.300037 - class: Dist::Zilla::Plugin::ExecDir name: '@BioPerl/@Filter/ExecDir' version: 4.300037 - class: Dist::Zilla::Plugin::ShareDir name: '@BioPerl/@Filter/ShareDir' version: 4.300037 - class: Dist::Zilla::Plugin::MakeMaker name: '@BioPerl/@Filter/MakeMaker' version: 4.300037 - class: Dist::Zilla::Plugin::Manifest name: '@BioPerl/@Filter/Manifest' version: 4.300037 - class: Dist::Zilla::Plugin::TestRelease name: '@BioPerl/@Filter/TestRelease' version: 4.300037 - class: Dist::Zilla::Plugin::ConfirmRelease name: '@BioPerl/@Filter/ConfirmRelease' version: 4.300037 - class: Dist::Zilla::Plugin::UploadToCPAN name: '@BioPerl/@Filter/UploadToCPAN' version: 4.300037 - class: Dist::Zilla::Plugin::MetaConfig name: '@BioPerl/MetaConfig' version: 4.300037 - class: Dist::Zilla::Plugin::MetaJSON name: '@BioPerl/MetaJSON' version: 4.300037 - class: Dist::Zilla::Plugin::PkgVersion name: '@BioPerl/PkgVersion' version: 4.300037 - class: Dist::Zilla::Plugin::PodSyntaxTests name: '@BioPerl/PodSyntaxTests' version: 4.300037 - class: Dist::Zilla::Plugin::NoTabsTests name: '@BioPerl/NoTabsTests' version: 0.01 - class: Dist::Zilla::Plugin::NextRelease name: '@BioPerl/NextRelease' version: 4.300037 - class: Dist::Zilla::Plugin::Test::Compile config: Dist::Zilla::Plugin::Test::Compile: module_finder: - ':InstallModules' script_finder: - ':ExecFiles' name: '@BioPerl/Test::Compile' version: 2.027 - class: Dist::Zilla::Plugin::PodCoverageTests name: '@BioPerl/PodCoverageTests' version: 4.300037 - class: Dist::Zilla::Plugin::MojibakeTests name: '@BioPerl/MojibakeTests' version: 0.5 - class: Dist::Zilla::Plugin::AutoPrereqs name: '@BioPerl/AutoPrereqs' version: 4.300037 - class: Dist::Zilla::Plugin::AutoMetaResources name: '@BioPerl/AutoMetaResources' version: 1.20 - class: Dist::Zilla::Plugin::MetaResources name: '@BioPerl/MetaResources' version: 4.300037 - class: Dist::Zilla::Plugin::Authority name: '@BioPerl/Authority' version: 1.006 - class: Dist::Zilla::Plugin::EOLTests name: '@BioPerl/EOLTests' version: 0.02 - class: Dist::Zilla::Plugin::PodWeaver name: '@BioPerl/PodWeaver' version: 3.101642 - class: Dist::Zilla::Plugin::Git::Check name: '@BioPerl/Git::Check' version: 2.014 - class: Dist::Zilla::Plugin::Git::Commit name: '@BioPerl/Git::Commit' version: 2.014 - class: Dist::Zilla::Plugin::Git::Tag name: '@BioPerl/Git::Tag' version: 2.014 - class: Dist::Zilla::Plugin::FinderCode name: ':InstallModules' version: 4.300037 - class: Dist::Zilla::Plugin::FinderCode name: ':IncModules' version: 4.300037 - class: Dist::Zilla::Plugin::FinderCode name: ':TestFiles' version: 4.300037 - class: Dist::Zilla::Plugin::FinderCode name: ':ExecFiles' version: 4.300037 - class: Dist::Zilla::Plugin::FinderCode name: ':ShareFiles' version: 4.300037 - class: Dist::Zilla::Plugin::FinderCode name: ':MainModule' version: 4.300037 zilla: class: Dist::Zilla::Dist::Builder config: is_trial: 0 version: 4.300037 x_authority: cpan:BIOPERLML MANIFEST100644000765000024 77512215135616 16172 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70Changes LICENSE MANIFEST META.json META.yml Makefile.PL README.md dist.ini examples/indexer_test.pl examples/parse_entrez_gene_example.pl examples/parse_sequence_example.pl examples/regex_parser_test.pl lib/Bio/ASN1/EntrezGene.pm lib/Bio/ASN1/EntrezGene/Indexer.pm lib/Bio/ASN1/Sequence.pm lib/Bio/ASN1/Sequence/Indexer.pm t/00-compile.t t/input.asn t/input1.asn t/release-eol.t t/release-mojibake.t t/release-no-tabs.t t/release-pod-coverage.t t/release-pod-syntax.t t/seq.asn t/testindexer.t t/testparser.t README.md100644000765000024 230512215135616 16327 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70Bio-ASN1-Entrezgene =================== This distribution includes: 1. XML parser-like parser for the ASN.1-formatted NCBI Entrez Gene files. 2. Indexer for Entrez Gene files. 3. XML parser-like parser for the ASN.1-formatted NCBI Sequence files. 4. Indexer for Sequence files. These modules have quite high performance and error reporting capabilities. Additionally, one could dump the data structure generated from extracted NCBI object records into XML extremely easily using XML::Simple's XMLout(). Written by Dr. Mingyi Liu . Copyright (c) 2005 Mingyi Liu, GPC Biotech, Altana Research Institute. This program is free software - you can redistribute it and/or modify it under the same terms as Perl itself. INSTALLATION ------------ Bio::ASN1::EntrezGene package can be installed & tested as follows: perl Makefile.PL make make test make install DOCUMENTATION ------------- For documentation, among many other things, please refer to the POD ( plain old documentation) inside the module. It is highly recommended that you check the example scripts out (under the examples directory)! - - - This distribution is part of the [BioPerl](http://www.bioperl.org/) project. t000755000765000024 012215135616 15153 5ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70seq.asn100755000765000024 2515412215135616 16640 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/tSeq-entry ::= set { class nuc-prot , descr { title "Leishmania major polyadenylate-binding protein 1 (PAB1) gene, complete cds." , source { org { taxname "Leishmania major" , db { { db "taxon" , tag id 5664 } } , orgname { name binomial { genus "Leishmania" , species "major" } , mod { { subtype strain , subname "Friedlin" } } , lineage "Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Leishmania" , gcode 1 , mgcode 4 , div "INV" } } , subtype { { subtype chromosome , name "35" } , { subtype other , name "MHOM/IL/81" } } } , create-date std { year 1998 , month 10 , day 17 } , pub { pub { sub { authors { names std { { name name { last "Bates" , first "E" , initials "E.J." } } , { name name { last "Smith" , first "D" , initials "D.F." } } } , affil std { affil "Imperial College of Science, Technology and Medicine" , div "Biochemistry" , city "London" , country "UK" , street "Exhibition Road" , postal-code "SW7 2AZ" } } , medium email , date std { year 1998 , month 9 , day 18 } } } } , pub { pub { gen { cit "Unpublished" , authors { names std { { name name { last "Bates" , first "E" , initials "E.J." } } , { name name { last "Smith" , first "D" , initials "D.F." } } } } , title "Cloning and characterisation of a Leishmania major poly A-binding protein" } } } , pub { pub { sub { authors { names std { { name name { last "Bates" , first "E" , initials "E.J." } } , { name name { last "Smith" , first "D" , initials "D.F." } } } , affil std { affil "Imperial College of Science, Technology and Medicine" , div "Biochemistry" , city "London" , country "UK" , street "Exhibition Road" , postal-code "SW7 2AZ" } } , medium email , date std { year 1999 , month 10 , day 12 } , descr "Sequence update by submitter" } } } , update-date std { year 2000 , month 2 , day 29 } , pub { pub { article { title { name "Poly(A)-binding protein I of Leishmania: functional analysis and localisation in trypanosomatid parasites." } , authors { names std { { name name { last "Bates" , initials "E.J." } } , { name name { last "Knuepfer" , initials "E." } } , { name name { last "Smith" , initials "D.F." } } } , affil str "Wellcome Laboratories for Molecular Parasitology, Department of Biochemistry, Imperial College of Science, Technology and Medicine, London SW7 2AZ, UK." } , from journal { title { iso-jta "Nucleic Acids Res." , ml-jta "Nucleic Acids Res" , issn "1362-4962" , name "Nucleic acids research." } , imp { date std { year 2000 , month 3 , day 1 } , volume "28" , issue "5" , pages "1211-1220" , language "eng" } } , ids { pubmed 10666465 , medline 20133040 } } , muid 20133040 , pmid 10666465 } } } , seq-set { seq { id { genbank { name "AF093062" , accession "AF093062" , version 2 } , gi 6019463 } , descr { molinfo { biomol genomic } } , inst { repr raw , mol dna , length 2795 , seq-data ncbi4na '41422482142424121421884422282221248242824282188828888188828 418882124248211212422188288828182124111141111211211211211212822221222128212242 212121142121212124212142888242242821212121124844111411112411142142428482F42882 484882888288882248888218882188282888188821282111844284284284822144114284284222 244844242182142242141844121142241824141824288221828124884424122884124221221821 124142242142844844142828821142248824421221822841124842484888422424121821821242 142428242844428184428124821128824121122124121424244141144241824148221841128821 142484824484121118424812422881848442142142424122224222822428121424421124421124 888824841141122844141144184844121421141422842124121828821221148824428241882828 228421144811842144124144144421141422424428124428824882128821114124141248224221 144124281884841141841124424284224182124228214144121141144242828124844221128821 882422421184212422884224242824844221128821241124848181821142144842842241224841 121144124821824141148828824211148884424421821248224284214228421144121141424422 424848824228828421128824111142124124124224821144244884114241842124122182121824 124421821244212284484141142848124812142424222142242421424142422821884222882412 141148121842142122114222884481121122848124842421128824122284148821244424244122 842824142828821144148184424144821141428422484841844841424141484444841422424428 824488824841488888241124224124144221124224282842424141841184482421842841124421 142222821824841121884222112422424182142428121211824842422842148822142142422842 141841841842482141842122142241842248884824421422144422482241842484422424484482 422142142844484482424242144422182241842241842218212212142142242144484224242142 212144488824221242248244214824488824842114241242241142128222284424184822224141 242242242842242241821212212144112824111421848242222144142142424224228844484122 482828822811144848124141882242214124844242241141821244481848822824141841112221 144144248124148842821124122141142412824144142424841214144242818424842821144242 182141244248114241411422142144242248848482114228414214144111142884242828242184 224242282122821421488282141281844122228828288288882242218288182228282124148284 882428182144224841414141444428144844424428884848482848281848184248414448842488 484282484884884184284182144141444142441144228212441841281441884218888888884884 282121211882841148428481442212288224282288844888282821841848844844222222812212 128222122122122212418221488224888124418411441842484188114821418114144242481148 218442428484288228282228282218288882288882488824842221821888882288282424482184 828482822142222284888824821144842228822288221214228422888848144848421844428484 844841414214214841842121411411111144144114211188248481144484184414442414414142 488442444812418112414184282281111284244284184142124844141142241144112121141114 211111111144114821224111121114141111188881211112411411448848411144144811142442 1414810'H , hist { replaces { date std { year 1999 , month 10 , day 12 } , ids { gi 3764100 } } } } , annot { { data ftable { { data gene { locus "PAB1" } , location int { from 262 , to 1944 , strand plus , id gi 6019463 } } } } } } , seq { id { genbank { accession "AAC64372" , version 2 } , gi 6019464 } , descr { molinfo { biomol peptide , tech concept-trans } , title "polyadenylate-binding protein 1 [Leishmania major]" } , inst { repr raw , mol aa , length 560 , topology not-set , seq-data ncbieaa "MAAAVQEAAAPVAHQPQMDKPIEIASIYVGDLDATINEPQLVELFKPFGTILNVRVCRD IITQRSLGYGYVNFDNHDSAEKAIESMNFKRVGDKCVRLMWQQRDPALRYSGNGNVFVKNLEKDVDSKSLHDIFTKFG SILSCKVMQDEEGKSRGYGFVHFKDETSAKDAIVKMNGAADHASEDKKALYVANFIRRNARLAALVANFTNVYIKQVL PTVNKDVIEKFFAKFGGITSAAACKDKSGRVFAFCNFEKHDDAVKAVEAMHDHHIDGITAPGEKLYVQRAQPRSERLI ALRQKYMQHQALGNNLYVRNFDPEFTGADLLELFKEYGEVKSCRVMVSESGVSRGFGFVSFSNADEANAALREMNGRM LNGKPLIVNIAQRRDQRYTIVRLQFQQRLQMMMRQMHQPMPFVGSQGRPMRGRGGRQQLGGRAQGHPMPMPSPQQPQG AAQPQGFATPSAVGFVQATPKHSPGDVPETPPLPPITPQELESMSPQEQRAALGDRLFLKVYEIPPDVAPKITGMFLE MKPKEAYELLNDQKRLEERVTEALCVLKAHQTA" , hist { replaces { date std { year 1999 , month 10 , day 12 } , ids { gi 3764101 } } } } , annot { { data ftable { { data prot { name { "polyadenylate-binding protein 1" } , desc "polyA-binding protein" } , location int { from 0 , to 559 , strand plus , id gi 6019464 } } } } } } } , annot { { data ftable { { data cdregion { frame one , code { id 1 } } , product whole gi 6019464 , location int { from 262 , to 1944 , strand plus , id gi 6019463 } } } } } } META.json100644000765000024 2007112215135616 16511 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70{ "abstract" : "Regular expression-based Perl Parser for NCBI Entrez Gene.", "author" : [ "Mingyi Liu " ], "dynamic_config" : 0, "generated_by" : "Dist::Zilla version 4.300037, CPAN::Meta::Converter version 2.120921", "license" : [ "perl_5" ], "meta-spec" : { "url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec", "version" : "2" }, "name" : "Bio-ASN1-EntrezGene", "prereqs" : { "configure" : { "requires" : { "ExtUtils::MakeMaker" : "6.30" } }, "develop" : { "requires" : { "Pod::Coverage::TrustPod" : "0", "Test::Pod" : "1.41", "Test::Pod::Coverage" : "1.08" } }, "runtime" : { "requires" : { "Bio::Index::AbstractSeq" : "0", "Carp" : "0", "parent" : "0", "strict" : "0", "utf8" : "0", "warnings" : "0" } }, "test" : { "requires" : { "File::Spec" : "0", "IO::Handle" : "0", "IPC::Open3" : "0", "Test::More" : "0" } } }, "release_status" : "stable", "resources" : { "bugtracker" : { "mailto" : "bioperl-l@bioperl.org", "web" : "https://redmine.open-bio.org/projects/bioperl/" }, "homepage" : "http://search.cpan.org/dist/Bio-ASN1-EntrezGene", "repository" : { "type" : "git", "url" : "git://github.com/bioperl/bio-asn1-entrezgene.git", "web" : "https://github.com/bioperl/bio-asn1-entrezgene" } }, "version" : "1.70", "x_Dist_Zilla" : { "perl" : { "version" : "5.018001" }, "plugins" : [ { "class" : "Dist::Zilla::Plugin::GatherDir", "name" : "@BioPerl/@Filter/GatherDir", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::PruneCruft", "name" : "@BioPerl/@Filter/PruneCruft", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::ManifestSkip", "name" : "@BioPerl/@Filter/ManifestSkip", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::MetaYAML", "name" : "@BioPerl/@Filter/MetaYAML", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::License", "name" : "@BioPerl/@Filter/License", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::ExtraTests", "name" : "@BioPerl/@Filter/ExtraTests", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::ExecDir", "name" : "@BioPerl/@Filter/ExecDir", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::ShareDir", "name" : "@BioPerl/@Filter/ShareDir", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::MakeMaker", "name" : "@BioPerl/@Filter/MakeMaker", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::Manifest", "name" : "@BioPerl/@Filter/Manifest", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::TestRelease", "name" : "@BioPerl/@Filter/TestRelease", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::ConfirmRelease", "name" : "@BioPerl/@Filter/ConfirmRelease", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::UploadToCPAN", "name" : "@BioPerl/@Filter/UploadToCPAN", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::MetaConfig", "name" : "@BioPerl/MetaConfig", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::MetaJSON", "name" : "@BioPerl/MetaJSON", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::PkgVersion", "name" : "@BioPerl/PkgVersion", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::PodSyntaxTests", "name" : "@BioPerl/PodSyntaxTests", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::NoTabsTests", "name" : "@BioPerl/NoTabsTests", "version" : "0.01" }, { "class" : "Dist::Zilla::Plugin::NextRelease", "name" : "@BioPerl/NextRelease", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::Test::Compile", "config" : { "Dist::Zilla::Plugin::Test::Compile" : { "module_finder" : [ ":InstallModules" ], "script_finder" : [ ":ExecFiles" ] } }, "name" : "@BioPerl/Test::Compile", "version" : "2.027" }, { "class" : "Dist::Zilla::Plugin::PodCoverageTests", "name" : "@BioPerl/PodCoverageTests", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::MojibakeTests", "name" : "@BioPerl/MojibakeTests", "version" : "0.5" }, { "class" : "Dist::Zilla::Plugin::AutoPrereqs", "name" : "@BioPerl/AutoPrereqs", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::AutoMetaResources", "name" : "@BioPerl/AutoMetaResources", "version" : "1.20" }, { "class" : "Dist::Zilla::Plugin::MetaResources", "name" : "@BioPerl/MetaResources", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::Authority", "name" : "@BioPerl/Authority", "version" : "1.006" }, { "class" : "Dist::Zilla::Plugin::EOLTests", "name" : "@BioPerl/EOLTests", "version" : "0.02" }, { "class" : "Dist::Zilla::Plugin::PodWeaver", "name" : "@BioPerl/PodWeaver", "version" : "3.101642" }, { "class" : "Dist::Zilla::Plugin::Git::Check", "name" : "@BioPerl/Git::Check", "version" : "2.014" }, { "class" : "Dist::Zilla::Plugin::Git::Commit", "name" : "@BioPerl/Git::Commit", "version" : "2.014" }, { "class" : "Dist::Zilla::Plugin::Git::Tag", "name" : "@BioPerl/Git::Tag", "version" : "2.014" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":InstallModules", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":IncModules", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":TestFiles", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":ExecFiles", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":ShareFiles", "version" : "4.300037" }, { "class" : "Dist::Zilla::Plugin::FinderCode", "name" : ":MainModule", "version" : "4.300037" } ], "zilla" : { "class" : "Dist::Zilla::Dist::Builder", "config" : { "is_trial" : "0" }, "version" : "4.300037" } }, "x_authority" : "cpan:BIOPERLML" } input.asn100755000765000024 32333312215135616 17227 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/tEntrezgene ::= { track-info { geneid 1 , status live , create-date std { year 2003 , month 8 , day 28 , hour 20 , minute 30 , second 0 } , update-date std { year 2005 , month 3 , day 19 , hour 10 , minute 30 , second 0 } } , type protein-coding , source { genome genomic , origin natural , org { taxname "Homo sapiens" , common "human" , db { { db "taxon" , tag id 9606 } } , syn { "man" } , orgname { name binomial { genus "Homo" , species "sapiens" } , lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo" , gcode 1 , mgcode 2 , div "PRI" } } , subtype { { subtype chromosome , name "19" } } } , gene { locus "A1BG" , desc "alpha-1-B glycoprotein" , maploc "19q13.4" , db { { db "MIM" , tag id 138670 } } , syn { "A1B" , "ABG" , "GAB" , "HYST2477" , "DKFZp686F0970" } , locus-tag "HGNC:5" } , prot { name { "alpha 1B-glycoprotein" , "alpha-1B-glycoprotein" } } , summary "The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins." , location { { display-str "19q13.4" , method map-type cyto } } , gene-source { src "LocusLink" , src-int 1 , src-str2 "1" , gene-display FALSE , locus-display FALSE , extra-terms FALSE } , locus { { type genomic , heading "Reference" , accession "NC_000019" , version 8 , seqs { int { from 63548355 , to 63556668 , strand minus , id gi 42406306 } } , products { { type mRNA , heading "Reference" , accession "NM_130786" , version 2 , genomic-coords { mix { int { from 63548355 , to 63550206 , strand minus , id gi 42406306 } , int { from 63550530 , to 63550817 , strand minus , id gi 42406306 } , int { from 63553547 , to 63553828 , strand minus , id gi 42406306 } , int { from 63554568 , to 63554864 , strand minus , id gi 42406306 } , int { from 63555460 , to 63555732 , strand minus , id gi 42406306 } , int { from 63556105 , to 63556374 , strand minus , id gi 42406306 } , int { from 63556469 , to 63556504 , strand minus , id gi 42406306 } , int { from 63556581 , to 63556668 , strand minus , id gi 42406306 } } } , seqs { whole gi 21071029 } , products { { type peptide , heading "Reference" , accession "NP_570602" , version 2 , genomic-coords { packed-int { { from 63550199 , to 63550206 , strand minus , id gi 42406306 } , { from 63550530 , to 63550817 , strand minus , id gi 42406306 } , { from 63553547 , to 63553828 , strand minus , id gi 42406306 } , { from 63554568 , to 63554864 , strand minus , id gi 42406306 } , { from 63555460 , to 63555732 , strand minus , id gi 42406306 } , { from 63556105 , to 63556374 , strand minus , id gi 42406306 } , { from 63556469 , to 63556504 , strand minus , id gi 42406306 } , { from 63556581 , to 63556614 , strand minus , id gi 42406306 } } } , seqs { whole gi 21071030 } } } } } } , { type genomic , heading "Reference" , accession "NT_011109" , version 15 , seqs { int { from 31124733 , to 31133046 , strand minus , id gi 29800594 } } , products { { type mRNA , heading "Reference" , accession "NM_130786" , version 2 , genomic-coords { mix { int { from 31124733 , to 31126584 , strand minus , id gi 29800594 } , int { from 31126908 , to 31127195 , strand minus , id gi 29800594 } , int { from 31129925 , to 31130206 , strand minus , id gi 29800594 } , int { from 31130946 , to 31131242 , strand minus , id gi 29800594 } , int { from 31131838 , to 31132110 , strand minus , id gi 29800594 } , int { from 31132483 , to 31132752 , strand minus , id gi 29800594 } , int { from 31132847 , to 31132882 , strand minus , id gi 29800594 } , int { from 31132959 , to 31133046 , strand minus , id gi 29800594 } } } , seqs { whole gi 21071029 } , products { { type peptide , heading "Reference" , accession "NP_570602" , version 2 , genomic-coords { packed-int { { from 31126577 , to 31126584 , strand minus , id gi 29800594 } , { from 31126908 , to 31127195 , strand minus , id gi 29800594 } , { from 31129925 , to 31130206 , strand minus , id gi 29800594 } , { from 31130946 , to 31131242 , strand minus , id gi 29800594 } , { from 31131838 , to 31132110 , strand minus , id gi 29800594 } , { from 31132483 , to 31132752 , strand minus , id gi 29800594 } , { from 31132847 , to 31132882 , strand minus , id gi 29800594 } , { from 31132959 , to 31132992 , strand minus , id gi 29800594 } } } , seqs { whole gi 21071030 } } } } } } , { type genomic , heading "Reference" , accession "NT_086907" , version 1 , seqs { int { from 8163589 , to 8172398 , strand minus , id gi 51475048 } } , products { { type mRNA , heading "Reference" , accession "NM_130786" , version 2 , genomic-coords { mix { int { from 8163589 , to 8165440 , strand minus , id gi 51475048 } , int { from 8165763 , to 8166050 , strand minus , id gi 51475048 } , int { from 8169274 , to 8169555 , strand minus , id gi 51475048 } , int { from 8170297 , to 8170593 , strand minus , id gi 51475048 } , int { from 8171190 , to 8171462 , strand minus , id gi 51475048 } , int { from 8171835 , to 8172104 , strand minus , id gi 51475048 } , int { from 8172199 , to 8172234 , strand minus , id gi 51475048 } , int { from 8172311 , to 8172398 , strand minus , id gi 51475048 } } } , seqs { whole gi 21071029 } , products { { type peptide , heading "Reference" , accession "NP_570602" , version 2 , genomic-coords { packed-int { { from 8165433 , to 8165440 , strand minus , id gi 51475048 } , { from 8165763 , to 8166050 , strand minus , id gi 51475048 } , { from 8169274 , to 8169555 , strand minus , id gi 51475048 } , { from 8170297 , to 8170593 , strand minus , id gi 51475048 } , { from 8171190 , to 8171462 , strand minus , id gi 51475048 } , { from 8171835 , to 8172104 , strand minus , id gi 51475048 } , { from 8172199 , to 8172234 , strand minus , id gi 51475048 } , { from 8172311 , to 8172344 , strand minus , id gi 51475048 } } } , seqs { whole gi 21071030 } } } } } } } , properties { { type comment , label "Nomenclature" , version 0 , source { { anchor "HUGO Gene Nomenclature Committee" } } , properties { { type property , label "Official Symbol" , text "A1BG" , version 0 } , { type property , label "Official Full Name" , text "alpha-1-B glycoprotein" , version 0 } } } , { type comment , heading "GeneOntology" , version 0 , source { { pre-text "Provided by" , anchor "GOA" , url "http://www.ebi.ac.uk/GOA/" } } , comment { { type comment , label "Function" , version 0 , comment { { type comment , version 0 , refs { pmid 3458201 } , source { { src { db "GO" , tag id 5554 } , anchor "molecular_function unknown" , post-text "evidence: ND" } } } } } , { type comment , label "Process" , version 0 , comment { { type comment , version 0 , source { { src { db "GO" , tag id 4 } , anchor "biological_process unknown" , post-text "evidence: ND" } } } } } , { type comment , label "Component" , version 0 , comment { { type comment , version 0 , refs { pmid 3458201 } , source { { src { db "GO" , tag id 5576 } , anchor "extracellular region" , post-text "evidence: IDA" } } } } } } } } , homology { { type comment , heading "Mouse, Rat" , version 0 , source { { src { db "HomoloGene" , tag id 11167 } , anchor "Map Viewer" , url "http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&chr=19& MAPS=genes-r-org/rat-chr/human%3A19,genes-r-org/mouse-chr/human%3A19,genes-r-o rg/human-chr19&query=e%3A1[id]+AND+gene[obj_type]&QSTR=a1bg&cmd=focus&fill=10" } } } } , comments { { type comment , heading "LocusTagLink" , version 0 , source { { src { db "HGNC" , tag id 5 } } } } , { type comment , heading "RefSeq Status" , label "REVIEWED" , version 0 } , { type comment , version 0 , refs { pmid 15461460 , pmid 15221005 , pmid 14702039 , pmid 12477932 , pmid 8889549 , pmid 3458201 , pmid 2591067 } } , { type comment , heading "Markers (Sequence Tagged Sites/STS)" , version 0 , comment { { type comment , version 0 , source { { src { db "UniSTS" , tag id 89991 } , anchor "SHGC-67307" , post-text "(e-PCR)" } } , comment { { type other , label "Alternate name" , text "RH80032" , version 0 } , { type other , label "Alternate name" , text "RH86145" , version 0 } } } , { type comment , version 0 , source { { src { db "UniSTS" , tag id 152074 } , anchor "D11S2921" , post-text "(e-PCR)" } } , comment { { type other , label "Alternate name" , text "GDB:461809" , version 0 } } } , { type comment , version 0 , source { { src { db "UniSTS" , tag id 155756 } , anchor "D10S16" , post-text "(e-PCR)" } } , comment { { type other , label "Alternate name" , text "D10S23" , version 0 } , { type other , label "Alternate name" , text "GDB:193809" , version 0 } } } } } , { type comment , heading "NCBI Reference Sequences (RefSeq)" , version 0 , products { { type mRNA , heading "mRNA Sequence" , accession "NM_130786" , version 2 , source { { src { db "Nucleotide" , tag id 21071029 } , anchor "NM_130786" } } , seqs { whole gi 21071029 } , products { { type peptide , heading "Product" , accession "NP_570602" , version 2 , source { { src { db "Protein" , tag id 21071030 } , anchor "NP_570602" , post-text "alpha 1B-glycoprotein" } } , seqs { whole gi 21071030 } , comment { { type other , heading "Consensus CDS (CCDS)" , version 0 , source { { src { db "CCDS" , tag str "CCDS12976.1" } , anchor "CCDS12976.1" } } } , { type other , heading "Conserved Domains" , version 0 , source { { src { db "PROT_CDD" , tag id 21071030 } , pre-text "(1)" , anchor "summary" } } , comment { { type other , version 0 , source { { src { db "CDD" , tag id 365 } , anchor "smart00408: IGc2; Immunoglobulin C-2 Type" } } , comment { { type other , text "Location: 223 - 282 Blast Score: 103" , version 0 } } } } } } } } , comment { { type other , heading "Source Sequence" , version 0 , source { { src { db "Nucleotide" , tag str "AF414429,AK055885,AK056201" } , anchor "AF414429,AK055885,AK056201" } } , comment { { type other , version 0 } } } } } } } , { type comment , heading "Related Sequences" , version 0 , products { { type genomic , heading "Genomic" , accession "AC010642" , version 5 , source { { src { db "Nucleotide" , tag id 9929687 } , anchor "AC010642" } } , seqs { int { from 41119 , to 43581 , strand plus , id gi 9929687 } } , products { { type peptide , text "None" , version 0 } } } , { type mRNA , heading "mRNA" , accession "AB073611" , version 1 , source { { src { db "Nucleotide" , tag id 51555784 } , anchor "AB073611" } } , seqs { whole gi 51555784 } , products { { type peptide , accession "BAD38648" , version 1 , source { { src { db "Protein" , tag id 51555785 } , anchor "BAD38648" } } , seqs { whole gi 51555785 } } } } , { type mRNA , heading "mRNA" , accession "AF414429" , version 1 , source { { src { db "Nucleotide" , tag id 15778555 } , anchor "AF414429" } } , seqs { whole gi 15778555 } , products { { type peptide , accession "AAL07469" , version 1 , source { { src { db "Protein" , tag id 15778556 } , anchor "AAL07469" } } , seqs { whole gi 15778556 } } } } , { type mRNA , heading "mRNA" , accession "AK055885" , version 1 , source { { src { db "Nucleotide" , tag id 16550723 } , anchor "AK055885" } } , seqs { whole gi 16550723 } , products { { type peptide , text "None" , version 0 } } } , { type mRNA , heading "mRNA" , accession "AK056201" , version 1 , source { { src { db "Nucleotide" , tag id 16551539 } , anchor "AK056201" } } , seqs { whole gi 16551539 } , products { { type peptide , text "None" , version 0 } } } , { type mRNA , heading "mRNA" , accession "BC035719" , version 1 , source { { src { db "Nucleotide" , tag id 23273475 } , anchor "BC035719" } } , seqs { whole gi 23273475 } , products { { type peptide , accession "AAH35719" , version 1 , source { { src { db "Protein" , tag id 23273476 } , anchor "AAH35719" } } , seqs { whole gi 23273476 } } } } , { type mRNA , heading "mRNA" , accession "BX537419" , version 1 , source { { src { db "Nucleotide" , tag id 31873339 } , anchor "BX537419" } } , seqs { whole gi 31873339 } , products { { type peptide , accession "CAD97661" , version 1 , source { { src { db "Protein" , tag id 31873340 } , anchor "CAD97661" } } , seqs { whole gi 31873340 } } } } , { type other , text "None" , version 0 , products { { type peptide , accession "P04217" , version 0 , source { { src { db "Protein" , tag id 46577680 } , anchor "P04217" } } , seqs { whole gi 46577680 } } } } } } , { type comment , heading "Additional Links" , version 0 , comment { { type comment , version 0 , source { { src { db "Evidence Viewer" , tag str "1" } , anchor "Evidence Viewer" , url "http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=9606&conti g=NT_011109.15&gene=A1BG&lid=1&from=31124734&to=31133047" } } } , { type comment , version 0 , source { { src { db "ModelMaker" , tag str "1" } , anchor "ModelMaker" , url "http://www.ncbi.nlm.nih.gov/mapview/modelmaker.cgi?taxid=96 06&contig=NT_011109.15&gene=A1BG&lid=1" } } } , { type comment , text "UniGene" , version 0 , xtra-properties { { tag "UNIGENE" , value "Hs.529161" } } , source { { src { db "UniGene" , tag str "Hs.529161" } , anchor "Hs.529161" , url "http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Hs&CID=52 9161" } } } , { type comment , text "MIM" , version 0 , source { { src { db "MIM" , tag str "138670" } , anchor "138670" } } } , { type comment , version 0 , source { { src { db "HomoloGene" , tag str "1" } , anchor "HomoloGene" , url "http://www.ncbi.nlm.nih.gov/HomoloGene/homolquery.cgi?TEXT= 1[loc]&TAXID=9606" } } } , { type comment , version 0 , source { { src { db "AceView" , tag id 1 } , anchor "AceView" , url "http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?c=l ocusid&org=9606&l=1" } } } , { type comment , version 0 , source { { src { db "GDB" , tag str "GDB:119638" } } } } , { type comment , version 0 , source { { src { db "Ensembl" , tag str "" } , url "http://www.ensembl.org/Homo_sapiens/contigview?geneid=AK055 885" } } } , { type comment , version 0 , source { { src { db "UCSC" , tag str "" } , url "http://genome.ucsc.edu/cgi-bin/hgTracks?org=human&position= AK055885" } } } , { type comment , version 0 , source { { src { db "MGC" , tag str "BC035719" } , anchor "MGC" , url "http://mgc.nci.nih.gov/Genes/CloneList?ORG=Hs&LIST=BC035719" } } } } } , { type generif , text "A1BG-cysteine-rich secretory protein 3 complex displays a similar function in protecting the circulation from a potentially harmful effect of free CRISP-3" , version 0 , refs { pmid 15461460 } , create-date str "Nov 6 2004 10:01AM" , update-date str "Nov 6 2004 3:27PM" } } , unique-keys { { db "LocusID" , tag id 1 } , { db "MIM" , tag id 138670 } } , xtra-index-terms { "LOC1" } } Entrezgene ::= { track-info { geneid 2 , status live , create-date std { year 2003 , month 8 , day 28 , hour 20 , minute 30 , second 0 } , update-date std { year 2005 , month 4 , day 3 , hour 13 , minute 27 , second 0 } } , type protein-coding , source { genome genomic , origin natural , org { taxname "Homo sapiens" , common "human" , db { { db "taxon" , tag id 9606 } } , syn { "man" } , orgname { name binomial { genus "Homo" , species "sapiens" } , lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo" , gcode 1 , mgcode 2 , div "PRI" } } , subtype { { subtype chromosome , name "12" } } } , gene { locus "A2M" , desc "alpha-2-macroglobulin" , maploc "12p13.3-p12.3" , db { { db "MIM" , tag id 103950 } } , syn { "FWP007" , "S863-7" , "DKFZp779B086" } , locus-tag "HGNC:7" } , prot { name { "alpha-2-macroglobulin" } } , summary "Alpha-2-macroglobulin is a protease inhibitor and cytokine transporter. It inhibits many proteases, including trypsin, thrombin and collagenase. A2M is implicated in Alzheimer disease (AD) due to its ability to mediate the clearance and degradation of A-beta, the major component of beta-amyloid deposits." , location { { display-str "12p13.3-p12.3" , method map-type cyto } } , gene-source { src "LocusLink" , src-int 2 , src-str2 "2" , gene-display FALSE , locus-display FALSE , extra-terms FALSE } , locus { { type genomic , heading "Reference" , accession "NC_000012" , version 9 , seqs { int { from 9111576 , to 9159755 , strand minus , id gi 51511728 } } , products { { type mRNA , heading "Reference" , accession "NM_000014" , version 3 , genomic-coords { mix { int { from 9111576 , to 9111701 , strand minus , id gi 51511728 } , int { from 9112045 , to 9112086 , strand minus , id gi 51511728 } , int { from 9112602 , to 9112704 , strand minus , id gi 51511728 } , int { from 9113607 , to 9113675 , strand minus , id gi 51511728 } , int { from 9114350 , to 9114440 , strand minus , id gi 51511728 } , int { from 9116221 , to 9116348 , strand minus , id gi 51511728 } , int { from 9116515 , to 9116733 , strand minus , id gi 51511728 } , int { from 9118422 , to 9118645 , strand minus , id gi 51511728 } , int { from 9120618 , to 9120798 , strand minus , id gi 51511728 } , int { from 9121208 , to 9121282 , strand minus , id gi 51511728 } , int { from 9121563 , to 9121719 , strand minus , id gi 51511728 } , int { from 9123106 , to 9123193 , strand minus , id gi 51511728 } , int { from 9123501 , to 9123677 , strand minus , id gi 51511728 } , int { from 9123956 , to 9124039 , strand minus , id gi 51511728 } , int { from 9133062 , to 9133113 , strand minus , id gi 51511728 } , int { from 9133764 , to 9133885 , strand minus , id gi 51511728 } , int { from 9134218 , to 9134344 , strand minus , id gi 51511728 } , int { from 9135063 , to 9135291 , strand minus , id gi 51511728 } , int { from 9137327 , to 9137441 , strand minus , id gi 51511728 } , int { from 9138835 , to 9138946 , strand minus , id gi 51511728 } , int { from 9139401 , to 9139562 , strand minus , id gi 51511728 } , int { from 9142469 , to 9142618 , strand minus , id gi 51511728 } , int { from 9143243 , to 9143385 , strand minus , id gi 51511728 } , int { from 9145006 , to 9145069 , strand minus , id gi 51511728 } , int { from 9145309 , to 9145536 , strand minus , id gi 51511728 } , int { from 9148101 , to 9148262 , strand minus , id gi 51511728 } , int { from 9150098 , to 9150207 , strand minus , id gi 51511728 } , int { from 9150353 , to 9150467 , strand minus , id gi 51511728 } , int { from 9151386 , to 9151506 , strand minus , id gi 51511728 } , int { from 9153183 , to 9153267 , strand minus , id gi 51511728 } , int { from 9153729 , to 9153897 , strand minus , id gi 51511728 } , int { from 9154176 , to 9154196 , strand minus , id gi 51511728 } , int { from 9156021 , to 9156073 , strand minus , id gi 51511728 } , int { from 9156239 , to 9156398 , strand minus , id gi 51511728 } , int { from 9157222 , to 9157405 , strand minus , id gi 51511728 } , int { from 9159626 , to 9159755 , strand minus , id gi 51511728 } } } , seqs { whole gi 6226959 } , products { { type peptide , heading "Reference" , label "precursor" , accession "NP_000005" , version 1 , genomic-coords { packed-int { { from 9111685 , to 9111701 , strand minus , id gi 51511728 } , { from 9112045 , to 9112086 , strand minus , id gi 51511728 } , { from 9112602 , to 9112704 , strand minus , id gi 51511728 } , { from 9113607 , to 9113675 , strand minus , id gi 51511728 } , { from 9114350 , to 9114440 , strand minus , id gi 51511728 } , { from 9116221 , to 9116348 , strand minus , id gi 51511728 } , { from 9116515 , to 9116733 , strand minus , id gi 51511728 } , { from 9118422 , to 9118645 , strand minus , id gi 51511728 } , { from 9120618 , to 9120798 , strand minus , id gi 51511728 } , { from 9121208 , to 9121282 , strand minus , id gi 51511728 } , { from 9121563 , to 9121719 , strand minus , id gi 51511728 } , { from 9123106 , to 9123193 , strand minus , id gi 51511728 } , { from 9123501 , to 9123677 , strand minus , id gi 51511728 } , { from 9123956 , to 9124039 , strand minus , id gi 51511728 } , { from 9133062 , to 9133113 , strand minus , id gi 51511728 } , { from 9133764 , to 9133885 , strand minus , id gi 51511728 } , { from 9134218 , to 9134344 , strand minus , id gi 51511728 } , { from 9135063 , to 9135291 , strand minus , id gi 51511728 } , { from 9137327 , to 9137441 , strand minus , id gi 51511728 } , { from 9138835 , to 9138946 , strand minus , id gi 51511728 } , { from 9139401 , to 9139562 , strand minus , id gi 51511728 } , { from 9142469 , to 9142618 , strand minus , id gi 51511728 } , { from 9143243 , to 9143385 , strand minus , id gi 51511728 } , { from 9145006 , to 9145069 , strand minus , id gi 51511728 } , { from 9145309 , to 9145536 , strand minus , id gi 51511728 } , { from 9148101 , to 9148262 , strand minus , id gi 51511728 } , { from 9150098 , to 9150207 , strand minus , id gi 51511728 } , { from 9150353 , to 9150467 , strand minus , id gi 51511728 } , { from 9151386 , to 9151506 , strand minus , id gi 51511728 } , { from 9153183 , to 9153267 , strand minus , id gi 51511728 } , { from 9153729 , to 9153897 , strand minus , id gi 51511728 } , { from 9154176 , to 9154196 , strand minus , id gi 51511728 } , { from 9156021 , to 9156073 , strand minus , id gi 51511728 } , { from 9156239 , to 9156398 , strand minus , id gi 51511728 } , { from 9157222 , to 9157405 , strand minus , id gi 51511728 } , { from 9159626 , to 9159711 , strand minus , id gi 51511728 } } } , seqs { whole gi 4557225 } } } } } } , { type genomic , heading "Reference" , accession "NT_009714" , version 16 , seqs { int { from 1979283 , to 2027462 , strand minus , id gi 37543832 } } , products { { type mRNA , heading "Reference" , accession "NM_000014" , version 3 , genomic-coords { mix { int { from 1979283 , to 1979408 , strand minus , id gi 37543832 } , int { from 1979752 , to 1979793 , strand minus , id gi 37543832 } , int { from 1980309 , to 1980411 , strand minus , id gi 37543832 } , int { from 1981314 , to 1981382 , strand minus , id gi 37543832 } , int { from 1982057 , to 1982147 , strand minus , id gi 37543832 } , int { from 1983928 , to 1984055 , strand minus , id gi 37543832 } , int { from 1984222 , to 1984440 , strand minus , id gi 37543832 } , int { from 1986129 , to 1986352 , strand minus , id gi 37543832 } , int { from 1988325 , to 1988505 , strand minus , id gi 37543832 } , int { from 1988915 , to 1988989 , strand minus , id gi 37543832 } , int { from 1989270 , to 1989426 , strand minus , id gi 37543832 } , int { from 1990813 , to 1990900 , strand minus , id gi 37543832 } , int { from 1991208 , to 1991384 , strand minus , id gi 37543832 } , int { from 1991663 , to 1991746 , strand minus , id gi 37543832 } , int { from 2000769 , to 2000820 , strand minus , id gi 37543832 } , int { from 2001471 , to 2001592 , strand minus , id gi 37543832 } , int { from 2001925 , to 2002051 , strand minus , id gi 37543832 } , int { from 2002770 , to 2002998 , strand minus , id gi 37543832 } , int { from 2005034 , to 2005148 , strand minus , id gi 37543832 } , int { from 2006542 , to 2006653 , strand minus , id gi 37543832 } , int { from 2007108 , to 2007269 , strand minus , id gi 37543832 } , int { from 2010176 , to 2010325 , strand minus , id gi 37543832 } , int { from 2010950 , to 2011092 , strand minus , id gi 37543832 } , int { from 2012713 , to 2012776 , strand minus , id gi 37543832 } , int { from 2013016 , to 2013243 , strand minus , id gi 37543832 } , int { from 2015808 , to 2015969 , strand minus , id gi 37543832 } , int { from 2017805 , to 2017914 , strand minus , id gi 37543832 } , int { from 2018060 , to 2018174 , strand minus , id gi 37543832 } , int { from 2019093 , to 2019213 , strand minus , id gi 37543832 } , int { from 2020890 , to 2020974 , strand minus , id gi 37543832 } , int { from 2021436 , to 2021604 , strand minus , id gi 37543832 } , int { from 2021883 , to 2021903 , strand minus , id gi 37543832 } , int { from 2023728 , to 2023780 , strand minus , id gi 37543832 } , int { from 2023946 , to 2024105 , strand minus , id gi 37543832 } , int { from 2024929 , to 2025112 , strand minus , id gi 37543832 } , int { from 2027333 , to 2027462 , strand minus , id gi 37543832 } } } , seqs { whole gi 6226959 } , products { { type peptide , heading "Reference" , label "precursor" , accession "NP_000005" , version 1 , genomic-coords { packed-int { { from 1979392 , to 1979408 , strand minus , id gi 37543832 } , { from 1979752 , to 1979793 , strand minus , id gi 37543832 } , { from 1980309 , to 1980411 , strand minus , id gi 37543832 } , { from 1981314 , to 1981382 , strand minus , id gi 37543832 } , { from 1982057 , to 1982147 , strand minus , id gi 37543832 } , { from 1983928 , to 1984055 , strand minus , id gi 37543832 } , { from 1984222 , to 1984440 , strand minus , id gi 37543832 } , { from 1986129 , to 1986352 , strand minus , id gi 37543832 } , { from 1988325 , to 1988505 , strand minus , id gi 37543832 } , { from 1988915 , to 1988989 , strand minus , id gi 37543832 } , { from 1989270 , to 1989426 , strand minus , id gi 37543832 } , { from 1990813 , to 1990900 , strand minus , id gi 37543832 } , { from 1991208 , to 1991384 , strand minus , id gi 37543832 } , { from 1991663 , to 1991746 , strand minus , id gi 37543832 } , { from 2000769 , to 2000820 , strand minus , id gi 37543832 } , { from 2001471 , to 2001592 , strand minus , id gi 37543832 } , { from 2001925 , to 2002051 , strand minus , id gi 37543832 } , { from 2002770 , to 2002998 , strand minus , id gi 37543832 } , { from 2005034 , to 2005148 , strand minus , id gi 37543832 } , { from 2006542 , to 2006653 , strand minus , id gi 37543832 } , { from 2007108 , to 2007269 , strand minus , id gi 37543832 } , { from 2010176 , to 2010325 , strand minus , id gi 37543832 } , { from 2010950 , to 2011092 , strand minus , id gi 37543832 } , { from 2012713 , to 2012776 , strand minus , id gi 37543832 } , { from 2013016 , to 2013243 , strand minus , id gi 37543832 } , { from 2015808 , to 2015969 , strand minus , id gi 37543832 } , { from 2017805 , to 2017914 , strand minus , id gi 37543832 } , { from 2018060 , to 2018174 , strand minus , id gi 37543832 } , { from 2019093 , to 2019213 , strand minus , id gi 37543832 } , { from 2020890 , to 2020974 , strand minus , id gi 37543832 } , { from 2021436 , to 2021604 , strand minus , id gi 37543832 } , { from 2021883 , to 2021903 , strand minus , id gi 37543832 } , { from 2023728 , to 2023780 , strand minus , id gi 37543832 } , { from 2023946 , to 2024105 , strand minus , id gi 37543832 } , { from 2024929 , to 2025112 , strand minus , id gi 37543832 } , { from 2027333 , to 2027418 , strand minus , id gi 37543832 } } } , seqs { whole gi 4557225 } } } } } } , { type genomic , heading "Reference" , accession "NT_086792" , version 1 , seqs { int { from 4173171 , to 4221277 , strand minus , id gi 51471135 } } , products { { type mRNA , heading "Reference" , accession "NM_000014" , version 3 , genomic-coords { mix { int { from 4173171 , to 4173296 , strand minus , id gi 51471135 } , int { from 4173640 , to 4173681 , strand minus , id gi 51471135 } , int { from 4174197 , to 4174299 , strand minus , id gi 51471135 } , int { from 4175201 , to 4175269 , strand minus , id gi 51471135 } , int { from 4175944 , to 4176034 , strand minus , id gi 51471135 } , int { from 4177816 , to 4177943 , strand minus , id gi 51471135 } , int { from 4178110 , to 4178328 , strand minus , id gi 51471135 } , int { from 4180017 , to 4180240 , strand minus , id gi 51471135 } , int { from 4182213 , to 4182393 , strand minus , id gi 51471135 } , int { from 4182803 , to 4182877 , strand minus , id gi 51471135 } , int { from 4183158 , to 4183314 , strand minus , id gi 51471135 } , int { from 4184702 , to 4184789 , strand minus , id gi 51471135 } , int { from 4185097 , to 4185273 , strand minus , id gi 51471135 } , int { from 4185552 , to 4185635 , strand minus , id gi 51471135 } , int { from 4194661 , to 4194712 , strand minus , id gi 51471135 } , int { from 4195363 , to 4195484 , strand minus , id gi 51471135 } , int { from 4195817 , to 4195943 , strand minus , id gi 51471135 } , int { from 4196662 , to 4196890 , strand minus , id gi 51471135 } , int { from 4198926 , to 4199040 , strand minus , id gi 51471135 } , int { from 4200364 , to 4200475 , strand minus , id gi 51471135 } , int { from 4200930 , to 4201091 , strand minus , id gi 51471135 } , int { from 4203997 , to 4204146 , strand minus , id gi 51471135 } , int { from 4204771 , to 4204913 , strand minus , id gi 51471135 } , int { from 4206534 , to 4206597 , strand minus , id gi 51471135 } , int { from 4206837 , to 4207064 , strand minus , id gi 51471135 } , int { from 4209629 , to 4209790 , strand minus , id gi 51471135 } , int { from 4211627 , to 4211736 , strand minus , id gi 51471135 } , int { from 4211882 , to 4211996 , strand minus , id gi 51471135 } , int { from 4212916 , to 4213036 , strand minus , id gi 51471135 } , int { from 4214705 , to 4214789 , strand minus , id gi 51471135 } , int { from 4215251 , to 4215419 , strand minus , id gi 51471135 } , int { from 4215698 , to 4215718 , strand minus , id gi 51471135 } , int { from 4217543 , to 4217595 , strand minus , id gi 51471135 } , int { from 4217761 , to 4217920 , strand minus , id gi 51471135 } , int { from 4218744 , to 4218927 , strand minus , id gi 51471135 } , int { from 4221148 , to 4221277 , strand minus , id gi 51471135 } } } , seqs { whole gi 6226959 } , products { { type peptide , heading "Reference" , label "precursor" , accession "NP_000005" , version 1 , genomic-coords { packed-int { { from 4173280 , to 4173296 , strand minus , id gi 51471135 } , { from 4173640 , to 4173681 , strand minus , id gi 51471135 } , { from 4174197 , to 4174299 , strand minus , id gi 51471135 } , { from 4175201 , to 4175269 , strand minus , id gi 51471135 } , { from 4175944 , to 4176034 , strand minus , id gi 51471135 } , { from 4177816 , to 4177943 , strand minus , id gi 51471135 } , { from 4178110 , to 4178328 , strand minus , id gi 51471135 } , { from 4180017 , to 4180240 , strand minus , id gi 51471135 } , { from 4182213 , to 4182393 , strand minus , id gi 51471135 } , { from 4182803 , to 4182877 , strand minus , id gi 51471135 } , { from 4183158 , to 4183314 , strand minus , id gi 51471135 } , { from 4184702 , to 4184789 , strand minus , id gi 51471135 } , { from 4185097 , to 4185273 , strand minus , id gi 51471135 } , { from 4185552 , to 4185635 , strand minus , id gi 51471135 } , { from 4194661 , to 4194712 , strand minus , id gi 51471135 } , { from 4195363 , to 4195484 , strand minus , id gi 51471135 } , { from 4195817 , to 4195943 , strand minus , id gi 51471135 } , { from 4196662 , to 4196890 , strand minus , id gi 51471135 } , { from 4198926 , to 4199040 , strand minus , id gi 51471135 } , { from 4200364 , to 4200475 , strand minus , id gi 51471135 } , { from 4200930 , to 4201091 , strand minus , id gi 51471135 } , { from 4203997 , to 4204146 , strand minus , id gi 51471135 } , { from 4204771 , to 4204913 , strand minus , id gi 51471135 } , { from 4206534 , to 4206597 , strand minus , id gi 51471135 } , { from 4206837 , to 4207064 , strand minus , id gi 51471135 } , { from 4209629 , to 4209790 , strand minus , id gi 51471135 } , { from 4211627 , to 4211736 , strand minus , id gi 51471135 } , { from 4211882 , to 4211996 , strand minus , id gi 51471135 } , { from 4212916 , to 4213036 , strand minus , id gi 51471135 } , { from 4214705 , to 4214789 , strand minus , id gi 51471135 } , { from 4215251 , to 4215419 , strand minus , id gi 51471135 } , { from 4215698 , to 4215718 , strand minus , id gi 51471135 } , { from 4217543 , to 4217595 , strand minus , id gi 51471135 } , { from 4217761 , to 4217920 , strand minus , id gi 51471135 } , { from 4218744 , to 4218927 , strand minus , id gi 51471135 } , { from 4221148 , to 4221233 , strand minus , id gi 51471135 } } } , seqs { whole gi 4557225 } } } } } } } , properties { { type comment , label "Nomenclature" , version 0 , source { { anchor "HUGO Gene Nomenclature Committee" } } , properties { { type property , label "Official Symbol" , text "A2M" , version 0 } , { type property , label "Official Full Name" , text "alpha-2-macroglobulin" , version 0 } } } , { type comment , heading "GeneOntology" , version 0 , source { { pre-text "Provided by" , anchor "GOA" , url "http://www.ebi.ac.uk/GOA/" } } , comment { { type comment , label "Function" , version 0 , comment { { type comment , version 0 , refs { pmid 11435418 } , source { { src { db "GO" , tag id 19899 } , anchor "enzyme binding" , post-text "evidence: IPI" } } } , { type comment , version 0 , refs { pmid 9714181 } , source { { src { db "GO" , tag id 19966 } , anchor "interleukin-1 binding" , post-text "evidence: IDA" } } } , { type comment , version 0 , refs { pmid 10880251 } , source { { src { db "GO" , tag id 19959 } , anchor "interleukin-8 binding" , post-text "evidence: IPI" } } } , { type comment , version 0 , source { { src { db "GO" , tag id 8320 } , anchor "protein carrier activity" , post-text "evidence: NR" } } } , { type comment , version 0 , source { { src { db "GO" , tag id 4867 } , anchor "serine-type endopeptidase inhibitor activity" , post-text "evidence: IEA" } } } , { type comment , version 0 , refs { pmid 9714181 } , source { { src { db "GO" , tag id 43120 } , anchor "tumor necrosis factor binding" , post-text "evidence: IDA" } } } , { type comment , version 0 , source { { src { db "GO" , tag id 17114 } , anchor "wide-spectrum protease inhibitor activity" , post-text "evidence: IEA" } } } } } , { type comment , label "Process" , version 0 , comment { { type comment , version 0 , source { { src { db "GO" , tag id 6886 } , anchor "intracellular protein transport" , post-text "evidence: NR" } } } , { type comment , version 0 , source { { src { db "GO" , tag id 51260 } , anchor "protein homooligomerization" , post-text "evidence: NAS" } } } } } , { type comment , label "Component" , version 0 , comment { { type comment , version 0 , refs { pmid 14718574 } , source { { src { db "GO" , tag id 5576 } , anchor "extracellular region" , post-text "evidence: NAS" } } } } } } } } , homology { { type comment , heading "Mouse, Rat" , version 0 , source { { src { db "HomoloGene" , tag id 37248 } , anchor "Map Viewer" , url "http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&chr=12& MAPS=genes-r-org/rat-chr/human%3A12,genes-r-org/mouse-chr/human%3A12,genes-r-o rg/human-chr12&query=e%3A2[id]+AND+gene[obj_type]&QSTR=a2m&cmd=focus&fill=10" } } } } , comments { { type comment , heading "LocusTagLink" , version 0 , source { { src { db "HGNC" , tag id 7 } } } } , { type comment , heading "RefSeq Status" , label "REVIEWED" , version 0 } , { type comment , version 0 , refs { pmid 15511627 , pmid 15023809 , pmid 14760718 , pmid 14718574 , pmid 14715656 , pmid 14678766 , pmid 14675603 , pmid 14637088 , pmid 14506912 , pmid 12966032 , pmid 12755687 , pmid 12755614 , pmid 12631277 , pmid 12477932 , pmid 12221172 , pmid 12175343 , pmid 12062545 , pmid 12042276 , pmid 12015318 , pmid 11916201 , pmid 11910179 , pmid 11901360 , pmid 11823454 , pmid 11811950 , pmid 11435418 , pmid 11100124 , pmid 10880251 , pmid 9714181 , pmid 2581245 , pmid 2408344 , pmid 1707161 , pmid 1370808 , pmid 1281457 } } , { type comment , heading "Phenotypes" , version 0 , comment { { type phenotype , text "Alzheimer disease, susceptibility to" , version 0 , source { { src { db "MIM" , tag id 103950 } , anchor "MIM: 103950" } } } , { type phenotype , text "Emphysema due to alpha-2-macroglobulin deficiency" , version 0 , source { { src { db "MIM" , tag id 103950 } , anchor "MIM: 103950" } } } } } , { type comment , heading "Markers (Sequence Tagged Sites/STS)" , version 0 , comment { { type comment , version 0 , source { { src { db "UniSTS" , tag id 25036 } , anchor "RH1601" , post-text "(e-PCR)" } } , comment { { type other , label "Alternate name" , text "RH44109" , version 0 } , { type other , label "Alternate name" , text "stSG1293" , version 0 } } } , { type comment , version 0 , source { { src { db "UniSTS" , tag id 40245 } , anchor "SGC31674" , post-text "(e-PCR)" } } , comment { { type other , label "Alternate name" , text "EST130345" , version 0 } , { type other , label "Alternate name" , text "RH52474" , version 0 } , { type other , label "Alternate name" , text "WI-219" , version 0 } } } , { type comment , version 0 , source { { src { db "UniSTS" , tag id 46849 } , anchor "RH11157" , post-text "(e-PCR)" } } , comment { { type other , label "Alternate name" , text "RH44108" , version 0 } , { type other , label "Alternate name" , text "stSG1290R" , version 0 } } } , { type comment , version 0 , source { { src { db "UniSTS" , tag id 95143 } , anchor "G44356" , post-text "(e-PCR)" } } , comment { { type other , label "Alternate name" , text "WIAF-1430-STS" , version 0 } } } , { type comment , version 0 , source { { src { db "UniSTS" , tag id 147540 } , anchor "NoName" , post-text "(e-PCR)" } } } , { type comment , version 0 , source { { src { db "UniSTS" , tag id 151442 } , anchor "NoName" , post-text "(e-PCR)" } } } } } , { type comment , heading "NCBI Reference Sequences (RefSeq)" , version 0 , products { { type mRNA , heading "mRNA Sequence" , accession "NM_000014" , version 3 , source { { src { db "Nucleotide" , tag id 6226959 } , anchor "NM_000014" } } , seqs { whole gi 6226959 } , products { { type peptide , heading "Product" , accession "NP_000005" , version 1 , source { { src { db "Protein" , tag id 4557225 } , anchor "NP_000005" , post-text "alpha-2-macroglobulin precursor" } } , seqs { whole gi 4557225 } , comment { { type other , heading "Conserved Domains" , version 0 , source { { src { db "PROT_CDD" , tag id 4557225 } , pre-text "(2)" , anchor "summary" } } , comment { { type other , version 0 , source { { src { db "CDD" , tag id 5952 } , anchor "pfam00207: A2M; Alpha-2-macroglobulin family" } } , comment { { type other , text "Location: 725 - 1463 Blast Score: 2329" , version 0 } } } , { type other , version 0 , source { { src { db "CDD" , tag id 25832 } , anchor "pfam01835: A2M_N; Alpha-2-macroglobulin family N-terminal region" } } , comment { { type other , text "Location: 21 - 628 Blast Score: 1960" , version 0 } } } } } } } } , comment { { type other , heading "Source Sequence" , version 0 , source { { src { db "Nucleotide" , tag str "M11313" } , anchor "M11313" } } , comment { { type other , version 0 } } } } } } } , { type comment , heading "Related Sequences" , version 0 , products { { type genomic , heading "Genomic" , accession "AF349032" , version 1 , source { { src { db "Nucleotide" , tag id 13661813 } , anchor "AF349032" } } , seqs { whole gi 13661813 } , products { { type peptide , accession "AAK38109" , version 1 , source { { src { db "Protein" , tag id 13661814 } , anchor "AAK38109" } } , seqs { whole gi 13661814 } } } } , { type genomic , heading "Genomic" , accession "AF349033" , version 1 , source { { src { db "Nucleotide" , tag id 13661815 } , anchor "AF349033" } } , seqs { whole gi 13661815 } , products { { type peptide , accession "AAK38110" , version 1 , source { { src { db "Protein" , tag id 13661816 } , anchor "AAK38110" } } , seqs { whole gi 13661816 } } } } , { type genomic , heading "Genomic" , accession "X68728" , version 1 , source { { src { db "Nucleotide" , tag id 450521 } , anchor "X68728" } } , seqs { whole gi 450521 } , products { { type peptide , accession "CAA48670" , version 1 , source { { src { db "Protein" , tag id 825615 } , anchor "CAA48670" } } , seqs { whole gi 825615 } } } } , { type genomic , heading "Genomic" , accession "Z11711" , version 1 , source { { src { db "Nucleotide" , tag id 24760 } , anchor "Z11711" } } , seqs { whole gi 24760 } , products { { type peptide , accession "CAA77774" , version 1 , source { { src { db "Protein" , tag id 24761 } , anchor "CAA77774" } } , seqs { whole gi 24761 } } } } , { type mRNA , heading "mRNA" , accession "AB209614" , version 1 , source { { src { db "Nucleotide" , tag id 62088807 } , anchor "AB209614" } } , seqs { whole gi 62088807 } , products { { type peptide , accession "BAD92851" , version 1 , source { { src { db "Protein" , tag id 62088808 } , anchor "BAD92851" } } , seqs { whole gi 62088808 } } } } , { type mRNA , heading "mRNA" , accession "AF109189" , version 1 , source { { src { db "Nucleotide" , tag id 33337723 } , anchor "AF109189" } } , seqs { whole gi 33337723 } , products { { type peptide , accession "AAQ13498" , version 1 , source { { src { db "Protein" , tag id 33337724 } , anchor "AAQ13498" } } , seqs { whole gi 33337724 } } } } , { type mRNA , heading "mRNA" , accession "AY591530" , version 1 , source { { src { db "Nucleotide" , tag id 46812314 } , anchor "AY591530" } } , seqs { whole gi 46812314 } , products { { type peptide , accession "AAT02228" , version 1 , source { { src { db "Protein" , tag id 46812315 } , anchor "AAT02228" } } , seqs { whole gi 46812315 } } } } , { type mRNA , heading "mRNA" , accession "BC026246" , version 1 , source { { src { db "Nucleotide" , tag id 45708660 } , anchor "BC026246" } } , seqs { whole gi 45708660 } , products { { type peptide , accession "AAH26246" , version 1 , source { { src { db "Protein" , tag id 45708661 } , anchor "AAH26246" } } , seqs { whole gi 45708661 } } } } , { type mRNA , heading "mRNA" , accession "BC040071" , version 1 , source { { src { db "Nucleotide" , tag id 25303945 } , anchor "BC040071" } } , seqs { whole gi 25303945 } , products { { type peptide , accession "AAH40071" , version 1 , source { { src { db "Protein" , tag id 25303946 } , anchor "AAH40071" } } , seqs { whole gi 25303946 } } } } , { type mRNA , heading "mRNA" , accession "BX647329" , version 1 , source { { src { db "Nucleotide" , tag id 34366357 } , anchor "BX647329" } } , seqs { whole gi 34366357 } , products { { type peptide , text "None" , version 0 } } } , { type mRNA , heading "mRNA" , accession "CR749334" , version 1 , source { { src { db "Nucleotide" , tag id 51476395 } , anchor "CR749334" } } , seqs { whole gi 51476395 } , products { { type peptide , accession "CAH18188" , version 1 , source { { src { db "Protein" , tag id 51476396 } , anchor "CAH18188" } } , seqs { whole gi 51476396 } } } } , { type mRNA , heading "mRNA" , accession "M11313" , version 1 , source { { src { db "Nucleotide" , tag id 177869 } , anchor "M11313" } } , seqs { whole gi 177869 } , products { { type peptide , accession "AAA51551" , version 1 , source { { src { db "Protein" , tag id 177870 } , anchor "AAA51551" } } , seqs { whole gi 177870 } } } } , { type mRNA , heading "mRNA" , accession "M36501" , version 1 , source { { src { db "Nucleotide" , tag id 177871 } , anchor "M36501" } } , seqs { whole gi 177871 } , products { { type peptide , accession "AAA51552" , version 1 , source { { src { db "Protein" , tag id 177872 } , anchor "AAA51552" } } , seqs { whole gi 177872 } } } } , { type other , text "None" , version 0 , products { { type peptide , accession "P01023" , version 0 , source { { src { db "Protein" , tag id 112911 } , anchor "P01023" } } , seqs { whole gi 112911 } } } } } } , { type comment , heading "Additional Links" , version 0 , comment { { type comment , version 0 , source { { src { db "Evidence Viewer" , tag str "2" } , anchor "Evidence Viewer" , url "http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=9606&conti g=NT_009714.16&gene=A2M&lid=2&from=1979284&to=2027463" } } } , { type comment , version 0 , source { { src { db "ModelMaker" , tag str "2" } , anchor "ModelMaker" , url "http://www.ncbi.nlm.nih.gov/mapview/modelmaker.cgi?taxid=96 06&contig=NT_009714.16&gene=A2M&lid=2" } } } , { type comment , text "UniGene" , version 0 , xtra-properties { { tag "UNIGENE" , value "Hs.212838" } } , source { { src { db "UniGene" , tag str "Hs.212838" } , anchor "Hs.212838" , url "http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Hs&CID=21 2838" } } } , { type comment , text "MIM" , version 0 , source { { src { db "MIM" , tag str "103950" } , anchor "103950" } } } , { type comment , version 0 , source { { src { db "HomoloGene" , tag str "2" } , anchor "HomoloGene" , url "http://www.ncbi.nlm.nih.gov/HomoloGene/homolquery.cgi?TEXT= 2[loc]&TAXID=9606" } } } , { type comment , version 0 , source { { src { db "AceView" , tag id 2 } , anchor "AceView" , url "http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?c=l ocusid&org=9606&l=2" } } } , { type comment , version 0 , source { { src { db "GDB" , tag str "GDB:119639" } } } } , { type comment , version 0 , source { { src { db "HGMD" , tag str "" } , url "http://www.uwcm.ac.uk/uwcm/mg/search/119639.html" } } } , { type comment , version 0 , source { { src { db "Ensembl" , tag str "" } , url "http://www.ensembl.org/Homo_sapiens/contigview?geneid=CR749 334" } } } , { type comment , version 0 , source { { src { db "UCSC" , tag str "" } , url "http://genome.ucsc.edu/cgi-bin/hgTracks?org=human&position= CR749334" } } } , { type comment , version 0 , source { { src { db "KEGG" , tag str "" } , url "http://www.genome.ad.jp/dbget-bin/www_bget?hsa:2" } } } , { type comment , text "PharmGKB" , version 0 , source { { src { db "PharmGKB" , tag str "PA24357" } , anchor "PA24357" } } } , { type comment , version 0 , source { { src { db "MGC" , tag str "BC040071" } , anchor "MGC" , url "http://mgc.nci.nih.gov/Genes/CloneList?ORG=Hs&LIST=BC040071" } } } } } , { type comment , heading "Pathways" , version 0 , comment { { type comment , text "KEGG pathway: Alzheimer's disease" , version 0 , source { { src { db "05010" , tag str "05010" } , anchor "05010" , url "http://www.genome.jp/dbget-bin/show_pathway?hsa05010+2" } } } , { type comment , text "KEGG pathway: Coagulation cascade" , version 0 , source { { src { db "04610" , tag str "04610" } , anchor "04610" , url "http://www.genome.jp/dbget-bin/show_pathway?hsa04610+2" } } } , { type comment , text "KEGG pathway: Complement and coagulation cascades" , version 0 , source { { src { db "04610" , tag str "04610" } , anchor "04610" , url "http://www.genome.jp/dbget-bin/show_pathway?hsa04610+2" } } } , { type comment , text "Reactome: Hemostasis" , version 0 , source { { anchor "49292" , url "http://www.reactome.org/cgi-bin/link?SOURCE=UniProt&ID=P010 23" } } } } } , { type generif , text "an important involvement of alpha2M in regulation of increased proteolytic activity occurring in multiple sclerosis disease" , version 0 , refs { pmid 15511627 } , create-date str "Dec 18 2004 10:01AM" , update-date str "Dec 18 2004 1:19PM" } , { type generif , text "alpha2-macroglobulin inhibits human pepsin and gastricsin" , version 0 , refs { pmid 14506912 } , create-date str "Oct 7 2003 12:00AM" , update-date str "Jul 7 2004 1:36PM" } , { type generif , text "There is a significant genetic association of the 5 bp deletion and two novel polymorphisms in alpha-2-macroglobulin alpha-2-macroglobulin precursor with AD" , version 0 , refs { pmid 12966032 } , create-date str "Jun 27 2004 5:41PM" , update-date str "Jun 27 2004 6:18PM" } , { type generif , text "Alpha2-macroglobulin is a substrate and an endogenous inhibitor for ADAMTS-4 and ADAMTS-5" , version 0 , refs { pmid 14715656 } , create-date str "Jun 14 2004 4:41PM" , update-date str "Jun 14 2004 5:44PM" } , { type generif , text "A2M-D allele played a weak Alzheimer disease protective role, and APOE-E4 and A2M-G alleles might act synergistically in Alzheimer disease risk for mainland Han Chinese." , version 0 , refs { pmid 14675603 } , create-date str "Feb 17 2004 12:00AM" , update-date str "Apr 13 2004 1:35PM" } , { type generif , text "Plasma from patients homozygous for the intronic deletion (DD) showed normal alpha(2)M subunit size, conformation, and proteinase inhibitory activity. Plasma alpha(2)M from two DD patients showed markedly increased TGF-beta1 binding." , version 0 , refs { pmid 14678766 } , create-date str "Jan 20 2004 12:00AM" , update-date str "Mar 7 2004 7:01AM" } , { type generif , text "The presence of MPO-G/G and A2M-Val/Val genotypes synergistically increased the risk of AD (OR, 25.5; 95% CI, 4.65-139.75)." , version 0 , refs { pmid 15023809 } , create-date str "Mar 25 2004 12:00AM" , update-date str "Apr 18 2004 7:02AM" } , { type generif , text "alpha2-M deletion polymorphism is probably not associated with functional deficiencies important in Alzheimer's disease pathology" , version 0 , refs { pmid 14637088 } , create-date str "Dec 31 2003 12:00AM" , update-date str "Jan 11 2004 7:02AM" } , { type generif , text "FGF-2 and this protein interact at specific binding sites, involving different FGF-2 sequences." , version 0 , refs { pmid 12755687 } , create-date str "Sep 7 2003 12:00AM" , update-date str "Sep 28 2003 7:01AM" } , { type generif , text "alpha(2)M-derived peptides target the receptor-binding sequence in TGF-beta" , version 0 , refs { pmid 12755614 } , create-date str "Jun 25 2003 12:00AM" , update-date str "Jul 6 2003 7:01AM" } , { type generif , text "These results suggest the possible involvement of cathepsin E in disruption of the structural and functional integrity of alpha 2-macroglobulin in the endolysosome system." , version 0 , refs { pmid 12631277 } , create-date str "May 8 2003 12:00AM" , update-date str "May 11 2003 7:01AM" } , { type generif , text "Alpha 2-macroglobulin enhances prothrombin activation and thrombin potential by inhibiting the anticoagulant protein C/protein S system in cord and adult plasma." , version 0 , refs { pmid 12062545 } , create-date str "Feb 6 2003 12:00AM" , update-date str "Aug 17 2003 7:01AM" } , { type generif , text "relationship between serum VEGF levels, alpha(2)M levels and the development of OHSS in hyperstimulated subjects undergoing IVF" , version 0 , refs { pmid 12042276 } , create-date str "Dec 16 2002 12:00AM" , update-date str "Feb 2 2003 7:00AM" } , { type generif , text "Genetic association of alpha2-macroglobulin polymorphisms with Alzheimer's disease" , version 0 , refs { pmid 12221172 } , create-date str "Sep 27 2002 12:00AM" , update-date str "Oct 7 2002 8:10AM" } , { type generif , text "Genetic association of argyrophilic grain disease with polymorphisms in alpha-2 macroglobulin." , version 0 , refs { pmid 12175343 } , create-date str "Sep 24 2002 12:00AM" , update-date str "Oct 7 2002 8:10AM" } , { type generif , text "The three-dimensional structure of the dimer reveals its structural organization in the tetrameric native and chymotrypsin alpha 2-macroglobulin complexes." , version 0 , refs { pmid 12015318 } , create-date str "Sep 5 2002 12:00AM" , update-date str "Sep 23 2002 6:27AM" } , { type generif , text "has an important role in the AD-specific neurodegenerative process but its exon 24 Val-1000-Ile polymorphism is not likely to be associated with late-onset sporadic AD in the Hungarian population" , version 0 , refs { pmid 11901360 } , create-date str "Aug 6 2002 12:00AM" , update-date str "Aug 28 2002 6:15PM" } , { type generif , text "REVIEW: binds and neutralizes alfimeprase, which has direct proteolytic activity against the fibrinogen Aalpha chain" , version 0 , refs { pmid 11910179 } , create-date str "Jun 12 2002 12:00AM" , update-date str "Sep 7 2003 7:01AM" } , { type generif , text "distinct binding sites mediate interaction with beta-amyloid peptide and growth factors" , version 0 , refs { pmid 11823454 } , create-date str "May 4 2002 12:00AM" , update-date str "May 18 2002 6:08AM" } , { type generif , text "Differential binding to ldl receptor related protein" , version 0 , refs { pmid 11811950 } , create-date str "Feb 14 2002 12:00AM" , update-date str "Mar 4 2002 7:46AM" } , { type generif , heading "HIV-1 protein interactions" , version 0 , comment { { type generif , text "Binding of HIV-1 Tat to LRP inhibits neuronal binding, uptake and degradation of physiological ligands for LRP, including alpha2-macroglobulin, apolipoprotein E4, amyloid precursor and amyloid beta-protein" , version 0 , refs { pmid 11100124 } , comment { { type comment , label "Tat" , accession "NP_057853" , version 1 , source { { src { db "GeneID" , tag id 155871 } } } } , { type comment , accession "NP_000005" , version 1 } } , create-date str "May 11 2004 12:48PM" , update-date str "May 11 2004 1:10PM" } } } } , unique-keys { { db "LocusID" , tag id 2 } , { db "MIM" , tag id 103950 } } , xtra-index-terms { "LOC2" } , xtra-properties { { tag "PROP" , value "phenotype" } } } Makefile.PL100644000765000024 307712215135616 17031 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70 use strict; use warnings; use ExtUtils::MakeMaker 6.30; my %WriteMakefileArgs = ( "ABSTRACT" => "Regular expression-based Perl Parser for NCBI Entrez Gene.", "AUTHOR" => "Mingyi Liu ", "BUILD_REQUIRES" => {}, "CONFIGURE_REQUIRES" => { "ExtUtils::MakeMaker" => "6.30" }, "DISTNAME" => "Bio-ASN1-EntrezGene", "EXE_FILES" => [], "LICENSE" => "perl", "NAME" => "Bio::ASN1::EntrezGene", "PREREQ_PM" => { "Bio::Index::AbstractSeq" => 0, "Carp" => 0, "parent" => 0, "strict" => 0, "utf8" => 0, "warnings" => 0 }, "TEST_REQUIRES" => { "File::Spec" => 0, "IO::Handle" => 0, "IPC::Open3" => 0, "Test::More" => 0 }, "VERSION" => "1.70", "test" => { "TESTS" => "t/*.t" } ); unless ( eval { ExtUtils::MakeMaker->VERSION(6.63_03) } ) { my $tr = delete $WriteMakefileArgs{TEST_REQUIRES}; my $br = $WriteMakefileArgs{BUILD_REQUIRES}; for my $mod ( keys %$tr ) { if ( exists $br->{$mod} ) { $br->{$mod} = $tr->{$mod} if $tr->{$mod} > $br->{$mod}; } else { $br->{$mod} = $tr->{$mod}; } } } unless ( eval { ExtUtils::MakeMaker->VERSION(6.56) } ) { my $br = delete $WriteMakefileArgs{BUILD_REQUIRES}; my $pp = $WriteMakefileArgs{PREREQ_PM}; for my $mod ( keys %$br ) { if ( exists $pp->{$mod} ) { $pp->{$mod} = $br->{$mod} if $br->{$mod} > $pp->{$mod}; } else { $pp->{$mod} = $br->{$mod}; } } } delete $WriteMakefileArgs{CONFIGURE_REQUIRES} unless eval { ExtUtils::MakeMaker->VERSION(6.52) }; WriteMakefile(%WriteMakefileArgs); input1.asn100755000765000024 2034412215135616 17264 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/tEntrezgene ::= { track-info { geneid 3 , status live , create-date std { year 2003 , month 8 , day 28 , hour 20 , minute 30 , second 0 } , update-date std { year 2005 , month 2 , day 17 , hour 12 , minute 9 , second 0 } } , type pseudo , source { genome genomic , origin natural , org { taxname "Homo sapiens" , common "human" , db { { db "taxon" , tag id 9606 } } , syn { "man" } , orgname { name binomial { genus "Homo" , species "sapiens" } , lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo" , gcode 1 , mgcode 2 , div "PRI" } } , subtype { { subtype chromosome , name "12" } } } , gene { locus "A2MP" , desc "alpha-2-macroglobulin pseudogene" , maploc "12p13.3-p12.3" , locus-tag "HGNC:8" } , location { { display-str "12p13.3-p12.3" , method map-type cyto } } , gene-source { src "LocusLink" , src-int 3 , src-str2 "3" , gene-display FALSE , locus-display FALSE , extra-terms FALSE } , locus { { type genomic , heading "Reference" , accession "NC_000012" , version 9 , seqs { int { from 9275468 , to 9278174 , strand minus , id gi 51511728 , fuzz-from lim lt , fuzz-to lim gt } } } , { type genomic , heading "Reference" , accession "NT_009714" , version 16 , seqs { int { from 2143175 , to 2145881 , strand minus , id gi 37543832 , fuzz-from lim lt , fuzz-to lim gt } } } , { type genomic , heading "Reference" , accession "NT_086792" , version 1 , seqs { int { from 4337316 , to 4340022 , strand minus , id gi 51471135 , fuzz-from lim lt , fuzz-to lim gt } } } , { type genomic , heading "Reference" , accession "NG_001067" , version 1 , seqs { int { from 176 , to 2880 , strand plus , id gi 20270626 , fuzz-from lim lt , fuzz-to lim gt } } } } , properties { { type comment , label "Nomenclature" , version 0 , source { { anchor "HUGO Gene Nomenclature Committee" } } , properties { { type property , label "Official Symbol" , text "A2MP" , version 0 } , { type property , label "Official Full Name" , text "alpha-2-macroglobulin pseudogene" , version 0 } } } } , comments { { type comment , heading "LocusTagLink" , version 0 , source { { src { db "HGNC" , tag id 8 } } } } , { type comment , heading "RefSeq Status" , label "PROVISIONAL" , version 0 } , { type comment , version 0 , refs { pmid 2478422 } } , { type comment , heading "NCBI Reference Sequences (RefSeq)" , version 0 , comment { { type genomic , heading "Reference" , accession "NG_001067" , version 1 , source { { src { db "Nucleotide" , tag id 20270626 } , anchor "NG_001067" } } , seqs { int { from 1 , to 3003 , strand plus , id gi 20270626 } } } } } , { type comment , heading "Related Sequences" , version 0 , products { { type genomic , heading "Genomic" , accession "M24415" , version 1 , source { { src { db "Nucleotide" , tag id 187575 } , anchor "M24415" } } , seqs { int { from 177 , to 2881 , strand plus , id gi 187575 } } , products { { type peptide , text "None" , version 0 } } } } } , { type comment , heading "Additional Links" , version 0 , comment { { type comment , version 0 , source { { src { db "Evidence Viewer" , tag str "3" } , anchor "Evidence Viewer" , url "http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=9606&conti g=NT_009714.16&gene=A2MP&lid=3&from=2143176&to=2145882" } } } , { type comment , version 0 , source { { src { db "ModelMaker" , tag str "3" } , anchor "ModelMaker" , url "http://www.ncbi.nlm.nih.gov/mapview/modelmaker.cgi?taxid=96 06&contig=NT_009714.16&gene=A2MP&lid=3" } } } , { type comment , version 0 , source { { src { db "AceView" , tag id 3 } , anchor "AceView" , url "http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?c=l ocusid&org=9606&l=3" } } } , { type comment , version 0 , source { { src { db "GDB" , tag str "GDB:128103" } } } } } } } , unique-keys { { db "LocusID" , tag id 3 } } , xtra-index-terms { "LOC3" } } Entrezgene ::= { track-info { geneid 4 , status discontinued , create-date std { year 2003 , month 8 , day 28 , hour 20 , minute 30 , second 0 } , update-date std { year 2005 , month 2 , day 17 , hour 12 , minute 9 , second 0 } } , type protein-coding , source { genome genomic , origin natural , org { taxname "Homo sapiens" , common "human" , db { { db "taxon" , tag id 9606 } } , syn { "man" } , orgname { name binomial { genus "Homo" , species "sapiens" } , lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo" , gcode 1 , mgcode 2 , div "PRI" } } , subtype { { subtype chromosome , name "1" } } } , gene { desc "adenovirus-12 chromosome modification site 1C" , maploc "1q42-q43" , syn { "A12M1" } } , prot { name { "adenovirus-12 chromosome modification site 1C" , "Adenovirus-12 chromosome modification site-1q1" } } , location { { display-str "1q42-q43" , method map-type cyto } } , gene-source { src "LocusLink" , src-int 4 , src-str2 "4" , gene-display FALSE , locus-display FALSE , extra-terms FALSE } , locus { { type genomic , version 0 } } , comments { { type comment , heading "RefSeq Status" , label "WITHDRAWN" , version 0 } , { type comment , heading "Additional Links" , version 0 , comment { { type comment , version 0 , source { { src { db "GDB" , tag str "GDB:118950" } } } } } } } , unique-keys { { db "LocusID" , tag id 4 } } , xtra-index-terms { "LOC4" } } testparser.t100755000765000024 274512215135616 17707 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/t#!/usr/bin/env perl -w use strict; use warnings; use File::Spec; use Test::More tests => 10; my ($nogene, $noseq); BEGIN { diag("\n\nTest parsers (Bio::ASN1::EntrezGene, Bio::ASN1::Sequence),\nparsing and method call:\n"); use_ok('Bio::ASN1::EntrezGene') || $nogene++; use_ok('Bio::ASN1::Sequence') || $noseq++; } diag("\n\nFirst testing gene parser:\n"); if(!$nogene) { my $parser = Bio::ASN1::EntrezGene->new(file => File::Spec->catfile('t','input.asn')); isa_ok($parser, 'Bio::ASN1::EntrezGene'); my $value = $parser->next_seq; isa_ok($value, 'ARRAY'); like($value->[0]{'track-info'}[0]{geneid}, qr/^\d+$/, 'correct geneid format'); my $raw = $parser->rawdata; like($raw, qr/^Entrezgene ::=/, 'rawdata() call'); } else { diag("\nThere's some problem with the installation of Bio::ASN1::EntrezGene!\nTry install again using:\n\tperl Makefile.PL\n\tmake\nQuitting now"); } diag("\n\nNow testing sequence parser:\n"); if(!$noseq) { my $parser = Bio::ASN1::Sequence->new(file => File::Spec->catfile('t','seq.asn')); isa_ok($parser, 'Bio::ASN1::Sequence'); my $value = $parser->next_seq; isa_ok($value, 'ARRAY'); like($value->[0]{'seq-set'}[0]{seq}[0]{id}[0]{genbank}[0]{accession}, qr/^[A-Za-z0-9.]+$/, 'genbank id format test'); my $raw = $parser->rawdata; like($raw, qr/^Seq-entry ::= set/, 'rawdata() call'); } else { diag("\nThere's some problem with the installation of Bio::ASN1::Sequence!\nTry install again using:\n\tperl Makefile.PL\n\tmake\nQuitting now"); } 00-compile.t100644000765000024 171512215135616 17351 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/tuse strict; use warnings; # this test was generated with Dist::Zilla::Plugin::Test::Compile 2.027 use Test::More tests => 4 + ($ENV{AUTHOR_TESTING} ? 1 : 0); my @module_files = ( 'Bio/ASN1/EntrezGene.pm', 'Bio/ASN1/EntrezGene/Indexer.pm', 'Bio/ASN1/Sequence.pm', 'Bio/ASN1/Sequence/Indexer.pm' ); # no fake home requested use IPC::Open3; use IO::Handle; my @warnings; for my $lib (@module_files) { # see L my $stdin = ''; # converted to a gensym by open3 my $stderr = IO::Handle->new; binmode $stderr, ':crlf' if $^O eq 'MSWin32'; my $pid = open3($stdin, '>&STDERR', $stderr, qq{$^X -Mblib -e"require q[$lib]"}); waitpid($pid, 0); is($? >> 8, 0, "$lib loaded ok"); if (my @_warnings = <$stderr>) { warn @_warnings; push @warnings, @_warnings; } } is(scalar(@warnings), 0, 'no warnings found') if $ENV{AUTHOR_TESTING}; testindexer.t100755000765000024 665412215135616 20054 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/t#!/usr/bin/env perl -w use strict; use File::Spec; use Test::More tests => 6; sub check_dependency { my $class = shift; eval "require $class; 1"; if ($@) { return; } 1; } my ( $noindex, $noabseq, $nogene, $noseq, $noseqindex ); BEGIN { diag( "\n\nTest indexers (Bio::ASN1::EntrezGene::Indexer, Bio::ASN1::Sequence::Indexer)\nIndexing and retrieval:\n" ); check_dependency('Bio::ASN1::EntrezGene') || $nogene++; check_dependency('Bio::Index::AbstractSeq') || $noabseq++; check_dependency('Bio::ASN1::EntrezGene::Indexer') || $noindex++; check_dependency('Bio::ASN1::Sequence') || $noseq++; check_dependency('Bio::ASN1::Sequence::Indexer') || $noseqindex++; } diag("\n\nFirst testing gene indexer:\n"); SKIP: { if ( !$nogene ) { skip( "BioPerl not installed, skipping", 3 ) if $noabseq; # test indexer if ( !$noabseq ) { if ( !$noindex ) { my $inx = Bio::ASN1::EntrezGene::Indexer->new( -filename => File::Spec->catfile('t','testgene.idx'), -write_flag => 'WRITE' ); isa_ok( $inx, 'Bio::ASN1::EntrezGene::Indexer' ); $inx->make_index( File::Spec->catfile('t','input.asn'), File::Spec->catfile('t','input1.asn' )); # cmp_ok($inx->count_records, '==', 4, 'total number of indexed gene records'); my $value = $inx->fetch_hash(3); isa_ok( $value, 'ARRAY' ); cmp_ok( $value->[0]{'track-info'}[0]{geneid}, '==', 3, 'correct gene record retrieved' ); } else { diag( "\nThere's some problem with the installation of Bio::ASN1::EntrezGene::Indexer!\nTry install again using:\n\tperl Makefile.PL\n\tmake\nQuitting now" ); } } } else { diag( "\nThere's some problem with the installation of Bio::ASN1::EntrezGene!\nTry install again using:\n\tperl Makefile.PL\n\tmake\nQuitting now" ); } diag("\n\nNow testing sequence indexer:\n"); } SKIP: { if ( !$noseq ) { skip( "BioPerl not installed, skipping", 3 ) if $noabseq; # test indexer if ( !$noabseq ) { if ( !$noseqindex ) { my $inx = Bio::ASN1::Sequence::Indexer->new( -filename => File::Spec->catfile('t','testseq.idx'), -write_flag => 'WRITE' ); isa_ok( $inx, 'Bio::ASN1::Sequence::Indexer' ); $inx->make_index(File::Spec->catfile('t','seq.asn')); # cmp_ok($inx->count_records, '==', 2, 'total number of sequence ids in index'); my $value = $inx->fetch_hash('AF093062'); isa_ok( $value, 'ARRAY' ); cmp_ok( $value->[0]{'seq-set'}[0]{seq}[0]{id}[0]{genbank}[0] {accession}, 'eq', 'AF093062', 'correct sequence record retrieved' ); } else { diag( "\nThere's some problem with the installation of Bio::ASN1::Sequence::Indexer!\nTry install again using:\n\tperl Makefile.PL\n\tmake\nQuitting now" ); } } } else { diag( "\nThere's some problem with the installation of Bio::ASN1::Sequence!\nTry install again using:\n\tperl Makefile.PL\n\tmake\nQuitting now" ); } } release-eol.t100644000765000024 47612215135616 17664 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/t BEGIN { unless ($ENV{RELEASE_TESTING}) { require Test::More; Test::More::plan(skip_all => 'these tests are for release candidate testing'); } } use strict; use warnings; use Test::More; eval 'use Test::EOL'; plan skip_all => 'Test::EOL required' if $@; all_perl_files_ok({ trailing_whitespace => 1 }); release-no-tabs.t100644000765000024 45012215135616 20440 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/t BEGIN { unless ($ENV{RELEASE_TESTING}) { require Test::More; Test::More::plan(skip_all => 'these tests are for release candidate testing'); } } use strict; use warnings; use Test::More; eval 'use Test::NoTabs'; plan skip_all => 'Test::NoTabs required' if $@; all_perl_files_ok(); release-mojibake.t100644000765000024 64412215135616 20663 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/t#!perl BEGIN { unless ($ENV{RELEASE_TESTING}) { require Test::More; Test::More::plan(skip_all => 'these tests are for release candidate testing'); } } use strict; use warnings qw(all); use Test::More; ## no critic (ProhibitStringyEval, RequireCheckingReturnValueOfEval) eval q(use Test::Mojibake); plan skip_all => q(Test::Mojibake required for source encoding testing) if $@; all_files_encoding_ok(); release-pod-syntax.t100644000765000024 45012215135616 21203 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/t#!perl BEGIN { unless ($ENV{RELEASE_TESTING}) { require Test::More; Test::More::plan(skip_all => 'these tests are for release candidate testing'); } } use Test::More; eval "use Test::Pod 1.41"; plan skip_all => "Test::Pod 1.41 required for testing POD" if $@; all_pod_files_ok(); examples000755000765000024 012215135616 16526 5ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70indexer_test.pl100755000765000024 300712215135616 21723 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/examples#!/usr/bin/perl # launch it like "perl indexer_test.pl Homo_sapiens" (Homo_sapiens can be downloaded # and decompressed from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN/Mammalia/Homo_sapiens.gz) # or use the included test file "perl indexer_test.pl ../t/input.asn ../t/input1.asn" use strict; use lib 'lib'; use lib '/home/liu/cvs/bioperl-1.5.0'; use lib '/home/liu/important/scripts/ari_geneindex'; use Bio::ASN1::EntrezGene::Indexer; use Dumpvalue; use Benchmark; # creation of index: my $file = 'entrezgene.idx'; my $inx = Bio::ASN1::EntrezGene::Indexer->new( -filename => $file, -write_flag => 'WRITE'); my $t0 = new Benchmark; $inx->make_index(@ARGV); my $t1 = new Benchmark; print "Indexing @ARGV took:",timestr(timediff($t1, $t0)),"\n"; # using the index: my $geneid = 3; # below is not needed in this script but it's the preferred calling way if # one's just using an existing index file # my $inx = Bio::ASN1::EntrezGene::Indexer->new(-filename => 'entrezgene.idx'); print "there are a total of " . $inx->count_records . " records\n"; my $t0 = new Benchmark; # uncomment below to test retrieving Bio::Seq obj # my $seq = $inx->fetch($geneid); # Bio::Seq obj returned by Bio::SeqIO::entrezgene.pm my $seq1 = $inx->fetch_hash($geneid); # a hash produced by Bio::ASN1::EntrezGene # that contains all data in the Entrez Gene record my $t1 = new Benchmark; print "Retrieving Entrez Gene #$geneid took:",timestr(timediff($t1, $t0)),"\n"; # Dumpvalue->new->dumpValue($seq); Dumpvalue->new->dumpValue($seq1); ASN1000755000765000024 012215135616 16671 5ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/lib/BioSequence.pm100755000765000024 4714612215135616 21176 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/lib/Bio/ASN1package Bio::ASN1::Sequence; BEGIN { $Bio::ASN1::Sequence::AUTHORITY = 'cpan:BIOPERLML'; } { $Bio::ASN1::Sequence::VERSION = '1.70'; } use utf8; use strict; use warnings; use Carp qw(carp croak); # ABSTRACT: Regular expression-based Perl Parser for ASN.1-formatted NCBI Sequences. # AUTHOR: Dr. Mingyi Liu # OWNER: 2005 Mingyi Liu # OWNER: 2005 GPC Biotech AG # OWNER: 2005 Altana Research Institute # LICENSE: Perl_5 sub new { my $class = shift; $class = ref($class) if(ref($class)); my $self = { maxerrstr => 20, @_ }; bless $self, $class; map { $self->input_file($self->{$_}) if($self->{$_}) } qw(file -file); map { $self->fh($self->{$_}) if($self->{$_}) } qw(fh -fh); return $self; } sub maxerrstr { my ($self, $value) = @_; $self->{maxerrstr} = $value if $value > 0; return $self->{maxerrstr}; } sub parse { my ($self, $input, $compact, $noreset) = @_; $input || croak "must have input!\n"; $self->{input} = $input; $self->{filename} = "input" unless $self->{filename}; $self->{linenumber} = 1 unless $self->{linenumber} && $noreset; $self->{depth} = 0; my $result; eval { $result = $self->_parse(); # no need to reset $self->{depth} or linenumber }; if($@) { if($@ !~ /^Data Error:/) { croak "non-conforming data broke parser on line $self->{linenumber} in $self->{filename}\n". "possible cause includes randomly inserted brackets in input file before line $self->{linenumber}\n". "first $self->{maxerrstr} (or till end of input) characters including the non-conforming data:\n" . substr($self->{input}, pos($self->{input}), $self->{maxerrstr}) . "\nRaw error mesg: $@\n"; } else { die $@ } } trimdata($result, $compact); return $result; } sub input_file { my ($self, $filename) = @_; # in case user's Perl system can't handle large file. Assuming Unix, otherwise raise error local *IN; # older styled code to enable module to work with perl 5.005_03 open(*IN, $filename) || ($! =~ /too large/i && open(*IN, "cat $filename |")) || croak "can't open $filename! -- $!\n"; $self->{fh} = *IN; $self->{filename} = $filename; $self->{linenumber} = 0; # reset line number } sub next_seq { my ($self, $compact) = @_; $self->{fh} || croak "you must pass in a file name or handle through new() or input_file() first before calling next_seq!\n"; local $/ = "Seq-entry ::= set {"; # set record separator while($_ = readline($self->{fh})) { chomp; next unless /\S/; my $tmp = (/^\s*Seq-entry ::= set ({.*)/si)? $1 : "{" . $_; # get rid of the 'Seq-entry ::= set ' at the beginning of Sequence record return $self->parse($tmp, $compact, 1); # 1 species no resetting line number } } sub _parse { my ($self, $flag) = @_; my $data; while(1) { # changing orders of regex if/elsif statements made little difference. current order is close to optimal if($self->{input} =~ /\G[ \t]*,?[ \t]*\n/cg) # cleanup leftover { $self->{linenumber}++; next; } if($self->{input} =~ /\G[ \t]*}/cg) { if(!($self->{depth}--) && $self->{input} =~ /\S/) { croak "Data Error: extra (mismatched) '}' found on line $self->{linenumber} in $self->{filename}!\n"; } return $data } elsif($self->{input} =~ /\G[ \t]*{/cg) { $self->{depth}++; push(@$data, $self->_parse()) } elsif($self->{input} =~ /\G[ \t]*([\w-]+)(\s*)/cg) { my ($id, $lines) = ($1, $2); # we're prepared for NCBI to make the format even worse: # note: to count line numbers right for text files on different OS, I'm sacrificing much speed (maybe I shouldn't worry so much) $self->{linenumber} += $lines =~ s/\n//g || $lines =~ s/\r//g; # count by *NIX/Win or Mac my ($tmp, $tmp1); # we put \s* in lookahead for linenumber counting purpose (which slows things down) if(($self->{input} =~ /\G"((?:[^"]+|"")*)"(?=\s*[,}])/cg && ++$tmp) || ($self->{input} =~ /\G'([^']+)'\s*H/icg && ++$tmp1) || # this is the only difference b/w sequence and entrez gene formats so far $self->{input} =~ /\G([\w-]+)(?=\s*[,}])/cg) { my $value = $1; if($tmp) # slight speed optimization, not really necessary since regex is fast enough { $value =~ s/""/"/g; $self->{linenumber} += $value =~ s/\n//g || $value =~ s/\r//g; # count by *NIX/Win or Mac $value =~ s/[\r\n]+//g; # in case it's Win format } elsif($tmp1) # slight speed optimization, not really necessary since regex is fast enough { $value =~ tr/fF8421/NNTGCA/; # good for NCBI4na. But if NCBI8na was used, then more needs to be transliterated $self->{linenumber} += $value =~ s/\n//g || $value =~ s/\r//g; # count by *NIX/Win or Mac $value =~ s/[\r\n0]+//g; # in case it's Win format (get rid of '0' at end of seq too) } if(ref($data->{$id})) { push(@{$data->{$id}}, $value) } # hash value is not a terminal (or have multiple values), create array to avoid multiple same-keyed hash overwrite each other elsif($data->{$id}) { $data->{$id} = [$data->{$id}, $value] } # hash value has a second terminal value now! else { $data->{$id} = $value } # the first terminal value } elsif($self->{input} =~ /\G{/cg) { $self->{depth}++; push(@{$data->{$id}}, $self->_parse()); } elsif($self->{input} =~ /\G(?=[,}])/cg) { push(@$data, $id) } else # must be "id value value" format { $self->{depth}++; push(@{$data->{$id}}, $self->_parse(1)) } if($flag) { if(!($self->{depth}--) && $self->{input} =~ /\S/) { croak "Data Error: extra (mismatched) '}' found on line $self->{linenumber} in $self->{filename}!\n"; } return $data; } } elsif($self->{input} =~ /\G[ \t]*"((?:[^"]+|"")*)"(?=\s*[,}])/cg) { my $value = $1; $value =~ s/""/"/g; $self->{linenumber} += $value =~ s/\n//g || $value =~ s/\r//g; # count by *NIX/Win or Mac $value =~ s/[\r\n]+//g; # in case it's Win format push(@$data, $value) } else # end of input { my ($pos, $len) = (pos($self->{input}), length($self->{input})); if($pos != $len && $self->{input} =~ /\G\s*\S/cg) # problem with parsing, must be non-conforming data { croak "Data Error: none conforming data found on line $self->{linenumber} in $self->{filename}!\n" . "first $self->{maxerrstr} (or till end of input) characters including the non-conforming data:\n" . substr($self->{input}, $pos, $self->{maxerrstr}) . "\n"; } elsif($self->{depth} > 0) { croak "Data Error: missing '}' found at end of input in $self->{filename}!"; } elsif($self->{depth} < 0) { croak "Data Error: extra (mismatched) '}' found at end of input in $self->{filename}!"; } return $data; } } } # following copied directly from my Pipeline::Util::util just to make this module # more self-sufficient. Changes should be made over in that module though. # trims arrayrefs that points to one-element array to slims the # data structure down (calls Pipeline::Util::util::trimdata) # something like # 'comments' => ARRAY(0x898be94) # 0 ARRAY(0x883fc54) # 0 ARRAY(0x886aef4) # 0 HASH(0x884d554) # 'heading' => 'LocusTagLink' # 'source' => ARRAY(0x8810714) # 0 ARRAY(0x8a7df18) # 0 ARRAY(0x889f940) # 0 HASH(0x886ada4) # 'src' => ARRAY(0x88454fc) # 0 ARRAY(0x8845598) # 0 HASH(0x898c0ec) # 'db' => 'HGNC' # 'tag' => ARRAY(0x898bfb4) # 0 HASH(0x898c164) # 'id' => 5 # becomes this if $flag == 1: # 'comments' => ARRAY(0x8840014) # 0 HASH(0x884d8a4) # 'heading' => 'LocusTagLink' # 'source' => HASH(0x8a9869c) # 'src' => HASH(0x884534c) # 'db' => 'HGNC' # 'tag' => HASH(0x88453c4) # 'id' => 5 # so now $hash->{comments}->[0]->[0]->[0]->{source}->[0]->[0]->[0]->{src}->[0]->[0]->{tag}->[0]->{id} # becomes $hash->{comments}->[0]->{source}->{src}->{tag}->{id} # this may create problem as array might suddenly change to hash depending on whether it # has multiple elements or not. So set $flag to 2 or 0/undef would disallow trimming that # would lead to data type change, thus resulting in data structure like: # 'comments' => ARRAY(0x88617e8) # 0 HASH(0x889d578) # 'heading' => 'LocusTagLink' # 'source' => ARRAY(0x8912244) # 0 HASH(0x8a5d648) # 'src' => ARRAY(0x8a2203c) # 0 HASH(0x8a1af10) # 'db' => 'HGNC' # 'tag' => ARRAY(0x8a1add8) # 0 HASH(0x8a1af88) # 'id' => 5 # still not the safest, but saves some hassle writing code sub trimdata { my ($ref, $flag) = @_; $flag = 2 unless $flag; return if $flag == 3 || !ref($ref); if(ref($ref) ne 'ARRAY') # allows for object refs { my @keys; eval { @keys = keys %$ref }; # let's be careful and check if it can work as a hash return if $@; foreach my $key (@keys) { my $tmp = $ref->{$key}; while(ref($tmp) eq 'ARRAY' && @$tmp == 1) { last if($flag == 2 && ref($tmp->[0]) ne 'ARRAY'); $tmp = $tmp->[0] } $ref->{$key} = $tmp; trimdata($ref->{$key}, $flag) if(ref($ref->{$key})) } } else { # since the only situations where we would get an array of array is # when ASN file has a bracket of brackets (otherwise we'd get at least # a hash), it makes sense to reduce the arrayrefs to one level foreach my $item (@$ref) { my $tmp = $item; while(ref($tmp) eq 'ARRAY' && @$tmp == 1) { $tmp = $tmp->[0]; } $item = $tmp; trimdata($item, $flag) if(ref($item)) } } } sub fh { my ($self, $filehandle) = @_; if($filehandle) { $self->{fh} = $filehandle; $self->{linenumber} = 0; # reset line number } return $self->{fh}; } sub rawdata { my $self = shift; return "Seq-entry ::= set $self->{input}"; } 1; __END__ =pod =encoding utf-8 =head1 NAME Bio::ASN1::Sequence - Regular expression-based Perl Parser for ASN.1-formatted NCBI Sequences. =head1 VERSION version 1.70 =head1 SYNOPSIS use Bio::ASN1::Sequence; my $parser = Bio::ASN1::Sequence->new('file' => "downloaded.asn1"); while(my $result = $parser->next_seq) { # extract data from $result, or Dumpvalue->new->dumpValue($result); } # a new way to get the $result data hash for a particular sequence id: use Bio::ASN1::Sequence::Indexer; my $inx = Bio::ASN1::Sequence::Indexer->new(-filename => 'seq.idx'); my $seq = $inx->fetch_hash('AF093062'); # for creation of .idx index files please refer to # Bio::ASN1::Sequence::Indexer perldoc =head1 DESCRIPTION Bio::ASN1::Sequence is a regular expression-based Perl Parser for ASN.1-formatted NCBI sequences. It parses an ASN.1-formatted sequence record and returns a data structure that contains all data items from the sequence record. The parser will report error & line number if input data does not conform to the NCBI Sequence annotation file format. The sequence parser is basically a modified version of the high-performance Bio::ASN1::EntrezGene parser. However, I created a standalone module for sequence since it is more efficient to keep Sequence-specific code out of EntrezGene.pm. In fact it is possible to provide reading of all NCBI's ASN.1-formatted files through simple variations of the Entrez Gene parser (I need more investigation to be sure, but at least the sequence parser works well). Since demand for parsing NCBI ASN.1-formatted sequences is much lower than EntrezGene, this module is more like a beta version that works on the examples I checked, but I did not check all available records or data definitions. The error-reporting function of this module has to be useful sometimes. :) =head1 ATTRIBUTES =head2 maxerrstr Parameters: $maxerrstr (optional) - maximum number of characters after offending element, used by error reporting, default is 20 Example: $parser->maxerrstr(20); Function: get/set maxerrstr. Returns: maxerrstr. Notes: =head2 input_file Parameters: $filename for file that contains Sequence record(s) Example: $parser->input_file($filename); Function: Takes in name of a file containing Sequence records. opens the file and stores file handle Returns: none. Notes: Attempts to open file larger than 2 GB even on Perl that does not support 2 GB file (accomplished by calling "cat" and piping output. On OS that does not have "cat" error message will be displayed) =head1 METHODS =head2 new Parameters: maxerrstr => 20 (optional) - maximum number of characters after offending element, used by error reporting, default is 20 file or -file => $filename (optional) - name of the file to be parsed. call next_seq to parse! fh or -fh => $filehandle (optional) - handle of the file to be parsed. Example: my $parser = Bio::ASN1::Sequence->new(); Function: Instantiate a parser object Returns: Object reference Notes: Setting file or fh will reset line numbers etc. that are used for error reporting purposes, and seeking on file handle would mess up linenumbers! =head2 parse Parameters: $string that contains Sequence record, $trimopt (optional) that specifies how the data structure returned should be trimmed. 2 is recommended and default $noreset (optional) that species that line number should not be reset DEPRECATED as external function!!! Do not call this function directly! Call next_seq() instead Example: my $value = $parser->parse($text); # DEPRECATED as # external function!!! Do not call this function # directly! Call next_seq() instead Function: Takes in a string representing Sequence record, parses the record and returns a data structure. Returns: A data structure containing all data items from the sequence record. Notes: DEPRECATED as external function!!! Do not call this function directly! Call next_seq() instead $string should not contain 'Seq-entry ::= set' at beginning! =head2 next_seq Parameters: $trimopt (optional) that specifies how the data structure returned should be trimmed. option 2 is recommended and default Example: my $value = $parser->next_seq(); Function: Use the file handle generated by input_file, parses the next the record and returns a data structure. Returns: A data structure containing all data items from the sequence record. Notes: Must pass in a filename through new() or input_file() first! For details on how to use the $trimopt data trimming option please see comment for the trimdata method. An option of 2 is recommended and default The acceptable values for $trimopt include: 1 - trim as much as possibile 2 (or 0, undef) - trim to an easy-to-use structure 3 - no trimming (in version 1.06, prior to version 1.06, 0 or undef means no trimming) =head2 trimdata Parameters: $hashref or $arrayref $trimflag (optional, see Notes) Example: trimdata($datahash); # using the default flag Function: recursively process all attributes of a hash/array hybrid and get rid of any arrayref that points to one-element arrays (trims data structure) depending on the optional flag. Returns: none - trimming happenes in-place Notes: This function is useful to compact a data structure produced by Bio::ASN1::Sequence::parse. The acceptable values for $trimopt include: 1 - trim as much as possibile 2 (or 0, undef) - trim to an easy-to-use structure 3 - no trimming (in version 1.06, prior to version 1.06, 0 or undef means no trimming) This function is duplicate to EntrezGene.pm's and code should be compressed in the future (using util module & subclass). =head2 fh Parameters: $filehandle (optional) Example: trimdata($datahash); # using the default flag Function: getter/setter for file handle Returns: file handle for current file being parsed. Notes: Use with care! Line number report would not be corresponding to file's line number if seek operation is performed on the file handle! =head2 rawdata Parameters: none Example: my $data = $parser->rawdata(); Function: Get the sequence data file that was just parsed Returns: a string containing the ASN1-formatted sequence record Notes: Must first parse a record then call this function! Could be useful in interpreting line number value in error report (if user did a seek on file handle right before parsing call) =head1 INTERNAL METHODS =head2 _parse NCBI's Apr 05, 2005 format change forced much usage of lookahead, which would for sure slows parser down. But can't code efficiently without it. =head1 PREREQUISITE None. =head1 INSTALLATION Bio::ASN1::Sequence is part of the Bio::ASN1::EntrezGene package. Bio::ASN1::EntrezGene package can be installed & tested as follows: perl Makefile.PL make make test make install =head1 SEE ALSO The parse_sequence_example.pl script included in this package (please see the Bio-ASN1-EntrezGene-x.xx/examples directory) shows the usage. Please check out perldoc for Bio::ASN1::EntrezGene for more info. =head1 CITATION Liu, Mingyi, and Andrei Grigoriev. "Fast parsers for Entrez Gene." Bioinformatics 21, no. 14 (2005): 3189-3190. =head1 OPERATION SYSTEMS SUPPORTED Any OS that Perl runs on. =head1 FEEDBACK =head2 Mailing lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to the Bioperl mailing list. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting bugs Report bugs to the Bioperl bug tracking system to help us keep track of the bugs and their resolution. Bug reports can be submitted via the web: https://redmine.open-bio.org/projects/bioperl/ =head1 AUTHOR Dr. Mingyi Liu =head1 COPYRIGHT This software is copyright (c) 2005 by Mingyi Liu, 2005 by GPC Biotech AG, and 2005 by Altana Research Institute. This software is available under the same terms as the perl 5 programming language system itself. =cut release-pod-coverage.t100644000765000024 76512215135616 21461 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/t#!perl BEGIN { unless ($ENV{RELEASE_TESTING}) { require Test::More; Test::More::plan(skip_all => 'these tests are for release candidate testing'); } } use Test::More; eval "use Test::Pod::Coverage 1.08"; plan skip_all => "Test::Pod::Coverage 1.08 required for testing POD coverage" if $@; eval "use Pod::Coverage::TrustPod"; plan skip_all => "Pod::Coverage::TrustPod required for testing POD coverage" if $@; all_pod_coverage_ok({ coverage_class => 'Pod::Coverage::TrustPod' }); EntrezGene.pm100755000765000024 4733612215135616 21475 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/lib/Bio/ASN1package Bio::ASN1::EntrezGene; BEGIN { $Bio::ASN1::EntrezGene::AUTHORITY = 'cpan:BIOPERLML'; } { $Bio::ASN1::EntrezGene::VERSION = '1.70'; } use utf8; use strict; use warnings; use Carp qw(carp croak); # ABSTRACT: Regular expression-based Perl Parser for NCBI Entrez Gene. # AUTHOR: Dr. Mingyi Liu # OWNER: 2005 Mingyi Liu # OWNER: 2005 GPC Biotech AG # OWNER: 2005 Altana Research Institute # LICENSE: Perl_5 sub new { my $class = shift; $class = ref($class) if(ref($class)); my $self = { maxerrstr => 20, @_ }; bless $self, $class; map { $self->input_file($self->{$_}) if($self->{$_}) } qw(file -file); map { $self->fh($self->{$_}) if($self->{$_}) } qw(fh -fh); return $self; } sub maxerrstr { my ($self, $value) = @_; $self->{maxerrstr} = $value if $value > 0; return $self->{maxerrstr}; } sub parse { my ($self, $input, $compact, $noreset) = @_; $input || croak "must have input!\n"; $self->{input} = $input; $self->{filename} = "input" unless $self->{filename}; $self->{linenumber} = 1 unless $self->{linenumber} && $noreset; $self->{depth} = 0; my $result; eval { $result = $self->_parse(); # no need to reset $self->{depth} or linenumber }; if($@) { if($@ !~ /^Data Error:/) { croak "non-conforming data broke parser on line $self->{linenumber} in $self->{filename}\n". "possible cause includes randomly inserted brackets in input file before line $self->{linenumber}\n". "first $self->{maxerrstr} (or till end of input) characters including the non-conforming data:\n" . substr($self->{input}, pos($self->{input}), $self->{maxerrstr}) . "\nRaw error mesg: $@\n"; } else { die $@ } } trimdata($result, $compact); return $result; } sub input_file { my ($self, $filename) = @_; # in case user's Perl system can't handle large file. Assuming Unix, otherwise raise error local *IN; # older styled code to enable module to work with perl 5.005_03 open(*IN, $filename) || ($! =~ /too large/i && open(*IN, "cat $filename |")) || croak "can't open $filename! -- $!\n"; $self->{fh} = *IN; $self->{filename} = $filename; $self->{linenumber} = 0; # reset line number } sub next_seq { my ($self, $compact) = @_; $self->{fh} || croak "you must pass in a file name or handle through new() or input_file() first before calling next_seq!\n"; local $/ = "Entrezgene ::= {"; # set record separator while($_ = readline($self->{fh})) { chomp; next unless /\S/; my $tmp = (/^\s*Entrezgene(-Set)? ::= ({.*)/si)? $2 : "{" . $_; # get rid of the 'Entrezgene ::= ' at the beginning of Entrez Gene record return $self->parse($tmp, $compact, 1); # 1 species no resetting line number } } sub _parse { my ($self, $flag) = @_; my $data; while(1) { # changing orders of regex if/elsif statements made little difference. current order is close to optimal if($self->{input} =~ /\G[ \t]*,?[ \t]*\n/cg) # cleanup leftover { $self->{linenumber}++; next; } if($self->{input} =~ /\G[ \t]*}/cg) { if(!($self->{depth}--) && $self->{input} =~ /\S/) { croak "Data Error: extra (mismatched) '}' found on line $self->{linenumber} in $self->{filename}!\n"; } return $data } elsif($self->{input} =~ /\G[ \t]*{/cg) { $self->{depth}++; push(@$data, $self->_parse()) } elsif($self->{input} =~ /\G[ \t]*([\w-]+)(\s*)/cg) { my ($id, $lines) = ($1, $2); # we're prepared for NCBI to make the format even worse: # note: to count line numbers right for text files on different OS, I'm sacrificing much speed (maybe I shouldn't worry so much) $self->{linenumber} += $lines =~ s/\n//g || $lines =~ s/\r//g; # count by *NIX/Win or Mac my $tmp; # we put \s* in lookahead for linenumber counting purpose (which slows things down) if(($self->{input} =~ /\G"((?:[^"]+|"")*)"(?=\s*[,}])/cg && ++$tmp) || $self->{input} =~ /\G([\w-]+)(?=\s*[,}])/cg) { my $value = $1; if($tmp) # slight speed optimization, not really necessary since regex is fast enough { $value =~ s/""/"/g; $self->{linenumber} += $value =~ s/\n//g || $value =~ s/\r//g; # count by *NIX/Win or Mac $value =~ s/[\r\n]+//g; # in case it's Win format } if(ref($data->{$id})) { push(@{$data->{$id}}, $value) } # hash value is not a terminal (or have multiple values), create array to avoid multiple same-keyed hash overwrite each other elsif($data->{$id}) { $data->{$id} = [$data->{$id}, $value] } # hash value has a second terminal value now! else { $data->{$id} = $value } # the first terminal value } elsif($self->{input} =~ /\G{/cg) { $self->{depth}++; push(@{$data->{$id}}, $self->_parse()); } elsif($self->{input} =~ /\G(?=[,}])/cg) { push(@$data, $id) } else # must be "id value value" format { $self->{depth}++; push(@{$data->{$id}}, $self->_parse(1)) } if($flag) { if(!($self->{depth}--) && $self->{input} =~ /\S/) { croak "Data Error: extra (mismatched) '}' found on line $self->{linenumber} in $self->{filename}!\n"; } return $data; } } elsif($self->{input} =~ /\G[ \t]*"((?:[^"]+|"")*)"(?=\s*[,}])/cg) { my $value = $1; $value =~ s/""/"/g; $self->{linenumber} += $value =~ s/\n//g || $value =~ s/\r//g; # count by *NIX/Win or Mac $value =~ s/[\r\n]+//g; # in case it's Win format push(@$data, $value) } else # end of input { my ($pos, $len) = (pos($self->{input}), length($self->{input})); if($pos != $len && $self->{input} =~ /\G\s*\S/cg) # problem with parsing, must be non-conforming data { croak "Data Error: none conforming data found on line $self->{linenumber} in $self->{filename}!\n" . "first $self->{maxerrstr} (or till end of input) characters including the non-conforming data:\n" . substr($self->{input}, $pos, $self->{maxerrstr}) . "\n"; } elsif($self->{depth} > 0) { croak "Data Error: missing '}' found at end of input in $self->{filename}!"; } elsif($self->{depth} < 0) { croak "Data Error: extra (mismatched) '}' found at end of input in $self->{filename}!"; } return $data; } } } # following copied directly from my Pipeline::Util::util just to make this module # more self-sufficient. Changes should be made over in that module though. # trims arrayrefs that points to one-element array to slims the # data structure down (calls Pipeline::Util::util::trimdata) # something like # 'comments' => ARRAY(0x898be94) # 0 ARRAY(0x883fc54) # 0 ARRAY(0x886aef4) # 0 HASH(0x884d554) # 'heading' => 'LocusTagLink' # 'source' => ARRAY(0x8810714) # 0 ARRAY(0x8a7df18) # 0 ARRAY(0x889f940) # 0 HASH(0x886ada4) # 'src' => ARRAY(0x88454fc) # 0 ARRAY(0x8845598) # 0 HASH(0x898c0ec) # 'db' => 'HGNC' # 'tag' => ARRAY(0x898bfb4) # 0 HASH(0x898c164) # 'id' => 5 # becomes this if $flag == 1: # 'comments' => ARRAY(0x8840014) # 0 HASH(0x884d8a4) # 'heading' => 'LocusTagLink' # 'source' => HASH(0x8a9869c) # 'src' => HASH(0x884534c) # 'db' => 'HGNC' # 'tag' => HASH(0x88453c4) # 'id' => 5 # so now $hash->{comments}->[0]->[0]->[0]->{source}->[0]->[0]->[0]->{src}->[0]->[0]->{tag}->[0]->{id} # becomes $hash->{comments}->[0]->{source}->{src}->{tag}->{id} # this may create problem as array might suddenly change to hash depending on whether it # has multiple elements or not. So set $flag to 2 or 0/undef would disallow trimming that # would lead to data type change, thus resulting in data structure like: # 'comments' => ARRAY(0x88617e8) # 0 HASH(0x889d578) # 'heading' => 'LocusTagLink' # 'source' => ARRAY(0x8912244) # 0 HASH(0x8a5d648) # 'src' => ARRAY(0x8a2203c) # 0 HASH(0x8a1af10) # 'db' => 'HGNC' # 'tag' => ARRAY(0x8a1add8) # 0 HASH(0x8a1af88) # 'id' => 5 # still not the safest, but saves some hassle writing code sub trimdata { my ($ref, $flag) = @_; $flag = 2 unless $flag; return if $flag == 3 || !ref($ref); if(ref($ref) ne 'ARRAY') # allows for object refs { my @keys; eval { @keys = keys %$ref }; # let's be careful and check if it can work as a hash return if $@; foreach my $key (@keys) { my $tmp = $ref->{$key}; while(ref($tmp) eq 'ARRAY' && @$tmp == 1) { last if($flag == 2 && ref($tmp->[0]) ne 'ARRAY'); $tmp = $tmp->[0] } $ref->{$key} = $tmp; trimdata($ref->{$key}, $flag) if(ref($ref->{$key})) } } else { # since the only situations where we would get an array of array is # when ASN file has a bracket of brackets (otherwise we'd get at least # a hash), it makes sense to reduce the arrayrefs to one level foreach my $item (@$ref) { my $tmp = $item; while(ref($tmp) eq 'ARRAY' && @$tmp == 1) { $tmp = $tmp->[0]; } $item = $tmp; trimdata($item, $flag) if(ref($item)) } } } sub fh { my ($self, $filehandle) = @_; if($filehandle) { $self->{fh} = $filehandle; $self->{linenumber} = 0; # reset line number } return $self->{fh}; } sub rawdata { my $self = shift; return "Entrezgene ::= $self->{input}"; } 1; __END__ =pod =encoding utf-8 =head1 NAME Bio::ASN1::EntrezGene - Regular expression-based Perl Parser for NCBI Entrez Gene. =head1 VERSION version 1.70 =head1 SYNOPSIS use Bio::ASN1::EntrezGene; my $parser = Bio::ASN1::EntrezGene->new('file' => "Homo_sapiens"); while(my $result = $parser->next_seq) { # extract data from $result, or Dumpvalue->new->dumpValue($result); } # a new way to get the $result data hash for a particular gene id: use Bio::ASN1::EntrezGene::Indexer; my $inx = Bio::ASN1::EntrezGene::Indexer->new(-filename => 'entrezgene.idx'); my $seq = $inx->fetch_hash(10); # returns $result for Entrez Gene record # with geneid 10 # note that the index file 'entrezgene.idx' can be created as follows my $inx = Bio::ASN1::EntrezGene::Indexer->new( -filename => 'entrezgene.idx', -write_flag => 'WRITE'); $inx->make_index('Homo_sapiens', 'Mus_musculus'); # files come from NCBI download # for more detail please refer to Bio::ASN1::EntrezGene::Indexer perldoc =head1 DESCRIPTION Bio::ASN1::EntrezGene is a regular expression-based Perl Parser for NCBI Entrez Gene genome databases (L). It parses an ASN.1-formatted Entrez Gene record and returns a data structure that contains all data items from the gene record. The parser will report error & line number if input data does not conform to the NCBI Entrez Gene genome annotation file format. Note that it is possible to provide reading of all NCBI's ASN.1-formatted files through simple variations of the Entrez Gene parser (I need more investigation to be sure, but at least the sequence parser is a very simple variation on Entrez Gene parser and works well). It took the parser version 1.0 11 minutes to parse the human genome Entrez Gene file on one 2.4 GHz Intel Xeon processor. The addition of validation and error reporting in 1.03 and handling of new Entrez Gene format slowed the parser down about 40%. Since V1.07, this package also included an indexer that runs pretty fast (it takes 21 seconds for the indexer to index the human genome on the same processor). Therefore the combination of the modules would allow user to retrieve and parse arbitrary records. =head1 ATTRIBUTES =head2 maxerrstr Parameters: $maxerrstr (optional) - maximum number of characters after offending element, used by error reporting, default is 20 Example: $parser->maxerrstr(20); Function: get/set maxerrstr. Returns: maxerrstr. Notes: =head2 input_file Parameters: $filename for file that contains Entrez Gene record(s) Example: $parser->input_file($filename); Function: Takes in name of a file containing Entrez Gene records. opens the file and stores file handle Returns: none. Notes: Attempts to open file larger than 2 GB even on Perl that does not support 2 GB file (accomplished by calling "cat" and piping output. On OS that does not have "cat" error message will be displayed) =head1 METHODS =head2 new Parameters: maxerrstr => 20 (optional) - maximum number of characters after offending element, used by error reporting, default is 20 file or -file => $filename (optional) - name of the file to be parsed. call next_seq to parse! fh or -fh => $filehandle (optional) - handle of the file to be parsed. Example: my $parser = Bio::ASN1::EntrezGene->new(); Function: Instantiate a parser object Returns: Object reference Notes: Setting file or fh will reset line numbers etc. that are used for error reporting purposes, and seeking on file handle would mess up linenumbers! =head2 parse Parameters: $string that contains Entrez Gene record, $trimopt (optional) that specifies how the data structure returned should be trimmed. 2 is recommended and default $noreset (optional) that species that line number should not be reset DEPRECATED as external function!!! Do not call this function directly! Call next_seq() instead Example: my $value = $parser->parse($text); # DEPRECATED as # external function!!! Do not call this function # directly! Call next_seq() instead Function: Takes in a string representing Entrez Gene record, parses the record and returns a data structure. Returns: A data structure containing all data items from the Entrez Gene record. Notes: DEPRECATED as external function!!! Do not call this function directly! Call next_seq() instead $string should not contain 'EntrezGene ::=' at beginning! =head2 next_seq Parameters: $trimopt (optional) that specifies how the data structure returned should be trimmed. option 2 is recommended and default Example: my $value = $parser->next_seq(); Function: Use the file handle generated by input_file, parses the next the record and returns a data structure. Returns: A data structure containing all data items from the Entrez Gene record. Notes: Must pass in a filename through new() or input_file() first! For details on how to use the $trimopt data trimming option please see comment for the trimdata method. An option of 2 is recommended and default The acceptable values for $trimopt include: 1 - trim as much as possibile 2 (or 0, undef) - trim to an easy-to-use structure 3 - no trimming (in version 1.06, prior to version 1.06, 0 or undef means no trimming) =head2 trimdata Parameters: $hashref or $arrayref $trimflag (optional, see Notes) Example: trimdata($datahash); # using the default flag Function: recursively process all attributes of a hash/array hybrid and get rid of any arrayref that points to one-element arrays (trims data structure) depending on the optional flag. Returns: none - trimming happenes in-place Notes: This function is useful to compact a data structure produced by Bio::ASN1::EntrezGene::parse. The acceptable values for $trimopt include: 1 - trim as much as possibile 2 (or 0, undef) - trim to an easy-to-use structure 3 - no trimming (in version 1.06, prior to version 1.06, 0 or undef means no trimming) =head2 fh Parameters: $filehandle (optional) Example: trimdata($datahash); # using the default flag Function: getter/setter for file handle Returns: file handle for current file being parsed. Notes: Use with care! Line number report would not be corresponding to file's line number if seek operation is performed on the file handle! =head2 rawdata Parameters: none Example: my $data = $parser->rawdata(); Function: Get the entrez gene data file that was just parsed Returns: a string containing the ASN1-formatted Entrez Gene record Notes: Must first parse a record then call this function! Could be useful in interpreting line number value in error report (if user did a seek on file handle right before parsing call) =head1 INTERNAL METHODS =head2 _parse NCBI's Apr 05, 2005 format change forced much usage of lookahead, which would for sure slows parser down. But can't code efficiently without it. =head1 PREREQUISITE None. =head1 INSTALLATION Bio::ASN1::EntrezGene package can be installed & tested as follows: perl Makefile.PL make make test make install =head1 SEE ALSO The parse_entrez_gene_example.pl script included in this package (please see the Bio-ASN1-EntrezGene-x.xx/examples directory) is a very important and near-complete demo on using this module to extract all data items from Entrez Gene records. Do check it out because in fact, this script took me about 3-4 times more time to make for my project than the parser V1.0 itself. Note that the example script was edited to leave out stuff specific to my internal project. For details on various parsers I generated for Entrez Gene, example scripts that uses/benchmarks the modules, please see L. Those other parsers etc. are included in V1.05 download. =head1 CITATION Liu, Mingyi, and Andrei Grigoriev. "Fast parsers for Entrez Gene." Bioinformatics 21, no. 14 (2005): 3189-3190. =head1 OPERATION SYSTEMS SUPPORTED Any OS that Perl runs on. =head1 FEEDBACK =head2 Mailing lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to the Bioperl mailing list. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting bugs Report bugs to the Bioperl bug tracking system to help us keep track of the bugs and their resolution. Bug reports can be submitted via the web: https://redmine.open-bio.org/projects/bioperl/ =head1 AUTHOR Dr. Mingyi Liu =head1 COPYRIGHT This software is copyright (c) 2005 by Mingyi Liu, 2005 by GPC Biotech AG, and 2005 by Altana Research Institute. This software is available under the same terms as the perl 5 programming language system itself. =cut regex_parser_test.pl100755000765000024 150212215135616 22751 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/examples#!/usr/bin/perl # launch it like "perl regex_parser_test.pl Homo_sapiens" (Homo_sapiens can be downloaded # and decompressed from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN/Mammalia/Homo_sapiens.gz) # or use the included test file "perl regex_parser_test.pl ../t/input.asn" use strict; use Dumpvalue; use Bio::ASN1::EntrezGene; use Benchmark; my $parser = Bio::ASN1::EntrezGene->new(file => $ARGV[0]); # instantiate a parser object my ($t0, $end, $i) = (new Benchmark, 10, 0); # process the first 10 records in the input file while(my $value = $parser->next_seq) { # Dumpvalue->new->dumpValue($value); # uncomment to dump the data structure out last if ++$i >= $end; # only process the first 20 records } my $t1 = new Benchmark; print "The first $i records in $ARGV[0] took EntrezGene parser:",timestr(timediff($t1, $t0)),"\n"; Sequence000755000765000024 012215135616 20441 5ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/lib/Bio/ASN1Indexer.pm100755000765000024 1446112215135616 22566 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/lib/Bio/ASN1/Sequencepackage Bio::ASN1::Sequence::Indexer; BEGIN { $Bio::ASN1::Sequence::Indexer::AUTHORITY = 'cpan:BIOPERLML'; } { $Bio::ASN1::Sequence::Indexer::VERSION = '1.70'; } use utf8; use strict; use warnings; use Carp qw(carp croak); use Bio::ASN1::Sequence; use Bio::Index::AbstractSeq; use parent qw(Bio::Index::AbstractSeq); # ABSTRACT: Indexes NCBI Sequence files. # AUTHOR: Dr. Mingyi Liu # OWNER: 2005 Mingyi Liu # OWNER: 2005 GPC Biotech AG # OWNER: 2005 Altana Research Institute # LICENSE: Perl_5 # TODO: Should this be deprecated? sub _version { return $Bio::Index::AbstractSeq::VERSION; } sub _type_stamp { return '__Sequence_ASN1__'; } sub _index_file { my($self, $file, $idx) = @_; my $position; open(IN, $file) || $self->throw("Can't open $file - $!"); local $/ = "Seq-entry ::= set {"; while() { chomp; while(/[,{}]\s+accession\s*"([^"]+)"\s+[,{}]/ig) # add both dna and protein { $self->add_record($1, $idx, $position); } $position = tell(IN) - 19; # $/'s length } close(IN); return 1; } sub _file_format { return 'sequence'; } sub fetch_hash { my ($self, $seqid) = @_; if (my $seq = $self->db->{$seqid}) { my ($fileno, $position) = $self->unpack_record($seq); my $parser = Bio::ASN1::Sequence->new('fh' => $self->_file_handle($fileno)); seek($parser->fh, $position, 0); return $parser->next_seq; } } sub _file_handle { my( $self, $i ) = @_; unless ($self->{'_filehandle'}[$i]) { my @rec = $self->unpack_record($self->db->{"__FILE_$i"}) or $self->throw("Can't get filename for index : $i"); my $file = $rec[0]; local *FH; open *FH, $file or $self->throw("Can't read file '$file' : $!"); $self->{'_filehandle'}[$i] = *FH; # Cache filehandle } return $self->{'_filehandle'}[$i]; } 1; __END__ =pod =encoding utf-8 =head1 NAME Bio::ASN1::Sequence::Indexer - Indexes NCBI Sequence files. =head1 VERSION version 1.70 =head1 SYNOPSIS use Bio::ASN1::Sequence::Indexer; # creating & using the index is just a few lines my $inx = Bio::ASN1::Sequence::Indexer->new( -filename => 'seq.idx', -write_flag => 'WRITE'); # needed for make_index call, but if opening # existing index file, don't set write flag! $inx->make_index('seq1.asn', 'seq2.asn'); my $seq = $inx->fetch('AF093062'); # Bio::Seq obj for Sequence (doesn't work yet) # alternatively, if one prefers just a data structure instead of objects $seq = $inx->fetch_hash('AF093062'); # a hash produced by Bio::ASN1::Sequence # that contains all data in the Sequence record =head1 DESCRIPTION Bio::ASN1::Sequence::Indexer is a Perl Indexer for NCBI Sequence genome databases. It processes an ASN.1-formatted Sequence record and stores the file position for each record in a way compliant with Bioperl standard (in fact its a subclass of Bioperl's index objects). Note that this module does not parse record, because it needs to run fast and grab only the gene ids. For parsing record, use Bio::ASN1::Sequence. As with Bio::ASN1::Sequence, this module is best thought of as beta version - it works, but is not fully tested. =head1 METHODS =head2 fetch Parameters: $geneid - id for the Sequence record to be retrieved Example: my $hash = $indexer->fetch(10); # get Sequence #10 Function: fetch the data for the given Sequence id. Returns: A Bio::Seq object produced by Bio::SeqIO::sequence Notes: Bio::SeqIO::sequence does not exist and probably won't exist for a while! So call fetch_hash instead =head2 fetch_hash Parameters: $seqid - id for the Sequence record to be retrieved Example: my $hash = $indexer->fetch_hash('AF093062'); Function: fetch a hash produced by Bio::ASN1::Sequence for given id Returns: A data structure containing all data items from the Sequence record. Notes: Alternative to fetch() =head1 INTERNAL METHODS =head2 _version =head2 _type_stamp =head2 _index_file =head2 _file_format =head2 _file_handle Title : _file_handle Usage : $fh = $index->_file_handle( INT ) Function: Returns an open filehandle for the file index INT. On opening a new filehandle it caches it in the @{$index->_filehandle} array. If the requested filehandle is already open, it simply returns it from the array. Example : $fist_file_indexed = $index->_file_handle( 0 ); Returns : ref to a filehandle Args : INT Notes : This function is copied from Bio::Index::Abstract. Once that module changes file handle code like I do below to fit perl 5.005_03, this sub would be removed from this module =head1 PREREQUISITE Bio::ASN1::Sequence, Bioperl and all dependencies therein. =head1 INSTALLATION Same as Bio::ASN1::EntrezGene =head1 SEE ALSO Please check out perldoc for Bio::ASN1::EntrezGene for more info. =head1 CITATION Liu, Mingyi, and Andrei Grigoriev. "Fast parsers for Entrez Gene." Bioinformatics 21, no. 14 (2005): 3189-3190. =head1 OPERATION SYSTEMS SUPPORTED Any OS that Perl & Bioperl run on. =head1 FEEDBACK =head2 Mailing lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to the Bioperl mailing list. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting bugs Report bugs to the Bioperl bug tracking system to help us keep track of the bugs and their resolution. Bug reports can be submitted via the web: https://redmine.open-bio.org/projects/bioperl/ =head1 AUTHOR Dr. Mingyi Liu =head1 COPYRIGHT This software is copyright (c) 2005 by Mingyi Liu, 2005 by GPC Biotech AG, and 2005 by Altana Research Institute. This software is available under the same terms as the perl 5 programming language system itself. =cut parse_sequence_example.pl100755000765000024 131012215135616 23736 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/examples#!/usr/bin/perl # launch it like "perl parse_sequence_example.pl seq.asn1" # one can use the included test file "perl parse_sequence_example.pl ../t/seq.asn" use strict; use Dumpvalue; use Bio::ASN1::Sequence; use Benchmark; my $parser = Bio::ASN1::Sequence->new(file => $ARGV[0]); # instantiate a parser object my ($t0, $end, $i) = (new Benchmark, 10, 0); # process the first 10 records in the input file while(my $value = $parser->next_seq) { Dumpvalue->new->dumpValue($value); # uncomment to dump the data structure out last if ++$i >= $end; # only process the first 20 records } my $t1 = new Benchmark; print "The first $i records in $ARGV[0] took Sequence parser:",timestr(timediff($t1, $t0)),"\n"; EntrezGene000755000765000024 012215135616 20737 5ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/lib/Bio/ASN1Indexer.pm100755000765000024 1541212215135616 23061 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/lib/Bio/ASN1/EntrezGenepackage Bio::ASN1::EntrezGene::Indexer; BEGIN { $Bio::ASN1::EntrezGene::Indexer::AUTHORITY = 'cpan:BIOPERLML'; } { $Bio::ASN1::EntrezGene::Indexer::VERSION = '1.70'; } use utf8; use strict; use warnings; use Carp qw(carp croak); use Bio::ASN1::EntrezGene; use Bio::Index::AbstractSeq; use parent qw(Bio::Index::AbstractSeq); # ABSTRACT: Indexes NCBI Sequence files. # AUTHOR: Dr. Mingyi Liu # OWNER: 2005 Mingyi Liu # OWNER: 2005 GPC Biotech AG # OWNER: 2005 Altana Research Institute # LICENSE: Perl_5 # TODO: Should this be deprecated? sub _version { return $Bio::Index::AbstractSeq::VERSION; } sub _type_stamp { return '__EntrezGene_ASN1__'; } sub _index_file { my($self, $file, $idx) = @_; my $position; open(IN, $file) || $self->throw("Can't open $file - $!"); local $/ = "Entrezgene ::= {"; while() { chomp; $self->add_record($1, $idx, $position) if (/[,{}]\s+geneid\s*(\d+)\s+[,{}]/i); $position = tell(IN) - 16; # $/'s length } close(IN); return 1; } sub _file_format { return 'entrezgene'; } sub fetch_hash { my ($self, $geneid) = @_; if (my $gene = $self->db->{$geneid}) { my ($fileno, $position) = $self->unpack_record($gene); my $parser = Bio::ASN1::EntrezGene->new('fh' => $self->_file_handle($fileno)); seek($parser->fh, $position, 0); return $parser->next_seq; } } sub _file_handle { my( $self, $i ) = @_; unless ($self->{'_filehandle'}[$i]) { my @rec = $self->unpack_record($self->db->{"__FILE_$i"}) or $self->throw("Can't get filename for index : $i"); my $file = $rec[0]; local *FH; open *FH, $file or $self->throw("Can't read file '$file' : $!"); $self->{'_filehandle'}[$i] = *FH; # Cache filehandle } return $self->{'_filehandle'}[$i]; } 1; __END__ =pod =encoding utf-8 =head1 NAME Bio::ASN1::EntrezGene::Indexer - Indexes NCBI Sequence files. =head1 VERSION version 1.70 =head1 SYNOPSIS use Bio::ASN1::EntrezGene::Indexer; # creating & using the index is just a few lines my $inx = Bio::ASN1::EntrezGene::Indexer->new( -filename => 'entrezgene.idx', -write_flag => 'WRITE'); # needed for make_index call, but if opening # existing index file, don't set write flag! $inx->make_index('Homo_sapiens', 'Mus_musculus', 'Rattus_norvegicus'); my $seq = $inx->fetch(10); # Bio::Seq obj for Entrez Gene #10 # alternatively, if one prefers just a data structure instead of objects $seq = $inx->fetch_hash(10); # a hash produced by Bio::ASN1::EntrezGene # that contains all data in the Entrez Gene record # note that in case you wonder, you can get the files 'Homo_sapiens' # from NCBI Entrez Gene ftp download, DATA/ASN/Mammalia directory =head1 DESCRIPTION Bio::ASN1::EntrezGene::Indexer is a Perl Indexer for NCBI Entrez Gene genome databases. It processes an ASN.1-formatted Entrez Gene record and stores the file position for each record in a way compliant with Bioperl standard (in fact its a subclass of Bioperl's index objects). Note that this module does not parse record, because it needs to run fast and grab only the gene ids. For parsing record, use Bio::ASN1::EntrezGene, or better yet, use Bio::SeqIO, format 'entrezgene'. It takes this module (version 1.07) 21 seconds to index the human genome Entrez Gene file (Apr. 5/2005 download) on one 2.4 GHz Intel Xeon processor. =head1 METHODS =head2 fetch Parameters: $geneid - id for the Entrez Gene record to be retrieved Example: my $hash = $indexer->fetch(10); # get Entrez Gene #10 Function: fetch the data for the given Entrez Gene id. Returns: A Bio::Seq object produced by Bio::SeqIO::entrezgene Notes: One needs to have Bio::SeqIO::entrezgene installed before calling this function! =head2 fetch_hash Parameters: $geneid - id for the Entrez Gene record to be retrieved Example: my $hash = $indexer->fetch_hash(10); # get Entrez Gene #10 Function: fetch a hash produced by Bio::ASN1::EntrezGene for given Entrez Gene id. Returns: A data structure containing all data items from the Entrez Gene record. Notes: Alternative to fetch() =head1 INTERNAL METHODS =head2 _version =head2 _type_stamp =head2 _index_file =head2 _file_format =head2 _file_handle Title : _file_handle Usage : $fh = $index->_file_handle( INT ) Function: Returns an open filehandle for the file index INT. On opening a new filehandle it caches it in the @{$index->_filehandle} array. If the requested filehandle is already open, it simply returns it from the array. Example : $fist_file_indexed = $index->_file_handle( 0 ); Returns : ref to a filehandle Args : INT Notes : This function is copied from Bio::Index::Abstract. Once that module changes file handle code like I do below to fit perl 5.005_03, this sub would be removed from this module =head1 PREREQUISITE Bio::ASN1::EntrezGene, Bioperl version that contains Stefan Kirov's entrezgene.pm and all dependencies therein. =head1 INSTALLATION Same as Bio::ASN1::EntrezGene =head1 SEE ALSO For details on various parsers I generated for Entrez Gene, example scripts that uses/benchmarks the modules, please see L. Those other parsers etc. are included in V1.05 download. =head1 CITATION Liu, Mingyi, and Andrei Grigoriev. "Fast parsers for Entrez Gene." Bioinformatics 21, no. 14 (2005): 3189-3190. =head1 OPERATION SYSTEMS SUPPORTED Any OS that Perl & Bioperl run on. =head1 FEEDBACK =head2 Mailing lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to the Bioperl mailing list. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists =head2 Support Please direct usage questions or support issues to the mailing list: I rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. =head2 Reporting bugs Report bugs to the Bioperl bug tracking system to help us keep track of the bugs and their resolution. Bug reports can be submitted via the web: https://redmine.open-bio.org/projects/bioperl/ =head1 AUTHOR Dr. Mingyi Liu =head1 COPYRIGHT This software is copyright (c) 2005 by Mingyi Liu, 2005 by GPC Biotech AG, and 2005 by Altana Research Institute. This software is available under the same terms as the perl 5 programming language system itself. =cut parse_entrez_gene_example.pl100755000765000024 6167612215135616 24500 0ustar00cjfieldsstaff000000000000Bio-ASN1-EntrezGene-1.70/examples#!/usr/bin/perl # launch it like "perl parse_entrez_gene_example.pl Homo_sapiens" (Homo_sapiens can be downloaded # and decompressed from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN/Mammalia/Homo_sapiens.gz) # or use the included test file "perl parse_entrez_gene_example.pl ../t/input.asn" ################################################################################ # parse_entrez_gene_example # Purpose: Demonstrates how to use Mingyi's Entrez Gene parser and retrieve # each data item from Entrez Gene. This data extraction demo is # very important as I spent 3-4 times more time on this script # than on writing, debugging, profiling, optimizing my parser! # It's a tedious task and I hope this script helps you (I'm sure # it will, if you use my parser). # NOTE!!! This example script shows where each data item from Entrez Gene # is in the data structure, but it does not store much data at all! # Therefore you will find little data in the dumpValue($gene) gene call # That's because data storage is project specific. Please store # the extracted data items in your script according to your own plan. # # Although the author tries to show how to get all data out of Entrez Gene, # there is no guarantee that all data from all versions of Entrez Gene will # be extracted using this script. # Copyright: (c) 2005, Mingyi Liu, GPC Biotech, Altana Research Institute. # License: this code is licensed under Perl itself or GPL. # Citation: Liu, M and Grigoriev, A (2005) "Fast Parsers for Entrez Gene" # Bioinformatics. In press ################################################################################ use strict; use Dumpvalue; use Benchmark; use Bio::ASN1::EntrezGene; my $parser = Bio::ASN1::EntrezGene->new('file' => $ARGV[0]); my $i = 0; while(my $result = $parser->next_seq) { unless(defined $result) # this never happens, but doesn't hurt { print STDERR "bad text for round #".++$i."!\n"; next; } $result = $result->[0] if(ref($result) eq 'ARRAY'); # this should always be true Dumpvalue->new->dumpValue($result); # $result contains all Entrez Gene data my $gene = makegene($result); Dumpvalue->new->dumpValue($gene); # although data are extracted, very few are stored (and thus dumped) after I edited out project-specific stuff, user should decide about the storage themselves last; } ################################################################################# # NOTE!!! this is just example that shows where each data item from Entrez Gene # is in the data structure, therefore I did not store all data items! But # for your own script, you'd probably want to store them. # sorry I don't have time to add more comments! I hope they're easy to understand. sub makegene { my $seq = shift; # Dumpvalue->new->dumpValue($seq);exit; my $geneid = safeval($seq, '{track-info}->[0]->{geneid}'); die "no geneid found!\n" unless $geneid; # this never happens, but doesn't hurt my (%protaccs, %protgis); my $llgene = {}; ################################################################################### # it is difficult to process Entrez Gene in event-triggered functions, so # we'll just process items one by one to pick & choose the ones we want safeassign($llgene, 'description', $seq, '{gene}->[0]->{desc}'); safeassign($llgene, 'type', $seq, '{type}'); safeassign($llgene, 'symbol', $seq, '{gene}->[0]->{locus}'); # may be overwritten map { push(@{$llgene->{genenames}}, $_) } @{$seq->{gene}->[0]->{syn}} if(safeval($seq, '{gene}->[0]->{syn}')); $llgene->{summary} = $seq->{summary} if($seq->{summary}); $llgene->{chromosome} = $seq->{source}->[0]->{subtype}->[0]->{name} if(safeval($seq, '{source}->[0]->{subtype}->[0]->{subtype}') eq 'chromosome'); safeassign($llgene, 'chrmap', $seq, '{gene}->[0]->{maploc}'); addxrefs($llgene, 'HGNC', $1) if(safeval($seq, '{gene}->[0]->{locus-tag}') =~ /HGNC:(\S+)/i); ########################################## # HomoloGene if($seq->{homology}) { # Entrez Gene documentation seems to have errors on this one (no label, text, anchor, # otherwise we could provide some description) # NOTE!!! species could be multiple species separated by ',' my $id = safeval($seq, '{homology}->[0]->{source}->[0]->{src}->[0]->{tag}->[0]->{id}'); my $species = safeval($seq, '{homology}->[0]->{heading}'); } ########################################## # OK, let's process the comments my $allseqs = {}; # refseq seqs foreach my $comment (@{$seq->{comments}}) { # pubmed ids addxrefs($llgene, 'PUBMED', $comment->{refs}->[0]->{pmid}) if(ref($comment->{refs}) eq 'ARRAY'); my $status = $comment->{label} if($comment->{heading} eq 'RefSeq Status'); ##################### # STS stuff if($comment->{heading} =~ /Markers.*STS\)$/) { foreach my $c (@{$comment->{comment}}) { addxref($llgene, 'UNISTS', id => safeval($c, '{source}->[0]->{anchor}'), acc => safeval($c, '{source}->[0]->{src}->[0]->{tag}->[0]->{id}')); } } ############################################################## # refseq stuff, DNA=>protein, domain info, we store it first # and assemble info for future processing # of variants, transcripts and proteins, (dealing with these objects # is by far the most time-consuming task in extracting Entrez Gene data) if($comment->{heading} =~ /\(RefSeq\)$/ && $comment->{products}) { foreach my $product (@{$comment->{products}}) { my ($acc, $gi, $trans, %ids); # we are probably too careful below since refseq should have accession AND gi $acc = $product->{accession}; $gi = safeval($product, '{source}->[0]->{src}->[0]->{tag}->[0]->{id}'); if($acc) { $ids{acc} = $acc; $trans->{acc} = $ids{acc}; $trans->{gi} = $gi if $gi; $allseqs->{$acc} = $trans; $trans->{type} = $product->{heading} if($product->{type} ne 'mRNA'); } $ids{gi} = $gi if($gi); if($product->{comment}) { foreach my $c (@{$product->{comment}}) { # check assembly info if($c->{heading} eq 'Source Sequence') { $trans->{assembly} = safeval($c, '{source}->[0]->{anchor}'); } # check variant info if($c->{heading} eq 'Transcriptional Variant') { $trans->{variantcomment} = safeval($c, '{comment}->[0]->{text}'); } } } ############################# # now deal with protein if(ref($product->{products}) eq 'ARRAY') # protein { my ($prot); for(my $j = 0; $j < @{$product->{products}}; $j++) { my $p = $product->{products}->[$j]; $acc = $p->{accession}; $gi = safeval($p, '{source}->[0]->{src}->[0]->{tag}->[0]->{id}'); if($gi) { $ids{'protgi' . ($j+1)} = $gi; $protgis{$gi} = 'protgi' . ($j+1); } if($acc) { $ids{'protacc' . ($j+1)} = $acc; $protgis{$acc} = 'protacc' . ($j+1); $prot->{acc} = $acc; $prot->{gi} = $gi if $gi; $trans->{protein}->{$acc} = $prot; $prot->{type} = $product->{heading} if($p->{type} ne 'peptide'); } safeassign($prot, 'name', $p, '{source}->[0]->{post-text}'); ############################# # check domain info if($p->{comment}) { foreach my $c1 (@{$p->{comment}}) { if($c1->{heading} =~ /Domains$/) { my @domains; foreach my $c2 (@{$c1->{comment}}) { my $dom; my $tmp = safeval($c2, '{comment}->[0]->{text}'); $dom->{score} = $tmp =~ /Blast Score: ([0-9.-]+)/; ($dom->{start}, $dom->{end}) = $tmp =~ /Location: (\d+) - (\d+)/; $dom->{desc} = safeval($c2, '{source}->[0]->{anchor}'); $dom->{xref} = { db => safeval($c2, '{source}->[0]->{src}->[0]->{db}'), ids => [{type=>'id',id=>safeval($c2, '{source}->[0]->{src}->[0]->{tag}->[0]->{id}')}] }; push(@domains, $dom); } $prot->{dom} = \@domains; } elsif($c1->{heading} =~ /\(CCDS\)/) # CCDS database xref { $prot->{ccds} = safeval($c1, '{source}->[0]->{src}->[0]->{tag}->[0]->{str}'); } } } ################ # end domain } } ###################### # end protein addxref($llgene, 'REFSEQ', %ids) if (keys %ids > 0); } } ######################################################### # related seqeunces, goes into xref only if($comment->{heading} eq 'Related Sequences' && $comment->{products}) { foreach my $product (@{$comment->{products}}) { my ($acc, $gi, %ids); $acc = $product->{accession}; $gi = safeval($product, '{source}->[0]->{src}->[0]->{tag}->[0]->{id}'); $ids{gi} = $gi if $gi; $ids{acc} = $acc if $acc; if(ref($product->{products}) eq 'ARRAY') { # just in case two genbank protein's assigned to one mRNA for(my $j = 0; $j < @{$product->{products}}; $j++) { $acc = $product->{products}->[$j]->{accession}; $gi = safeval($product->{products}->[$j], '{source}->[0]->{src}->[0]->{tag}->[0]->{id}'); $ids{'protgi' . ($j+1)} = $gi if $gi; $ids{'protacc' . ($j+1)} = $acc if $acc; } } addxref($llgene, 'GENBANK', %ids) if (keys %ids > 0); } } ###################### # various dblinks if($comment->{heading} eq 'Additional Links' && $comment->{comment}) { foreach my $c (@{$comment->{comment}}) { my $id = safeval($c, '{source}->[0]->{src}->[0]->{tag}->[0]->{str}'); if($id) # add xref to llgene { my $db = uc(safeval($c, '{source}->[0]->{src}->[0]->{db}')); next if $db =~ /\s+/ || $db =~ /HomoloGene/i; # homologene id here is not real id, and some of dblinks' DB name are not real DBs, we temporarily use space to identify those my @ids = trim(split /,/, $id); # MGC sometimes has multiple ids concatenated by ',' map { s/^$db://i; addxref($llgene, $db, 'id' => $_) } @ids; # $db: truncation useful for GDB ids } my $url = safeval($c, '{source}->[0]->{url}'); if($url) { $url =~ s/\s//g; # some urls have linebreaks in them my $desc = safeval($c, '{source}->[0]->{anchor}'); } } } ###################### # Pathways if($comment->{heading} eq 'Pathways' && $comment->{comment}) { foreach my $c (@{$comment->{comment}}) { my $id = safeval($c, '{source}->[0]->{src}->[0]->{tag}->[0]->{str}'); my $url = safeval($c, '{source}->[0]->{url}'); if($url) { $url =~ s/\s//g; # some urls have linebreaks in them my $desc = ($c->{text})? $c->{text} : 'KEGG Pathway'; } if($id) # add xref to llgene { my @ids = trim(split /,/, $id); # MGC sometimes has multiple ids concatenated by ',' map { addxref($llgene, 'KEGG', 'id' => $_) } @ids; # $db: truncation useful for GDB ids } } } ###################### # generif if($comment->{type} eq 'generif') { my @cs = ($comment->{heading} && ref($comment->{comment}) eq 'ARRAY')? @{$comment->{comment}} : $comment; foreach my $c (@cs) { my $generif = $c->{text}; my $pmid => safeval($c, "{refs}->[0]->{pmid}"); if($comment->{heading}) { my $type = $comment->{heading}; my ($db, $id) = makexref($c); my $anchor = safeval($c, '{source}->[0]->{anchor}'); foreach my $c1 (@{$c->{comment}}) { my ($db, $id) = makexref($c1); my ($label, $acc) = ((($c1->{label})? "$c1->{label}:" : ''), $c1->{accession}); } } addxref($llgene, 'PUBMED', 'id' => safeval($c, "{refs}->[0]->{pmid}")); } } ###################### # phenotype if($comment->{heading} eq 'Phenotypes') { my $detail = safeval($comment, '{comment}->[0]->{text}'); if(safeval($comment, '{comment}->[0]->{source}')) { my $db = safeval($comment, '{comment}->[0]->{source}->[0]->{src}->[0]->{db}'); my $id = safeval($comment, '{comment}->[0]->{source}->[0]->{src}->[0]->{tag}->[0]->{id}'); } } ###################### # relationships if($comment->{heading} eq 'Relationships') { foreach my $c (@{$comment->{comment}}) { my $type = ($c->{text} =~ /related (.*)/)? $1 : $c->{text}; my $anchor = safeval($c, '{source}->[0]->{anchor}'); my ($db, $id) = makexref($c); } } ###################### # tRNA if($comment->{heading} eq 'tRNA-ext') { my $trnatext = $comment->{text}; } ######################################################################### # ECNUM (documentation for Entrez Gene says it's here, so I put this in, # but it is actually in locus (see further below) if($comment->{type} =~ /property/i && $comment->{label} eq 'EC') { addxref($llgene, 'EC', 'id' => $comment->{text}); } } my %map = ('FUNCTION' => 'molecular function', 'PROCESS' => 'biological process', 'COMPONENT' => 'cellular component'); ################################## # OK, let's process the GO info if($seq->{properties}) { foreach my $p (@{$seq->{properties}}) { if($p->{heading} eq 'GeneOntology') { foreach my $c (@{$p->{comment}}) { foreach my $c1 (@{$c->{comment}}) { my ($db, $id) = (safeval($c1, '{source}->[0]->{src}->[0]->{db}'), safeval($c1, '{source}->[0]->{src}->[0]->{tag}->[0]->{id}')); addxref($llgene, $db, id => $id); my $category = $map{uc($c->{label})}; my $content = safeval($c1, '{source}->[0]->{anchor}'); } } } elsif($p->{label} eq 'Nomenclature') { foreach my $p1 (@{$p->{properties}}) { if($p1->{label} eq 'Official Symbol') { my $hugosymbol = $p1->{text}; } elsif($p1->{label} eq 'Official Full Name') { my $hugoname = $p1->{text}; } } } } } ################################## # protein aliases if($seq->{prot}) { foreach my $p (@{$seq->{prot}}) { map { my $protalias = $_ } @{$p->{name}} if(ref($p->{name}) eq 'ARRAY'); } } ##################################################################### # now locus, again assemble info into $allseqs for future processing # of variants, transcripts and proteins, (dealing with these objects # is by far the most time-consuming task in extracting Entrez Gene data) if($seq->{locus}) { # judgement call: # we should take NC_ whenever possible and disregard NT_ # but in absence of NC_, we use NT_ to figure out exons foreach my $l (@{$seq->{locus}}) { if($l->{products}) { foreach my $p (@{$l->{products}}) { ######################################## # let's first get the accession numbers my %ids; $ids{acc} = $p->{accession}; my $gi = safeval($p, '{seqs}->[0]->{whole}->[0]->{gi}'); $ids{gi} = $gi if $gi; my $t = $allseqs->{$p->{accession}}; $allseqs->{$p->{accession}}->{acc} = $ids{acc} unless $allseqs->{$p->{accession}}->{acc}; # sometimes NCBI forgot to put refseq IDs into comments about Refseq sequences, e.g. Gene 616. $allseqs->{$p->{accession}}->{gi} = $ids{gi} unless $allseqs->{$p->{accession}}->{gi}; # sometimes NCBI forgot to put refseq IDs into comments about Refseq sequences, e.g. Gene 616. if($p->{products}) { for(my $j = 0; $j < @{$p->{products}}; $j++) { my $p1 = $p->{products}->[$j]; $gi = safeval($p1, '{seqs}->[0]->{whole}->[0]->{gi}'); my ($gino, $accno) = ((($protgis{$gi})? $protgis{$gi} : 'protgi' . ($j+1)), (($protaccs{$p1->{accession}})? $protaccs{$p1->{accession}} : 'protacc' . ($j+1))); if($gi) { $ids{$gino} = $gi; $protgis{$gi} = $gino; } $ids{$accno} = $p1->{accession}; $protaccs{$p1->{accession}} = $accno; $allseqs->{$p->{accession}}->{protein}->{$p1->{accession}}->{acc} = $ids{$accno} unless $allseqs->{$p->{accession}}->{protein}->{$p1->{accession}}->{acc}; # sometimes NCBI forgot to put refseq IDs into comments about Refseq sequences, e.g. Gene 616. $allseqs->{$p->{accession}}->{protein}->{$p1->{accession}}->{gi} = $ids{$gino} unless $allseqs->{$p->{accession}}->{protein}->{$p1->{accession}}->{gi}; # sometimes NCBI forgot to put refseq IDs into comments about Refseq sequences, e.g. Gene 616. if($p1->{comment} && safeval($p1, '{comment}->[0]->{type}') eq 'property' && safeval($p1, '{comment}->[0]->{label}') eq 'EC') { my $ec = safeval($p1, '{comment}->[0]->{text}'); # change dealing with EC number to add to xref addxref($llgene, 'EC', 'id' => $ec); } } } addxref($llgene, 'REFSEQ', %ids); # trans and prots xrefs ############################################################# # now get exon coordinates - only do the work when necessary unless($t && $t->{genomic} && ($t->{genomic} =~ /^NC_/ || $l->{accession} =~ /^NT_/)) { $allseqs->{$p->{accession}}->{genomic} = $l->{accession}; $allseqs->{$p->{accession}}->{type} = $p->{type}; my $tmp = safeval($p, '{genomic-coords}->[0]->{mix}->[0]->{int}') || safeval($p, '{genomic-coords}->[0]->{int}'); $allseqs->{$p->{accession}}->{exons} = $tmp if($tmp); if($p->{products}) { foreach my $p1 (@{$p->{products}}) { ############################################################# # now get protein location - only do the work when necessary $t = $allseqs->{$p->{accession}}->{protein}->{$p1->{accession}}; unless($t && $t->{from}) { my $tmp = safeval($p1, '{genomic-coords}->[0]->{packed-int}') || safeval($p1, '{genomic-coords}->[0]->{int}'); if($tmp) { $allseqs->{$p->{accession}}->{protein}->{$p1->{accession}}->{from} = $tmp->[0]->{from}; $allseqs->{$p->{accession}}->{protein}->{$p1->{accession}}->{to} = $tmp->[$#$tmp]->{to}; } } } } } } addxref($llgene, 'REFSEQ', 'acc' => $l->{accession}, 'gi' => safeval($l, '{seqs}->[0]->{int}->[0]->{id}->[0]->{gi}')); # trans and prots xrefs } } } ########################################################################## # now that we got all info for transcripts and proteins # we should assemble info into variants, transcripts and proteins, # (dealing with these objects is by far the most time-consuming task # in extracting Entrez Gene data) # note again I edited out project-specific stuff, so the only purpose is # to show you where data items are, not to store everything my (@variants, @trans, $genestart, $geneend, $genomeacc); foreach my $dnaacc (keys %$allseqs) { my ($variant, $trans, $xref); my $t = $allseqs->{$dnaacc}; if($t->{variantcomment}) { $variant->{comment} = $t->{variantcomment}; } ############################################ # work on transcript $trans->{assembly} = $t->{assembly} if $t->{assembly}; $trans->{type} = $t->{type} if $t->{type}; $trans->{comment} = "Data from $t->{genomic}" if($t->{genomic}); $genomeacc = $t->{genomic} unless $genomeacc; my $dealwithtransxref if($t->{acc} || $t->{gi}); # db should be refseq # now exon if($t->{exons}) { if($t->{genomic} eq $genomeacc) # only process when the trans is on same contig as first one { $genestart = $t->{exons}->[0]->{from} if(!$genestart || $t->{exons}->[0]->{from} < $genestart); $geneend = $t->{exons}->[$#{$t->{exons}}]->{to} if($t->{exons}->[$#{$t->{exons}}]->{to} > $geneend); } foreach my $exon (@{$t->{exons}}) { $trans->{strand} = $exon->{strand} unless $trans->{strand}; my $start = $exon->{from}; # follow Entrez Gene style, start is always smaller than end my $end = $exon->{to}; my $coordSys = "$t->{genomic}"; } # finally protein if($t->{protein}) { foreach my $pacc (keys %{$t->{protein}}) { my $p = $t->{protein}->{$pacc}; my $deal_with_prot_xref if($p->{acc}); # db should be refseq my $add_ccds_xref if($p->{ccds}); # db should be CCDS my $protname = $p->{name} if($p->{name}); if($p->{from} || $p->{to}) # sometimes Entrez Gene forgets to annotate CDS start/end like for gene 574, NP_001178 { my $protcoordSys = "$t->{genomic}"; my $protstart = $p->{from} if $p->{from}; my $protend = $p->{to} if $p->{to}; } # domains if($p->{dom}) { foreach my $dom (@{$p->{dom}}) { my $desc = $dom->{desc}; my $score = $dom->{score} if(defined $dom->{score}); my $loc = { start => $dom->{start}, end=> $dom->{end}, coordSys => "$p->{acc}" }; my $xref = $dom->{xref}; } } } } } # put transcript into variant if annotated, otherwise into transcript if($variant) { # user can decide how to store } else { # user can decide how to store } } if($genestart && $geneend) { # user can decide how to store } # tracking info about how this gene has changed if(safeval($seq, '{track-info}->[0]->{current-id}')) { my (@ids, $newegid, $newllid); foreach my $id (@{$seq->{'track-info'}->[0]->{'current-id'}}) { my $tmpid = safeval($id, '{tag}->[0]->{id}'); push(@ids, "$id->{db}:$tmpid"); $newegid = $tmpid if($id->{db} =~ /^GeneID$/i); $newllid = $tmpid if($id->{db} =~ /^LocusID$/i); } my $comment = "Gene moved: current IDs are: " . join(' ; ', @ids); } return $llgene; } # safely assign a value to $data->{$key} ($data must be hash) sub safeassign { my ($data, $key, $ds, $str) = @_; my $tmp = safeval($ds, $str); $data->{$key} = $tmp if $tmp; return (defined $tmp)? 1 : 0; } # safely extracts a value, another choice is to simply use # eval in-line, if it fails, it fails. Probably faster, but can't # give feedback in-line (always has to add a couple lines dealing with # $@ for error reporting), might still be worth it though because # of the speed. User can make his/her own choice here. sub safeval { my ($ds, $str) = @_; # data structure and string (we need $ds passed in because we use strict) my @items = split('->', $str); foreach (@items) { my $tmp; if(($tmp) = /\[(\d+)\]/) { return undef unless(ref($ds) eq 'ARRAY' && @$ds > $tmp); $ds = $ds->[$tmp]; } elsif(($tmp) = /^{(.*?)}$/) { return undef unless(ref($ds) eq 'HASH' && $ds->{$tmp}); # this is not ideal (since one might want to return '' instead of undef when this hash value is defined as ''), but correct for our situations $ds = $ds->{$tmp}; } else { die "wrong syntax for string:$str\n"; } } return $ds; } sub addxrefs { # left for user implementation since it's project-specific } sub addxref { # left for user implementation since it's project-specific } # used for making xrefs listed under comments sub makexref { my $c = shift; my $db = safeval($c, '{source}->[0]->{src}->[0]->{db}'); my $id = safeval($c, '{source}->[0]->{src}->[0]->{tag}->[0]->{id}') || safeval($c, '{source}->[0]->{src}->[0]->{tag}->[0]->{str}'); # change the following as suited for your project $id =~ $1 if($id =~ /^"(.*)"$/); if($db =~ /GeneID/) { $db = 'ENTREZGENE'; } elsif($db =~ /Nucleotide/) { $db = 'GENBANK'; } return ($db, $id); } sub trim { my @data = @_; map { s/^\s+//; s/\s+$//; } (@data); return wantarray ? @data : $data[0]; }