pax_global_header00006660000000000000000000000064132477532020014517gustar00rootroot0000000000000052 comment=9f22e341f119296de8d4d427ad74978617c0a244 ruby-classifier-reborn-2.2.0/000077500000000000000000000000001324775320200161105ustar00rootroot00000000000000ruby-classifier-reborn-2.2.0/LICENSE000066400000000000000000000635441324775320200171310ustar00rootroot00000000000000 GNU LESSER GENERAL PUBLIC LICENSE Version 2.1, February 1999 Copyright (C) 1991, 1999 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. [This is the first released version of the Lesser GPL. It also counts as the successor of the GNU Library Public License, version 2, hence the version number 2.1.] Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public Licenses are intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This license, the Lesser General Public License, applies to some specially designated software packages--typically libraries--of the Free Software Foundation and other authors who decide to use it. You can use it too, but we suggest you first think carefully about whether this license or the ordinary General Public License is the better strategy to use in any particular case, based on the explanations below. When we speak of free software, we are referring to freedom of use, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish); that you receive source code or can get it if you want it; that you can change the software and use pieces of it in new free programs; and that you are informed that you can do these things. To protect your rights, we need to make restrictions that forbid distributors to deny you these rights or to ask you to surrender these rights. These restrictions translate to certain responsibilities for you if you distribute copies of the library or if you modify it. For example, if you distribute copies of the library, whether gratis or for a fee, you must give the recipients all the rights that we gave you. You must make sure that they, too, receive or can get the source code. If you link other code with the library, you must provide complete object files to the recipients, so that they can relink them with the library after making changes to the library and recompiling it. And you must show them these terms so they know their rights. We protect your rights with a two-step method: (1) we copyright the library, and (2) we offer you this license, which gives you legal permission to copy, distribute and/or modify the library. To protect each distributor, we want to make it very clear that there is no warranty for the free library. Also, if the library is modified by someone else and passed on, the recipients should know that what they have is not the original version, so that the original author's reputation will not be affected by problems that might be introduced by others. Finally, software patents pose a constant threat to the existence of any free program. We wish to make sure that a company cannot effectively restrict the users of a free program by obtaining a restrictive license from a patent holder. Therefore, we insist that any patent license obtained for a version of the library must be consistent with the full freedom of use specified in this license. Most GNU software, including some libraries, is covered by the ordinary GNU General Public License. This license, the GNU Lesser General Public License, applies to certain designated libraries, and is quite different from the ordinary General Public License. We use this license for certain libraries in order to permit linking those libraries into non-free programs. When a program is linked with a library, whether statically or using a shared library, the combination of the two is legally speaking a combined work, a derivative of the original library. The ordinary General Public License therefore permits such linking only if the entire combination fits its criteria of freedom. The Lesser General Public License permits more lax criteria for linking other code with the library. We call this license the "Lesser" General Public License because it does Less to protect the user's freedom than the ordinary General Public License. It also provides other free software developers Less of an advantage over competing non-free programs. These disadvantages are the reason we use the ordinary General Public License for many libraries. However, the Lesser license provides advantages in certain special circumstances. For example, on rare occasions, there may be a special need to encourage the widest possible use of a certain library, so that it becomes a de-facto standard. To achieve this, non-free programs must be allowed to use the library. A more frequent case is that a free library does the same job as widely used non-free libraries. In this case, there is little to gain by limiting the free library to free software only, so we use the Lesser General Public License. In other cases, permission to use a particular library in non-free programs enables a greater number of people to use a large body of free software. For example, permission to use the GNU C Library in non-free programs enables many more people to use the whole GNU operating system, as well as its variant, the GNU/Linux operating system. Although the Lesser General Public License is Less protective of the users' freedom, it does ensure that the user of a program that is linked with the Library has the freedom and the wherewithal to run that program using a modified version of the Library. The precise terms and conditions for copying, distribution and modification follow. Pay close attention to the difference between a "work based on the library" and a "work that uses the library". The former contains code derived from the library, whereas the latter must be combined with the library in order to run. GNU LESSER GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License Agreement applies to any software library or other program which contains a notice placed by the copyright holder or other authorized party saying it may be distributed under the terms of this Lesser General Public License (also called "this License"). Each licensee is addressed as "you". A "library" means a collection of software functions and/or data prepared so as to be conveniently linked with application programs (which use some of those functions and data) to form executables. The "Library", below, refers to any such software library or work which has been distributed under these terms. A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".) "Source code" for a work means the preferred form of the work for making modifications to it. For a library, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the library. Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running a program using the Library is not restricted, and output from such a program is covered only if its contents constitute a work based on the Library (independent of the use of the Library in a tool for writing it). Whether that is true depends on what the Library does and what the program that uses the Library does. 1. You may copy and distribute verbatim copies of the Library's complete source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and distribute a copy of this License along with the Library. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Library or any portion of it, thus forming a work based on the Library, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) The modified work must itself be a software library. b) You must cause the files modified to carry prominent notices stating that you changed the files and the date of any change. c) You must cause the whole of the work to be licensed at no charge to all third parties under the terms of this License. d) If a facility in the modified Library refers to a function or a table of data to be supplied by an application program that uses the facility, other than as an argument passed when the facility is invoked, then you must make a good faith effort to ensure that, in the event an application does not supply such function or table, the facility still operates, and performs whatever part of its purpose remains meaningful. (For example, a function in a library to compute square roots has a purpose that is entirely well-defined independent of the application. Therefore, Subsection 2d requires that any application-supplied function or table used by this function must be optional: if the application does not supply it, the square root function must still compute square roots.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Library, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Library, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Library. In addition, mere aggregation of another work not based on the Library with the Library (or with a work based on the Library) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may opt to apply the terms of the ordinary GNU General Public License instead of this License to a given copy of the Library. To do this, you must alter all the notices that refer to this License, so that they refer to the ordinary GNU General Public License, version 2, instead of to this License. (If a newer version than version 2 of the ordinary GNU General Public License has appeared, then you can specify that version instead if you wish.) Do not make any other change in these notices. Once this change is made in a given copy, it is irreversible for that copy, so the ordinary GNU General Public License applies to all subsequent copies and derivative works made from that copy. This option is useful when you wish to copy part of the code of the Library into a program that is not a library. 4. You may copy and distribute the Library (or a portion or derivative of it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. If distribution of object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place satisfies the requirement to distribute the source code, even though third parties are not compelled to copy the source along with the object code. 5. A program that contains no derivative of any portion of the Library, but is designed to work with the Library by being compiled or linked with it, is called a "work that uses the Library". Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the scope of this License. However, linking a "work that uses the Library" with the Library creates an executable that is a derivative of the Library (because it contains portions of the Library), rather than a "work that uses the library". The executable is therefore covered by this License. Section 6 states terms for distribution of such executables. When a "work that uses the Library" uses material from a header file that is part of the Library, the object code for the work may be a derivative work of the Library even though the source code is not. Whether this is true is especially significant if the work can be linked without the Library, or if the work is itself a library. The threshold for this to be true is not precisely defined by law. If such an object file uses only numerical parameters, data structure layouts and accessors, and small macros and small inline functions (ten lines or less in length), then the use of the object file is unrestricted, regardless of whether it is legally a derivative work. (Executables containing this object code plus portions of the Library will still fall under Section 6.) Otherwise, if the work is a derivative of the Library, you may distribute the object code for the work under the terms of Section 6. Any executables containing that work also fall under Section 6, whether or not they are linked directly with the Library itself. 6. As an exception to the Sections above, you may also combine or link a "work that uses the Library" with the Library to produce a work containing portions of the Library, and distribute that work under terms of your choice, provided that the terms permit modification of the work for the customer's own use and reverse engineering for debugging such modifications. You must give prominent notice with each copy of the work that the Library is used in it and that the Library and its use are covered by this License. You must supply a copy of this License. If the work during execution displays copyright notices, you must include the copyright notice for the Library among them, as well as a reference directing the user to the copy of this License. Also, you must do one of these things: a) Accompany the work with the complete corresponding machine-readable source code for the Library including whatever changes were used in the work (which must be distributed under Sections 1 and 2 above); and, if the work is an executable linked with the Library, with the complete machine-readable "work that uses the Library", as object code and/or source code, so that the user can modify the Library and then relink to produce a modified executable containing the modified Library. (It is understood that the user who changes the contents of definitions files in the Library will not necessarily be able to recompile the application to use the modified definitions.) b) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (1) uses at run time a copy of the library already present on the user's computer system, rather than copying library functions into the executable, and (2) will operate properly with a modified version of the library, if the user installs one, as long as the modified version is interface-compatible with the version that the work was made with. c) Accompany the work with a written offer, valid for at least three years, to give the same user the materials specified in Subsection 6a, above, for a charge no more than the cost of performing this distribution. d) If distribution of the work is made by offering access to copy from a designated place, offer equivalent access to copy the above specified materials from the same place. e) Verify that the user has already received a copy of these materials or that you have already sent this user a copy. For an executable, the required form of the "work that uses the Library" must include any data and utility programs needed for reproducing the executable from it. However, as a special exception, the materials to be distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. It may happen that this requirement contradicts the license restrictions of other proprietary libraries that do not normally accompany the operating system. Such a contradiction means you cannot use both them and the Library together in an executable that you distribute. 7. You may place library facilities that are a work based on the Library side-by-side in a single library together with other library facilities not covered by this License, and distribute such a combined library, provided that the separate distribution of the work based on the Library and of the other library facilities is otherwise permitted, and provided that you do these two things: a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities. This must be distributed under the terms of the Sections above. b) Give prominent notice with the combined library of the fact that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. 8. You may not copy, modify, sublicense, link with, or distribute the Library except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense, link with, or distribute the Library is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 9. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Library or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Library (or any work based on the Library), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Library or works based on it. 10. Each time you redistribute the Library (or any work based on the Library), the recipient automatically receives a license from the original licensor to copy, distribute, link with or modify the Library subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties with this License. 11. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Library at all. For example, if a patent license would not permit royalty-free redistribution of the Library by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Library. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply, and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 12. If the distribution and/or use of the Library is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Library under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 13. The Free Software Foundation may publish revised and/or new versions of the Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Library specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Library does not specify a license version number, you may choose any version ever published by the Free Software Foundation. 14. If you wish to incorporate parts of the Library into other free programs whose distribution conditions are incompatible with these, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Libraries If you develop a new library, and you want it to be of the greatest possible use to the public, we recommend making it free software that everyone can redistribute and change. You can do so by permitting redistribution under these terms (or, alternatively, under the terms of the ordinary General Public License). To apply these terms, attach the following notices to the library. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Also add information on how to contact you by electronic and paper mail. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the library, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the library `Frob' (a library for tweaking knobs) written by James Random Hacker. , 1 April 1990 Ty Coon, President of Vice That's all there is to it! ruby-classifier-reborn-2.2.0/README.markdown000066400000000000000000000107051324775320200206140ustar00rootroot00000000000000# Classifier Reborn [![Gem Version](https://badge.fury.io/rb/classifier-reborn.svg)](https://rubygems.org/gems/classifier-reborn) [![Build Status](https://img.shields.io/travis/jekyll/classifier-reborn/master.svg)](https://travis-ci.org/jekyll/classifier-reborn) [![Dependency Status](https://img.shields.io/gemnasium/jekyll/classifier-reborn.svg)](https://gemnasium.com/jekyll/classifier-reborn) --- ## [Read the Docs](http://www.classifier-reborn.com/) ## Getting Started Classifier Reborn is a general classifier module to allow Bayesian and other types of classifications. It is a fork of [cardmagic/classifier](https://github.com/cardmagic/classifier) under more active development. Currently, it has [Bayesian Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Latent Semantic Indexer (LSI)](https://en.wikipedia.org/wiki/Latent_semantic_analysis) implemented. Here is a quick illustration of the Bayesian classifier. ```bash $ gem install classifier-reborn $ irb irb(main):001:0> require 'classifier-reborn' irb(main):002:0> classifier = ClassifierReborn::Bayes.new 'Ham', 'Spam' irb(main):003:0> classifier.train "Ham", "Sunday is a holiday. Say no to work on Sunday!" irb(main):004:0> classifier.train "Spam", "You are the lucky winner! Claim your holiday prize." irb(main):005:0> classifier.classify "What's the plan for Sunday?" #=> "Ham" ``` Now, let's build an LSI, classify some text, and find a cluster of related documents. ```bash irb(main):006:0> lsi = ClassifierReborn::LSI.new irb(main):007:0> lsi.add_item "This text deals with dogs. Dogs.", :dog irb(main):008:0> lsi.add_item "This text involves dogs too. Dogs!", :dog irb(main):009:0> lsi.add_item "This text revolves around cats. Cats.", :cat irb(main):010:0> lsi.add_item "This text also involves cats. Cats!", :cat irb(main):011:0> lsi.add_item "This text involves birds. Birds.", :bird irb(main):012:0> lsi.classify "This text is about dogs!" #=> :dog irb(main):013:0> lsi.find_related("This text is around cats!", 2) #=> ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"] ``` There is much more that can be done using Bayes and LSI beyond these quick examples. For more information read the following documentation topics. * [Installation and Dependencies](http://www.classifier-reborn.com/) * [Bayesian Classifier](http://www.classifier-reborn.com/bayes) * [Latent Semantic Indexer (LSI)](http://www.classifier-reborn.com/lsi) * [Classifier Validation](http://www.classifier-reborn.com/validation) * [Development and Contributions](http://www.classifier-reborn.com/development) (*Optional Docker instructions included*) ### Notes on JRuby support ```ruby gem 'classifier-reborn-jruby', platforms: :java ``` While experimental, this gem should work on JRuby without any kind of additional changes. Unfortunately, you will **not** be able to use C bindings to GNU/GSL or similar performance-enhancing native code. Additionally, we do not use `fast_stemmer`, but rather [an implementation](https://tartarus.org/martin/PorterStemmer/java.txt) of the [Porter Stemming](https://tartarus.org/martin/PorterStemmer/) algorithm. Stemming will differ between MRI and JRuby, however you may choose to [disable stemming](https://tartarus.org/martin/PorterStemmer/) and do your own manual preprocessing (or use some other [popular Java library](https://opennlp.apache.org/)). If you encounter a problem, please submit your issue with `[JRuby]` in the title. ## Code of Conduct In order to have a more open and welcoming community, `Classifier Reborn` adheres to the `Jekyll` [code of conduct](https://github.com/jekyll/jekyll/blob/master/CODE_OF_CONDUCT.markdown) adapted from the `Ruby on Rails` code of conduct. Please adhere to this code of conduct in any interactions you have in the `Classifier` community. If you encounter someone violating these terms, please let [Chase Gilliam](https://github.com/Ch4s3) know and we will address it as soon as possible. ## Authors and Contributors * [Lucas Carlson](mailto:lucas@rufy.com) * [David Fayram II](mailto:dfayram@gmail.com) * [Cameron McBride](mailto:cameron.mcbride@gmail.com) * [Ivan Acosta-Rubio](mailto:ivan@softwarecriollo.com) * [Parker Moore](mailto:email@byparker.com) * [Chase Gilliam](mailto:chase.gilliam@gmail.com) * and [many more](https://github.com/jekyll/classifier-reborn/graphs/contributors)... The Classifier Reborn library is released under the terms of the [GNU LGPL-2.1](https://github.com/jekyll/classifier-reborn/blob/master/LICENSE). ruby-classifier-reborn-2.2.0/classifier-reborn.gemspec000066400000000000000000000070231324775320200230700ustar00rootroot00000000000000######################################################### # This file has been automatically generated by gem2tgz # ######################################################### # -*- encoding: utf-8 -*- Gem::Specification.new do |s| s.name = "classifier-reborn" s.version = "2.2.0" s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version= s.authors = ["Lucas Carlson", "Parker Moore", "Chase Gilliam"] s.date = "2017-12-15" s.email = ["lucas@rufy.com", "parkrmoore@gmail.com", "chase.gilliam@gmail.com"] s.extra_rdoc_files = ["LICENSE", "README.markdown"] s.files = ["LICENSE", "README.markdown", "data/stopwords/ar", "data/stopwords/bn", "data/stopwords/ca", "data/stopwords/cs", "data/stopwords/da", "data/stopwords/de", "data/stopwords/en", "data/stopwords/es", "data/stopwords/fi", "data/stopwords/fr", "data/stopwords/hi", "data/stopwords/hu", "data/stopwords/it", "data/stopwords/ja", "data/stopwords/nl", "data/stopwords/no", "data/stopwords/pl", "data/stopwords/pt", "data/stopwords/ru", "data/stopwords/se", "data/stopwords/tr", "data/stopwords/vi", "data/stopwords/zh", "lib/classifier-reborn.rb", "lib/classifier-reborn/backends/bayes_memory_backend.rb", "lib/classifier-reborn/backends/bayes_redis_backend.rb", "lib/classifier-reborn/backends/no_redis_error.rb", "lib/classifier-reborn/bayes.rb", "lib/classifier-reborn/category_namer.rb", "lib/classifier-reborn/extensions/hasher.rb", "lib/classifier-reborn/extensions/vector.rb", "lib/classifier-reborn/extensions/vector_serialize.rb", "lib/classifier-reborn/lsi.rb", "lib/classifier-reborn/lsi/cached_content_node.rb", "lib/classifier-reborn/lsi/content_node.rb", "lib/classifier-reborn/lsi/summarizer.rb", "lib/classifier-reborn/lsi/word_list.rb", "lib/classifier-reborn/validators/classifier_validator.rb", "lib/classifier-reborn/version.rb"] s.homepage = "https://github.com/jekyll/classifier-reborn" s.licenses = ["LGPL"] s.rdoc_options = ["--charset=UTF-8"] s.require_paths = ["lib"] s.required_ruby_version = Gem::Requirement.new(">= 1.9.3") s.rubygems_version = "1.8.23" s.summary = "A general classifier module to allow Bayesian and other types of classifications." if s.respond_to? :specification_version then s.specification_version = 2 if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then s.add_runtime_dependency(%q, ["~> 1.0"]) s.add_development_dependency(%q, [">= 0"]) s.add_development_dependency(%q, [">= 0"]) s.add_development_dependency(%q, [">= 0"]) s.add_development_dependency(%q, [">= 0"]) s.add_development_dependency(%q, [">= 0"]) s.add_development_dependency(%q, [">= 0"]) s.add_development_dependency(%q, [">= 0"]) else s.add_dependency(%q, ["~> 1.0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) end else s.add_dependency(%q, ["~> 1.0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) s.add_dependency(%q, [">= 0"]) end end ruby-classifier-reborn-2.2.0/data/000077500000000000000000000000001324775320200170215ustar00rootroot00000000000000ruby-classifier-reborn-2.2.0/data/stopwords/000077500000000000000000000000001324775320200210655ustar00rootroot00000000000000ruby-classifier-reborn-2.2.0/data/stopwords/ar000066400000000000000000000013441324775320200214140ustar00rootroot00000000000000 فى في كل لم لن له من هو هي قوة كما لها منذ وقد ولا لقاء مقابل هناك وقال وكان وقالت وكانت فيه لكن وفي ولم ومن وهو وهي يوم فيها منها يكون يمكن حيث االا اما االتى التي اكثر ايضا الذى الذي الان الذين ابين ذلك دون حول حين الى انه اول انها ف و و6 قد لا ما مع هذا واحد واضاف واضافت فان قبل قال كان لدى نحو هذه وان واكد كانت واوضح ب ا أ ، عن عند عندما على عليه عليها تم ضد بعد بعض حتى اذا احد بان اجل غير بن به ثم اف ان او اي بها ruby-classifier-reborn-2.2.0/data/stopwords/bn000066400000000000000000000113621324775320200214120ustar00rootroot00000000000000 অবশ্য অনেক অনেকে অনেকেই অন্তত অথবা অথচ অর্থাত অন্য আজ আছে আপনার আপনি আবার আমরা আমাকে আমাদের আমার আমি আরও আর আগে আগেই আই অতএব আগামী অবধি অনুযায়ী আদ্যভাগে এই একই একে একটি এখন এখনও এখানে এখানেই এটি এটা এটাই এতটাই এবং একবার এবার এদের এঁদের এমন এমনকী এল এর এরা এঁরা এস এত এতে এসে একে এ ঐ ই ইহা ইত্যাদি উনি উপর উপরে উচিত ও ওই ওর ওরা ওঁর ওঁরা ওকে ওদের ওঁদের ওখানে কত কবে করতে কয়েক কয়েকটি করবে করলেন করার কারও করা করি করিয়ে করার করাই করলে করলেন করিতে করিয়া করেছিলেন করছে করছেন করেছেন করেছে করেন করবেন করায় করে করেই কাছ কাছে কাজে কারণ কিছু কিছুই কিন্তু কিংবা কি কী কেউ কেউই কাউকে কেন কে কোনও কোনো কোন কখনও ক্ষেত্রে খুব গুলি গিয়ে গিয়েছে গেছে গেল গেলে গোটা চলে ছাড়া ছাড়াও ছিলেন ছিল জন্য জানা ঠিক তিনি তিনঐ তিনিও তখন তবে তবু তাঁদের তাঁাহারা তাঁরা তাঁর তাঁকে তাই তেমন তাকে তাহা তাহাতে তাহার তাদের তারপর তারা তারৈ তার তাহলে তিনি তা তাও তাতে তো তত তুমি তোমার তথা থাকে থাকা থাকায় থেকে থেকেও থাকবে থাকেন থাকবেন থেকেই দিকে দিতে দিয়ে দিয়েছে দিয়েছেন দিলেন দু দুটি দুটো দেয় দেওয়া দেওয়ার দেখা দেখে দেখতে দ্বারা ধরে ধরা নয় নানা না নাকি নাগাদ নিতে নিজে নিজেই নিজের নিজেদের নিয়ে নেওয়া নেওয়ার নেই নাই পক্ষে পর্যন্ত পাওয়া পারেন পারি পারে পরে পরেই পরেও পর পেয়ে প্রতি প্রভৃতি প্রায় ফের ফলে ফিরে ব্যবহার বলতে বললেন বলেছেন বলল বলা বলেন বলে বহু বসে বার বা বিনা বরং বদলে বাদে বার বিশেষ বিভিন্ন বিষয়টি ব্যবহার ব্যাপারে ভাবে ভাবেই মধ্যে মধ্যেই মধ্যেও মধ্যভাগে মাধ্যমে মাত্র মতো মতোই মোটেই যখন যদি যদিও যাবে যায় যাকে যাওয়া যাওয়ার যত যতটা যা যার যারা যাঁর যাঁরা যাদের যান যাচ্ছে যেতে যাতে যেন যেমন যেখানে যিনি যে রেখে রাখা রয়েছে রকম শুধু সঙ্গে সঙ্গেও সমস্ত সব সবার সহ সুতরাং সহিত সেই সেটা সেটি সেটাই সেটাও সম্প্রতি সেখান সেখানে সে স্পষ্ট স্বয়ং হইতে হইবে হৈলে হইয়া হচ্ছে হত হতে হতেই হবে হবেন হয়েছিল হয়েছে হয়েছেন হয়ে হয়নি হয় হয়েই হয়তো হল হলে হলেই হলেও হলো হিসাবে হওয়া হওয়ার হওয়ায় হন হোক জন জনকে জনের জানতে জানায় জানিয়ে জানানো জানিয়েছে জন্য জন্যওজে জে বেশ দেন তুলে ছিলেন চান চায় চেয়ে মোট যথেষ্ট টি ruby-classifier-reborn-2.2.0/data/stopwords/ca000066400000000000000000000012661324775320200214000ustar00rootroot00000000000000de es i a o un una unes uns un tot també altre algun alguna alguns algunes ser és soc ets som estic està estem esteu estan com en per perquè per que estat estava ans abans éssent ambdós però per poder potser puc podem podeu poden vaig va van fer faig fa fem feu fan cada fi inclòs primer des de conseguir consegueixo consigueix consigueixes conseguim consigueixen anar haver tenir tinc te tenim teniu tene el la les els seu aquí meu teu ells elles ens nosaltres vosaltres si dins sols solament saber saps sap sabem sabeu saben últim llarg bastant fas molts aquells aquelles seus llavors sota dalt ús molt era eres erem eren mode bé quant quan on mentre qui amb entre sense jo aquellruby-classifier-reborn-2.2.0/data/stopwords/cs000066400000000000000000000012111324775320200214100ustar00rootroot00000000000000dnes cz timto budes budem byli jses muj svym ta tomto tohle tuto tyto jej zda proc mate tato kam tohoto kdo kteri mi nam tom tomuto mit nic proto kterou byla toho protoze asi ho nasi napiste re coz tim takze svych jeji svymi jste aj tu tedy teto bylo kde ke prave ji nad nejsou ci pod tema mezi pres ty pak vam ani kdyz vsak ne jsem tento clanku clanky aby jsme pred pta jejich byl jeste az bez take pouze prvni vase ktera nas novy tipy pokud muze design strana jeho sve jine zpravy nove neni vas jen podle zde clanek uz email byt vice bude jiz nez ktery by ktere co nebo ten tak ma pri od po jsou jak dalsi ale si ve to jako za zpet ze do pro je naruby-classifier-reborn-2.2.0/data/stopwords/da000066400000000000000000000007431324775320200214000ustar00rootroot00000000000000af alle andet andre at begge da de den denne der deres det dette dig din dog du ej eller en end ene eneste enhver et fem fire flere fleste for fordi forrige fra få før god han hans har hendes her hun hvad hvem hver hvilken hvis hvor hvordan hvorfor hvornår i ikke ind ingen intet jeg jeres kan kom kommer lav lidt lille man mand mange med meget men mens mere mig ned ni nogen noget ny nyt nær næste næsten og op otte over på se seks ses som stor store syv ti til to tre ud varruby-classifier-reborn-2.2.0/data/stopwords/de000066400000000000000000000073301324775320200214030ustar00rootroot00000000000000a ab aber aber ach acht achte achten achter achtes ag alle allein allem allen aller allerdings alles allgemeinen als als also am an andere anderen andern anders au auch auch auf aus ausser außer ausserdem außerdem b bald bei beide beiden beim beispiel bekannt bereits besonders besser besten bin bis bisher bist c d da dabei dadurch dafür dagegen daher dahin dahinter damals damit danach daneben dank dann daran darauf daraus darf darfst darin darüber darum darunter das das dasein daselbst dass daß dasselbe davon davor dazu dazwischen de dein deine deinem deiner dem dementsprechend demgegenüber demgemäss demgemäß demselben demzufolge den denen denn denn denselben der deren derjenige derjenigen dermassen dermaßen derselbe derselben des deshalb desselben dessen deswegen d.h dich die diejenige diejenigen dies diese dieselbe dieselben diesem diesen dieser dieses dir doch dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft durfte durften e eben ebenso ehrlich ei ei, ei, eigen eigene eigenen eigener eigenes ein einander eine einem einen einer eines einige einigen einiger einiges einmal einmal eins elf en ende endlich entweder entweder er Ernst erst erste ersten erster erstes es etwa etwas euch f früher fünf fünfte fünften fünfter fünftes für g gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt gesagt geschweige gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen großen grosser großer grosses großes gut gute guter gutes h habe haben habt hast hat hatte hätte hatten hätten heisst her heute hier hin hinter hoch i ich ihm ihn ihnen ihr ihre ihrem ihren ihrer ihres im im immer in in indem infolgedessen ins irgend ist j ja ja jahr jahre jahren je jede jedem jeden jeder jedermann jedermanns jedoch jemand jemandem jemanden jene jenem jenen jener jenes jetzt k kam kann kannst kaum kein keine keinem keinen keiner kleine kleinen kleiner kleines kommen kommt können könnt konnte könnte konnten kurz l lang lange lange leicht leide lieber los m machen macht machte mag magst mahn man manche manchem manchen mancher manches mann mehr mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst musste mussten n na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter neuntes nicht nicht nichts nie niemand niemandem niemanden noch nun nun nur o ob ob oben oder oder offen oft oft ohne Ordnung p q r recht rechte rechten rechter rechtes richtig rund s sa sache sagt sagte sah satt schlecht Schluss schon sechs sechste sechsten sechster sechstes sehr sei sei seid seien sein seine seinem seinen seiner seines seit seitdem selbst selbst sich sie sieben siebente siebenten siebenter siebentes sind so solang solche solchem solchen solcher solches soll sollen sollte sollten sondern sonst sowie später statt t tag tage tagen tat teil tel tritt trotzdem tun u über überhaupt übrigens uhr um und und? uns unser unsere unserer unter v vergangenen viel viele vielem vielen vielleicht vier vierte vierten vierter viertes vom von vor w wahr? während währenddem währenddessen wann war wäre waren wart warum was wegen weil weit weiter weitere weiteren weiteres welche welchem welchen welcher welches wem wen wenig wenig wenige weniger weniges wenigstens wenn wenn wer werde werden werdet wessen wie wie wieder will willst wir wird wirklich wirst wo wohl wollen wollt wollte wollten worden wurde würde wurden würden x y z z.b zehn zehnte zehnten zehnter zehntes zeit zu zuerst zugleich zum zum zunächst zur zurück zusammen zwanzig zwar zwar zwei zweite zweiten zweiter zweites zwischen zwölfruby-classifier-reborn-2.2.0/data/stopwords/en000066400000000000000000000005521324775320200214140ustar00rootroot00000000000000a again all along are also an and as at but by came can cant couldnt did didn didnt do doesnt dont ever first from have her here him how i if in into is isnt it itll just last least like most my new no not now of on or should sinc so some th than this that the their then those to told too true try until url us were when whether while with within yes you youll ruby-classifier-reborn-2.2.0/data/stopwords/es000066400000000000000000000042501324775320200214200ustar00rootroot00000000000000él ésta éstas éste éstos última últimas último últimos a añadió aún actualmente adelante además afirmó agregó ahí ahora al algún algo alguna algunas alguno algunos alrededor ambos ante anterior antes apenas aproximadamente aquí así aseguró aunque ayer bajo bien buen buena buenas bueno buenos cómo cada casi cerca cierto cinco comentó como con conocer consideró considera contra cosas creo cual cuales cualquier cuando cuanto cuatro cuenta da dado dan dar de debe deben debido decir dejó del demás dentro desde después dice dicen dicho dieron diferente diferentes dijeron dijo dio donde dos durante e ejemplo el ella ellas ello ellos embargo en encuentra entonces entre era eran es esa esas ese eso esos está están esta estaba estaban estamos estar estará estas este esto estos estoy estuvo ex existe existen explicó expresó fin fue fuera fueron gran grandes ha había habían haber habrá hace hacen hacer hacerlo hacia haciendo han hasta hay haya he hecho hemos hicieron hizo hoy hubo igual incluso indicó informó junto la lado las le les llegó lleva llevar lo los luego lugar más manera manifestó mayor me mediante mejor mencionó menos mi mientras misma mismas mismo mismos momento mucha muchas mucho muchos muy nada nadie ni ningún ninguna ningunas ninguno ningunos no nos nosotras nosotros nuestra nuestras nuestro nuestros nueva nuevas nuevo nuevos nunca o ocho otra otras otro otros para parece parte partir pasada pasado pero pesar poca pocas poco pocos podemos podrá podrán podría podrían poner por porque posible próximo próximos primer primera primero primeros principalmente propia propias propio propios pudo pueda puede pueden pues qué que quedó queremos quién quien quienes quiere realizó realizado realizar respecto sí sólo se señaló sea sean según segunda segundo seis ser será serán sería si sido siempre siendo siete sigue siguiente sin sino sobre sola solamente solas solo solos son su sus tal también tampoco tan tanto tenía tendrá tendrán tenemos tener tenga tengo tenido tercera tiene tienen toda todas todavía todo todos total tras trata través tres tuvo un una unas uno unos usted va vamos van varias varios veces ver vez y ya yoruby-classifier-reborn-2.2.0/data/stopwords/fi000066400000000000000000000131231324775320200214060ustar00rootroot00000000000000aiemmin aika aikaa aikaan aikaisemmin aikaisin aikajen aikana aikoina aikoo aikovat aina ainakaan ainakin ainoa ainoat aiomme aion aiotte aist aivan ajan älä alas alemmas älköön alkuisin alkuun alla alle aloitamme aloitan aloitat aloitatte aloitattivat aloitettava aloitettevaksi aloitettu aloitimme aloitin aloitit aloititte aloittaa aloittamatta aloitti aloittivat alta aluksi alussa alusta annettavaksi annetteva annettu antaa antamatta antoi aoua apu asia asiaa asian asiasta asiat asioiden asioihin asioita asti avuksi avulla avun avutta edellä edelle edelleen edeltä edemmäs edes edessä edestä ehkä ei eikä eilen eivät eli ellei elleivät ellemme ellen ellet ellette emme en enää enemmän eniten ennen ensi ensimmäinen ensimmäiseksi ensimmäisen ensimmäisenä ensimmäiset ensimmäisiä ensimmäisiksi ensimmäisinä ensimmäistä ensin entinen entisen entisiä entistä entisten eräät eräiden eräs eri erittäin erityisesti esi esiin esillä esimerkiksi et eteen etenkin että ette ettei halua haluaa haluamatta haluamme haluan haluat haluatte haluavat halunnut halusi halusimme halusin halusit halusitte halusivat halutessa haluton hän häneen hänellä hänelle häneltä hänen hänessä hänestä hänet he hei heidän heihin heille heiltä heissä heistä heitä helposti heti hetkellä hieman huolimatta huomenna hyvä hyvää hyvät hyviä hyvien hyviin hyviksi hyville hyviltä hyvin hyvinä hyvissä hyvistä ihan ilman ilmeisesti itse itseään itsensä ja jää jälkeen jälleen jo johon joiden joihin joiksi joilla joille joilta joissa joista joita joka jokainen jokin joko joku jolla jolle jolloin jolta jompikumpi jonka jonkin jonne joo jopa jos joskus jossa josta jota jotain joten jotenkin jotenkuten jotka jotta jouduimme jouduin jouduit jouduitte joudumme joudun joudutte joukkoon joukossa joukosta joutua joutui joutuivat joutumaan joutuu joutuvat juuri kahdeksan kahdeksannen kahdella kahdelle kahdelta kahden kahdessa kahdesta kahta kahteen kai kaiken kaikille kaikilta kaikkea kaikki kaikkia kaikkiaan kaikkialla kaikkialle kaikkialta kaikkien kaikkin kaksi kannalta kannattaa kanssa kanssaan kanssamme kanssani kanssanne kanssasi kauan kauemmas kautta kehen keiden keihin keiksi keillä keille keiltä keinä keissä keistä keitä keittä keitten keneen keneksi kenellä kenelle keneltä kenen kenenä kenessä kenestä kenet kenettä kennessästä kerran kerta kertaa kesken keskimäärin ketä ketkä kiitos kohti koko kokonaan kolmas kolme kolmen kolmesti koska koskaan kovin kuin kuinka kuitenkaan kuitenkin kuka kukaan kukin kumpainen kumpainenkaan kumpi kumpikaan kumpikin kun kuten kuuden kuusi kuutta kyllä kymmenen kyse lähekkäin lähellä lähelle läheltä lähemmäs lähes lähinnä lähtien läpi liian liki lisää lisäksi luo mahdollisimman mahdollista me meidän meillä meille melkein melko menee meneet menemme menen menet menette menevät meni menimme menin menit menivät mennessä mennyt menossa mihin mikä mikään mikäli mikin miksi milloin minä minne minun minut missä mistä mitä mitään miten moi molemmat mones monesti monet moni moniaalla moniaalle moniaalta monta muassa muiden muita muka mukaan mukaansa mukana mutta muu muualla muualle muualta muuanne muulloin muun muut muuta muutama muutaman muuten myöhemmin myös myöskään myöskin myötä näiden näin näissä näissähin näissälle näissältä näissästä näitä nämä ne neljä neljää neljän niiden niin niistä niitä noin nopeammin nopeasti nopeiten nro nuo nyt ohi oikein ole olemme olen olet olette oleva olevan olevat oli olimme olin olisi olisimme olisin olisit olisitte olisivat olit olitte olivat olla olleet olli ollut oma omaa omaan omaksi omalle omalta oman omassa omat omia omien omiin omiksi omille omilta omissa omista on onkin onko ovat päälle paikoittain paitsi pakosti paljon paremmin parempi parhaillaan parhaiten peräti perusteella pian pieneen pieneksi pienellä pienelle pieneltä pienempi pienestä pieni pienin puolesta puolestaan runsaasti saakka sadam sama samaa samaan samalla samallalta samallassa samallasta saman samat samoin sata sataa satojen se seitsemän sekä sen seuraavat siellä sieltä siihen siinä siis siitä sijaan siksi sillä silloin silti sinä sinne sinua sinulle sinulta sinun sinussa sinusta sinut sisäkkäin sisällä sitä siten sitten suoraan suuntaan suuren suuret suuri suuria suurin suurten taa täällä täältä taas taemmas tähän tahansa tai takaa takaisin takana takia tällä tällöin tämä tämän tänä tänään tänne tapauksessa tässä tästä tätä täten tavalla tavoitteena täysin täytyvät täytyy te tietysti todella toinen toisaalla toisaalle toisaalta toiseen toiseksi toisella toiselle toiselta toisemme toisen toisensa toisessa toisesta toista toistaiseksi toki tosin tuhannen tuhat tule tulee tulemme tulen tulet tulette tulevat tulimme tulin tulisi tulisimme tulisin tulisit tulisitte tulisivat tulit tulitte tulivat tulla tulleet tullut tuntuu tuo tuolla tuolloin tuolta tuonne tuskin tykö usea useasti useimmiten usein useita uudeksi uudelleen uuden uudet uusi uusia uusien uusinta uuteen uutta vaan vähän vähemmän vähintään vähiten vai vaiheessa vaikea vaikean vaikeat vaikeilla vaikeille vaikeilta vaikeissa vaikeista vaikka vain välillä varmasti varsin varsinkin varten vasta vastaan vastakkain verran vielä vierekkäin vieri viiden viime viimeinen viimeisen viimeksi viisi voi voidaan voimme voin voisi voit voitte voivat vuoden vuoksi vuosi vuosien vuosina vuotta yhä yhdeksän yhden yhdessä yhtä yhtäällä yhtäälle yhtäältä yhtään yhteen yhteensä yhteydessä yhteyteen yksi yksin yksittäin yleensä ylemmäs yli ylös ympäriruby-classifier-reborn-2.2.0/data/stopwords/fr000066400000000000000000000054461324775320200214300ustar00rootroot00000000000000a à â abord afin ah ai aie ainsi allaient allo allô allons après assez attendu au aucun aucune aujourd aujourd'hui auquel aura auront aussi autre autres aux auxquelles auxquels avaient avais avait avant avec avoir ayant b bah beaucoup bien bigre boum bravo brrr c ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui celui-ci celui-là cent cependant certain certaine certaines certains certes ces cet cette ceux ceux-ci ceux-là chacun chaque cher chère chères chers chez chiche chut ci cinq cinquantaine cinquante cinquantième cinquième clac clic combien comme comment compris concernant contre couic crac d da dans de debout dedans dehors delà depuis derrière des dès désormais desquelles desquels dessous dessus deux deuxième deuxièmement devant devers devra différent différente différentes différents dire divers diverse diverses dix dix-huit dixième dix-neuf dix-sept doit doivent donc dont douze douzième dring du duquel durant e effet eh elle elle-même elles elles-mêmes en encore entre envers environ es ès est et etant étaient étais était étant etc été etre être eu euh eux eux-mêmes excepté f façon fais faisaient faisant fait feront fi flac floc font g gens h ha hé hein hélas hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum hurrah i il ils importe j je jusqu jusque k l la là laquelle las le lequel les lès lesquelles lesquels leur leurs longtemps lorsque lui lui-même m ma maint mais malgré me même mêmes merci mes mien mienne miennes miens mille mince moi moi-même moins mon moyennant n na ne néanmoins neuf neuvième ni nombreuses nombreux non nos notre nôtre nôtres nous nous-mêmes nul o o| ô oh ohé olé ollé on ont onze onzième ore ou où ouf ouias oust ouste outre p paf pan par parmi partant particulier particulière particulièrement pas passé pendant personne peu peut peuvent peux pff pfft pfut pif plein plouf plus plusieurs plutôt pouah pour pourquoi premier première premièrement près proche psitt puisque q qu quand quant quanta quant-à-soi quarante quatorze quatre quatre-vingt quatrième quatrièmement que quel quelconque quelle quelles quelque quelques quelqu'un quels qui quiconque quinze quoi quoique r revoici revoilà rien s sa sacrebleu sans sapristi sauf se seize selon sept septième sera seront ses si sien sienne siennes siens sinon six sixième soi soi-même soit soixante son sont sous stop suis suivant sur surtout t ta tac tant te té tel telle tellement telles tels tenant tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous tout toute toutes treize trente très trois troisième troisièmement trop tsoin tsouin tu u un une unes uns v va vais vas vé vers via vif vifs vingt vivat vive vives vlan voici voilà vont vos votre vôtre vôtres vous vous-mêmes vu w x y z zutruby-classifier-reborn-2.2.0/data/stopwords/hi000066400000000000000000000017041324775320200214120ustar00rootroot00000000000000के का एक में की है यह और से हैं को पर इस होता कि जो कर मे गया करने किया लिये अपने ने बनी नहीं तो ही या एवं दिया हो इसका था द्वारा हुआ तक साथ करना वाले बाद लिए आप कुछ सकते किसी ये इसके सबसे इसमें थे दो होने वह वे करते बहुत कहा वर्ग कई करें होती अपनी उनके थी यदि हुई जा ना इसे कहते जब होते कोई हुए व न अभी जैसे सभी करता उनकी तरह उस आदि कुल एस रहा इसकी सकता रहे उनका इसी रखें अपना पे उसके ruby-classifier-reborn-2.2.0/data/stopwords/hu000066400000000000000000000002171324775320200214240ustar00rootroot00000000000000a az egy be ki le fel meg el át rá ide oda szét össze vissza de hát és vagy hogy van lesz volt csak nem igen mint én te õ mi ti õk önruby-classifier-reborn-2.2.0/data/stopwords/it000066400000000000000000000050531324775320200214270ustar00rootroot00000000000000a abbastanza abbiomo accidenti ad adesso affinche affinché agli ahime ahimè ai al alcuna alcuni alcuno all alla alle allo altri altrimenti altro altrui anche ancora anni anno ansa assai attesa avanti avendo avente aver avere avete aveva avuta avute avuti avuto basta bene benissimo berlusconi brava bravo c casa caso cento certa certe certi certo che chi chicchessia chinque chiunque ci ciascuna ciascuno cima cio ciò cioe cioè circa citta città codesta codeste codesti codesto cogli coi col colei coll coloro colui come con concernente consiglio contro cortesia cos cosa cosi cosí così cui d da dagli dai dal dall dalla dalle dallo davanti degli dei del dell della delle dello dentro detto deve di dice dietro dila dire dirimpetto dopo dove dovra dovrà due dunque durante e è ecco ed egli ella eppure era erano esse essendo esser essere essi ex fa fare fatto favore fin finalmente finche finché fine fino forse fra frattanto fuori gia già giacche giacché giorni giorno gli gliela gliele glieli glielo gliene governo grande grazie gruppo ha hai hanno ho i ieri il improvviso in infatti insieme intanto intorno invece invere io l la là lavoro le lei li lo lontano loro lui lungo ma macche macché magari mai male malgrado malissimo me medesimo mediante meglio meno mentre mesi mezzo mi mia mie miei mieri mila miliardi milioni ministro mio moltissimo molto mondo nazionale ne né negli nei nel nell nella nelle nello nemmeno neppure nessuna nessuno niente no noi non nondimeno nondimento nostra nostre nostri nostro nulla nuovo o od oggi ogni ognuna ognuno oltre oppure ora ore osi osì ossia paese parecchi parecchie parecchio parte partendo peccato peggio per perche perché perchè percio perciò perfino pero però perque perqué persone piedi pieno piglia piu piú più po pochissimo poco poi poiche poiché press prima primo proprio puo può pure purtroppo qualche qualcuna qualcuno quale quali qualsiani qualunque quando quanta quante quanti quanto quantunque quasi quattro quel quella quelli quello quest questa queste questi questo qui quindi riecco rieccò saltro salvo sara sarà sarebbe scopo scorso se sé secondo seguente sei sempre senza si sí sia siamo siete solito solo sono soppra sopra sotto sta staranno stata state stati stato stesso stresso su sua successivo sue sugli sui sul sull sulla sulle sullo suo suoi tale talvolta tanto te tempo ti torino tra tranne trannefino tre troppo tu tua tue tuo tuoi tutta tuttavia tutte tutti tutto uguali un una uno uomo uori va vale varia varie vario verso vi via vicino vise visé visto vita voi volta vostra vostre vostri vostro ruby-classifier-reborn-2.2.0/data/stopwords/ja000066400000000000000000000004651324775320200214070ustar00rootroot00000000000000 これ それ あれ この その あの ここ そこ あそこ こちら どこ だれ なに なん 何 私 貴方 貴方方 我々 私達 あの人 あのかた 彼女 彼 です あります おります います は が の に を で え から まで より も どの と し それで しかし ruby-classifier-reborn-2.2.0/data/stopwords/nl000066400000000000000000000002601324775320200214170ustar00rootroot00000000000000aan af al als bij dan dat die dit een en er had heb hem het hij hoe hun ik in is je kan me men met mij nog nu of ons ook te tot uit van was wat we wel wij zal ze zei zij zo zouruby-classifier-reborn-2.2.0/data/stopwords/no000066400000000000000000000011201324775320200214160ustar00rootroot00000000000000alle andre arbeid av begge bort bra bruke da denne der deres det din disse du eller en ene eneste enhver enn er et folk for fordi forsÛke fra fÅ fÛr fÛrst gjorde gjÛre god gÅ ha hadde han hans hennes her hva hvem hver hvilken hvis hvor hvordan hvorfor i ikke inn innen kan kunne lage lang lik like makt mange med meg meget men mens mer mest min mye mÅ mÅte navn nei ny nÅ nÅr og ogsÅ om opp oss over part punkt pÅ rett riktig samme sant si siden sist skulle slik slutt som start stille sÅ tid til tilbake tilstand under ut uten var ved verdi vi vil ville vite vÅr vÖre vÖrt Åruby-classifier-reborn-2.2.0/data/stopwords/pl000066400000000000000000000007361324775320200214310ustar00rootroot00000000000000a aby ale bardziej bardzo bez bo bowiem bêdzie co czy czyli dla dlatego do gdy gdzie go i ich im innych jak jako jednak jego jej jest jeszcze kiedy kilka która które którego której który których którym którzy lub ma mi miêdzy mnie na nad nam nas naszego naszych nawet nich nie nim o od oraz po pod poza przed przede przez przy siê sobie swoje ta tak takie tam te tego tej ten to tu tych tylko tym u w we wiele wielu wiêc wszystkich wszystkim wszystko z za zawsze ze ruby-classifier-reborn-2.2.0/data/stopwords/pt000066400000000000000000000040461324775320200214370ustar00rootroot00000000000000a à adeus agora aí ainda além algo algumas alguns ali ano anos antes ao aos apenas apoio após aquela aquelas aquele aqueles aqui aquilo área as às assim até atrás através baixo bastante bem bom breve cá cada catorze cedo cento certamente certeza cima cinco coisa com como conselho contra custa da dá dão daquela daquele dar das de debaixo demais dentro depois desde dessa desse desta deste deve deverá dez dezanove dezasseis dezassete dezoito dia diante diz dizem dizer do dois dos doze duas dúvida e é ela elas ele eles em embora entre era és essa essas esse esses esta está estar estas estás estava este estes esteve estive estivemos estiveram estiveste estivestes estou eu exemplo faço falta favor faz fazeis fazem fazemos fazer fazes fez fim final foi fomos for foram forma foste fostes fui geral grande grandes grupo há hoje horas isso isto já lá lado local logo longe lugar maior maioria mais mal mas máximo me meio menor menos mês meses meu meus mil minha minhas momento muito muitos na nada não naquela naquele nas nem nenhuma nessa nesse nesta neste nível no noite nome nos nós nossa nossas nosso nossos nova nove novo novos num numa número nunca o obra obrigada obrigado oitava oitavo oito onde ontem onze os ou outra outras outro outros para parece parte partir pela pelas pelo pelos perto pode pôde podem poder põe põem ponto pontos por porque porquê posição possível possivelmente posso pouca pouco primeira primeiro próprio próximo puderam qual quando quanto quarta quarto quatro que quê quem quer quero questão quinta quinto quinze relação sabe são se segunda segundo sei seis sem sempre ser seria sete sétima sétimo seu seus sexta sexto sim sistema sob sobre sois somos sou sua suas tal talvez também tanto tão tarde te tem têm temos tendes tenho tens ter terceira terceiro teu teus teve tive tivemos tiveram tiveste tivestes toda todas todo todos trabalho três treze tu tua tuas tudo um uma umas uns vai vais vão vários vem vêm vens ver vez vezes viagem vindo vinte você vocês vos vós vossa vossas vosso vossos zeroruby-classifier-reborn-2.2.0/data/stopwords/ru000066400000000000000000000106721324775320200214440ustar00rootroot00000000000000 а е и ж м о на не ни об но он мне мои мож она они оно мной много многочисленное многочисленная многочисленные многочисленный мною мой мог могут можно может можхо мор моя моё мочь над нее оба нам нем нами ними мимо немного одной одного менее однажды однако меня нему меньше ней наверху него ниже мало надо один одиннадцать одиннадцатый назад наиболее недавно миллионов недалеко между низко меля нельзя нибудь непрерывно наконец никогда никуда нас наш нет нею неё них мира наша наше наши ничего начала нередко несколько обычно опять около мы ну нх от отовсюду особенно нужно очень отсюда в во вон вниз внизу вокруг вот восемнадцать восемнадцатый восемь восьмой вверх вам вами важное важная важные важный вдали везде ведь вас ваш ваша ваше ваши впрочем весь вдруг вы все второй всем всеми времени время всему всего всегда всех всею всю вся всё всюду г год говорил говорит года году где да ее за из ли же им до по ими под иногда довольно именно долго позже более должно пожалуйста значит иметь больше пока ему имя пор пора потом потому после почему почти посреди ей два две двенадцать двенадцатый двадцать двадцатый двух его дел или без день занят занята занято заняты действительно давно девятнадцать девятнадцатый девять девятый даже алло жизнь далеко близко здесь дальше для лет зато даром первый перед затем зачем лишь десять десятый ею её их бы еще при был про процентов против просто бывает бывь если люди была были было будем будет будете будешь прекрасно буду будь будто будут ещё пятнадцать пятнадцатый друго другое другой другие другая других есть пять быть лучше пятый к ком конечно кому кого когда которой которого которая которые который которых кем каждое каждая каждые каждый кажется как какой какая кто кроме куда кругом с т у я та те уж со то том снова тому совсем того тогда тоже собой тобой собою тобою сначала только уметь тот тою хорошо хотеть хочешь хоть хотя свое свои твой своей своего своих свою твоя твоё раз уже сам там тем чем сама сами теми само рано самом самому самой самого семнадцать семнадцатый самим самими самих саму семь чему раньше сейчас чего сегодня себе тебе сеаой человек разве теперь себя тебя седьмой спасибо слишком так такое такой такие также такая сих тех чаще четвертый через часто шестой шестнадцать шестнадцатый шесть четыре четырнадцать четырнадцатый сколько сказал сказала сказать ту ты три эта эти что это чтоб этом этому этой этого чтобы этот стал туда этим этими рядом тринадцать тринадцатый этих третий тут эту суть чуть тысяч ruby-classifier-reborn-2.2.0/data/stopwords/se000066400000000000000000000046671324775320200214340ustar00rootroot00000000000000aderton adertonde adjö aldrig alla allas allt alltid alltså än andra andras annan annat ännu artonde artonn åtminstone att åtta åttio åttionde åttonde av även båda bådas bakom bara bäst bättre behöva behövas behövde behövt beslut beslutat beslutit bland blev bli blir blivit bort borta bra då dag dagar dagarna dagen där därför de del delen dem den deras dess det detta dig din dina dit ditt dock du efter eftersom elfte eller elva en enkel enkelt enkla enligt er era ert ett ettusen få fanns får fått fem femte femtio femtionde femton femtonde fick fin finnas finns fjärde fjorton fjortonde fler flera flesta följande för före förlåt förra första fram framför från fyra fyrtio fyrtionde gå gälla gäller gällt går gärna gått genast genom gick gjorde gjort god goda godare godast gör göra gott ha hade haft han hans har här heller hellre helst helt henne hennes hit hög höger högre högst hon honom hundra hundraen hundraett hur i ibland idag igår igen imorgon in inför inga ingen ingenting inget innan inne inom inte inuti ja jag jämfört kan kanske knappast kom komma kommer kommit kr kunde kunna kunnat kvar länge längre långsam långsammare långsammast långsamt längst långt lätt lättare lättast legat ligga ligger lika likställd likställda lilla lite liten litet man många måste med mellan men mer mera mest mig min mina mindre minst mitt mittemot möjlig möjligen möjligt möjligtvis mot mycket någon någonting något några när nästa ned nederst nedersta nedre nej ner ni nio nionde nittio nittionde nitton nittonde nödvändig nödvändiga nödvändigt nödvändigtvis nog noll nr nu nummer och också ofta oftast olika olikt om oss över övermorgon överst övre på rakt rätt redan så sade säga säger sagt samma sämre sämst sedan senare senast sent sex sextio sextionde sexton sextonde sig sin sina sist sista siste sitt sjätte sju sjunde sjuttio sjuttionde sjutton sjuttonde ska skall skulle slutligen små smått snart som stor stora större störst stort tack tidig tidigare tidigast tidigt till tills tillsammans tio tionde tjugo tjugoen tjugoett tjugonde tjugotre tjugotvå tjungo tolfte tolv tre tredje trettio trettionde tretton trettonde två tvåhundra under upp ur ursäkt ut utan utanför ute vad vänster vänstra var vår vara våra varför varifrån varit varken värre varsågod vart vårt vem vems verkligen vi vid vidare viktig viktigare viktigast viktigt vilka vilken vilket villruby-classifier-reborn-2.2.0/data/stopwords/tr000066400000000000000000000033641324775320200214430ustar00rootroot00000000000000a acaba altı altmış ama ancak arada artık asla aslında ayrıca az bana bazen bazı bazıları belki ben benden beni benim beri beş bile bilhassa bin bir biraz birçoğu birçok biri birisi birkaç birşey biz bizden bize bizi bizim böyle böylece bu buna bunda bundan bunlar bunları bunların bunu bunun burada bütün çoğu çoğunu çok çünkü da daha dahi dan de defa değil diğer diğeri diğerleri diye doksan dokuz dolayı dolayısıyla dört e edecek eden ederek edilecek ediliyor edilmesi ediyor eğer elbette elli en etmesi etti ettiği ettiğini fakat falan filan gene gereği gerek gibi göre hala halde halen hangi hangisi hani hatta hem henüz hep hepsi her herhangi herkes herkese herkesi herkesin hiç hiçbir hiçbiri i ı için içinde iki ile ilgili ise işte itibaren itibariyle kaç kadar karşın kendi kendilerine kendine kendini kendisi kendisine kendisini kez ki kim kime kimi kimin kimisi kimse kırk madem mi mı milyar milyon mu mü nasıl ne neden nedenle nerde nerede nereye neyse niçin nin nın niye nun nün o öbür olan olarak oldu olduğu olduğunu olduklarını olmadı olmadığı olmak olması olmayan olmaz olsa olsun olup olur olur olursa oluyor on ön ona önce ondan onlar onlara onlardan onları onların onu onun orada öte ötürü otuz öyle oysa pek rağmen sana sanki sanki şayet şekilde sekiz seksen sen senden seni senin şey şeyden şeye şeyi şeyler şimdi siz siz sizden sizden size sizi sizi sizin sizin sonra şöyle şu şuna şunları şunu ta tabii tam tamam tamamen tarafından trilyon tüm tümü u ü üç un ün üzere var vardı ve veya ya yani yapacak yapılan yapılması yapıyor yapmak yaptı yaptığı yaptığını yaptıkları ye yedi yerine yetmiş yi yı yine yirmi yoksa yu yüz zaten zira ruby-classifier-reborn-2.2.0/data/stopwords/vi000066400000000000000000000134641324775320200214360ustar00rootroot00000000000000/ á à ạ á à a ha à ơi ạ ơi ai ái ai ai ái chà ái dà ai nấy alô a-lô amen áng anh ào ắt ắt hẳn ắt là âu là ầu ơ ấy bà bác bài bản bạn bằng bằng ấy bằng không bằng nấy bao giờ bao lâu bao nả bao nhiêu bập bà bập bõm bập bõm bất kỳ bất chợt bất cứ bắt đầu từ bất đồ bất giác bất kể bất kì bất luận bất nhược bất quá bất thình lình bất tử bấy bây bẩy bay biến bấy chầy bây chừ bấy chừ bây giờ bấy giờ bấy lâu bấy lâu nay bấy nay bây nhiêu bấy nhiêu bèn bển béng bệt bị biết biết bao biết bao nhiêu biết chừng nào biết đâu biết đâu chừng biết đâu đấy biết mấy bớ bộ bỏ mẹ bởi bởi chưng bởi nhưng bội phần bởi thế bởi vậy bởi vì bông bỗng bỗng chốc bỗng đâu bỗng dưng bỗng không bỗng nhiên bức cả cả thảy cả thể các cảm ơn căn căn cắt càng cật lực cật sức cậu cây cha cha chả chắc chậc chắc hẳn chầm chập chăn chắn chăng chẳng lẽ chẳng những chẳng nữa chẳng phải chành chạnh chao ôi chết nỗi chết thật chết tiệt chí chết chiếc chỉn chính chính là chính thị cho chớ chớ chi cho đến cho đến khi cho nên cho tới cho tới khi choa chốc chốc chợt chú chứ chu cha chứ lị chú mày chú mình chui cha chủn chùn chùn chùn chũn chung cục chúng mình chung qui chung quy chung quy lại chúng ta chúng tôi có cô cơ có thể có chăng là cơ chừng có dễ cơ hồ cổ lai cơ mà cô mình có vẻ cóc khô coi bộ coi mòi con còn cơn công nhiên cứ cu cậu cứ việc của cực lực cùng cũng cùng cực cùng nhau cũng như cũng vậy cũng vậy thôi cùng với cuộc cuốn dạ đại để đại loại đại nhân đại phàm dần dà dần dần đang đáng lẽ đáng lí đáng lý đành đạch đánh đùng dào đáo để dẫu dầu sao dẫu sao đây để dễ sợ dễ thường đến dì đi điều do đó dở chừng do đó do vậy do vì dữ dù cho dù rằng dưới duy em gì giữa hầu hết họ hỏi khác khi là lại làm lần lên luôn mà mình mợ một muốn năm nào này nấy nên nền nên chi nếu nếu như ngăn ngắt ngay ngày ngay cả ngày càng ngay khi ngay lập tức ngay lúc ngày ngày ngay từ ngay tức khắc ngày xưa ngày xửa nghe chừng nghe đâu nghen nghiễm nhiên nghỉm ngõ hầu ngộ nhỡ ngoài ngoải ngôi ngọn ngọt ngươi người nhà nhận nhân dịp nhân tiện nhất nhất đán nhất định nhất loạt nhất luật nhất mực nhất nhất nhất quyết nhất sinh nhất tâm nhất tề nhất thiết nhau nhé nhỉ nhiên hậu nhiệt liệt nhỡ ra nhón nhén như như chơi như không như quả như thể như tuồng như vậy nhưng những nhưng mà những ai nhung nhăng những như nhược bằng nó nọ nớ nóc nói nữa nức nở ồ ơ ớ ờ ở ở trên ô hay ơ hay ô hô ô kê ô kìa ơ kìa oái oai oái ơi ôi chao ối dào ối giời ối giời ơi ôi thôi ông ổng phải phải chăng phải chi phăn phắt phắt phè phỉ phui pho phóc phốc phỏng phỏng như phót phương chi phụt phứt qua quả quá chừng quá độ quá đỗi quả đúng quá lắm quả là quả tang qua quít qua quýt quá sá quả thật quá thể quả tình quá trời quá ư quả vậy quá xá quý hồ quyển quyết quyết nhiên ra ra phết ra trò răng rằng rằng là ráo ráo trọi rày rén ren rén rích riêng riệt riu ríu rồi rón rén rốt cục rốt cuộc rứa rút cục sa sả sạch sao sắp sất sau sau chót sau cùng sau cuối sau đó sẽ sì số sở dĩ số là song le sốt sột sự suýt tà tà tại tại vì tấm tăm tắp tấn tanh tắp tắp lự tất cả tất tần tật tất tật tất thảy tênh thà tha hồ thà là thà rằng thái quá thậm thậm chí than ôi tháng thanh thành ra thành thử thảo hèn thảo nào thật lực thật ra thật vậy thấy thẩy thế thế à thế là thế mà thế nào thế nên thế ra thế thì thếch theo thì thi thoảng thím thình lình thỉnh thoảng thoắt thoạt thoạt nhiên thốc thộc thốc tháo thôi thời gian thỏm thốt thốt nhiên thửa thuần thực sự thục mạng thực ra thực vậy thúng thắng thương ôi tiện thể tiếp đó tiếp theo tít mù tớ tỏ ra tò te tỏ vẻ toà tốc tả toé khói toẹt tôi tới tối ư tông tốc tọt tột trên trển trệt trếu tráo trệu trạo trời đất ơi trong trỏng trừ phi trước trước đây trước đó trước kia trước nay trước tiên từ tù tì tự vì tuần tự tức thì tức tốc từng tuốt luốt tuốt tuồn tuột tuốt tuột tựu trung tuy tuy nhiên tuy rằng tuy thế tuy vậy tuyệt nhiên ư ừ ử ứ hự ứ ừ ủa úi úi chà úi dào và vả chăng vả lại vẫn vạn nhất vâng văng tê vào vậy vậy là vậy thì về veo vèo veo veo vì ví bằng vì chưng ví dù ví phỏng vị tất vì thế ví thử vì vậy việc vở vô hình trung vô kể vô luận vô vàn với với lại vốn dĩ vừa mới vung tán tàn vung tàn tán vung thiên địa vụt xa xả xăm xăm xăm xắm xăm xúi xềnh xệch xệp xiết bao xoẳn xoành xoạch xoét xoẹt xon xón xuất kì bất ý xuất kỳ bất ý xuể xuống ý ý chừng ý da cái cần chỉ chưa chuyện của đã đến nỗi đều được không lúc mỗi một cách nhiều nơi rất so vừa cao quá hay lớn mới hơn thường hoặc nh ngoài ra hoàn toàn thì thôi ra sao ruby-classifier-reborn-2.2.0/data/stopwords/zh000066400000000000000000000007651324775320200214410ustar00rootroot00000000000000 的 一 不 在 人 有 是 为 以 于 上 他 而 后 之 来 及 了 因 下 可 到 由 这 与 也 此 但 并 个 其 已 无 小 我 们 起 最 再 今 去 好 只 又 或 很 亦 某 把 那 你 乃 它 吧 被 比 别 趁 当 从 到 得 打 凡 儿 尔 该 各 给 跟 和 何 还 即 几 既 看 据 距 靠 啦 了 另 么 每 们 嘛 拿 哪 那 您 凭 且 却 让 仍 啥 如 若 使 谁 虽 随 同 所 她 哇 嗡 往 哪 些 向 沿 哟 用 于 咱 则 怎 曾 至 致 着 诸 自 ruby-classifier-reborn-2.2.0/lib/000077500000000000000000000000001324775320200166565ustar00rootroot00000000000000ruby-classifier-reborn-2.2.0/lib/classifier-reborn.rb000066400000000000000000000030021324775320200226070ustar00rootroot00000000000000#-- # Copyright (c) 2005 Lucas Carlson # # Permission is hereby granted, free of charge, to any person obtaining # a copy of this software and associated documentation files (the # "Software"), to deal in the Software without restriction, including # without limitation the rights to use, copy, modify, merge, publish, # distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so, subject to # the following conditions: # # The above copyright notice and this permission notice shall be # included in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. #++ # Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL require 'rubygems' case RUBY_PLATFORM when 'java' require 'jruby-stemmer' else require 'fast-stemmer' end require_relative 'classifier-reborn/category_namer' require_relative 'classifier-reborn/bayes' require_relative 'classifier-reborn/lsi' require_relative 'classifier-reborn/validators/classifier_validator'ruby-classifier-reborn-2.2.0/lib/classifier-reborn/000077500000000000000000000000001324775320200222675ustar00rootroot00000000000000ruby-classifier-reborn-2.2.0/lib/classifier-reborn/backends/000077500000000000000000000000001324775320200240415ustar00rootroot00000000000000ruby-classifier-reborn-2.2.0/lib/classifier-reborn/backends/bayes_memory_backend.rb000066400000000000000000000032411324775320200305300ustar00rootroot00000000000000module ClassifierReborn class BayesMemoryBackend attr_reader :total_words, :total_trainings # This class provides Memory as the storage backend for the classifier data structures def initialize @total_words = 0 @total_trainings = 0 @category_counts = {} @categories = {} end def update_total_words(diff) @total_words += diff end def update_total_trainings(diff) @total_trainings += diff end def category_training_count(category) category_counts(category)[:training] end def update_category_training_count(category, diff) category_counts(category)[:training] += diff end def category_has_trainings?(category) @category_counts.key?(category) && category_training_count(category) > 0 end def category_word_count(category) category_counts(category)[:word] end def update_category_word_count(category, diff) category_counts(category)[:word] += diff end def add_category(category) @categories[category] ||= Hash.new(0) end def category_keys @categories.keys end def category_word_frequency(category, word) @categories[category][word] end def update_category_word_frequency(category, word, diff) @categories[category][word] += diff end def delete_category_word(category, word) @categories[category].delete(word) end def word_in_category?(category, word) @categories[category].key?(word) end def reset initialize end private def category_counts(category) @category_counts[category] ||= {training: 0, word: 0} end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/backends/bayes_redis_backend.rb000066400000000000000000000057261324775320200303400ustar00rootroot00000000000000require_relative 'no_redis_error' # require redis when we run #intialize. This way only people using this backend # will need to install and load the backend without having to # require 'classifier-reborn/backends/bayes_redis_backend' module ClassifierReborn # This class provides Redis as the storage backend for the classifier data structures class BayesRedisBackend # The class can be created with the same arguments that the redis gem accepts # E.g., # b = ClassifierReborn::BayesRedisBackend.new # b = ClassifierReborn::BayesRedisBackend.new host: "10.0.1.1", port: 6380, db: 2 # b = ClassifierReborn::BayesRedisBackend.new url: "redis://:secret@10.0.1.1:6380/2" # # Options available are: # url: lambda { ENV["REDIS_URL"] } # scheme: "redis" # host: "127.0.0.1" # port: 6379 # path: nil # timeout: 5.0 # password: nil # db: 0 # driver: nil # id: nil # tcp_keepalive: 0 # reconnect_attempts: 1 # inherit_socket: false def initialize(options = {}) begin # because some people don't have redis installed require 'redis' rescue LoadError raise NoRedisError end @redis = Redis.new(options) @redis.setnx(:total_words, 0) @redis.setnx(:total_trainings, 0) end def total_words @redis.get(:total_words).to_i end def update_total_words(diff) @redis.incrby(:total_words, diff) end def total_trainings @redis.get(:total_trainings).to_i end def update_total_trainings(diff) @redis.incrby(:total_trainings, diff) end def category_training_count(category) @redis.hget(:category_training_count, category).to_i end def update_category_training_count(category, diff) @redis.hincrby(:category_training_count, category, diff) end def category_has_trainings?(category) category_training_count(category) > 0 end def category_word_count(category) @redis.hget(:category_word_count, category).to_i end def update_category_word_count(category, diff) @redis.hincrby(:category_word_count, category, diff) end def add_category(category) @redis.sadd(:category_keys, category) end def category_keys @redis.smembers(:category_keys).map(&:intern) end def category_word_frequency(category, word) @redis.hget(category, word).to_i end def update_category_word_frequency(category, word, diff) @redis.hincrby(category, word, diff) end def delete_category_word(category, word) @redis.hdel(category, word) end def word_in_category?(category, word) @redis.hexists(category, word) end def reset @redis.flushdb @redis.set(:total_words, 0) @redis.set(:total_trainings, 0) end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/backends/no_redis_error.rb000066400000000000000000000007531324775320200274060ustar00rootroot00000000000000class NoRedisError < LoadError def initialize msg = %q{The Redis Backend can only be used if Redis is installed. This error is raised from 'lib/classifier-reborn/backends/bayes_redis_backend.rb'. If you have encountered this error and would like to use the Redis Backend, please run 'gem install redis' or include 'gem "redis"' in your gemfile. For more info see https://github.com/jekyll/classifier-reborn#usage. } super(msg) end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/bayes.rb000066400000000000000000000230311324775320200237160ustar00rootroot00000000000000# Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL require 'set' require_relative 'category_namer' require_relative 'backends/bayes_memory_backend' require_relative 'backends/bayes_redis_backend' module ClassifierReborn class Bayes CategoryNotFoundError = Class.new(StandardError) # The class can be created with one or more categories, each of which will be # initialized and given a training method. E.g., # b = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', 'Spam' # # Options available are: # language: 'en' Used to select language specific stop words # auto_categorize: false When true, enables ability to dynamically declare a category; the default is true if no initial categories are provided # enable_threshold: false When true, enables a threshold requirement for classifition # threshold: 0.0 Default threshold, only used when enabled # enable_stemmer: true When false, disables word stemming # stopwords: nil Accepts path to a text file or an array of words, when supplied, overwrites the default stopwords; assign empty string or array to disable stopwords # backend: BayesMemoryBackend.new Alternatively, BayesRedisBackend.new for persistent storage def initialize(*args) @initial_categories = [] options = { language: 'en', enable_threshold: false, threshold: 0.0, enable_stemmer: true, backend: BayesMemoryBackend.new } args.flatten.each do |arg| if arg.is_a?(Hash) options.merge!(arg) else @initial_categories.push(arg) end end unless options.key?(:auto_categorize) options[:auto_categorize] = @initial_categories.empty? ? true : false end @language = options[:language] @auto_categorize = options[:auto_categorize] @enable_threshold = options[:enable_threshold] @threshold = options[:threshold] @enable_stemmer = options[:enable_stemmer] @backend = options[:backend] populate_initial_categories if options.key?(:stopwords) custom_stopwords options[:stopwords] end end # Provides a general training method for all categories specified in Bayes#new # For example: # b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other' # b.train :this, "This text" # b.train "that", "That text" # b.train "The other", "The other text" def train(category, text) word_hash = Hasher.word_hash(text, @language, @enable_stemmer) return if word_hash.empty? category = CategoryNamer.prepare_name(category) # Add the category dynamically or raise an error unless category_keys.include?(category) if @auto_categorize add_category(category) else raise CategoryNotFoundError, "Cannot train; category #{category} does not exist" end end word_hash.each do |word, count| @backend.update_category_word_frequency(category, word, count) @backend.update_category_word_count(category, count) @backend.update_total_words(count) end @backend.update_total_trainings(1) @backend.update_category_training_count(category, 1) end # Provides a untraining method for all categories specified in Bayes#new # Be very careful with this method. # # For example: # b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other' # b.train :this, "This text" # b.untrain :this, "This text" def untrain(category, text) word_hash = Hasher.word_hash(text, @language, @enable_stemmer) return if word_hash.empty? category = CategoryNamer.prepare_name(category) word_hash.each do |word, count| next if @backend.total_words < 0 orig = @backend.category_word_frequency(category, word) || 0 @backend.update_category_word_frequency(category, word, -count) if @backend.category_word_frequency(category, word) <= 0 @backend.delete_category_word(category, word) count = orig end @backend.update_category_word_count(category, -count) if @backend.category_word_count(category) >= count @backend.update_total_words(-count) end @backend.update_total_trainings(-1) @backend.update_category_training_count(category, -1) end # Returns the scores in each category the provided +text+. E.g., # b.classifications "I hate bad words and you" # => {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524} # The largest of these scores (the one closest to 0) is the one picked out by #classify def classifications(text) score = {} word_hash = Hasher.word_hash(text, @language, @enable_stemmer) if word_hash.empty? category_keys.each do |category| score[category.to_s] = Float::INFINITY end return score end category_keys.each do |category| score[category.to_s] = 0 total = (@backend.category_word_count(category) || 1).to_f word_hash.each do |word, _count| s = @backend.word_in_category?(category, word) ? @backend.category_word_frequency(category, word) : 0.1 score[category.to_s] += Math.log(s / total) end # now add prior probability for the category s = @backend.category_has_trainings?(category) ? @backend.category_training_count(category) : 0.1 score[category.to_s] += Math.log(s / @backend.total_trainings.to_f) end score end # Returns the classification of the provided +text+, which is one of the # categories given in the initializer along with the score. E.g., # b.classify "I hate bad words and you" # => ['Uninteresting', -4.852030263919617] def classify_with_score(text) (classifications(text).sort_by { |a| -a[1] })[0] end # Return the classification without the score def classify(text) result, score = classify_with_score(text) result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled? result end # Retrieve the current threshold value attr_reader :threshold # Dynamically set the threshold value attr_writer :threshold # Dynamically enable threshold for classify results def enable_threshold @enable_threshold = true end # Dynamically disable threshold for classify results def disable_threshold @enable_threshold = false end # Is threshold processing enabled? def threshold_enabled? @enable_threshold end # is threshold processing disabled? def threshold_disabled? !@enable_threshold end # Is word stemming enabled? def stemmer_enabled? @enable_stemmer end # Is word stemming disabled? def stemmer_disabled? !@enable_stemmer end # Provides training and untraining methods for the categories specified in Bayes#new # For example: # b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other' # b.train_this "This text" # b.train_that "That text" # b.untrain_that "That text" # b.train_the_other "The other text" def method_missing(name, *args) cleaned_name = name.to_s.gsub(/(un)?train_([\w]+)/, '\2') category = CategoryNamer.prepare_name(cleaned_name) if category_keys.include?(category) args.each { |text| eval("#{Regexp.last_match(1)}train(category, text)") } elsif name.to_s =~ /(un)?train_([\w]+)/ raise StandardError, "No such category: #{category}" else super # raise StandardError, "No such method: #{name}" end end # Provides a list of category names # For example: # b.categories # => ["This", "That", "The other"] def categories category_keys.collect(&:to_s) end # Provides a list of category keys as symbols # For example: # b.categories # => [:This, :That, :"The other"] def category_keys @backend.category_keys end # Allows you to add categories to the classifier. # For example: # b.add_category "Not spam" # # WARNING: Adding categories to a trained classifier will # result in an undertrained category that will tend to match # more criteria than the trained selective categories. In short, # try to initialize your categories at initialization. def add_category(category) category = CategoryNamer.prepare_name(category) @backend.add_category(category) end alias_method :append_category, :add_category def reset @backend.reset populate_initial_categories end private def populate_initial_categories @initial_categories.each do |c| add_category(c) end end # Overwrites the default stopwords for current language with supplied list of stopwords or file def custom_stopwords(stopwords) unless stopwords.is_a?(Enumerable) if stopwords.strip.empty? stopwords = [] elsif File.exist?(stopwords) stopwords = File.read(stopwords).force_encoding("utf-8").split else return # Do not overwrite the default end end Hasher::STOPWORDS[@language] = Set.new stopwords end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/category_namer.rb000066400000000000000000000005601324775320200256140ustar00rootroot00000000000000# Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL require 'classifier-reborn/extensions/hasher' module ClassifierReborn module CategoryNamer module_function def prepare_name(name) return name if name.is_a?(Symbol) name.to_s.tr('_', ' ').capitalize.intern end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/extensions/000077500000000000000000000000001324775320200244665ustar00rootroot00000000000000ruby-classifier-reborn-2.2.0/lib/classifier-reborn/extensions/hasher.rb000066400000000000000000000037041324775320200262710ustar00rootroot00000000000000# encoding: utf-8 # Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL require 'set' module ClassifierReborn module Hasher STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')] module_function # Return a Hash of strings => ints. Each word in the string is stemmed, # interned, and indexes to its frequency in the document. def word_hash(str, language = 'en', enable_stemmer = true) cleaned_word_hash = clean_word_hash(str, language, enable_stemmer) symbol_hash = word_hash_for_symbols(str.scan(/[^\s\p{WORD}]/)) cleaned_word_hash.merge(symbol_hash) end # Return a word hash without extra punctuation or short symbols, just stemmed words def clean_word_hash(str, language = 'en', enable_stemmer = true) word_hash_for_words(str.gsub(/[^\p{WORD}\s]/, '').downcase.split, language, enable_stemmer) end def word_hash_for_words(words, language = 'en', enable_stemmer = true) d = Hash.new(0) words.each do |word| next unless word.length > 2 && !STOPWORDS[language].include?(word) if enable_stemmer d[word.stem.intern] += 1 else d[word.intern] += 1 end end d end # Add custom path to a new stopword file created by user def add_custom_stopword_path(path) STOPWORDS_PATH.unshift(path) end def word_hash_for_symbols(words) d = Hash.new(0) words.each do |word| d[word.intern] += 1 end d end # Create a lazily-loaded hash of stopword data STOPWORDS = Hash.new do |hash, language| hash[language] = [] STOPWORDS_PATH.each do |path| if File.exist?(File.join(path, language)) hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split break end end hash[language] end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/extensions/vector.rb000066400000000000000000000035251324775320200263220ustar00rootroot00000000000000# Author:: Ernest Ellingson # Copyright:: Copyright (c) 2005 # These are extensions to the std-lib 'matrix' to allow an all ruby SVD require 'matrix' class Matrix def self.diag(s) Matrix.diagonal(*s) end alias_method :trans, :transpose def SV_decomp(maxSweeps = 20) if row_size >= column_size q = trans * self else q = self * trans end qrot = q.dup v = Matrix.identity(q.row_size) mzrot = nil cnt = 0 s_old = nil loop do cnt += 1 (0...qrot.row_size - 1).each do |row| (1..qrot.row_size - 1).each do |col| next if row == col h = Math.atan((2 * qrot[row, col]) / (qrot[row, row] - qrot[col, col])) / 2.0 hcos = Math.cos(h) hsin = Math.sin(h) mzrot = Matrix.identity(qrot.row_size) mzrot[row, row] = hcos mzrot[row, col] = -hsin mzrot[col, row] = hsin mzrot[col, col] = hcos qrot = mzrot.trans * qrot * mzrot v *= mzrot end end s_old = qrot.dup if cnt == 1 sum_qrot = 0.0 if cnt > 1 qrot.row_size.times do |r| sum_qrot += (qrot[r, r] - s_old[r, r]).abs if (qrot[r, r] - s_old[r, r]).abs > 0.001 end s_old = qrot.dup end break if (sum_qrot <= 0.001 && cnt > 1) || cnt >= maxSweeps end # of do while true s = [] qrot.row_size.times do |r| s << Math.sqrt(qrot[r, r]) end # puts "cnt = #{cnt}" if row_size >= column_size mu = self * v * Matrix.diagonal(*s).inverse return [mu, v, s] else puts v.row_size puts v.column_size puts row_size puts column_size puts s.size mu = (trans * v * Matrix.diagonal(*s).inverse) return [mu, v, s] end end def []=(i, j, val) @rows[i][j] = val end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/extensions/vector_serialize.rb000066400000000000000000000004111324775320200303600ustar00rootroot00000000000000module GSL class Vector def _dump(_v) Marshal.dump(to_a) end def self._load(arr) arry = Marshal.load(arr) GSL::Vector.alloc(arry) end end class Matrix class < false # If you want to use ContentNodes with cached vector transpositions, use # lsi = ClassifierReborn::LSI.new :cache_node_vectors => true # def initialize(options = {}) @auto_rebuild = options[:auto_rebuild] != false @word_list = WordList.new @items = {} @version = 0 @built_at_version = -1 @language = options[:language] || 'en' extend CachedContentNode::InstanceMethods if @cache_node_vectors = options[:cache_node_vectors] end # Returns true if the index needs to be rebuilt. The index needs # to be built after all informaton is added, but before you start # using it for search, classification and cluster detection. def needs_rebuild? (@items.size > 1) && (@version != @built_at_version) end # Adds an item to the index. item is assumed to be a string, but # any item may be indexed so long as it responds to #to_s or if # you provide an optional block explaining how the indexer can # fetch fresh string data. This optional block is passed the item, # so the item may only be a reference to a URL or file name. # # For example: # lsi = ClassifierReborn::LSI.new # lsi.add_item "This is just plain text" # lsi.add_item "/home/me/filename.txt" { |x| File.read x } # ar = ActiveRecordObject.find( :all ) # lsi.add_item ar, *ar.categories { |x| ar.content } # def add_item(item, *categories, &block) clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language) if clean_word_hash.empty? puts "Input: '#{item}' is entirely stopwords or words with 2 or fewer characters. Classifier-Reborn cannot handle this document properly." else @items[item] = if @cache_node_vectors CachedContentNode.new(clean_word_hash, *categories) else ContentNode.new(clean_word_hash, *categories) end @version += 1 build_index if @auto_rebuild end end # A less flexible shorthand for add_item that assumes # you are passing in a string with no categorries. item # will be duck typed via to_s . # def <<(item) add_item(item) end # Returns categories for a given indexed item. You are free to add and remove # items from this as you see fit. It does not invalide an index to change its categories. def categories_for(item) return [] unless @items[item] @items[item].categories end # Removes an item from the database, if it is indexed. # def remove_item(item) return unless @items.key? item @items.delete item @version += 1 end # Returns an array of items that are indexed. def items @items.keys end # This function rebuilds the index if needs_rebuild? returns true. # For very large document spaces, this indexing operation may take some # time to complete, so it may be wise to place the operation in another # thread. # # As a rule, indexing will be fairly swift on modern machines until # you have well over 500 documents indexed, or have an incredibly diverse # vocabulary for your documents. # # The optional parameter "cutoff" is a tuning parameter. When the index is # built, a certain number of s-values are discarded from the system. The # cutoff parameter tells the indexer how many of these values to keep. # A value of 1 for cutoff means that no semantic analysis will take place, # turning the LSI class into a simple vector search engine. def build_index(cutoff = 0.75) return unless needs_rebuild? make_word_list doc_list = @items.values tda = doc_list.collect { |node| node.raw_vector_with(@word_list) } if $GSL tdm = GSL::Matrix.alloc(*tda).trans ntdm = build_reduced_matrix(tdm, cutoff) ntdm.size[1].times do |col| vec = GSL::Vector.alloc(ntdm.column(col)).row doc_list[col].lsi_vector = vec doc_list[col].lsi_norm = vec.normalize end else tdm = Matrix.rows(tda).trans ntdm = build_reduced_matrix(tdm, cutoff) ntdm.row_size.times do |col| doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col] doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col] end end @built_at_version = @version end # This method returns max_chunks entries, ordered by their average semantic rating. # Essentially, the average distance of each entry from all other entries is calculated, # the highest are returned. # # This can be used to build a summary service, or to provide more information about # your dataset's general content. For example, if you were to use categorize on the # results of this data, you could gather information on what your dataset is generally # about. def highest_relative_content(max_chunks = 10) return [] if needs_rebuild? avg_density = {} @items.each_key { |item| avg_density[item] = proximity_array_for_content(item).inject(0.0) { |x, y| x + y[1] } } avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks - 1].map end # This function is the primitive that find_related and classify # build upon. It returns an array of 2-element arrays. The first element # of this array is a document, and the second is its "score", defining # how "close" it is to other indexed items. # # These values are somewhat arbitrary, having to do with the vector space # created by your content, so the magnitude is interpretable but not always # meaningful between indexes. # # The parameter doc is the content to compare. If that content is not # indexed, you can pass an optional block to define how to create the # text data. See add_item for examples of how this works. def proximity_array_for_content(doc, &block) return [] if needs_rebuild? content_node = node_for_content(doc, &block) result = @items.keys.collect do |item| if $GSL val = content_node.search_vector * @items[item].transposed_search_vector else val = (Matrix[content_node.search_vector] * @items[item].search_vector)[0] end [item, val] end result.sort_by { |x| x[1] }.reverse end # Similar to proximity_array_for_content, this function takes similar # arguments and returns a similar array. However, it uses the normalized # calculated vectors instead of their full versions. This is useful when # you're trying to perform operations on content that is much smaller than # the text you're working with. search uses this primitive. def proximity_norms_for_content(doc, &block) return [] if needs_rebuild? content_node = node_for_content(doc, &block) if $GSL && content_node.raw_norm.isnan?.all? puts "There are no documents that are similar to #{doc}" else content_node_norms(content_node) end end def content_node_norms(content_node) result = @items.keys.collect do |item| if $GSL val = content_node.search_norm * @items[item].search_norm.col else val = (Matrix[content_node.search_norm] * @items[item].search_norm)[0] end [item, val] end result.sort_by { |x| x[1] }.reverse end # This function allows for text-based search of your index. Unlike other functions # like find_related and classify, search only takes short strings. It will also ignore # factors like repeated words. It is best for short, google-like search terms. # A search will first priortize lexical relationships, then semantic ones. # # While this may seem backwards compared to the other functions that LSI supports, # it is actually the same algorithm, just applied on a smaller document. def search(string, max_nearest = 3) return [] if needs_rebuild? carry = proximity_norms_for_content(string) unless carry.nil? result = carry.collect { |x| x[0] } result[0..max_nearest - 1] end end # This function takes content and finds other documents # that are semantically "close", returning an array of documents sorted # from most to least relavant. # max_nearest specifies the number of documents to return. A value of # 0 means that it returns all the indexed documents, sorted by relavence. # # This is particularly useful for identifing clusters in your document space. # For example you may want to identify several "What's Related" items for weblog # articles, or find paragraphs that relate to each other in an essay. def find_related(doc, max_nearest = 3, &block) carry = proximity_array_for_content(doc, &block).reject { |pair| pair[0].eql? doc } result = carry.collect { |x| x[0] } result[0..max_nearest - 1] end # Return the most obvious category with the score def classify_with_score(doc, cutoff = 0.30, &block) scored_categories(doc, cutoff, &block).last end # Return the most obvious category without the score def classify(doc, cutoff = 0.30, &block) scored_categories(doc, cutoff, &block).last.first end # This function uses a voting system to categorize documents, based on # the categories of other documents. It uses the same logic as the # find_related function to find related documents, then returns the # list of sorted categories. # # cutoff signifies the number of documents to consider when clasifying # text. A cutoff of 1 means that every document in the index votes on # what category the document is in. This may not always make sense. # def scored_categories(doc, cutoff = 0.30, &block) icutoff = (@items.size * cutoff).round carry = proximity_array_for_content(doc, &block) carry = carry[0..icutoff - 1] votes = Hash.new(0.0) carry.each do |pair| @items[pair[0]].categories.each do |category| votes[category] += pair[1] end end votes.sort_by { |_, score| score } end # Prototype, only works on indexed documents. # I have no clue if this is going to work, but in theory # it's supposed to. def highest_ranked_stems(doc, count = 3) raise 'Requested stem ranking on non-indexed content!' unless @items[doc] content_vector_array = node_for_content(doc).lsi_vector.to_a top_n = content_vector_array.sort.reverse[0..count - 1] top_n.collect { |x| @word_list.word_for_index(content_vector_array.index(x)) } end def reset initialize(auto_rebuild: @auto_rebuild, cache_node_vectors: @cache_node_vectors) end private def build_reduced_matrix(matrix, cutoff = 0.75) # TODO: Check that M>=N on these dimensions! Transpose helps assure this u, v, s = matrix.SV_decomp # TODO: Better than 75% term, please. :\ s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1] s.size.times do |ord| s[ord] = 0.0 if s[ord] < s_cutoff end # Reconstruct the term document matrix, only with reduced rank u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans end def node_for_content(item, &block) if @items[item] return @items[item] else clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language) content_node = ContentNode.new(clean_word_hash, &block) # make the node and extract the data unless needs_rebuild? content_node.raw_vector_with(@word_list) # make the lsi raw and norm vectors end end content_node end def make_word_list @word_list = WordList.new @items.each_value do |node| node.word_hash.each_key { |key| @word_list.add_word(key) } end end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/lsi/000077500000000000000000000000001324775320200230565ustar00rootroot00000000000000ruby-classifier-reborn-2.2.0/lib/classifier-reborn/lsi/cached_content_node.rb000066400000000000000000000025251324775320200273550ustar00rootroot00000000000000# Author:: Kelley Reynolds (mailto:kelley@insidesystems.net) # Copyright:: Copyright (c) 2015 Kelley Reynolds # License:: LGPL module ClassifierReborn # Subclass of ContentNode which caches the search_vector transpositions. # Its great because its much faster for large indexes, but at the cost of more ram. Additionally, # if you Marshal your classifier and want to keep the size down, you'll need to manually # clear the cache before you dump class CachedContentNode < ContentNode module InstanceMethods # Go through each item in this index and clear the cache def clear_cache! @items.each_value(&:clear_cache!) end end def initialize(word_hash, *categories) clear_cache! super end def clear_cache! @transposed_search_vector = nil end # Cache the transposed vector, it gets used a lot def transposed_search_vector @transposed_search_vector ||= super end # Clear the cache before we continue on def raw_vector_with(word_list) clear_cache! super end # We don't want the cached_data here def marshal_dump [@lsi_vector, @lsi_norm, @raw_vector, @raw_norm, @categories, @word_hash] end def marshal_load(array) @lsi_vector, @lsi_norm, @raw_vector, @raw_norm, @categories, @word_hash = array end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/lsi/content_node.rb000066400000000000000000000054741324775320200260740ustar00rootroot00000000000000# Author:: David Fayram (mailto:dfayram@lensmen.net) # Copyright:: Copyright (c) 2005 David Fayram II # License:: LGPL module ClassifierReborn # This is an internal data structure class for the LSI node. Save for # raw_vector_with, it should be fairly straightforward to understand. # You should never have to use it directly. class ContentNode attr_accessor :raw_vector, :raw_norm, :lsi_vector, :lsi_norm, :categories attr_reader :word_hash # If text_proc is not specified, the source will be duck-typed # via source.to_s def initialize(word_hash, *categories) @categories = categories || [] @word_hash = word_hash @lsi_norm, @lsi_vector = nil end # Use this to fetch the appropriate search vector. def search_vector @lsi_vector || @raw_vector end # Method to access the transposed search vector def transposed_search_vector search_vector.col end # Use this to fetch the appropriate search vector in normalized form. def search_norm @lsi_norm || @raw_norm end # Creates the raw vector out of word_hash using word_list as the # key for mapping the vector space. def raw_vector_with(word_list) if $GSL vec = GSL::Vector.alloc(word_list.size) else vec = Array.new(word_list.size, 0) end @word_hash.each_key do |word| vec[word_list[word]] = @word_hash[word] if word_list[word] end # Perform the scaling transform and force floating point arithmetic if $GSL sum = 0.0 vec.each { |v| sum += v } total_words = sum else total_words = vec.reduce(0, :+).to_f end total_unique_words = 0 if $GSL vec.each { |word| total_unique_words += 1 if word != 0.0 } else total_unique_words = vec.count { |word| word != 0 } end # Perform first-order association transform if this vector has more # then one word in it. if total_words > 1.0 && total_unique_words > 1 weighted_total = 0.0 # Cache calculations, this takes too long on large indexes cached_calcs = Hash.new do |hash, term| hash[term] = ((term / total_words) * Math.log(term / total_words)) end vec.each do |term| weighted_total += cached_calcs[term] if term > 0.0 end # Cache calculations, this takes too long on large indexes cached_calcs = Hash.new do |hash, val| hash[val] = Math.log(val + 1) / -weighted_total end vec.collect! do |val| cached_calcs[val] end end if $GSL @raw_norm = vec.normalize @raw_vector = vec else @raw_norm = Vector[*vec].normalize @raw_vector = Vector[*vec] end end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/lsi/summarizer.rb000066400000000000000000000020171324775320200256010ustar00rootroot00000000000000# Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL module ClassifierReborn module Summarizer module_function def summary(str, count = 10, separator = ' [...] ') perform_lsi split_sentences(str), count, separator end def paragraph_summary(str, count = 1, separator = ' [...] ') perform_lsi split_paragraphs(str), count, separator end def split_sentences(str) str.split(/(\.|\!|\?)/) # TODO: make this less primitive end def split_paragraphs(str) str.split(/(\n\n|\r\r|\r\n\r\n)/) # TODO: make this less primitive end def perform_lsi(chunks, count, separator) lsi = ClassifierReborn::LSI.new auto_rebuild: false chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } lsi.build_index summaries = lsi.highest_relative_content count summaries.reject { |chunk| !summaries.include? chunk }.map(&:strip).join(separator) end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/lsi/word_list.rb000066400000000000000000000014651324775320200254170ustar00rootroot00000000000000# Author:: David Fayram (mailto:dfayram@lensmen.net) # Copyright:: Copyright (c) 2005 David Fayram II # License:: LGPL module ClassifierReborn # This class keeps a word => index mapping. It is used to map stemmed words # to dimensions of a vector. class WordList def initialize @location_table = {} end # Adds a word (if it is new) and assigns it a unique dimension. def add_word(word) @location_table[word] = @location_table.size unless @location_table[word] end # Returns the dimension of the word or nil if the word is not in the space. def [](lookup) @location_table[lookup] end def word_for_index(ind) @location_table.invert[ind] end # Returns the number of words mapped. def size @location_table.size end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/validators/000077500000000000000000000000001324775320200244375ustar00rootroot00000000000000ruby-classifier-reborn-2.2.0/lib/classifier-reborn/validators/classifier_validator.rb000066400000000000000000000144231324775320200311610ustar00rootroot00000000000000module ClassifierReborn module ClassifierValidator module_function def cross_validate(classifier, sample_data, fold=10, *options) classifier = ClassifierReborn::const_get(classifier).new(options) if classifier.is_a?(String) sample_data.shuffle! partition_size = sample_data.length / fold partitioned_data = sample_data.each_slice(partition_size) conf_mats = [] fold.times do |i| training_data = partitioned_data.take(fold) test_data = training_data.slice!(i) conf_mats << validate(classifier, training_data.flatten!(1), test_data) end classifier.reset() generate_report(conf_mats) end def validate(classifier, training_data, test_data, *options) classifier = ClassifierReborn::const_get(classifier).new(options) if classifier.is_a?(String) classifier.reset() training_data.each do |rec| classifier.train(rec.first, rec.last) end evaluate(classifier, test_data) end def evaluate(classifier, test_data) conf_mat = empty_conf_mat(classifier.categories.sort) test_data.each do |rec| actual = rec.first.tr('_', ' ').capitalize predicted = classifier.classify(rec.last) conf_mat[actual][predicted] += 1 unless predicted.nil? end conf_mat end def generate_report(*conf_mats) conf_mats.flatten! accumulated_conf_mat = conf_mats.length == 1 ? conf_mats.first : empty_conf_mat(conf_mats.first.keys.sort) header = "Run Total Correct Incorrect Accuracy" puts puts " Run Report ".center(header.length, "-") puts header puts "-" * header.length if conf_mats.length > 1 conf_mats.each_with_index do |conf_mat, i| run_report = build_run_report(conf_mat) print_run_report(run_report, i+1) conf_mat.each do |actual, cols| cols.each do |predicted, v| accumulated_conf_mat[actual][predicted] += v end end end puts "-" * header.length end run_report = build_run_report(accumulated_conf_mat) print_run_report(run_report, "All") puts print_conf_mat(accumulated_conf_mat) puts conf_tab = conf_mat_to_tab(accumulated_conf_mat) print_conf_tab(conf_tab) end def build_run_report(conf_mat) correct = incorrect = 0 conf_mat.each do |actual, cols| cols.each do |predicted, v| if actual == predicted correct += v else incorrect += v end end end total = correct + incorrect {total: total, correct: correct, incorrect: incorrect, accuracy: divide(correct, total)} end def conf_mat_to_tab(conf_mat) conf_tab = Hash.new {|h, k| h[k] = {p: {t: 0, f: 0}, n: {t: 0, f: 0}}} conf_mat.each_key do |positive| conf_mat.each do |actual, cols| cols.each do |predicted, v| conf_tab[positive][positive == predicted ? :p : :n][actual == predicted ? :t : :f] += v end end end conf_tab end def print_run_report(stats, prefix="", print_header=false) puts "#{"Run".rjust([3, prefix.length].max)} Total Correct Incorrect Accuracy" if print_header puts "#{prefix.to_s.rjust(3)} #{stats[:total].to_s.rjust(9)} #{stats[:correct].to_s.rjust(9)} #{stats[:incorrect].to_s.rjust(9)} #{stats[:accuracy].round(5).to_s.ljust(7, '0').rjust(9)}" end def print_conf_mat(conf_mat) header = ["Predicted ->"] + conf_mat.keys + ["Total", "Recall"] cell_size = header.map(&:length).max header = header.map{|h| h.rjust(cell_size)}.join(" ") puts " Confusion Matrix ".center(header.length, "-") puts header puts "-" * header.length predicted_totals = conf_mat.keys.map{|predicted| [predicted, 0]}.to_h correct = 0 conf_mat.each do |k, rec| actual_total = rec.values.reduce(:+) puts ([k.ljust(cell_size)] + rec.values.map{|v| v.to_s.rjust(cell_size)} + [actual_total.to_s.rjust(cell_size), divide(rec[k], actual_total).round(5).to_s.rjust(cell_size)]).join(" ") rec.each do |cat, val| predicted_totals[cat] += val correct += val if cat == k end end total = predicted_totals.values.reduce(:+) puts "-" * header.length puts (["Total".ljust(cell_size)] + predicted_totals.values.map{|v| v.to_s.rjust(cell_size)} + [total.to_s.rjust(cell_size), "".rjust(cell_size)]).join(" ") puts (["Precision".ljust(cell_size)] + predicted_totals.keys.map{|k| divide(conf_mat[k][k], predicted_totals[k]).round(5).to_s.rjust(cell_size)} + ["Accuracy ->".rjust(cell_size), divide(correct, total).round(5).to_s.rjust(cell_size)]).join(" ") end def print_conf_tab(conf_tab) conf_tab.each do |positive, tab| puts "# Positive class: #{positive}" derivations = conf_tab_derivations(tab) print_derivations(derivations) puts end end def conf_tab_derivations(tab) positives = tab[:p][:t] + tab[:n][:f] negatives = tab[:n][:t] + tab[:p][:f] total = positives + negatives { total_population: positives + negatives, condition_positive: positives, condition_negative: negatives, true_positive: tab[:p][:t], true_negative: tab[:n][:t], false_positive: tab[:p][:f], false_negative: tab[:n][:f], prevalence: divide(positives, total), specificity: divide(tab[:n][:t], negatives), recall: divide(tab[:p][:t], positives), precision: divide(tab[:p][:t], tab[:p][:t] + tab[:p][:f]), accuracy: divide(tab[:p][:t] + tab[:n][:t], total), f1_score: divide(2 * tab[:p][:t], 2 * tab[:p][:t] + tab[:p][:f] + tab[:n][:f]) } end def print_derivations(derivations) max_len = derivations.keys.map(&:length).max derivations.each do |k, v| puts k.to_s.tr('_', ' ').capitalize.ljust(max_len) + " : " + v.to_s end end def empty_conf_mat(categories) categories.map{|actual| [actual, categories.map{|predicted| [predicted, 0]}.to_h]}.to_h end def divide(dividend, divisor) divisor.zero? ? 0.0 : dividend / divisor.to_f end end end ruby-classifier-reborn-2.2.0/lib/classifier-reborn/version.rb000066400000000000000000000000601324775320200242750ustar00rootroot00000000000000module ClassifierReborn VERSION = '2.2.0' end