pax_global_header00006660000000000000000000000064126152272220014513gustar00rootroot0000000000000052 comment=d7ecc55aa63c3f5281b88798b696a139a4c2733d classifier-reborn-2.0.4/000077500000000000000000000000001261522722200151275ustar00rootroot00000000000000classifier-reborn-2.0.4/.gitignore000066400000000000000000000000211261522722200171100ustar00rootroot00000000000000Gemfile.lock pkg classifier-reborn-2.0.4/.travis.yml000066400000000000000000000004401261522722200172360ustar00rootroot00000000000000language: ruby rvm: - 1.9.3 - 2.0 - 2.1 - 2.2 notifications: irc: on_success: change on_failure: change channels: - irc.freenode.org#jekyll template: - '%{repository}#%{build_number} %{message} %{build_url}' email: on_success: never on_failure: change classifier-reborn-2.0.4/Gemfile000066400000000000000000000000461261522722200164220ustar00rootroot00000000000000source 'https://rubygems.org' gemspec classifier-reborn-2.0.4/History.markdown000066400000000000000000000054501261522722200203400ustar00rootroot00000000000000## 2.0.4 / 2015-10-31 ### Major Enhancements * Classification thresholds can be enabled or disabled. The default is disabled. The threshold value can be set at initialization time or dynamically during processing (#47) * Made auto-categorization optional, defaulting to false (#45) * Added the ability to handle an array of classifications to the constructor (#44) * Classification with a threshold has been added to the api (#39) ### Minor Enhancements * Documentation around threshold usage (#54) * Fixed UTF-8 encoding for `hasher.rb` (#50) * Removed some unnecessary methods (#43) * Add optional `CachedContentNode` (GSL only) (#43) * Caches the transposed `search_vector` (#43) * Added custom marshal_ methods to not save the cache when dumping/loading (#43) * Optimized some numeric comparisons and iterators (#43) * Added cached calculation table when computing raw_vectors (#43) * If a category name is already a symbol, just return it (#45) * Various Hash improvements (#45) * Eliminated several Ruby :warning:s when run with RUBYOPT="-w" (#38) * Simple performance improvements for the Hasher process (#41) * Fixes for broken regex splitting for non-ascii characters and removal of the unused punctuation filter (#41) * Add multiple language stopwords with customizable stop word paths (#40) ### Bug Fixes * Fixed the bug where adding the same category a second time would clobber the category that was already there (#45) * Fixed deprecation warning for `<=>` in ls.rb (#33) * Remove references to Madeline in the README and replace it with Marshal or Redis (#32) ### Development Fixes * Added development dependency on `mini_test` and added 2.2 to travis.yml (#36) ## 2.0.3 / 2014-12-23 ### Bug Fixes * Handle `GSL::Vector`'s which don't have `#reduce` in `ContentNode#raw_vector_with` (#28) * Remove unnecessary `Vector` monkey-patch (#29) ## 2.0.2 / 2014-11-08 ### Minor Enhancements * Remove `Array#sum` monkey patch in favour of `#reduce(0, :+)` (#20) * Cache total word counts per category for speed (#4) ### Development Fixes * Add a test for `Bayes#untrain_*`. (#21) * Fix link to rb-gsl gem (#24) * Add helper scripts per Jekyll convention (#25) ## 2.0.1 / 2014-08-14 ### Bug Fixes * Replace `Object` monkey patch with `CategoryNamer` method (#18) * Count total unique words using methods supported by `Vector` and `GSL::Vector` (#11) ### Development Fixes * Remove `stats` rake task (#17) * Add some tests for `ClassifierRebord::WordList` (#15) ## 2.0.0 / 2014-08-13 ### Bug Fixes * Remove mathn dependency (#8) * Only perform first order transform if total UNIQUE words is greater than 1 (#3) * Update `LSI#remove_item` such that they will work with the `@items` hash. (#2) ### Development Fixes * Exclude Gemfile.lock in .gitignore (#7) classifier-reborn-2.0.4/LICENSE000066400000000000000000000544521261522722200161460ustar00rootroot00000000000000 GNU LESSER GENERAL PUBLIC LICENSE Version 2.1, February 1999 Copyright (C) 1991, 1999 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. [This is the first released version of the Lesser GPL. It also counts as the successor of the GNU Library Public License, version 2, hence the version number 2.1.] Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public Licenses are intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This license, the Lesser General Public License, applies to some specially designated software packages--typically libraries--of the Free Software Foundation and other authors who decide to use it. You can use it too, but we suggest you first think carefully about whether this license or the ordinary General Public License is the better strategy to use in any particular case, based on the explanations below. When we speak of free software, we are referring to freedom of use, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish); that you receive source code or can get it if you want it; that you can change the software and use pieces of it in new free programs; and that you are informed that you can do these things. To protect your rights, we need to make restrictions that forbid distributors to deny you these rights or to ask you to surrender these rights. These restrictions translate to certain responsibilities for you if you distribute copies of the library or if you modify it. For example, if you distribute copies of the library, whether gratis or for a fee, you must give the recipients all the rights that we gave you. You must make sure that they, too, receive or can get the source code. If you link other code with the library, you must provide complete object files to the recipients, so that they can relink them with the library after making changes to the library and recompiling it. And you must show them these terms so they know their rights. We protect your rights with a two-step method: (1) we copyright the library, and (2) we offer you this license, which gives you legal permission to copy, distribute and/or modify the library. To protect each distributor, we want to make it very clear that there is no warranty for the free library. Also, if the library is modified by someone else and passed on, the recipients should know that what they have is not the original version, so that the original author's reputation will not be affected by problems that might be introduced by others. Finally, software patents pose a constant threat to the existence of any free program. We wish to make sure that a company cannot effectively restrict the users of a free program by obtaining a restrictive license from a patent holder. Therefore, we insist that any patent license obtained for a version of the library must be consistent with the full freedom of use specified in this license. Most GNU software, including some libraries, is covered by the ordinary GNU General Public License. This license, the GNU Lesser General Public License, applies to certain designated libraries, and is quite different from the ordinary General Public License. We use this license for certain libraries in order to permit linking those libraries into non-free programs. When a program is linked with a library, whether statically or using a shared library, the combination of the two is legally speaking a combined work, a derivative of the original library. The ordinary General Public License therefore permits such linking only if the entire combination fits its criteria of freedom. The Lesser General Public License permits more lax criteria for linking other code with the library. We call this license the "Lesser" General Public License because it does Less to protect the user's freedom than the ordinary General Public License. It also provides other free software developers Less of an advantage over competing non-free programs. These disadvantages are the reason we use the ordinary General Public License for many libraries. However, the Lesser license provides advantages in certain special circumstances. For example, on rare occasions, there may be a special need to encourage the widest possible use of a certain library, so that it becomes a de-facto standard. To achieve this, non-free programs must be allowed to use the library. A more frequent case is that a free library does the same job as widely used non-free libraries. In this case, there is little to gain by limiting the free library to free software only, so we use the Lesser General Public License. In other cases, permission to use a particular library in non-free programs enables a greater number of people to use a large body of free software. For example, permission to use the GNU C Library in non-free programs enables many more people to use the whole GNU operating system, as well as its variant, the GNU/Linux operating system. Although the Lesser General Public License is Less protective of the users' freedom, it does ensure that the user of a program that is linked with the Library has the freedom and the wherewithal to run that program using a modified version of the Library. The precise terms and conditions for copying, distribution and modification follow. Pay close attention to the difference between a "work based on the library" and a "work that uses the library". The former contains code derived from the library, whereas the latter must be combined with the library in order to run. GNU LESSER GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License Agreement applies to any software library or other program which contains a notice placed by the copyright holder or other authorized party saying it may be distributed under the terms of this Lesser General Public License (also called "this License"). Each licensee is addressed as "you". A "library" means a collection of software functions and/or data prepared so as to be conveniently linked with application programs (which use some of those functions and data) to form executables. The "Library", below, refers to any such software library or work which has been distributed under these terms. A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".) "Source code" for a work means the preferred form of the work for making modifications to it. For a library, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the library. Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running a program using the Library is not restricted, and output from such a program is covered only if its contents constitute a work based on the Library (independent of the use of the Library in a tool for writing it). Whether that is true depends on what the Library does and what the program that uses the Library does. 1. You may copy and distribute verbatim copies of the Library's complete source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and distribute a copy of this License along with the Library. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Library or any portion of it, thus forming a work based on the Library, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) The modified work must itself be a software library. b) You must cause the files modified to carry prominent notices stating that you changed the files and the date of any change. c) You must cause the whole of the work to be licensed at no charge to all third parties under the terms of this License. d) If a facility in the modified Library refers to a function or a table of data to be supplied by an application program that uses the facility, other than as an argument passed when the facility is invoked, then you must make a good faith effort to ensure that, in the event an application does not supply such function or table, the facility still operates, and performs whatever part of its purpose remains meaningful. (For example, a function in a library to compute square roots has a purpose that is entirely well-defined independent of the application. Therefore, Subsection 2d requires that any application-supplied function or table used by this function must be optional: if the application does not supply it, the square root function must still compute square roots.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Library, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Library, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Library. In addition, mere aggregation of another work not based on the Library with the Library (or with a work based on the Library) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may opt to apply the terms of the ordinary GNU General Public License instead of this License to a given copy of the Library. To do this, you must alter all the notices that refer to this License, so that they refer to the ordinary GNU General Public License, version 2, instead of to this License. (If a newer version than version 2 of the ordinary GNU General Public License has appeared, then you can specify that version instead if you wish.) Do not make any other change in these notices. Once this change is made in a given copy, it is irreversible for that copy, so the ordinary GNU General Public License applies to all subsequent copies and derivative works made from that copy. This option is useful when you wish to copy part of the code of the Library into a program that is not a library. 4. You may copy and distribute the Library (or a portion or derivative of it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. If distribution of object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place satisfies the requirement to distribute the source code, even though third parties are not compelled to copy the source along with the object code. 5. A program that contains no derivative of any portion of the Library, but is designed to work with the Library by being compiled or linked with it, is called a "work that uses the Library". Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the scope of this License. However, linking a "work that uses the Library" with the Library creates an executable that is a derivative of the Library (because it contains portions of the Library), rather than a "work that uses the library". The executable is therefore covered by this License. Section 6 states terms for distribution of such executables. When a "work that uses the Library" uses material from a header file that is part of the Library, the object code for the work may be a derivative work of the Library even though the source code is not. Whether this is true is especially significant if the work can be linked without the Library, or if the work is itself a library. The threshold for this to be true is not precisely defined by law. If such an object file uses only numerical parameters, data structure layouts and accessors, and small macros and small inline functions (ten lines or less in length), then the use of the object file is unrestricted, regardless of whether it is legally a derivative work. (Executables containing this object code plus portions of the Library will still fall under Section 6.) Otherwise, if the work is a derivative of the Library, you may distribute the object code for the work under the terms of Section 6. Any executables containing that work also fall under Section 6, whether or not they are linked directly with the Library itself. 6. As an exception to the Sections above, you may also combine or link a "work that uses the Library" with the Library to produce a work containing portions of the Library, and distribute that work under terms of your choice, provided that the terms permit modification of the work for the customer's own use and reverse engineering for debugging such modifications. You must give prominent notice with each copy of the work that the Library is used in it and that the Library and its use are covered by this License. You must supply a copy of this License. If the work during execution displays copyright notices, you must include the copyright notice for the Library among them, as well as a reference directing the user to the copy of this License. Also, you must do one of these things: a) Accompany the work with the complete corresponding machine-readable source code for the Library including whatever changes were used in the work (which must be distributed under Sections 1 and 2 above); and, if the work is an executable linked with the Library, with the complete machine-readable "work that uses the Library", as object code and/or source code, so that the user can modify the Library and then relink to produce a modified executable containing the modified Library. (It is understood that the user who changes the contents of definitions files in the Library will not necessarily be able to recompile the application to use the modified definitions.) b) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (1) uses at run time a copy of the library already present on the user's computer system, rather than copying library functions into the executable, and (2) will operate properly with a modified version of the library, if the user installs one, as long as the modified version is interface-compatible with the version that the work was made with. c) Accompany the work with a written offer, valid for at least three years, to give the same user the materials specified in Subsection 6a, above, for a charge no more than the cost of performing this distribution. d) If distribution of the work is made by offering access to copy from a designated place, offer equivalent access to copy the above specified materials from the same place. e) Verify that the user has already received a copy of these materials or that you have already sent this user a copy. For an executable, the required form of the "work that uses the Library" must include any data and utility programs needed for reproducing the executable from it. However, as a special exception, the materials to be distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. It may happen that this requirement contradicts the license restrictions of other proprietary libraries that do not normally accompany the operating system. Such a contradiction means you cannot use both them and the Library together in an executable that you distribute. 7. You may place library facilities that are a work based on the Library side-by-side in a single library together with other library facilities not covered by this License, and distribute such a combined library, provided that the separate distribution of the work based on the Library and of the other library facilities is otherwise permitted, and provided that you do these two things: a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities. This must be distributed under the terms of the Sections above. b) Give prominent notice with the combined library of the fact that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. 8. You may not copy, modify, sublicense, link with, or distribute the Library except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense, link with, or distribute the Library is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 9. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Library or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Library (or any work based on the Library), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Library or works based on it. 10. Each time you redistribute the Library (or any work based on the Library), the recipient automatically receives a license from the original licensor to copy, distribute, link with or modify the Library subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties with this License. 11. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Library at all. For example, if a patent license would not permit royalty-free redistribution of the Library by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Library. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply, and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 12. If the distribution and/or use of the Library is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Library under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 13. The Free Software Foundation may publish revised and/or new versions of the Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Library specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Library does not specify a license version number, you may choose any version ever published by the Free Software Foundation. 14. If you wish to incorporate parts of the Library into other free programs whose distribution conditions are incompatible with these, write to the author to ask for permission. For software which is copyrighted byclassifier-reborn-2.0.4/README.markdown000066400000000000000000000221141261522722200176300ustar00rootroot00000000000000## Welcome to Classifier Reborn Classifier is a general module to allow Bayesian and other types of classifications. Classifier Reborn is a fork of cardmagic/classifier under more active development. ## Download Add this line to your application's Gemfile: gem 'classifier-reborn' And then execute: $ bundle Or install it yourself as: $ gem install classifier-reborn ## Dependencies The only runtime dependency you'll need to install is Roman Shterenzon's fast-stemmer gem: gem install fast-stemmer This should install automatically with RubyGems. If you would like to speed up LSI classification by at least 10x, please install the following libraries: * [GNU GSL](http://www.gnu.org/software/gsl) * [rb-gsl](https://rubygems.org/gems/rb-gsl) Notice that LSI will work without these libraries, but as soon as they are installed, Classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you. ## Bayes A Bayesian classifier by Lucas Carlson. Bayesian Classifiers are accurate, fast, and have modest memory requirements. ### Usage ```ruby require 'classifier-reborn' classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting' classifier.train_interesting "here are some good words. I hope you love them" classifier.train_uninteresting "here are some bad words, I hate you" classifier.classify "I hate bad words and you" # returns 'Uninteresting' classifier_snapshot = Marshal.dump classifier # This is a string of bytes, you can persist it anywhere you like File.open("classifier.dat", "w") {|f| f.write(classifier_snapshot) } # Or Redis.current.save "classifier", classifier_snapshot # This is now saved to a file, and you can safely restart the application data = File.read("classifier.dat") # Or data = Redis.current.get "classifier" trained_classifier = Marshal.load data trained_classifier.classify "I love" # returns 'Interesting' ``` Beyond the basic example, the constructor and trainer can be used in a more flexible way to accomidate non-trival applications. Consider the following program: ```ruby #!/usr/bin/env ruby # classifier_reborn_demo.rb require 'classifier-reborn' training_set = DATA.read.split("\n") categories = training_set.shift.split(',').map{|c| c.strip} classifier = ClassifierReborn::Bayes.new categories training_set.each do |a_line| next if a_line.empty? || '#' == a_line.strip[0] parts = a_line.strip.split(':') classifier.train(parts.first, parts.last) end puts classifier.classify "I hate bad words and you" #=> 'Uninteresting' puts classifier.classify "I hate javascript" #=> 'Uninteresting' puts classifier.classify "javascript is bad" #=> 'Uninteresting' puts classifier.classify "all you need is ruby" #=> 'Interesting' puts classifier.classify "i love ruby" #=> 'Interesting' puts classifier.classify "which is better dogs or cats" #=> 'dog' puts classifier.classify "what do I need to kill rats and mice" #=> 'cat' __END__ Interesting, Uninteresting interesting: here are some good words. I hope you love them interesting: all you need is love interesting: the love boat, soon we will be taking another ride interesting: ruby don't take your love to town uninteresting: here are some bad words, I hate you uninteresting: bad bad leroy brown badest man in the darn town uninteresting: the good the bad and the ugly uninteresting: java, javascript, css front-end html # # train categories that were not pre-described # dog: dog days of summer dog: a man's best friend is his dog dog: a good hunting dog is a fine thing dog: man my dogs are tired dog: dogs are better than cats in soooo many ways cat: the fuzz ball spilt the milk cat: got rats or mice get a cat to kill them cat: cats never come when you call them cat: That dang cat keeps scratching the furniture ``` #### Knowing the Score When you ask a bayesian classifier to classify text against a set of trained categories it does so by generating a score (as a Float) for each possible category. The higher the score the closer the fit your text has with that category. The category with the highest score is returned as the best matching category. In *ClassifierReborn* the methods *classifications* and *classify_with_score* give you access to the calculated scores. The method *classify* only returns the best matching category. Knowing the score allows you to do some interesting things. For example if your application is to generate tags for a blog post you could use the *classifications* method to get a hash of the categories and their scores. You would sort on score and take only the top 3 or 4 categories as your tags for the blog post. You could within your application establish the smallest acceptable score and only use those categories whose score is greater than or equal to your smallest acceptable score as your tags for the blog post. But what if you only use the *classify* method? It does not show you the score of the best category. How do you know that the best category is really any good? You can use the threshold. #### Using the Threshold Some applications can have only one category. The application wants to know if the text being classified is of that category or not. For example consider a list of normal free text responses to some question or maybe a URL string coming to your web application. You know what a normal response looks like; but, you have no idea how people might mis-use the response. So what you want to do is create a bayesian classifier that just has one category, for example 'Good' and you want to know wither your text is classified as Good or Not Good. Or suppose you just want the ability to have multiple categories and a 'None of the Above' as a possibility. ##### Threshold When you initialize the *ClassifierReborn::Bayes* classifier there are several options which can be set that control threshold processing. ```ruby b = ClassifierRebor::Bayes.new( 'good', # one or more categories enable_threshold: true, # default: false threshold: -10.0 # default: 0.0 ) b.train_good 'good stuff from Dobie Gillis' # ... text = 'bad junk from Maynard G. Krebs' result = b.classify text if result.nil? STDERR.puts "ALERT: This is not good: #{text}" let_loose_the_dogs_of_war! # method definition left to the reader end ``` In the *classify* method when the best category for the text has a score that is either less than the established threshold or is Float::INIFINITY, a nil category is returned. When you see a nil value returned from the *classify* method it means that none of the trained categories (regardless or how many categories were trained) has a score that is above or equal to the established threshold. #### Other Threshold-related Convience Methods ```ruby b.threshold # get the current threshold b.threshold = -10.0 # set the threshold b.threshold_enabled? # Boolean: is the threshold enabled? b.threshold_disabled? # Boolean: is the threshold disabled? b.enable_threshold # enables threshold processing b.disable_threshold # disables threshold processing ``` Using these convience methods your applications can dynamically adjust threshold processing as required. ### Bayesian Classification * https://en.wikipedia.org/wiki/Naive_Bayes_classifier * http://www.process.com/precisemail/bayesian_filtering.htm * http://en.wikipedia.org/wiki/Bayesian_filtering * http://www.paulgraham.com/spam.html ## LSI A Latent Semantic Indexer by David Fayram. Latent Semantic Indexing engines are not as fast or as small as Bayesian classifiers, but are more flexible, providing fast search and clustering detection as well as semantic analysis of the text that theoretically simulates human learning. ### Usage ```ruby require 'classifier-reborn' lsi = ClassifierReborn::LSI.new strings = [ ["This text deals with dogs. Dogs.", :dog], ["This text involves dogs too. Dogs! ", :dog], ["This text revolves around cats. Cats.", :cat], ["This text also involves cats. Cats!", :cat], ["This text involves birds. Birds.",:bird ]] strings.each {|x| lsi.add_item x.first, x.last} lsi.search("dog", 3) # returns => ["This text deals with dogs. Dogs.", "This text involves dogs too. Dogs! ", # "This text also involves cats. Cats!"] lsi.find_related(strings[2], 2) # returns => ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"] lsi.classify "This text is also about dogs!" # returns => :dog ``` Please see the ClassifierReborn::LSI documentation for more information. It is possible to index, search and classify with more than just simple strings. ### Latent Semantic Indexing * http://www.c2.com/cgi/wiki?LatentSemanticIndexing * http://www.chadfowler.com/index.cgi/Computing/LatentSemanticIndexing.rdoc * http://en.wikipedia.org/wiki/Latent_semantic_analysis ## Authors * Lucas Carlson (lucas@rufy.com) * David Fayram II (dfayram@gmail.com) * Cameron McBride (cameron.mcbride@gmail.com) * Ivan Acosta-Rubio (ivan@softwarecriollo.com) * Parker Moore (email@byparker.com) * Chase Gilliam (chase.gilliam@gmail.com) This library is released under the terms of the GNU LGPL. See LICENSE for more details. classifier-reborn-2.0.4/Rakefile000066400000000000000000000013661261522722200166020ustar00rootroot00000000000000require 'rubygems' require 'rake' require 'rake/testtask' require 'rdoc/task' require 'bundler/gem_tasks' desc "Default Task" task :default => [ :test ] # Run the unit tests desc "Run all unit tests" Rake::TestTask.new("test") { |t| t.libs << "lib" t.pattern = 'test/*/*_test.rb' t.verbose = true } # Make a console, useful when working on tests desc "Generate a test console" task :console do verbose( false ) { sh "irb -I lib/ -r 'classifier-reborn'" } end # Genereate the RDoc documentation desc "Create documentation" Rake::RDocTask.new("doc") { |rdoc| rdoc.title = "Ruby Classifier - Bayesian and LSI classification library" rdoc.rdoc_dir = 'html' rdoc.rdoc_files.include('README.markdown') rdoc.rdoc_files.include('lib/**/*.rb') } classifier-reborn-2.0.4/bin/000077500000000000000000000000001261522722200156775ustar00rootroot00000000000000classifier-reborn-2.0.4/bin/bayes.rb000077500000000000000000000014701261522722200173340ustar00rootroot00000000000000#!/usr/bin/env ruby begin require 'rubygems' require 'classifier' rescue require 'classifier' end require 'madeleine' m = SnapshotMadeleine.new(File.expand_path("~/.bayes_data")) { ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting' } case ARGV[0] when "add" case ARGV[1].downcase when "interesting" m.system.train_interesting File.open(ARGV[2]).read puts "#{ARGV[2]} has been classified as interesting" when "uninteresting" m.system.train_uninteresting File.open(ARGV[2]).read puts "#{ARGV[2]} has been classified as uninteresting" else puts "Invalid category: choose between interesting and uninteresting" exit(1) end when "classify" puts m.system.classify(File.open(ARGV[1]).read) else puts "Invalid option: choose add [category] [file] or clasify [file]" exit(-1) end m.take_snapshot classifier-reborn-2.0.4/bin/summarize.rb000077500000000000000000000004261261522722200202450ustar00rootroot00000000000000#!/usr/bin/env ruby begin require 'rubygems' require 'classifier' rescue require 'classifier' end require 'open-uri' num = ARGV[1].to_i num = num < 1 ? 10 : num text = open(ARGV.first).read puts text.gsub(/<[^>]+>/,"").gsub(/[\s]+/," ").summary(num) classifier-reborn-2.0.4/classifier-reborn.gemspec000066400000000000000000000026131261522722200221070ustar00rootroot00000000000000# coding: utf-8 lib = File.expand_path('../lib', __FILE__) $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib) require 'classifier-reborn/version' Gem::Specification.new do |s| s.specification_version = 2 if s.respond_to? :specification_version= s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version= s.rubygems_version = '2.2.2' s.required_ruby_version = '>= 1.9.3' s.name = 'classifier-reborn' s.version = ClassifierReborn::VERSION s.license = 'LGPL' s.summary = "A general classifier module to allow Bayesian and other types of classifications." s.authors = ["Lucas Carlson", "Parker Moore", "Chase Gilliam"] s.email = ["lucas@rufy.com", "parkrmoore@gmail.com", "chase.gilliam@gmail.com"] s.homepage = 'https://github.com/jekyll/classifier-reborn' all_files = `git ls-files -z`.split("\x0") s.files = all_files.grep(%r{^(bin|lib|data)/}) s.executables = all_files.grep(%r{^bin/}) { |f| File.basename(f) } s.require_paths = ["lib"] s.has_rdoc = true s.rdoc_options = ["--charset=UTF-8"] s.extra_rdoc_files = %w[README.markdown LICENSE] s.add_runtime_dependency('fast-stemmer', '~> 1.0') s.add_development_dependency('rake') s.add_development_dependency('rdoc') s.add_development_dependency('test-unit') end classifier-reborn-2.0.4/data/000077500000000000000000000000001261522722200160405ustar00rootroot00000000000000classifier-reborn-2.0.4/data/stopwords/000077500000000000000000000000001261522722200201045ustar00rootroot00000000000000classifier-reborn-2.0.4/data/stopwords/ca000066400000000000000000000012661261522722200204170ustar00rootroot00000000000000de es i a o un una unes uns un tot també altre algun alguna alguns algunes ser és soc ets som estic està estem esteu estan com en per perquè per que estat estava ans abans éssent ambdós però per poder potser puc podem podeu poden vaig va van fer faig fa fem feu fan cada fi inclòs primer des de conseguir consegueixo consigueix consigueixes conseguim consigueixen anar haver tenir tinc te tenim teniu tene el la les els seu aquí meu teu ells elles ens nosaltres vosaltres si dins sols solament saber saps sap sabem sabeu saben últim llarg bastant fas molts aquells aquelles seus llavors sota dalt ús molt era eres erem eren mode bé quant quan on mentre qui amb entre sense jo aquellclassifier-reborn-2.0.4/data/stopwords/cs000066400000000000000000000012111261522722200204270ustar00rootroot00000000000000dnes cz timto budes budem byli jses muj svym ta tomto tohle tuto tyto jej zda proc mate tato kam tohoto kdo kteri mi nam tom tomuto mit nic proto kterou byla toho protoze asi ho nasi napiste re coz tim takze svych jeji svymi jste aj tu tedy teto bylo kde ke prave ji nad nejsou ci pod tema mezi pres ty pak vam ani kdyz vsak ne jsem tento clanku clanky aby jsme pred pta jejich byl jeste az bez take pouze prvni vase ktera nas novy tipy pokud muze design strana jeho sve jine zpravy nove neni vas jen podle zde clanek uz email byt vice bude jiz nez ktery by ktere co nebo ten tak ma pri od po jsou jak dalsi ale si ve to jako za zpet ze do pro je naclassifier-reborn-2.0.4/data/stopwords/da000066400000000000000000000007431261522722200204170ustar00rootroot00000000000000af alle andet andre at begge da de den denne der deres det dette dig din dog du ej eller en end ene eneste enhver et fem fire flere fleste for fordi forrige fra få før god han hans har hendes her hun hvad hvem hver hvilken hvis hvor hvordan hvorfor hvornår i ikke ind ingen intet jeg jeres kan kom kommer lav lidt lille man mand mange med meget men mens mere mig ned ni nogen noget ny nyt nær næste næsten og op otte over på se seks ses som stor store syv ti til to tre ud varclassifier-reborn-2.0.4/data/stopwords/de000066400000000000000000000073301261522722200204220ustar00rootroot00000000000000a ab aber aber ach acht achte achten achter achtes ag alle allein allem allen aller allerdings alles allgemeinen als als also am an andere anderen andern anders au auch auch auf aus ausser außer ausserdem außerdem b bald bei beide beiden beim beispiel bekannt bereits besonders besser besten bin bis bisher bist c d da dabei dadurch dafür dagegen daher dahin dahinter damals damit danach daneben dank dann daran darauf daraus darf darfst darin darüber darum darunter das das dasein daselbst dass daß dasselbe davon davor dazu dazwischen de dein deine deinem deiner dem dementsprechend demgegenüber demgemäss demgemäß demselben demzufolge den denen denn denn denselben der deren derjenige derjenigen dermassen dermaßen derselbe derselben des deshalb desselben dessen deswegen d.h dich die diejenige diejenigen dies diese dieselbe dieselben diesem diesen dieser dieses dir doch dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft durfte durften e eben ebenso ehrlich ei ei, ei, eigen eigene eigenen eigener eigenes ein einander eine einem einen einer eines einige einigen einiger einiges einmal einmal eins elf en ende endlich entweder entweder er Ernst erst erste ersten erster erstes es etwa etwas euch f früher fünf fünfte fünften fünfter fünftes für g gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt gesagt geschweige gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen großen grosser großer grosses großes gut gute guter gutes h habe haben habt hast hat hatte hätte hatten hätten heisst her heute hier hin hinter hoch i ich ihm ihn ihnen ihr ihre ihrem ihren ihrer ihres im im immer in in indem infolgedessen ins irgend ist j ja ja jahr jahre jahren je jede jedem jeden jeder jedermann jedermanns jedoch jemand jemandem jemanden jene jenem jenen jener jenes jetzt k kam kann kannst kaum kein keine keinem keinen keiner kleine kleinen kleiner kleines kommen kommt können könnt konnte könnte konnten kurz l lang lange lange leicht leide lieber los m machen macht machte mag magst mahn man manche manchem manchen mancher manches mann mehr mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst musste mussten n na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter neuntes nicht nicht nichts nie niemand niemandem niemanden noch nun nun nur o ob ob oben oder oder offen oft oft ohne Ordnung p q r recht rechte rechten rechter rechtes richtig rund s sa sache sagt sagte sah satt schlecht Schluss schon sechs sechste sechsten sechster sechstes sehr sei sei seid seien sein seine seinem seinen seiner seines seit seitdem selbst selbst sich sie sieben siebente siebenten siebenter siebentes sind so solang solche solchem solchen solcher solches soll sollen sollte sollten sondern sonst sowie später statt t tag tage tagen tat teil tel tritt trotzdem tun u über überhaupt übrigens uhr um und und? uns unser unsere unserer unter v vergangenen viel viele vielem vielen vielleicht vier vierte vierten vierter viertes vom von vor w wahr? während währenddem währenddessen wann war wäre waren wart warum was wegen weil weit weiter weitere weiteren weiteres welche welchem welchen welcher welches wem wen wenig wenig wenige weniger weniges wenigstens wenn wenn wer werde werden werdet wessen wie wie wieder will willst wir wird wirklich wirst wo wohl wollen wollt wollte wollten worden wurde würde wurden würden x y z z.b zehn zehnte zehnten zehnter zehntes zeit zu zuerst zugleich zum zum zunächst zur zurück zusammen zwanzig zwar zwar zwei zweite zweiten zweiter zweites zwischen zwölfclassifier-reborn-2.0.4/data/stopwords/en000066400000000000000000000005521261522722200204330ustar00rootroot00000000000000a again all along are also an and as at but by came can cant couldnt did didn didnt do doesnt dont ever first from have her here him how i if in into is isnt it itll just last least like most my new no not now of on or should sinc so some th than this that the their then those to told too true try until url us were when whether while with within yes you youll classifier-reborn-2.0.4/data/stopwords/es000066400000000000000000000042501261522722200204370ustar00rootroot00000000000000él ésta éstas éste éstos última últimas último últimos a añadió aún actualmente adelante además afirmó agregó ahí ahora al algún algo alguna algunas alguno algunos alrededor ambos ante anterior antes apenas aproximadamente aquí así aseguró aunque ayer bajo bien buen buena buenas bueno buenos cómo cada casi cerca cierto cinco comentó como con conocer consideró considera contra cosas creo cual cuales cualquier cuando cuanto cuatro cuenta da dado dan dar de debe deben debido decir dejó del demás dentro desde después dice dicen dicho dieron diferente diferentes dijeron dijo dio donde dos durante e ejemplo el ella ellas ello ellos embargo en encuentra entonces entre era eran es esa esas ese eso esos está están esta estaba estaban estamos estar estará estas este esto estos estoy estuvo ex existe existen explicó expresó fin fue fuera fueron gran grandes ha había habían haber habrá hace hacen hacer hacerlo hacia haciendo han hasta hay haya he hecho hemos hicieron hizo hoy hubo igual incluso indicó informó junto la lado las le les llegó lleva llevar lo los luego lugar más manera manifestó mayor me mediante mejor mencionó menos mi mientras misma mismas mismo mismos momento mucha muchas mucho muchos muy nada nadie ni ningún ninguna ningunas ninguno ningunos no nos nosotras nosotros nuestra nuestras nuestro nuestros nueva nuevas nuevo nuevos nunca o ocho otra otras otro otros para parece parte partir pasada pasado pero pesar poca pocas poco pocos podemos podrá podrán podría podrían poner por porque posible próximo próximos primer primera primero primeros principalmente propia propias propio propios pudo pueda puede pueden pues qué que quedó queremos quién quien quienes quiere realizó realizado realizar respecto sí sólo se señaló sea sean según segunda segundo seis ser será serán sería si sido siempre siendo siete sigue siguiente sin sino sobre sola solamente solas solo solos son su sus tal también tampoco tan tanto tenía tendrá tendrán tenemos tener tenga tengo tenido tercera tiene tienen toda todas todavía todo todos total tras trata través tres tuvo un una unas uno unos usted va vamos van varias varios veces ver vez y ya yoclassifier-reborn-2.0.4/data/stopwords/fi000066400000000000000000000131231261522722200204250ustar00rootroot00000000000000aiemmin aika aikaa aikaan aikaisemmin aikaisin aikajen aikana aikoina aikoo aikovat aina ainakaan ainakin ainoa ainoat aiomme aion aiotte aist aivan ajan älä alas alemmas älköön alkuisin alkuun alla alle aloitamme aloitan aloitat aloitatte aloitattivat aloitettava aloitettevaksi aloitettu aloitimme aloitin aloitit aloititte aloittaa aloittamatta aloitti aloittivat alta aluksi alussa alusta annettavaksi annetteva annettu antaa antamatta antoi aoua apu asia asiaa asian asiasta asiat asioiden asioihin asioita asti avuksi avulla avun avutta edellä edelle edelleen edeltä edemmäs edes edessä edestä ehkä ei eikä eilen eivät eli ellei elleivät ellemme ellen ellet ellette emme en enää enemmän eniten ennen ensi ensimmäinen ensimmäiseksi ensimmäisen ensimmäisenä ensimmäiset ensimmäisiä ensimmäisiksi ensimmäisinä ensimmäistä ensin entinen entisen entisiä entistä entisten eräät eräiden eräs eri erittäin erityisesti esi esiin esillä esimerkiksi et eteen etenkin että ette ettei halua haluaa haluamatta haluamme haluan haluat haluatte haluavat halunnut halusi halusimme halusin halusit halusitte halusivat halutessa haluton hän häneen hänellä hänelle häneltä hänen hänessä hänestä hänet he hei heidän heihin heille heiltä heissä heistä heitä helposti heti hetkellä hieman huolimatta huomenna hyvä hyvää hyvät hyviä hyvien hyviin hyviksi hyville hyviltä hyvin hyvinä hyvissä hyvistä ihan ilman ilmeisesti itse itseään itsensä ja jää jälkeen jälleen jo johon joiden joihin joiksi joilla joille joilta joissa joista joita joka jokainen jokin joko joku jolla jolle jolloin jolta jompikumpi jonka jonkin jonne joo jopa jos joskus jossa josta jota jotain joten jotenkin jotenkuten jotka jotta jouduimme jouduin jouduit jouduitte joudumme joudun joudutte joukkoon joukossa joukosta joutua joutui joutuivat joutumaan joutuu joutuvat juuri kahdeksan kahdeksannen kahdella kahdelle kahdelta kahden kahdessa kahdesta kahta kahteen kai kaiken kaikille kaikilta kaikkea kaikki kaikkia kaikkiaan kaikkialla kaikkialle kaikkialta kaikkien kaikkin kaksi kannalta kannattaa kanssa kanssaan kanssamme kanssani kanssanne kanssasi kauan kauemmas kautta kehen keiden keihin keiksi keillä keille keiltä keinä keissä keistä keitä keittä keitten keneen keneksi kenellä kenelle keneltä kenen kenenä kenessä kenestä kenet kenettä kennessästä kerran kerta kertaa kesken keskimäärin ketä ketkä kiitos kohti koko kokonaan kolmas kolme kolmen kolmesti koska koskaan kovin kuin kuinka kuitenkaan kuitenkin kuka kukaan kukin kumpainen kumpainenkaan kumpi kumpikaan kumpikin kun kuten kuuden kuusi kuutta kyllä kymmenen kyse lähekkäin lähellä lähelle läheltä lähemmäs lähes lähinnä lähtien läpi liian liki lisää lisäksi luo mahdollisimman mahdollista me meidän meillä meille melkein melko menee meneet menemme menen menet menette menevät meni menimme menin menit menivät mennessä mennyt menossa mihin mikä mikään mikäli mikin miksi milloin minä minne minun minut missä mistä mitä mitään miten moi molemmat mones monesti monet moni moniaalla moniaalle moniaalta monta muassa muiden muita muka mukaan mukaansa mukana mutta muu muualla muualle muualta muuanne muulloin muun muut muuta muutama muutaman muuten myöhemmin myös myöskään myöskin myötä näiden näin näissä näissähin näissälle näissältä näissästä näitä nämä ne neljä neljää neljän niiden niin niistä niitä noin nopeammin nopeasti nopeiten nro nuo nyt ohi oikein ole olemme olen olet olette oleva olevan olevat oli olimme olin olisi olisimme olisin olisit olisitte olisivat olit olitte olivat olla olleet olli ollut oma omaa omaan omaksi omalle omalta oman omassa omat omia omien omiin omiksi omille omilta omissa omista on onkin onko ovat päälle paikoittain paitsi pakosti paljon paremmin parempi parhaillaan parhaiten peräti perusteella pian pieneen pieneksi pienellä pienelle pieneltä pienempi pienestä pieni pienin puolesta puolestaan runsaasti saakka sadam sama samaa samaan samalla samallalta samallassa samallasta saman samat samoin sata sataa satojen se seitsemän sekä sen seuraavat siellä sieltä siihen siinä siis siitä sijaan siksi sillä silloin silti sinä sinne sinua sinulle sinulta sinun sinussa sinusta sinut sisäkkäin sisällä sitä siten sitten suoraan suuntaan suuren suuret suuri suuria suurin suurten taa täällä täältä taas taemmas tähän tahansa tai takaa takaisin takana takia tällä tällöin tämä tämän tänä tänään tänne tapauksessa tässä tästä tätä täten tavalla tavoitteena täysin täytyvät täytyy te tietysti todella toinen toisaalla toisaalle toisaalta toiseen toiseksi toisella toiselle toiselta toisemme toisen toisensa toisessa toisesta toista toistaiseksi toki tosin tuhannen tuhat tule tulee tulemme tulen tulet tulette tulevat tulimme tulin tulisi tulisimme tulisin tulisit tulisitte tulisivat tulit tulitte tulivat tulla tulleet tullut tuntuu tuo tuolla tuolloin tuolta tuonne tuskin tykö usea useasti useimmiten usein useita uudeksi uudelleen uuden uudet uusi uusia uusien uusinta uuteen uutta vaan vähän vähemmän vähintään vähiten vai vaiheessa vaikea vaikean vaikeat vaikeilla vaikeille vaikeilta vaikeissa vaikeista vaikka vain välillä varmasti varsin varsinkin varten vasta vastaan vastakkain verran vielä vierekkäin vieri viiden viime viimeinen viimeisen viimeksi viisi voi voidaan voimme voin voisi voit voitte voivat vuoden vuoksi vuosi vuosien vuosina vuotta yhä yhdeksän yhden yhdessä yhtä yhtäällä yhtäälle yhtäältä yhtään yhteen yhteensä yhteydessä yhteyteen yksi yksin yksittäin yleensä ylemmäs yli ylös ympäriclassifier-reborn-2.0.4/data/stopwords/fr000066400000000000000000000054461261522722200204470ustar00rootroot00000000000000a à â abord afin ah ai aie ainsi allaient allo allô allons après assez attendu au aucun aucune aujourd aujourd'hui auquel aura auront aussi autre autres aux auxquelles auxquels avaient avais avait avant avec avoir ayant b bah beaucoup bien bigre boum bravo brrr c ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui celui-ci celui-là cent cependant certain certaine certaines certains certes ces cet cette ceux ceux-ci ceux-là chacun chaque cher chère chères chers chez chiche chut ci cinq cinquantaine cinquante cinquantième cinquième clac clic combien comme comment compris concernant contre couic crac d da dans de debout dedans dehors delà depuis derrière des dès désormais desquelles desquels dessous dessus deux deuxième deuxièmement devant devers devra différent différente différentes différents dire divers diverse diverses dix dix-huit dixième dix-neuf dix-sept doit doivent donc dont douze douzième dring du duquel durant e effet eh elle elle-même elles elles-mêmes en encore entre envers environ es ès est et etant étaient étais était étant etc été etre être eu euh eux eux-mêmes excepté f façon fais faisaient faisant fait feront fi flac floc font g gens h ha hé hein hélas hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum hurrah i il ils importe j je jusqu jusque k l la là laquelle las le lequel les lès lesquelles lesquels leur leurs longtemps lorsque lui lui-même m ma maint mais malgré me même mêmes merci mes mien mienne miennes miens mille mince moi moi-même moins mon moyennant n na ne néanmoins neuf neuvième ni nombreuses nombreux non nos notre nôtre nôtres nous nous-mêmes nul o o| ô oh ohé olé ollé on ont onze onzième ore ou où ouf ouias oust ouste outre p paf pan par parmi partant particulier particulière particulièrement pas passé pendant personne peu peut peuvent peux pff pfft pfut pif plein plouf plus plusieurs plutôt pouah pour pourquoi premier première premièrement près proche psitt puisque q qu quand quant quanta quant-à-soi quarante quatorze quatre quatre-vingt quatrième quatrièmement que quel quelconque quelle quelles quelque quelques quelqu'un quels qui quiconque quinze quoi quoique r revoici revoilà rien s sa sacrebleu sans sapristi sauf se seize selon sept septième sera seront ses si sien sienne siennes siens sinon six sixième soi soi-même soit soixante son sont sous stop suis suivant sur surtout t ta tac tant te té tel telle tellement telles tels tenant tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous tout toute toutes treize trente très trois troisième troisièmement trop tsoin tsouin tu u un une unes uns v va vais vas vé vers via vif vifs vingt vivat vive vives vlan voici voilà vont vos votre vôtre vôtres vous vous-mêmes vu w x y z zutclassifier-reborn-2.0.4/data/stopwords/hu000066400000000000000000000002171261522722200204430ustar00rootroot00000000000000a az egy be ki le fel meg el át rá ide oda szét össze vissza de hát és vagy hogy van lesz volt csak nem igen mint én te õ mi ti õk önclassifier-reborn-2.0.4/data/stopwords/it000066400000000000000000000050531261522722200204460ustar00rootroot00000000000000a abbastanza abbiomo accidenti ad adesso affinche affinché agli ahime ahimè ai al alcuna alcuni alcuno all alla alle allo altri altrimenti altro altrui anche ancora anni anno ansa assai attesa avanti avendo avente aver avere avete aveva avuta avute avuti avuto basta bene benissimo berlusconi brava bravo c casa caso cento certa certe certi certo che chi chicchessia chinque chiunque ci ciascuna ciascuno cima cio ciò cioe cioè circa citta città codesta codeste codesti codesto cogli coi col colei coll coloro colui come con concernente consiglio contro cortesia cos cosa cosi cosí così cui d da dagli dai dal dall dalla dalle dallo davanti degli dei del dell della delle dello dentro detto deve di dice dietro dila dire dirimpetto dopo dove dovra dovrà due dunque durante e è ecco ed egli ella eppure era erano esse essendo esser essere essi ex fa fare fatto favore fin finalmente finche finché fine fino forse fra frattanto fuori gia già giacche giacché giorni giorno gli gliela gliele glieli glielo gliene governo grande grazie gruppo ha hai hanno ho i ieri il improvviso in infatti insieme intanto intorno invece invere io l la là lavoro le lei li lo lontano loro lui lungo ma macche macché magari mai male malgrado malissimo me medesimo mediante meglio meno mentre mesi mezzo mi mia mie miei mieri mila miliardi milioni ministro mio moltissimo molto mondo nazionale ne né negli nei nel nell nella nelle nello nemmeno neppure nessuna nessuno niente no noi non nondimeno nondimento nostra nostre nostri nostro nulla nuovo o od oggi ogni ognuna ognuno oltre oppure ora ore osi osì ossia paese parecchi parecchie parecchio parte partendo peccato peggio per perche perché perchè percio perciò perfino pero però perque perqué persone piedi pieno piglia piu piú più po pochissimo poco poi poiche poiché press prima primo proprio puo può pure purtroppo qualche qualcuna qualcuno quale quali qualsiani qualunque quando quanta quante quanti quanto quantunque quasi quattro quel quella quelli quello quest questa queste questi questo qui quindi riecco rieccò saltro salvo sara sarà sarebbe scopo scorso se sé secondo seguente sei sempre senza si sí sia siamo siete solito solo sono soppra sopra sotto sta staranno stata state stati stato stesso stresso su sua successivo sue sugli sui sul sull sulla sulle sullo suo suoi tale talvolta tanto te tempo ti torino tra tranne trannefino tre troppo tu tua tue tuo tuoi tutta tuttavia tutte tutti tutto uguali un una uno uomo uori va vale varia varie vario verso vi via vicino vise visé visto vita voi volta vostra vostre vostri vostro classifier-reborn-2.0.4/data/stopwords/nl000066400000000000000000000002601261522722200204360ustar00rootroot00000000000000aan af al als bij dan dat die dit een en er had heb hem het hij hoe hun ik in is je kan me men met mij nog nu of ons ook te tot uit van was wat we wel wij zal ze zei zij zo zouclassifier-reborn-2.0.4/data/stopwords/no000066400000000000000000000011201261522722200204350ustar00rootroot00000000000000alle andre arbeid av begge bort bra bruke da denne der deres det din disse du eller en ene eneste enhver enn er et folk for fordi forsÛke fra fÅ fÛr fÛrst gjorde gjÛre god gÅ ha hadde han hans hennes her hva hvem hver hvilken hvis hvor hvordan hvorfor i ikke inn innen kan kunne lage lang lik like makt mange med meg meget men mens mer mest min mye mÅ mÅte navn nei ny nÅ nÅr og ogsÅ om opp oss over part punkt pÅ rett riktig samme sant si siden sist skulle slik slutt som start stille sÅ tid til tilbake tilstand under ut uten var ved verdi vi vil ville vite vÅr vÖre vÖrt Åclassifier-reborn-2.0.4/data/stopwords/pl000066400000000000000000000007361261522722200204500ustar00rootroot00000000000000a aby ale bardziej bardzo bez bo bowiem bêdzie co czy czyli dla dlatego do gdy gdzie go i ich im innych jak jako jednak jego jej jest jeszcze kiedy kilka która które którego której który których którym którzy lub ma mi miêdzy mnie na nad nam nas naszego naszych nawet nich nie nim o od oraz po pod poza przed przede przez przy siê sobie swoje ta tak takie tam te tego tej ten to tu tych tylko tym u w we wiele wielu wiêc wszystkich wszystkim wszystko z za zawsze ze classifier-reborn-2.0.4/data/stopwords/pt000066400000000000000000000040461261522722200204560ustar00rootroot00000000000000a à adeus agora aí ainda além algo algumas alguns ali ano anos antes ao aos apenas apoio após aquela aquelas aquele aqueles aqui aquilo área as às assim até atrás através baixo bastante bem bom breve cá cada catorze cedo cento certamente certeza cima cinco coisa com como conselho contra custa da dá dão daquela daquele dar das de debaixo demais dentro depois desde dessa desse desta deste deve deverá dez dezanove dezasseis dezassete dezoito dia diante diz dizem dizer do dois dos doze duas dúvida e é ela elas ele eles em embora entre era és essa essas esse esses esta está estar estas estás estava este estes esteve estive estivemos estiveram estiveste estivestes estou eu exemplo faço falta favor faz fazeis fazem fazemos fazer fazes fez fim final foi fomos for foram forma foste fostes fui geral grande grandes grupo há hoje horas isso isto já lá lado local logo longe lugar maior maioria mais mal mas máximo me meio menor menos mês meses meu meus mil minha minhas momento muito muitos na nada não naquela naquele nas nem nenhuma nessa nesse nesta neste nível no noite nome nos nós nossa nossas nosso nossos nova nove novo novos num numa número nunca o obra obrigada obrigado oitava oitavo oito onde ontem onze os ou outra outras outro outros para parece parte partir pela pelas pelo pelos perto pode pôde podem poder põe põem ponto pontos por porque porquê posição possível possivelmente posso pouca pouco primeira primeiro próprio próximo puderam qual quando quanto quarta quarto quatro que quê quem quer quero questão quinta quinto quinze relação sabe são se segunda segundo sei seis sem sempre ser seria sete sétima sétimo seu seus sexta sexto sim sistema sob sobre sois somos sou sua suas tal talvez também tanto tão tarde te tem têm temos tendes tenho tens ter terceira terceiro teu teus teve tive tivemos tiveram tiveste tivestes toda todas todo todos trabalho três treze tu tua tuas tudo um uma umas uns vai vais vão vários vem vêm vens ver vez vezes viagem vindo vinte você vocês vos vós vossa vossas vosso vossos zeroclassifier-reborn-2.0.4/data/stopwords/se000066400000000000000000000046671261522722200204530ustar00rootroot00000000000000aderton adertonde adjö aldrig alla allas allt alltid alltså än andra andras annan annat ännu artonde artonn åtminstone att åtta åttio åttionde åttonde av även båda bådas bakom bara bäst bättre behöva behövas behövde behövt beslut beslutat beslutit bland blev bli blir blivit bort borta bra då dag dagar dagarna dagen där därför de del delen dem den deras dess det detta dig din dina dit ditt dock du efter eftersom elfte eller elva en enkel enkelt enkla enligt er era ert ett ettusen få fanns får fått fem femte femtio femtionde femton femtonde fick fin finnas finns fjärde fjorton fjortonde fler flera flesta följande för före förlåt förra första fram framför från fyra fyrtio fyrtionde gå gälla gäller gällt går gärna gått genast genom gick gjorde gjort god goda godare godast gör göra gott ha hade haft han hans har här heller hellre helst helt henne hennes hit hög höger högre högst hon honom hundra hundraen hundraett hur i ibland idag igår igen imorgon in inför inga ingen ingenting inget innan inne inom inte inuti ja jag jämfört kan kanske knappast kom komma kommer kommit kr kunde kunna kunnat kvar länge längre långsam långsammare långsammast långsamt längst långt lätt lättare lättast legat ligga ligger lika likställd likställda lilla lite liten litet man många måste med mellan men mer mera mest mig min mina mindre minst mitt mittemot möjlig möjligen möjligt möjligtvis mot mycket någon någonting något några när nästa ned nederst nedersta nedre nej ner ni nio nionde nittio nittionde nitton nittonde nödvändig nödvändiga nödvändigt nödvändigtvis nog noll nr nu nummer och också ofta oftast olika olikt om oss över övermorgon överst övre på rakt rätt redan så sade säga säger sagt samma sämre sämst sedan senare senast sent sex sextio sextionde sexton sextonde sig sin sina sist sista siste sitt sjätte sju sjunde sjuttio sjuttionde sjutton sjuttonde ska skall skulle slutligen små smått snart som stor stora större störst stort tack tidig tidigare tidigast tidigt till tills tillsammans tio tionde tjugo tjugoen tjugoett tjugonde tjugotre tjugotvå tjungo tolfte tolv tre tredje trettio trettionde tretton trettonde två tvåhundra under upp ur ursäkt ut utan utanför ute vad vänster vänstra var vår vara våra varför varifrån varit varken värre varsågod vart vårt vem vems verkligen vi vid vidare viktig viktigare viktigast viktigt vilka vilken vilket villclassifier-reborn-2.0.4/data/stopwords/tr000066400000000000000000000011751261522722200204600ustar00rootroot00000000000000acaba altmýþ altý ama bana bazý belki ben benden beni benim beþ bin bir biri birkaç birkez birþey birþeyi biz bizden bizi bizim bu buna bunda bundan bunu bunun da daha dahi de defa diye doksan dokuz dört elli en gibi hem hep hepsi her hiç iki ile INSERmi ise için katrilyon kez ki kim kimden kime kimi kýrk milyar milyon mu mü mý nasýl ne neden nerde nerede nereye niye niçin on ona ondan onlar onlardan onlari onlarýn onu otuz sanki sekiz seksen sen senden seni senin siz sizden sizi sizin trilyon tüm ve veya ya yani yedi yetmiþ yirmi yüz çok çünkü üç þey þeyden þeyi þeyler þu þuna þunda þundan þunuclassifier-reborn-2.0.4/lib/000077500000000000000000000000001261522722200156755ustar00rootroot00000000000000classifier-reborn-2.0.4/lib/classifier-reborn.rb000066400000000000000000000025401261522722200216340ustar00rootroot00000000000000#-- # Copyright (c) 2005 Lucas Carlson # # Permission is hereby granted, free of charge, to any person obtaining # a copy of this software and associated documentation files (the # "Software"), to deal in the Software without restriction, including # without limitation the rights to use, copy, modify, merge, publish, # distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so, subject to # the following conditions: # # The above copyright notice and this permission notice shall be # included in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. #++ # Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL require 'rubygems' require_relative 'classifier-reborn/category_namer' require_relative 'classifier-reborn/bayes' require_relative 'classifier-reborn/lsi'classifier-reborn-2.0.4/lib/classifier-reborn/000077500000000000000000000000001261522722200213065ustar00rootroot00000000000000classifier-reborn-2.0.4/lib/classifier-reborn/bayes.rb000066400000000000000000000160461261522722200227450ustar00rootroot00000000000000# Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL require_relative 'category_namer' module ClassifierReborn class Bayes CategoryNotFoundError = Class.new(StandardError) # The class can be created with one or more categories, each of which will be # initialized and given a training method. E.g., # b = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', 'Spam' # # Options available are: # language: 'en' Used to select language specific stop words # auto_categorize: false When true, enables ability to dynamically declare a category # enable_threshold: false When true, enables a threshold requirement for classifition # threshold: 0.0 Default threshold, only used when enabled def initialize(*args) @categories = Hash.new options = { language: 'en', auto_categorize: false, enable_threshold: false, threshold: 0.0 } args.flatten.each { |arg| if arg.kind_of?(Hash) options.merge!(arg) else add_category(arg) end } @total_words = 0 @category_counts = Hash.new(0) @category_word_count = Hash.new(0) @language = options[:language] @auto_categorize = options[:auto_categorize] @enable_threshold = options[:enable_threshold] @threshold = options[:threshold] end # Provides a general training method for all categories specified in Bayes#new # For example: # b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other' # b.train :this, "This text" # b.train "that", "That text" # b.train "The other", "The other text" def train(category, text) category = CategoryNamer.prepare_name(category) # Add the category dynamically or raise an error if !@categories.has_key?(category) if @auto_categorize add_category(category) else raise CategoryNotFoundError.new("Cannot train; category #{category} does not exist") end end @category_counts[category] += 1 Hasher.word_hash(text, @language).each do |word, count| @categories[category][word] += count @category_word_count[category] += count @total_words += count end end # Provides a untraining method for all categories specified in Bayes#new # Be very careful with this method. # # For example: # b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other' # b.train :this, "This text" # b.untrain :this, "This text" def untrain(category, text) category = CategoryNamer.prepare_name(category) @category_counts[category] -= 1 Hasher.word_hash(text, @language).each do |word, count| if @total_words >= 0 orig = @categories[category][word] || 0 @categories[category][word] -= count if @categories[category][word] <= 0 @categories[category].delete(word) count = orig end if @category_word_count[category] >= count @category_word_count[category] -= count end @total_words -= count end end end # Returns the scores in each category the provided +text+. E.g., # b.classifications "I hate bad words and you" # => {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524} # The largest of these scores (the one closest to 0) is the one picked out by #classify def classifications(text) score = Hash.new word_hash = Hasher.word_hash(text, @language) training_count = @category_counts.values.reduce(:+).to_f @categories.each do |category, category_words| score[category.to_s] = 0 total = (@category_word_count[category] || 1).to_f word_hash.each do |word, count| s = category_words.has_key?(word) ? category_words[word] : 0.1 score[category.to_s] += Math.log(s/total) end # now add prior probability for the category s = @category_counts.has_key?(category) ? @category_counts[category] : 0.1 score[category.to_s] += Math.log(s / training_count) end return score end # Returns the classification of the provided +text+, which is one of the # categories given in the initializer along with the score. E.g., # b.classify "I hate bad words and you" # => ['Uninteresting', -4.852030263919617] def classify_with_score(text) (classifications(text).sort_by { |a| -a[1] })[0] end # Return the classification without the score def classify(text) result, score = classify_with_score(text) if threshold_enabled? result = nil if score < @threshold || score == Float::INFINITY end return result end # Retrieve the current threshold value def threshold @threshold end # Dynamically set the threshold value def threshold=(a_float) @threshold = a_float end # Dynamically enable threshold for classify results def enable_threshold @enable_threshold = true end # Dynamically disable threshold for classify results def disable_threshold @enable_threshold = false end # Is threshold processing enabled? def threshold_enabled? @enable_threshold end # is threshold processing disabled? def threshold_disabled? !@enable_threshold end # Provides training and untraining methods for the categories specified in Bayes#new # For example: # b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other' # b.train_this "This text" # b.train_that "That text" # b.untrain_that "That text" # b.train_the_other "The other text" def method_missing(name, *args) cleaned_name = name.to_s.gsub(/(un)?train_([\w]+)/, '\2') category = CategoryNamer.prepare_name(cleaned_name) if @categories.has_key? category args.each { |text| eval("#{$1}train(category, text)") } elsif name.to_s =~ /(un)?train_([\w]+)/ raise StandardError, "No such category: #{category}" else super #raise StandardError, "No such method: #{name}" end end # Provides a list of category names # For example: # b.categories # => ['This', 'That', 'the_other'] def categories # :nodoc: @categories.keys.collect {|c| c.to_s} end # Allows you to add categories to the classifier. # For example: # b.add_category "Not spam" # # WARNING: Adding categories to a trained classifier will # result in an undertrained category that will tend to match # more criteria than the trained selective categories. In short, # try to initialize your categories at initialization. def add_category(category) @categories[CategoryNamer.prepare_name(category)] ||= Hash.new(0) end alias append_category add_category end end classifier-reborn-2.0.4/lib/classifier-reborn/category_namer.rb000066400000000000000000000006041261522722200246320ustar00rootroot00000000000000# Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL require 'fast_stemmer' require 'classifier-reborn/extensions/hasher' module ClassifierReborn module CategoryNamer extend self def prepare_name(name) return name if name.is_a?(Symbol) name.to_s.gsub("_"," ").capitalize.intern end end end classifier-reborn-2.0.4/lib/classifier-reborn/extensions/000077500000000000000000000000001261522722200235055ustar00rootroot00000000000000classifier-reborn-2.0.4/lib/classifier-reborn/extensions/hasher.rb000066400000000000000000000031701261522722200253050ustar00rootroot00000000000000# encoding: utf-8 # Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL require 'set' module ClassifierReborn module Hasher STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')] extend self # Return a Hash of strings => ints. Each word in the string is stemmed, # interned, and indexes to its frequency in the document. def word_hash(str, language = 'en') cleaned_word_hash = clean_word_hash(str, language) symbol_hash = word_hash_for_symbols(str.scan(/[^\s\p{WORD}]/)) return cleaned_word_hash.merge(symbol_hash) end # Return a word hash without extra punctuation or short symbols, just stemmed words def clean_word_hash(str, language = 'en') word_hash_for_words str.gsub(/[^\p{WORD}\s]/,'').downcase.split, language end def word_hash_for_words(words, language = 'en') d = Hash.new(0) words.each do |word| if word.length > 2 && !STOPWORDS[language].include?(word) d[word.stem.intern] += 1 end end return d end def word_hash_for_symbols(words) d = Hash.new(0) words.each do |word| d[word.intern] += 1 end return d end # Create a lazily-loaded hash of stopword data STOPWORDS = Hash.new do |hash, language| hash[language] = [] STOPWORDS_PATH.each do |path| if File.exist?(File.join(path, language)) hash[language] = Set.new File.read(File.join(path, language.to_s)).split break end end hash[language] end end end classifier-reborn-2.0.4/lib/classifier-reborn/extensions/vector.rb000066400000000000000000000035741261522722200253450ustar00rootroot00000000000000# Author:: Ernest Ellingson # Copyright:: Copyright (c) 2005 # These are extensions to the std-lib 'matrix' to allow an all ruby SVD require 'matrix' class Matrix def Matrix.diag(s) Matrix.diagonal(*s) end alias :trans :transpose def SV_decomp(maxSweeps = 20) if self.row_size >= self.column_size q = self.trans * self else q = self * self.trans end qrot = q.dup v = Matrix.identity(q.row_size) mzrot = nil cnt = 0 s_old = nil mu = nil while true do cnt += 1 for row in (0...qrot.row_size-1) do for col in (1..qrot.row_size-1) do next if row == col h = Math.atan((2 * qrot[row,col])/(qrot[row,row]-qrot[col,col]))/2.0 hcos = Math.cos(h) hsin = Math.sin(h) mzrot = Matrix.identity(qrot.row_size) mzrot[row,row] = hcos mzrot[row,col] = -hsin mzrot[col,row] = hsin mzrot[col,col] = hcos qrot = mzrot.trans * qrot * mzrot v = v * mzrot end end s_old = qrot.dup if cnt == 1 sum_qrot = 0.0 if cnt > 1 qrot.row_size.times do |r| sum_qrot += (qrot[r,r]-s_old[r,r]).abs if (qrot[r,r]-s_old[r,r]).abs > 0.001 end s_old = qrot.dup end break if (sum_qrot <= 0.001 and cnt > 1) or cnt >= maxSweeps end # of do while true s = [] qrot.row_size.times do |r| s << Math.sqrt(qrot[r,r]) end #puts "cnt = #{cnt}" if self.row_size >= self.column_size mu = self * v * Matrix.diagonal(*s).inverse return [mu, v, s] else puts v.row_size puts v.column_size puts self.row_size puts self.column_size puts s.size mu = (self.trans * v * Matrix.diagonal(*s).inverse) return [mu, v, s] end end def []=(i,j,val) @rows[i][j] = val end end classifier-reborn-2.0.4/lib/classifier-reborn/extensions/vector_serialize.rb000066400000000000000000000004371261522722200274070ustar00rootroot00000000000000module GSL class Vector def _dump(v) Marshal.dump( self.to_a ) end def self._load(arr) arry = Marshal.load(arr) return GSL::Vector.alloc(arry) end end class Matrix class < false # If you want to use ContentNodes with cached vector transpositions, use # lsi = ClassifierReborn::LSI.new :cache_node_vectors => true # def initialize(options = {}) @auto_rebuild = options[:auto_rebuild] != false @word_list, @items = WordList.new, {} @version, @built_at_version = 0, -1 @language = options[:language] || 'en' if @cache_node_vectors = options[:cache_node_vectors] extend CachedContentNode::InstanceMethods end end # Returns true if the index needs to be rebuilt. The index needs # to be built after all informaton is added, but before you start # using it for search, classification and cluster detection. def needs_rebuild? (@items.size > 1) && (@version != @built_at_version) end # Adds an item to the index. item is assumed to be a string, but # any item may be indexed so long as it responds to #to_s or if # you provide an optional block explaining how the indexer can # fetch fresh string data. This optional block is passed the item, # so the item may only be a reference to a URL or file name. # # For example: # lsi = ClassifierReborn::LSI.new # lsi.add_item "This is just plain text" # lsi.add_item "/home/me/filename.txt" { |x| File.read x } # ar = ActiveRecordObject.find( :all ) # lsi.add_item ar, *ar.categories { |x| ar.content } # def add_item( item, *categories, &block ) clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language) @items[item] = if @cache_node_vectors CachedContentNode.new(clean_word_hash, *categories) else ContentNode.new(clean_word_hash, *categories) end @version += 1 build_index if @auto_rebuild end # A less flexible shorthand for add_item that assumes # you are passing in a string with no categorries. item # will be duck typed via to_s . # def <<( item ) add_item item end # Returns the categories for a given indexed items. You are free to add and remove # items from this as you see fit. It does not invalide an index to change its categories. def categories_for(item) return [] unless @items[item] return @items[item].categories end # Removes an item from the database, if it is indexed. # def remove_item( item ) if @items.key? item @items.delete item @version += 1 end end # Returns an array of items that are indexed. def items @items.keys end # This function rebuilds the index if needs_rebuild? returns true. # For very large document spaces, this indexing operation may take some # time to complete, so it may be wise to place the operation in another # thread. # # As a rule, indexing will be fairly swift on modern machines until # you have well over 500 documents indexed, or have an incredibly diverse # vocabulary for your documents. # # The optional parameter "cutoff" is a tuning parameter. When the index is # built, a certain number of s-values are discarded from the system. The # cutoff parameter tells the indexer how many of these values to keep. # A value of 1 for cutoff means that no semantic analysis will take place, # turning the LSI class into a simple vector search engine. def build_index( cutoff=0.75 ) return unless needs_rebuild? make_word_list doc_list = @items.values tda = doc_list.collect { |node| node.raw_vector_with( @word_list ) } if $GSL tdm = GSL::Matrix.alloc(*tda).trans ntdm = build_reduced_matrix(tdm, cutoff) ntdm.size[1].times do |col| vec = GSL::Vector.alloc( ntdm.column(col) ).row doc_list[col].lsi_vector = vec doc_list[col].lsi_norm = vec.normalize end else tdm = Matrix.rows(tda).trans ntdm = build_reduced_matrix(tdm, cutoff) ntdm.row_size.times do |col| doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col] doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col] end end @built_at_version = @version end # This method returns max_chunks entries, ordered by their average semantic rating. # Essentially, the average distance of each entry from all other entries is calculated, # the highest are returned. # # This can be used to build a summary service, or to provide more information about # your dataset's general content. For example, if you were to use categorize on the # results of this data, you could gather information on what your dataset is generally # about. def highest_relative_content( max_chunks=10 ) return [] if needs_rebuild? avg_density = Hash.new @items.each_key { |item| avg_density[item] = proximity_array_for_content(item).inject(0.0) { |x,y| x + y[1]} } avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks-1].map end # This function is the primitive that find_related and classify # build upon. It returns an array of 2-element arrays. The first element # of this array is a document, and the second is its "score", defining # how "close" it is to other indexed items. # # These values are somewhat arbitrary, having to do with the vector space # created by your content, so the magnitude is interpretable but not always # meaningful between indexes. # # The parameter doc is the content to compare. If that content is not # indexed, you can pass an optional block to define how to create the # text data. See add_item for examples of how this works. def proximity_array_for_content( doc, &block ) return [] if needs_rebuild? content_node = node_for_content( doc, &block ) result = @items.keys.collect do |item| if $GSL val = content_node.search_vector * @items[item].transposed_search_vector else val = (Matrix[content_node.search_vector] * @items[item].search_vector)[0] end [item, val] end result.sort_by { |x| x[1] }.reverse end # Similar to proximity_array_for_content, this function takes similar # arguments and returns a similar array. However, it uses the normalized # calculated vectors instead of their full versions. This is useful when # you're trying to perform operations on content that is much smaller than # the text you're working with. search uses this primitive. def proximity_norms_for_content( doc, &block ) return [] if needs_rebuild? content_node = node_for_content( doc, &block ) result = @items.keys.collect do |item| if $GSL val = content_node.search_norm * @items[item].search_norm.col else val = (Matrix[content_node.search_norm] * @items[item].search_norm)[0] end [item, val] end result.sort_by { |x| x[1] }.reverse end # This function allows for text-based search of your index. Unlike other functions # like find_related and classify, search only takes short strings. It will also ignore # factors like repeated words. It is best for short, google-like search terms. # A search will first priortize lexical relationships, then semantic ones. # # While this may seem backwards compared to the other functions that LSI supports, # it is actually the same algorithm, just applied on a smaller document. def search( string, max_nearest=3 ) return [] if needs_rebuild? carry = proximity_norms_for_content( string ) result = carry.collect { |x| x[0] } return result[0..max_nearest-1] end # This function takes content and finds other documents # that are semantically "close", returning an array of documents sorted # from most to least relavant. # max_nearest specifies the number of documents to return. A value of # 0 means that it returns all the indexed documents, sorted by relavence. # # This is particularly useful for identifing clusters in your document space. # For example you may want to identify several "What's Related" items for weblog # articles, or find paragraphs that relate to each other in an essay. def find_related( doc, max_nearest=3, &block ) carry = proximity_array_for_content( doc, &block ).reject { |pair| pair[0].eql? doc } result = carry.collect { |x| x[0] } return result[0..max_nearest-1] end # Return the most obvious category with the score def classify_with_score( doc, cutoff=0.30, &block) return scored_categories(doc, cutoff, &block).last end # Return the most obvious category without the score def classify( doc, cutoff=0.30, &block ) return scored_categories(doc, cutoff, &block).last.first end # This function uses a voting system to categorize documents, based on # the categories of other documents. It uses the same logic as the # find_related function to find related documents, then returns the # list of sorted categories. # # cutoff signifies the number of documents to consider when clasifying # text. A cutoff of 1 means that every document in the index votes on # what category the document is in. This may not always make sense. # def scored_categories( doc, cutoff=0.30, &block ) icutoff = (@items.size * cutoff).round carry = proximity_array_for_content( doc, &block ) carry = carry[0..icutoff-1] votes = Hash.new(0.0) carry.each do |pair| @items[pair[0]].categories.each do |category| votes[category] += pair[1] end end return votes.sort_by { |_, score| score } end # Prototype, only works on indexed documents. # I have no clue if this is going to work, but in theory # it's supposed to. def highest_ranked_stems( doc, count=3 ) raise "Requested stem ranking on non-indexed content!" unless @items[doc] arr = node_for_content(doc).lsi_vector.to_a top_n = arr.sort.reverse[0..count-1] return top_n.collect { |x| @word_list.word_for_index(arr.index(x))} end private def build_reduced_matrix( matrix, cutoff=0.75 ) # TODO: Check that M>=N on these dimensions! Transpose helps assure this u, v, s = matrix.SV_decomp # TODO: Better than 75% term, please. :\ s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1] s.size.times do |ord| s[ord] = 0.0 if s[ord] < s_cutoff end # Reconstruct the term document matrix, only with reduced rank u * ($GSL ? GSL::Matrix : ::Matrix).diag( s ) * v.trans end def node_for_content(item, &block) if @items[item] return @items[item] else clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language) cn = ContentNode.new(clean_word_hash, &block) # make the node and extract the data unless needs_rebuild? cn.raw_vector_with( @word_list ) # make the lsi raw and norm vectors end end return cn end def make_word_list @word_list = WordList.new @items.each_value do |node| node.word_hash.each_key { |key| @word_list.add_word key } end end end end classifier-reborn-2.0.4/lib/classifier-reborn/lsi/000077500000000000000000000000001261522722200220755ustar00rootroot00000000000000classifier-reborn-2.0.4/lib/classifier-reborn/lsi/cached_content_node.rb000066400000000000000000000025421261522722200263730ustar00rootroot00000000000000# Author:: Kelley Reynolds (mailto:kelley@insidesystems.net) # Copyright:: Copyright (c) 2015 Kelley Reynolds # License:: LGPL module ClassifierReborn # Subclass of ContentNode which caches the search_vector transpositions. # Its great because its much faster for large indexes, but at the cost of more ram. Additionally, # if you Marshal your classifier and want to keep the size down, you'll need to manually # clear the cache before you dump class CachedContentNode < ContentNode module InstanceMethods # Go through each item in this index and clear the cache def clear_cache! @items.each_value(&:clear_cache!) end end def initialize( word_hash, *categories ) clear_cache! super end def clear_cache! @transposed_search_vector = nil end # Cache the transposed vector, it gets used a lot def transposed_search_vector @transposed_search_vector ||= super end # Clear the cache before we continue on def raw_vector_with( word_list ) clear_cache! super end # We don't want the cached_data here def marshal_dump [@lsi_vector, @lsi_norm, @raw_vector, @raw_norm, @categories, @word_hash] end def marshal_load(array) @lsi_vector, @lsi_norm, @raw_vector, @raw_norm, @categories, @word_hash = array end end end classifier-reborn-2.0.4/lib/classifier-reborn/lsi/content_node.rb000066400000000000000000000054751261522722200251140ustar00rootroot00000000000000# Author:: David Fayram (mailto:dfayram@lensmen.net) # Copyright:: Copyright (c) 2005 David Fayram II # License:: LGPL module ClassifierReborn # This is an internal data structure class for the LSI node. Save for # raw_vector_with, it should be fairly straightforward to understand. # You should never have to use it directly. class ContentNode attr_accessor :raw_vector, :raw_norm, :lsi_vector, :lsi_norm, :categories attr_reader :word_hash # If text_proc is not specified, the source will be duck-typed # via source.to_s def initialize( word_hash, *categories ) @categories = categories || [] @word_hash = word_hash @lsi_norm, @lsi_vector = nil end # Use this to fetch the appropriate search vector. def search_vector @lsi_vector || @raw_vector end # Method to access the transposed search vector def transposed_search_vector search_vector.col end # Use this to fetch the appropriate search vector in normalized form. def search_norm @lsi_norm || @raw_norm end # Creates the raw vector out of word_hash using word_list as the # key for mapping the vector space. def raw_vector_with( word_list ) if $GSL vec = GSL::Vector.alloc(word_list.size) else vec = Array.new(word_list.size, 0) end @word_hash.each_key do |word| vec[word_list[word]] = @word_hash[word] if word_list[word] end # Perform the scaling transform and force floating point arithmetic if $GSL sum = 0.0 vec.each {|v| sum += v } total_words = sum else total_words = vec.reduce(0, :+).to_f end total_unique_words = 0 if $GSL vec.each { |word| total_unique_words += 1 if word != 0.0 } else total_unique_words = vec.count{ |word| word != 0 } end # Perform first-order association transform if this vector has more # then one word in it. if total_words > 1.0 && total_unique_words > 1 weighted_total = 0.0 # Cache calculations, this takes too long on large indexes cached_calcs = Hash.new { |hash, term| hash[term] = (( term / total_words ) * Math.log( term / total_words )) } vec.each do |term| weighted_total += cached_calcs[term] if term > 0.0 end # Cache calculations, this takes too long on large indexes cached_calcs = Hash.new do |hash, val| hash[val] = Math.log( val + 1 ) / -weighted_total end vec.collect! { |val| cached_calcs[val] } end if $GSL @raw_norm = vec.normalize @raw_vector = vec else @raw_norm = Vector[*vec].normalize @raw_vector = Vector[*vec] end end end end classifier-reborn-2.0.4/lib/classifier-reborn/lsi/summarizer.rb000066400000000000000000000020301261522722200246130ustar00rootroot00000000000000# Author:: Lucas Carlson (mailto:lucas@rufy.com) # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL module ClassifierReborn module Summarizer extend self def summary( str, count=10, separator=" [...] " ) perform_lsi split_sentences(str), count, separator end def paragraph_summary( str, count=1, separator=" [...] " ) perform_lsi split_paragraphs(str), count, separator end def split_sentences(str) str.split(/(\.|\!|\?)/) # TODO: make this less primitive end def split_paragraphs(str) str.split(/(\n\n|\r\r|\r\n\r\n)/) # TODO: make this less primitive end def perform_lsi(chunks, count, separator) lsi = ClassifierReborn::LSI.new :auto_rebuild => false chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } lsi.build_index summaries = lsi.highest_relative_content count return summaries.reject { |chunk| !summaries.include? chunk }.map { |x| x.strip }.join(separator) end end end classifier-reborn-2.0.4/lib/classifier-reborn/lsi/word_list.rb000066400000000000000000000015401261522722200244300ustar00rootroot00000000000000# Author:: David Fayram (mailto:dfayram@lensmen.net) # Copyright:: Copyright (c) 2005 David Fayram II # License:: LGPL module ClassifierReborn # This class keeps a word => index mapping. It is used to map stemmed words # to dimensions of a vector. class WordList def initialize @location_table = Hash.new end # Adds a word (if it is new) and assigns it a unique dimension. def add_word(word) term = word @location_table[term] = @location_table.size unless @location_table[term] end # Returns the dimension of the word or nil if the word is not in the space. def [](lookup) term = lookup @location_table[term] end def word_for_index(ind) @location_table.invert[ind] end # Returns the number of words mapped. def size @location_table.size end end end classifier-reborn-2.0.4/lib/classifier-reborn/version.rb000066400000000000000000000000601261522722200233140ustar00rootroot00000000000000module ClassifierReborn VERSION = '2.0.4' end classifier-reborn-2.0.4/script/000077500000000000000000000000001261522722200164335ustar00rootroot00000000000000classifier-reborn-2.0.4/script/bootstrap000077500000000000000000000000541261522722200203750ustar00rootroot00000000000000#!/usr/bin/env bash bundle install --jobs=8 classifier-reborn-2.0.4/script/cibuild000077500000000000000000000000421261522722200177700ustar00rootroot00000000000000#!/usr/bin/env bash ./script/test classifier-reborn-2.0.4/script/test000077500000000000000000000005601261522722200173410ustar00rootroot00000000000000#! /bin/bash # # Usage: # script/test # script/test if [ -z "$1" ]; then TEST_FILES="./test/test_*.rb" else TEST_FILES="$@" fi RAKE_LIB_DIR=$(ruby -e "puts Gem::Specification.find_by_name('rake').gem_dir + '/lib'") set -x time bundle exec ruby -I"lib:test" \ -I"${RAKE_LIB_DIR}" \ "${RAKE_LIB_DIR}/rake/rake_test_loader.rb" \ $TEST_FILES classifier-reborn-2.0.4/test/000077500000000000000000000000001261522722200161065ustar00rootroot00000000000000classifier-reborn-2.0.4/test/bayes/000077500000000000000000000000001261522722200172115ustar00rootroot00000000000000classifier-reborn-2.0.4/test/bayes/bayesian_test.rb000077500000000000000000000072301261522722200223750ustar00rootroot00000000000000# encoding: utf-8 require File.dirname(__FILE__) + '/../test_helper' class BayesianTest < Test::Unit::TestCase def setup @classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting' end def test_good_training assert_nothing_raised { @classifier.train_interesting "love" } end def test_training_with_utf8 assert_nothing_raised { @classifier.train_interesting "Água" } end def test_bad_training assert_raise(StandardError) { @classifier.train_no_category "words" } end def test_bad_method assert_raise(NoMethodError) { @classifier.forget_everything_you_know "" } end def test_categories assert_equal ['Interesting', 'Uninteresting'].sort, @classifier.categories.sort end def test_categories_from_array another_classifier = ClassifierReborn::Bayes.new ['Interesting', 'Uninteresting'] assert_equal another_classifier.categories.sort, @classifier.categories.sort end def test_add_category @classifier.add_category 'Test' assert_equal ['Test', 'Interesting', 'Uninteresting'].sort, @classifier.categories.sort end def test_dynamic_category_succeeds_with_auto_categorize classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', auto_categorize: true classifier.train('Ruby', 'I really sweet language') assert classifier.categories.include?('Ruby') end def test_dynamic_category_fails_without_auto_categorize assert_raises(ClassifierReborn::Bayes::CategoryNotFoundError) { @classifier.train('Ruby', 'A really sweet language') } refute @classifier.categories.include?('Ruby') end def test_classification @classifier.train_interesting "here are some good words. I hope you love them" @classifier.train_uninteresting "here are some bad words, I hate you" assert_equal 'Uninteresting', @classifier.classify("I hate bad words and you") end def test_classification_with_threshold b = ClassifierReborn::Bayes.new 'Digit' assert_equal 1, b.categories.size refute b.threshold_enabled? b.enable_threshold assert b.threshold_enabled? assert_equal 0.0, b.threshold # default b.threshold = -7.0 10.times do |a_number| b.train_digit(a_number.to_s) b.train_digit(a_number.to_s) end 10.times do |a_number| assert_equal 'Digit', b.classify(a_number.to_s) end refute b.classify("xyzzy") end def test_classification_with_threshold_again b = ClassifierReborn::Bayes.new 'Normal' assert_equal 1, b.categories.size refute b.threshold_enabled? b.enable_threshold assert b.threshold_enabled? assert_equal 0.0, b.threshold # default %w{ http://example.com/about http://example.com/contact http://example.com/download http://example.com/login http://example.com/logout http://example.com/blog/2015-04-01 }.each do |url| b.train_normal(url) end assert 'Normal', b.classify('http://example.com') refute b.classify("http://example.com/login/?user='select * from users;'") end def test_classification_with_score @classifier.train_interesting "here are some good words. I hope you love them" @classifier.train_uninteresting "here are some bad words, I hate you" assert_in_delta -4.85, @classifier.classify_with_score("I hate bad words and you")[1], 0.1 end def test_untrain @classifier.train_interesting "here are some good words. I hope you love them" @classifier.train_uninteresting "here are some bad words, I hate you" @classifier.add_category 'colors' @classifier.train_colors "red orange green blue seven" classification_of_bad_data = @classifier.classify "seven" @classifier.untrain_colors "seven" classification_after_untrain = @classifier.classify "seven" assert_not_equal classification_of_bad_data, classification_after_untrain end end classifier-reborn-2.0.4/test/data/000077500000000000000000000000001261522722200170175ustar00rootroot00000000000000classifier-reborn-2.0.4/test/data/stopwords/000077500000000000000000000000001261522722200210635ustar00rootroot00000000000000classifier-reborn-2.0.4/test/data/stopwords/en000066400000000000000000000000321261522722200214030ustar00rootroot00000000000000These are custom stopwordsclassifier-reborn-2.0.4/test/extensions/000077500000000000000000000000001261522722200203055ustar00rootroot00000000000000classifier-reborn-2.0.4/test/extensions/hasher_test.rb000066400000000000000000000024671261522722200231540ustar00rootroot00000000000000require File.dirname(__FILE__) + '/../test_helper' class HasherTest < Test::Unit::TestCase def setup @original_stopwords_path = Hasher::STOPWORDS_PATH.dup end def test_word_hash hash = {:good=>1, :"!"=>1, :hope=>1, :"'"=>1, :"."=>1, :love=>1, :word=>1, :them=>1, :test=>1} assert_equal hash, Hasher.word_hash("here are some good words of test's. I hope you love them!") end def test_clean_word_hash hash = {:good=>1, :word=>1, :hope=>1, :love=>1, :them=>1, :test=>1} assert_equal hash, Hasher.clean_word_hash("here are some good words of test's. I hope you love them!") end def test_default_stopwords assert_not_empty Hasher::STOPWORDS['en'] assert_not_empty Hasher::STOPWORDS['fr'] assert_empty Hasher::STOPWORDS['gibberish'] end def test_loads_custom_stopwords default_english_stopwords = Hasher::STOPWORDS['en'] # Remove the english stopwords Hasher::STOPWORDS.delete('en') # Add a custom stopwords path Hasher::STOPWORDS_PATH.unshift File.expand_path(File.dirname(__FILE__) + '/../data/stopwords') custom_english_stopwords = Hasher::STOPWORDS['en'] assert_not_equal default_english_stopwords, custom_english_stopwords end def teardown Hasher::STOPWORDS.clear Hasher::STOPWORDS_PATH.clear.concat @original_stopwords_path end end classifier-reborn-2.0.4/test/lsi/000077500000000000000000000000001261522722200166755ustar00rootroot00000000000000classifier-reborn-2.0.4/test/lsi/lsi_test.rb000066400000000000000000000141131261522722200210500ustar00rootroot00000000000000require File.dirname(__FILE__) + '/../test_helper' class LSITest < Test::Unit::TestCase def setup # we repeat principle words to help weight them. # This test is rather delicate, since this system is mostly noise. @str1 = "This text deals with dogs. Dogs." @str2 = "This text involves dogs too. Dogs! " @str3 = "This text revolves around cats. Cats." @str4 = "This text also involves cats. Cats!" @str5 = "This text involves birds. Birds." end def test_basic_indexing lsi = ClassifierReborn::LSI.new [@str1, @str2, @str3, @str4, @str5].each { |x| lsi << x } assert ! lsi.needs_rebuild? # note that the closest match to str1 is str2, even though it is not # the closest text match. assert_equal [@str2, @str5, @str3], lsi.find_related(@str1, 3) end def test_not_auto_rebuild lsi = ClassifierReborn::LSI.new :auto_rebuild => false lsi.add_item @str1, "Dog" lsi.add_item @str2, "Dog" assert lsi.needs_rebuild? lsi.build_index assert ! lsi.needs_rebuild? end def test_basic_categorizing lsi = ClassifierReborn::LSI.new lsi.add_item @str2, "Dog" lsi.add_item @str3, "Cat" lsi.add_item @str4, "Cat" lsi.add_item @str5, "Bird" assert_equal "Dog", lsi.classify( @str1 ) assert_equal "Cat", lsi.classify( @str3 ) assert_equal "Bird", lsi.classify( @str5 ) end def test_basic_categorizing_with_score lsi = ClassifierReborn::LSI.new lsi.add_item @str2, "Dog" lsi.add_item @str3, "Cat" lsi.add_item @str4, "Cat" lsi.add_item @str5, "Bird" assert_in_delta 2.49, lsi.classify_with_score( @str1 )[1], 0.1 assert_in_delta 1.41, lsi.classify_with_score( @str3 )[1], 0.1 assert_in_delta 1.99, lsi.classify_with_score( @str5 )[1], 0.1 end def test_scored_categories lsi = ClassifierReborn::LSI.new lsi.add_item @str1, "Dog" lsi.add_item @str2, "Dog" lsi.add_item @str3, "Cat" lsi.add_item @str4, "Cat" lsi.add_item @str5, "Bird" scored_categories = lsi.scored_categories("dog bird cat") assert_equal 2, scored_categories.size assert_equal ["Bird", "Dog"], scored_categories.map(&:first) end def test_external_classifying lsi = ClassifierReborn::LSI.new bayes = ClassifierReborn::Bayes.new 'Dog', 'Cat', 'Bird' lsi.add_item @str1, "Dog" ; bayes.train_dog @str1 lsi.add_item @str2, "Dog" ; bayes.train_dog @str2 lsi.add_item @str3, "Cat" ; bayes.train_cat @str3 lsi.add_item @str4, "Cat" ; bayes.train_cat @str4 lsi.add_item @str5, "Bird" ; bayes.train_bird @str5 # We're talking about dogs. Even though the text matches the corpus on # cats better. Dogs have more semantic weight than cats. So bayes # will fail here, but the LSI recognizes content. tricky_case = "This text revolves around dogs." assert_equal "Dog", lsi.classify( tricky_case ) assert_not_equal "Dog", bayes.classify( tricky_case ) end def test_recategorize_interface lsi = ClassifierReborn::LSI.new lsi.add_item @str1, "Dog" lsi.add_item @str2, "Dog" lsi.add_item @str3, "Cat" lsi.add_item @str4, "Cat" lsi.add_item @str5, "Bird" tricky_case = "This text revolves around dogs." assert_equal "Dog", lsi.classify( tricky_case ) # Recategorize as needed. lsi.categories_for(@str1).clear.push "Cow" lsi.categories_for(@str2).clear.push "Cow" assert !lsi.needs_rebuild? assert_equal "Cow", lsi.classify( tricky_case ) end def test_search lsi = ClassifierReborn::LSI.new [@str1, @str2, @str3, @str4, @str5].each { |x| lsi << x } # Searching by content and text, note that @str2 comes up first, because # both "dog" and "involve" are present. But, the next match is @str1 instead # of @str4, because "dog" carries more weight than involves. assert_equal( [@str2, @str1, @str4, @str5, @str3], lsi.search("dog involves", 100) ) # Keyword search shows how the space is mapped out in relation to # dog when magnitude is remove. Note the relations. We move from dog # through involve and then finally to other words. assert_equal( [@str1, @str2, @str4, @str5, @str3], lsi.search("dog", 5) ) end def test_serialize_safe lsi = ClassifierReborn::LSI.new [@str1, @str2, @str3, @str4, @str5].each { |x| lsi << x } lsi_md = Marshal.dump lsi lsi_m = Marshal.load lsi_md assert_equal lsi_m.search("cat", 3), lsi.search("cat", 3) assert_equal lsi_m.find_related(@str1, 3), lsi.find_related(@str1, 3) end def test_uncached_content_node_option lsi = ClassifierReborn::LSI.new [@str1, @str2, @str3, @str4, @str5].each { |x| lsi << x } lsi.instance_variable_get(:@items).values.each { |node| assert node.instance_of?(ContentNode) } end def test_cached_content_node_option lsi = ClassifierReborn::LSI.new(cache_node_vectors: true) [@str1, @str2, @str3, @str4, @str5].each { |x| lsi << x } lsi.instance_variable_get(:@items).values.each { |node| assert node.instance_of?(CachedContentNode) } end def test_clears_cached_content_node_cache if $GSL lsi = ClassifierReborn::LSI.new(cache_node_vectors: true) lsi.add_item @str1, "Dog" lsi.add_item @str2, "Dog" lsi.add_item @str3, "Cat" lsi.add_item @str4, "Cat" lsi.add_item @str5, "Bird" assert_equal "Dog", lsi.classify( "something about dogs, but not an exact dog string" ) first_content_node = lsi.instance_variable_get(:@items).values.first refute_nil first_content_node.instance_variable_get(:@transposed_search_vector) lsi.clear_cache! assert_nil first_content_node.instance_variable_get(:@transposed_search_vector) end end def test_keyword_search lsi = ClassifierReborn::LSI.new lsi.add_item @str1, "Dog" lsi.add_item @str2, "Dog" lsi.add_item @str3, "Cat" lsi.add_item @str4, "Cat" lsi.add_item @str5, "Bird" assert_equal [:dog, :text, :deal], lsi.highest_ranked_stems(@str1) end def test_summary assert_equal "This text involves dogs too [...] This text also involves cats", Summarizer.summary([@str1, @str2, @str3, @str4, @str5].join, 2) end end classifier-reborn-2.0.4/test/lsi/word_list_test.rb000066400000000000000000000015151261522722200222710ustar00rootroot00000000000000require_relative '../test_helper' class WordListTest < Test::Unit::TestCase def test_size_does_not_count_words_twice list = ClassifierReborn::WordList.new assert list.size == 0 list.add_word('hello') assert list.size == 1 list.add_word('hello') assert list.size == 1 list.add_word('world') assert list.size == 2 end def test_brackets_return_correct_position_based_on_add_order list = ClassifierReborn::WordList.new list.add_word('hello') list.add_word('world') assert list['hello'] == 0 assert list['world'] == 1 end def test_word_for_index_returns_correct_word_based_on_add_order list = ClassifierReborn::WordList.new list.add_word('hello') list.add_word('world') assert list.word_for_index(0) == 'hello' assert list.word_for_index(1) == 'world' end end classifier-reborn-2.0.4/test/test_helper.rb000066400000000000000000000001721261522722200207510ustar00rootroot00000000000000$:.unshift(File.dirname(__FILE__) + '/../lib') require 'test/unit' require 'classifier-reborn' include ClassifierReborn