zope.index-3.6.4/0000755000175000017500000000000011727503757014771 5ustar tseavertseaver00000000000000zope.index-3.6.4/buildout.cfg0000644000175000017500000000072111727503631017270 0ustar tseavertseaver00000000000000[buildout] develop = . parts = test coverage-test coverage-report python [test] recipe = zc.recipe.testrunner eggs = zope.index [test] [coverage-test] recipe = zc.recipe.testrunner eggs = zope.index [test] defaults = ['--coverage', '../../coverage'] [coverage-report] recipe = zc.recipe.egg eggs = z3c.coverage scripts = coverage=coverage-report arguments = ('coverage', 'coverage/report') [python] recipe = zc.recipe.egg eggs = zope.index interpreter = python zope.index-3.6.4/COPYRIGHT.txt0000644000175000017500000000004011727503631017063 0ustar tseavertseaver00000000000000Zope Foundation and Contributorszope.index-3.6.4/README.txt0000644000175000017500000000041411727503631016455 0ustar tseavertseaver00000000000000Overview -------- The ``zope.index`` package provides several indices for the Zope catalog. These include: * a field index (for indexing orderable values), * a keyword index, * a topic index, * a text index (with support for lexicon, splitter, normalizer, etc.) zope.index-3.6.4/LICENSE.txt0000644000175000017500000000402611727503631016605 0ustar tseavertseaver00000000000000Zope Public License (ZPL) Version 2.1 A copyright notice accompanies this license document that identifies the copyright holders. This license has been certified as open source. It has also been designated as GPL compatible by the Free Software Foundation (FSF). Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions in source code must retain the accompanying copyright notice, this list of conditions, and the following disclaimer. 2. Redistributions in binary form must reproduce the accompanying copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Names of the copyright holders must not be used to endorse or promote products derived from this software without prior written permission from the copyright holders. 4. The right to distribute this software or to use it for any purpose does not give you the right to use Servicemarks (sm) or Trademarks (tm) of the copyright holders. Use of them is covered by separate agreement with the copyright holders. 5. If any files are modified, you must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. Disclaimer THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. zope.index-3.6.4/bootstrap.py0000644000175000017500000000330211727503631017345 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2006 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Bootstrap a buildout-based project Simply run this script in a directory containing a buildout.cfg. The script accepts buildout command-line options, so you can use the -c option to specify an alternate configuration file. """ import os, shutil, sys, tempfile, urllib2 tmpeggs = tempfile.mkdtemp() ez = {} exec urllib2.urlopen('http://peak.telecommunity.com/dist/ez_setup.py' ).read() in ez ez['use_setuptools'](to_dir=tmpeggs, download_delay=0) import pkg_resources cmd = 'from setuptools.command.easy_install import main; main()' if sys.platform == 'win32': cmd = '"%s"' % cmd # work around spawn lamosity on windows ws = pkg_resources.working_set assert os.spawnle( os.P_WAIT, sys.executable, sys.executable, '-c', cmd, '-mqNxd', tmpeggs, 'zc.buildout', dict(os.environ, PYTHONPATH= ws.find(pkg_resources.Requirement.parse('setuptools')).location ), ) == 0 ws.add_entry(tmpeggs) ws.require('zc.buildout') import zc.buildout.buildout zc.buildout.buildout.main(sys.argv[1:] + ['bootstrap']) shutil.rmtree(tmpeggs) zope.index-3.6.4/setup.py0000644000175000017500000000601511727503631016474 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2006 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## # This package is developed by the Zope Toolkit project, documented here: # http://docs.zope.org/zopetoolkit # When developing and releasing this package, please follow the documented # Zope Toolkit policies as described by this documentation. ############################################################################## """Setup for zope.index package """ import sys import os from setuptools import setup, find_packages, Extension from distutils.command.build_ext import build_ext from distutils.errors import CCompilerError from distutils.errors import DistutilsExecError from distutils.errors import DistutilsPlatformError long_description = (open('README.txt').read() + '\n\n' + open('CHANGES.txt').read()) class optional_build_ext(build_ext): """This class subclasses build_ext and allows the building of C extensions to fail. """ def run(self): try: build_ext.run(self) except DistutilsPlatformError, e: self._unavailable(e) def build_extension(self, ext): try: build_ext.build_extension(self, ext) except (CCompilerError, DistutilsExecError), e: self._unavailable(e) def _unavailable(self, e): print >> sys.stderr, '*' * 80 print >> sys.stderr, """WARNING: An optional code optimization (C extension) could not be compiled. Optimizations for this package will not be available!""" print >> sys.stderr print >> sys.stderr, e print >> sys.stderr, '*' * 80 setup(name='zope.index', version='3.6.4', url='http://pypi.python.org/pypi/zope.index', license='ZPL 2.1', author='Zope Foundation and Contributors', author_email='zope-dev@zope.org', description="Indices for using with catalog like text, field, etc.", long_description=long_description, packages=find_packages('src'), package_dir = {'': 'src'}, namespace_packages=['zope',], extras_require={'test': []}, install_requires=['setuptools', 'ZODB3>=3.8', 'zope.interface'], include_package_data = True, ext_modules=[ Extension('zope.index.text.okascore', [os.path.join('src', 'zope', 'index', 'text', 'okascore.c')]), ], zip_safe=False, cmdclass = {'build_ext':optional_build_ext}, ) zope.index-3.6.4/src/0000755000175000017500000000000011727503757015560 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope/0000755000175000017500000000000011727503757016535 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope/__init__.py0000644000175000017500000000007011727503631020632 0ustar tseavertseaver00000000000000__import__('pkg_resources').declare_namespace(__name__) zope.index-3.6.4/src/zope/index/0000755000175000017500000000000011727503757017644 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope/index/topic/0000755000175000017500000000000011727503757020762 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope/index/topic/interfaces.py0000644000175000017500000000332011727503631023444 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Basic interfaces shared between different types of index. """ from zope.interface import Interface class ITopicQuerying(Interface): """Query over topics, seperated by white space.""" def search(query, operator='and'): """Execute a search given by 'query' as a list/tuple of filter ids. 'operator' can be 'and' or 'or' to search for matches in all or any filter. Return an IFSet of docids """ class ITopicFilteredSet(Interface): """Interface for filtered sets used by topic indexes.""" def clear(): """Remove all entries from the index.""" def index_doc(docid, context): """Add an object's info to the index.""" def unindex_doc(docid): """Remove an object with id 'docid' from the index.""" def getId(): """Return the id of the filter itself.""" def setExpression(expr): """Set the filter expression, e.g. 'context.meta_type=='...'""" def getExpression(): """Return the filter expression.""" def getIds(): """Return an IFSet of docids.""" zope.index-3.6.4/src/zope/index/topic/index.py0000644000175000017500000000633611727503631022442 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Topic index """ from persistent import Persistent import BTrees from zope.interface import implements from zope.index.interfaces import IInjection, IIndexSearch from zope.index.topic.interfaces import ITopicQuerying class TopicIndex(Persistent): implements(IInjection, ITopicQuerying, IIndexSearch) family = BTrees.family32 def __init__(self, family=None): if family is not None: self.family = family self.clear() def clear(self): # mapping filter id -> filter self._filters = self.family.OO.BTree() def addFilter(self, f): """ Add filter 'f' with ID 'id' """ self._filters[f.getId()] = f def delFilter(self, id): """ remove a filter given by its ID 'id' """ del self._filters[id] def clearFilters(self): """ Clear existing filters of their docids, but leave them in place. """ for filter in self._filters.values(): filter.clear() def index_doc(self, docid, obj): """index an object""" for f in self._filters.values(): f.index_doc(docid, obj) def unindex_doc(self, docid): """unindex an object""" for f in self._filters.values(): f.unindex_doc(docid) def search(self, query, operator='and'): if isinstance(query, basestring): query = [query] if not isinstance(query, (tuple, list)): raise TypeError( 'query argument must be a list/tuple of filter ids') sets = [] for id in self._filters.keys(): if id in query: docids = self._filters[id].getIds() sets.append(docids) if operator == 'or': rs = self.family.IF.multiunion(sets) elif operator == 'and': # sort smallest to largest set so we intersect the smallest # number of document identifiers possible sets.sort(key=len) rs = None for set in sets: rs = self.family.IF.intersection(rs, set) if not rs: break else: raise TypeError('Topic index only supports `and` and `or` ' 'operators, not `%s`.' % operator) if rs: return rs else: return self.family.IF.Set() def apply(self, query): operator = 'and' if isinstance(query, dict): if 'operator' in query: operator = query.pop('operator') query = query['query'] return self.search(query, operator=operator) zope.index-3.6.4/src/zope/index/topic/filter.py0000644000175000017500000000422211727503631022610 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Filters for TopicIndexes """ import BTrees from zope.index.topic.interfaces import ITopicFilteredSet from zope.interface import implements class FilteredSetBase(object): """ Base class for all filtered sets. A filtered set is a collection of documents represented by their document ids that match a common criteria given by a condition. """ implements(ITopicFilteredSet) family = BTrees.family32 def __init__(self, id, expr, family=None): if family is not None: self.family = family self.id = id self.expr = expr self.clear() def clear(self): self._ids = self.family.IF.Set() def index_doc(self, docid, context): raise NotImplementedError def unindex_doc(self, docid): try: self._ids.remove(docid) except KeyError: pass def getId(self): return self.id def getExpression(self): return self.expr def setExpression(self, expr): self.expr = expr def getIds(self): return self._ids def __repr__(self): #pragma NO COVERAGE return '%s: (%s) %s' % (self.id, self.expr, list(self._ids)) __str__ = __repr__ class PythonFilteredSet(FilteredSetBase): """ a topic filtered set to check a context against a Python expression """ def index_doc(self, docid, context): try: if eval(self.expr): self._ids.insert(docid) except: pass # ignore errors zope.index-3.6.4/src/zope/index/topic/tests/0000755000175000017500000000000011727503757022124 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope/index/topic/tests/test_filter.py0000644000175000017500000001010211727503631025003 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2009 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Topic Index tests """ import unittest _marker = object() class FilteredSetBaseTests(unittest.TestCase): def _getTargetClass(self): from zope.index.topic.filter import FilteredSetBase return FilteredSetBase def _makeOne(self, id=None, expr=None, family=_marker): if id is None: id = 'test' if expr is None: expr = 'True' if family is _marker: return self._getTargetClass()(id, expr) return self._getTargetClass()(id, expr, family) def test_class_conforms_to_ITopicFilteredSet(self): from zope.interface.verify import verifyClass from zope.index.topic.interfaces import ITopicFilteredSet verifyClass(ITopicFilteredSet, self._getTargetClass()) def test_instance_conforms_to_ITopicFilteredSet(self): from zope.interface.verify import verifyObject from zope.index.topic.interfaces import ITopicFilteredSet verifyObject(ITopicFilteredSet, self._makeOne()) def test_ctor_defaults(self): import BTrees filter = self._makeOne(family=None) self.failUnless(filter.family is BTrees.family32) self.assertEqual(filter.getId(), 'test') self.assertEqual(filter.getExpression(), 'True') self.assertEqual(len(filter.getIds()), 0) def test_ctor_explicit_family(self): import BTrees filter = self._makeOne(family=BTrees.family64) self.failUnless(filter.family is BTrees.family64) def test_index_doc_raises_NotImplementedError(self): filter = self._makeOne() self.assertRaises(NotImplementedError, filter.index_doc, 1, object()) def test_unindex_doc_missing_docid(self): filter = self._makeOne() filter.unindex_doc(1) # doesn't raise self.assertEqual(len(filter.getIds()), 0) def test_unindex_doc_existing_docid(self): filter = self._makeOne() filter._ids.insert(1) filter.unindex_doc(1) self.assertEqual(len(filter.getIds()), 0) def test_unindex_doc_existing_docid_w_residue(self): filter = self._makeOne() filter._ids.insert(1) filter._ids.insert(2) filter.unindex_doc(1) self.assertEqual(len(filter.getIds()), 1) def test_setExpression(self): filter = self._makeOne() filter.setExpression('False') self.assertEqual(filter.getExpression(), 'False') class PythonFilteredSetTests(unittest.TestCase): def _getTargetClass(self): from zope.index.topic.filter import PythonFilteredSet return PythonFilteredSet def _makeOne(self, id=None, expr=None, family=_marker): if id is None: id = 'test' if expr is None: expr = 'True' return self._getTargetClass()(id, expr) def test_index_object_expr_True(self): filter = self._makeOne() filter.index_doc(1, object()) self.assertEqual(list(filter.getIds()), [1]) def test_index_object_expr_False(self): filter = self._makeOne(expr='False') filter.index_doc(1, object()) self.assertEqual(len(filter.getIds()), 0) def test_index_object_expr_w_zero_divide_error(self): filter = self._makeOne(expr='1/0') filter.index_doc(1, object()) # doesn't raise self.assertEqual(len(filter.getIds()), 0) def test_suite(): return unittest.TestSuite(( unittest.makeSuite(FilteredSetBaseTests), unittest.makeSuite(PythonFilteredSetTests), )) zope.index-3.6.4/src/zope/index/topic/tests/__init__.py0000644000175000017500000000007511727503631024226 0ustar tseavertseaver00000000000000# # This file is necessary to make this directory a package. zope.index-3.6.4/src/zope/index/topic/tests/test_index.py0000644000175000017500000003345011727503631024640 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Topic Index tests """ import unittest _marker = object() class TopicIndexTest(unittest.TestCase): def _getTargetClass(self): from zope.index.topic.index import TopicIndex return TopicIndex def _get_family(self): import BTrees return BTrees.family32 def _makeOne(self, family=_marker): if family is _marker: family = self._get_family() if family is None: return self._getTargetClass()() return self._getTargetClass()(family) def _search(self, index, query, expected, operator='and'): result = index.search(query, operator) self.assertEqual(result.keys(), expected) def _search_or(self, index, query, expected): return self._search(index, query, expected, 'or') def _search_and(self, index, query, expected): return self._search(index, query, expected, 'and') def _apply(self, index, query, expected, operator='and'): result = index.apply(query) self.assertEqual(result.keys(), expected) def _apply_or(self, index, query, expected): result = index.apply({'query': query, 'operator': 'or'}) self.assertEqual(result.keys(), expected) def _apply_and(self, index, query, expected): result = index.apply({'query': query, 'operator': 'and'}) self.assertEqual(result.keys(), expected) def test_class_conforms_to_IInjection(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IInjection verifyClass(IInjection, self._getTargetClass()) def test_instance_conforms_to_IInjection(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IInjection verifyObject(IInjection, self._makeOne()) def test_class_conforms_to_IIndexSearch(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IIndexSearch verifyClass(IIndexSearch, self._getTargetClass()) def test_instance_conforms_to_IIndexSearch(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IIndexSearch verifyObject(IIndexSearch, self._makeOne()) def test_class_conforms_to_ITopicQuerying(self): from zope.interface.verify import verifyClass from zope.index.topic.interfaces import ITopicQuerying verifyClass(ITopicQuerying, self._getTargetClass()) def test_instance_conforms_to_ITopicQuerying(self): from zope.interface.verify import verifyObject from zope.index.topic.interfaces import ITopicQuerying verifyObject(ITopicQuerying, self._makeOne()) def test_ctor_defaults(self): import BTrees index = self._makeOne(family=None) self.failUnless(index.family is BTrees.family32) def test_ctor_explicit_family(self): import BTrees index = self._makeOne(family=BTrees.family64) self.failUnless(index.family is BTrees.family64) def test_clear_erases_filters(self): index = self._makeOne() foo = DummyFilter('foo') index.addFilter(foo) index.clear() self.assertEqual(list(index._filters), []) def test_addFilter(self): index = self._makeOne() foo = DummyFilter('foo') index.addFilter(foo) self.assertEqual(list(index._filters), ['foo']) self.failUnless(index._filters['foo'] is foo) def test_addFilter_duplicate_replaces(self): index = self._makeOne() foo = DummyFilter('foo') index.addFilter(foo) foo2 = DummyFilter('foo') index.addFilter(foo2) self.assertEqual(list(index._filters), ['foo']) self.failUnless(index._filters['foo'] is foo2) def test_delFilter_nonesuch_raises_KeyError(self): index = self._makeOne() self.assertRaises(KeyError, index.delFilter, 'nonesuch') def test_delFilter(self): index = self._makeOne() foo = DummyFilter('foo') index.addFilter(foo) bar = DummyFilter('bar') index.addFilter(bar) index.delFilter('foo') self.assertEqual(list(index._filters), ['bar']) self.failUnless(index._filters['bar'] is bar) def test_clearFilters_empty(self): index = self._makeOne() index.clearFilters() # doesn't raise def test_clearFilters_non_empty(self): index = self._makeOne() foo = DummyFilter('foo') index.addFilter(foo) bar = DummyFilter('bar') index.addFilter(bar) index.clearFilters() self.failUnless(foo._cleared) self.failUnless(bar._cleared) def test_index_doc(self): index = self._makeOne() foo = DummyFilter('foo') index.addFilter(foo) bar = DummyFilter('bar') index.addFilter(bar) obj = object() index.index_doc(1, obj) self.assertEqual(foo._indexed, [(1, obj)]) self.assertEqual(bar._indexed, [(1, obj)]) def test_unindex_doc(self): index = self._makeOne() foo = DummyFilter('foo') index.addFilter(foo) bar = DummyFilter('bar') index.addFilter(bar) index.unindex_doc(1) self.assertEqual(foo._unindexed, [1]) self.assertEqual(bar._unindexed, [1]) def test_search_non_tuple_list_query(self): index = self._makeOne() self.assertRaises(TypeError, index.search, {'nonesuch': 'ugh'}) def test_search_bad_operator(self): index = self._makeOne() self.assertRaises(TypeError, index.search, ['whatever'], 'maybe') def test_search_no_filters_list_query(self): index = self._makeOne() result = index.search(['nonesuch']) self.assertEqual(set(result), set()) def test_search_no_filters_tuple_query(self): index = self._makeOne() result = index.search(('nonesuch',)) self.assertEqual(set(result), set()) def test_search_no_filters_string_query(self): index = self._makeOne() result = index.search('nonesuch') self.assertEqual(set(result), set()) def test_search_query_matches_one_filter(self): index = self._makeOne() foo = DummyFilter('foo', [1, 2, 3], self._get_family()) index.addFilter(foo) bar = DummyFilter('bar', [2, 3, 4], self._get_family()) index.addFilter(bar) result = index.search(['foo']) self.assertEqual(set(result), set([1, 2, 3])) def test_search_query_matches_multiple_implicit_operator(self): index = self._makeOne() foo = DummyFilter('foo', [1, 2, 3], self._get_family()) index.addFilter(foo) bar = DummyFilter('bar', [2, 3, 4], self._get_family()) index.addFilter(bar) result = index.search(['foo', 'bar']) self.assertEqual(set(result), set([2, 3])) def test_search_query_matches_multiple_implicit_op_no_intersect(self): index = self._makeOne() foo = DummyFilter('foo', [1, 2, 3], self._get_family()) index.addFilter(foo) bar = DummyFilter('bar', [4, 5, 6], self._get_family()) index.addFilter(bar) result = index.search(['foo', 'bar']) self.assertEqual(set(result), set()) def test_search_query_matches_multiple_explicit_and(self): index = self._makeOne() foo = DummyFilter('foo', [1, 2, 3], self._get_family()) index.addFilter(foo) bar = DummyFilter('bar', [2, 3, 4], self._get_family()) index.addFilter(bar) result = index.search(['foo', 'bar'], operator='and') self.assertEqual(set(result), set([2, 3])) def test_search_query_matches_multiple_explicit_or(self): index = self._makeOne() foo = DummyFilter('foo', [1, 2, 3], self._get_family()) index.addFilter(foo) bar = DummyFilter('bar', [2, 3, 4], self._get_family()) index.addFilter(bar) result = index.search(['foo', 'bar'], operator='or') self.assertEqual(set(result), set([1, 2, 3, 4])) def test_apply_query_matches_multiple_non_dict_query(self): index = self._makeOne() foo = DummyFilter('foo', [1, 2, 3], self._get_family()) index.addFilter(foo) bar = DummyFilter('bar', [2, 3, 4], self._get_family()) index.addFilter(bar) result = index.apply(['foo', 'bar']) self.assertEqual(set(result), set([2, 3])) def test_apply_query_matches_multiple_implicit_op(self): index = self._makeOne() foo = DummyFilter('foo', [1, 2, 3], self._get_family()) index.addFilter(foo) bar = DummyFilter('bar', [2, 3, 4], self._get_family()) index.addFilter(bar) result = index.apply({'query': ['foo', 'bar']}) self.assertEqual(set(result), set([2, 3])) def test_apply_query_matches_multiple_explicit_and(self): index = self._makeOne() foo = DummyFilter('foo', [1, 2, 3], self._get_family()) index.addFilter(foo) bar = DummyFilter('bar', [2, 3, 4], self._get_family()) index.addFilter(bar) result = index.apply({'query': ['foo', 'bar'], 'operator': 'and'}) self.assertEqual(set(result), set([2, 3])) def test_apply_query_matches_multiple_explicit_or(self): index = self._makeOne() foo = DummyFilter('foo', [1, 2, 3], self._get_family()) index.addFilter(foo) bar = DummyFilter('bar', [2, 3, 4], self._get_family()) index.addFilter(bar) result = index.apply({'query': ['foo', 'bar'], 'operator': 'or'}) self.assertEqual(set(result), set([1, 2, 3, 4])) class _NotYet: def _addFilters(self, index): from zope.index.topic.filter import PythonFilteredSet index.addFilter( PythonFilteredSet('doc1', "context.meta_type == 'doc1'", index.family)) index.addFilter( PythonFilteredSet('doc2', "context.meta_type == 'doc2'", index.family)) index.addFilter( PythonFilteredSet('doc3', "context.meta_type == 'doc3'", index.family)) def _populate(self, index): class O(object): """ a dummy class """ def __init__(self, meta_type): self.meta_type = meta_type index.index_doc(0 , O('doc0')) index.index_doc(1 , O('doc1')) index.index_doc(2 , O('doc1')) index.index_doc(3 , O('doc2')) index.index_doc(4 , O('doc2')) index.index_doc(5 , O('doc3')) index.index_doc(6 , O('doc3')) def test_unindex(self): index = self._makeOne() index.unindex_doc(-99) # should not raise index.unindex_doc(3) index.unindex_doc(4) index.unindex_doc(5) self._search_or(index, 'doc1', [1,2]) self._search_or(index, 'doc2', []) self._search_or(index, 'doc3', [6]) self._search_or(index, 'doc4', []) def test_or(self): index = self._makeOne() self._search_or(index, 'doc1', [1,2]) self._search_or(index, ['doc1'],[1,2]) self._search_or(index, 'doc2', [3,4]), self._search_or(index, ['doc2'],[3,4]) self._search_or(index, ['doc1','doc2'], [1,2,3,4]) def test_and(self): index = self._makeOne() self._search_and(index, 'doc1', [1,2]) self._search_and(index, ['doc1'], [1,2]) self._search_and(index, 'doc2', [3,4]) self._search_and(index, ['doc2'], [3,4]) self._search_and(index, ['doc1','doc2'], []) def test_apply_or(self): index = self._makeOne() self._apply_or(index, 'doc1', [1,2]) self._apply_or(index, ['doc1'],[1,2]) self._apply_or(index, 'doc2', [3,4]), self._apply_or(index, ['doc2'],[3,4]) self._apply_or(index, ['doc1','doc2'], [1,2,3,4]) def test_apply_and(self): index = self._makeOne() self._apply_and(index, 'doc1', [1,2]) self._apply_and(index, ['doc1'], [1,2]) self._apply_and(index, 'doc2', [3,4]) self._apply_and(index, ['doc2'], [3,4]) self._apply_and(index, ['doc1','doc2'], []) def test_apply(self): index = self._makeOne() self._apply(index, 'doc1', [1,2]) self._apply(index, ['doc1'], [1,2]) self._apply(index, 'doc2', [3,4]) self._apply(index, ['doc2'], [3,4]) self._apply(index, ['doc1','doc2'], []) class TopicIndexTest64(TopicIndexTest): def _get_family(self): import BTrees return BTrees.family64 class DummyFilter: _cleared = False def __init__(self, id, ids=(), family=None): self._id = id self._indexed = [] self._unindexed = [] self._family = family self._ids = ids def getId(self): return self._id def clear(self): self._cleared = True def index_doc(self, docid, obj): self._indexed.append((docid, obj)) def unindex_doc(self, docid): self._unindexed.append(docid) def getIds(self): if self._family is not None: return self._family.IF.TreeSet(self._ids) return set(self._ids) def test_suite(): return unittest.TestSuite(( unittest.makeSuite(TopicIndexTest), unittest.makeSuite(TopicIndexTest64), )) zope.index-3.6.4/src/zope/index/topic/__init__.py0000644000175000017500000000005611727503631023063 0ustar tseavertseaver00000000000000from zope.index.topic.index import TopicIndex zope.index-3.6.4/src/zope/index/DEPENDENCIES.cfg0000644000175000017500000000006211727503631022140 0ustar tseavertseaver00000000000000BTrees ZODB persistent transaction zope.interface zope.index-3.6.4/src/zope/index/interfaces.py0000644000175000017500000001174611727503631022341 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Basic interfaces shared between different types of index. """ from zope.interface import Interface class IInjection(Interface): """Interface for injecting documents into an index.""" def index_doc(docid, value): """Add a document to the index. docid: int, identifying the document value: the value to be indexed return: None This can also be used to reindex documents. """ def unindex_doc(docid): """Remove a document from the index. docid: int, identifying the document return: None This call is a no-op if the docid isn't in the index, however, after this call, the index should have no references to the docid. """ def clear(): """Unindex all documents indexed by the index """ class IIndexSearch(Interface): def apply(query): """Apply an index to the given query The type if the query is index specific. TODO This is somewhat problemetic. It means that application code that calls apply has to be aware of the expected query type. This isn't too much of a problem now, as we have no more general query language nor do we have any sort of automatic query-form generation. It would be nice to have a system later for having query-form generation or, perhaps, sme sort of query language. At that point, we'll need some sort of way to determine query types, presumably through introspection of the index objects. A result is returned that is: - An IFBTree or an IFBucket mapping document ids to floating-point scores for document ids of documents that match the query, - An IFSet or IFTreeSet containing document ids of documents that match the query, or - None, indicating that the index could not use the query and that the result should have no impact on determining a final result. """ class IIndexSort(Interface): def sort(docids, reverse=False, limit=None): """Sort document ids sequence using indexed values If some of docids are not indexed they are skipped from resulting iterable. Return a sorted iterable of document ids. Limited by value of the "limit" argument and optionally reversed, using the "reverse" argument. """ class IStatistics(Interface): """An index that provides statistical information about itself.""" def documentCount(): """Return the number of documents currently indexed.""" def wordCount(): """Return the number of words currently indexed.""" class INBest(Interface): """Interface for an N-Best chooser.""" def add(item, score): """Record that item 'item' has score 'score'. No return value. The N best-scoring items are remembered, where N was passed to the constructor. 'item' can by anything. 'score' should be a number, and larger numbers are considered better. """ def addmany(sequence): """Like "for item, score in sequence: self.add(item, score)". This is simply faster than calling add() len(seq) times. """ def getbest(): """Return the (at most) N best-scoring items as a sequence. The return value is a sequence of 2-tuples, (item, score), with the largest score first. If .add() has been called fewer than N times, this sequence will contain fewer than N pairs. """ def pop_smallest(): """Return and remove the (item, score) pair with lowest score. If len(self) is 0, raise IndexError. To be cleaer, this is the lowest score among the N best-scoring seen so far. This is most useful if the capacity of the NBest object is never exceeded, in which case pop_smallest() allows using the object as an ordinary smallest-in-first-out priority queue. """ def __len__(): """Return the number of (item, score) pairs currently known. This is N (the value passed to the constructor), unless .add() has been called fewer than N times. """ def capacity(): """Return the maximum number of (item, score) pairs. This is N (the value passed to the constructor). """ zope.index-3.6.4/src/zope/index/nbest.py0000644000175000017500000000507011727503631021322 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """NBest An NBest object remembers the N best-scoring items ever passed to its .add(item, score) method. If .add() is called M times, the worst-case number of comparisons performed overall is M * log2(N). """ from bisect import bisect_left as bisect from zope.index.interfaces import INBest from zope.interface import implements class NBest(object): implements(INBest) def __init__(self, N): "Build an NBest object to remember the N best-scoring objects." if N < 1: raise ValueError("NBest() argument must be at least 1") self._capacity = N # This does a very simple thing with sorted lists. For large # N, a min-heap can be unboundedly better in terms of data # movement time. self._scores = [] self._items = [] def __len__(self): return len(self._scores) def capacity(self): return self._capacity def add(self, item, score): self.addmany([(item, score)]) def addmany(self, sequence): scores, items, capacity = self._scores, self._items, self._capacity n = len(scores) for item, score in sequence: # When we're in steady-state, the usual case is that we're filled # to capacity, and that an incoming item is worse than any of # the best-seen so far. if n >= capacity and score <= scores[0]: continue i = bisect(scores, score) scores.insert(i, score) items.insert(i, item) if n == capacity: del items[0], scores[0] else: n += 1 assert n == len(scores) def getbest(self): result = zip(self._items, self._scores) result.reverse() return result def pop_smallest(self): if self._scores: return self._items.pop(0), self._scores.pop(0) raise IndexError("pop_smallest() called on empty NBest object") zope.index-3.6.4/src/zope/index/tests.py0000644000175000017500000000632611727503631021356 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """N-Best index tests """ from unittest import TestCase, main, makeSuite from zope.index.nbest import NBest class NBestTest(TestCase): def testConstructor(self): self.assertRaises(ValueError, NBest, 0) self.assertRaises(ValueError, NBest, -1) for n in range(1, 11): nb = NBest(n) self.assertEqual(len(nb), 0) self.assertEqual(nb.capacity(), n) def testOne(self): nb = NBest(1) nb.add('a', 0) self.assertEqual(nb.getbest(), [('a', 0)]) nb.add('b', 1) self.assertEqual(len(nb), 1) self.assertEqual(nb.capacity(), 1) self.assertEqual(nb.getbest(), [('b', 1)]) nb.add('c', -1) self.assertEqual(len(nb), 1) self.assertEqual(nb.capacity(), 1) self.assertEqual(nb.getbest(), [('b', 1)]) nb.addmany([('d', 3), ('e', -6), ('f', 5), ('g', 4)]) self.assertEqual(len(nb), 1) self.assertEqual(nb.capacity(), 1) self.assertEqual(nb.getbest(), [('f', 5)]) def testMany(self): import random inputs = [(-i, i) for i in range(50)] reversed_inputs = inputs[:] reversed_inputs.reverse() # Test the N-best for a variety of n (1, 6, 11, ... 50). for n in range(1, len(inputs)+1, 5): expected = inputs[-n:] expected.reverse() random_inputs = inputs[:] random.shuffle(random_inputs) for source in inputs, reversed_inputs, random_inputs: # Try feeding them one at a time. nb = NBest(n) for item, score in source: nb.add(item, score) self.assertEqual(len(nb), n) self.assertEqual(nb.capacity(), n) self.assertEqual(nb.getbest(), expected) # And again in one gulp. nb = NBest(n) nb.addmany(source) self.assertEqual(len(nb), n) self.assertEqual(nb.capacity(), n) self.assertEqual(nb.getbest(), expected) for i in range(1, n+1): self.assertEqual(nb.pop_smallest(), expected[-i]) self.assertRaises(IndexError, nb.pop_smallest) def testAllSameScore(self): inputs = [(i, 0) for i in range(10)] for n in range(1, 12): nb = NBest(n) nb.addmany(inputs) outputs = nb.getbest() self.assertEqual(outputs, inputs[:len(outputs)]) def test_suite(): return makeSuite(NBestTest) if __name__=='__main__': main(defaultTest='test_suite') zope.index-3.6.4/src/zope/index/field/0000755000175000017500000000000011727503757020727 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope/index/field/sorting.py0000644000175000017500000001220511727503631022755 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2009 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """A sorting mixin class for FieldIndex-like indexes. """ import heapq import bisect from itertools import islice from zope.interface import implements from zope.index.interfaces import IIndexSort class SortingIndexMixin(object): implements(IIndexSort) _sorting_num_docs_attr = '_num_docs' # Length object _sorting_fwd_index_attr = '_fwd_index' # forward BTree index _sorting_rev_index_attr = '_rev_index' # reverse BTree index def sort(self, docids, reverse=False, limit=None): if (limit is not None) and (limit < 1): raise ValueError('limit value must be 1 or greater') numdocs = getattr(self, self._sorting_num_docs_attr).value if not numdocs: raise StopIteration if not isinstance(docids, (self.family.IF.Set, self.family.IF.TreeSet)): docids = self.family.IF.Set(docids) if not docids: raise StopIteration rlen = len(docids) fwd_index = getattr(self, self._sorting_fwd_index_attr) rev_index = getattr(self, self._sorting_rev_index_attr) getValue = rev_index.get marker = object() # use_lazy and use_nbest computations lifted wholesale from # Zope2 catalog without questioning reasoning use_lazy = rlen > numdocs * (rlen / 100 + 1) use_nbest = limit and limit * 4 < rlen # overrides for unit tests if getattr(self, '_use_lazy', False): use_lazy = True if getattr(self, '_use_nbest', False): use_nbest = True if use_nbest: # this is a sort with a limit that appears useful, try to # take advantage of the fact that we can keep a smaller # set of simultaneous values in memory; use generators # and heapq functions to do so. def nsort(): for docid in docids: val = getValue(docid, marker) if val is not marker: yield (val, docid) iterable = nsort() if reverse: # we use a generator as an iterable in the reverse # sort case because the nlargest implementation does # not manifest the whole thing into memory at once if # we do so. for val in heapq.nlargest(limit, iterable): yield val[1] else: # lifted from heapq.nsmallest it = iter(iterable) result = sorted(islice(it, 0, limit)) if not result: raise StopIteration insort = bisect.insort pop = result.pop los = result[-1] # los --> Largest of the nsmallest for elem in it: if los <= elem: continue insort(result, elem) pop() los = result[-1] for val in result: yield val[1] else: if use_lazy and not reverse: # Since this the sort is not reversed, and the number # of results in the search result set is much larger # than the number of items in this index, we assume it # will be fastest to iterate over all of our forward # BTree's items instead of using a full sort, as our # forward index is already sorted in ascending order # by value. The Zope 2 catalog implementation claims # that this case is rarely exercised in practice. n = 0 for stored_docids in fwd_index.values(): for docid in self.family.IF.intersection(docids, stored_docids): n += 1 yield docid if limit and n >= limit: raise StopIteration else: # If the result set is not much larger than the number # of documents in this index, or if we need to sort in # reverse order, use a non-lazy sort. n = 0 for docid in sorted(docids, key=getValue, reverse=reverse): if getValue(docid, marker) is not marker: n += 1 yield docid if limit and n >= limit: raise StopIteration zope.index-3.6.4/src/zope/index/field/README.txt0000644000175000017500000001132111727503631022412 0ustar tseavertseaver00000000000000Field Indexes ============= Field indexes index orderable values. Note that they don't check for orderability. That is, all of the values added to the index must be orderable together. It is up to applications to provide only mutually orderable values. >>> from zope.index.field import FieldIndex >>> index = FieldIndex() >>> index.index_doc(0, 6) >>> index.index_doc(1, 26) >>> index.index_doc(2, 94) >>> index.index_doc(3, 68) >>> index.index_doc(4, 30) >>> index.index_doc(5, 68) >>> index.index_doc(6, 82) >>> index.index_doc(7, 30) >>> index.index_doc(8, 43) >>> index.index_doc(9, 15) Field indexes are searched with apply. The argument is a tuple with a minimum and maximum value: >>> index.apply((30, 70)) IFSet([3, 4, 5, 7, 8]) A common mistake is to pass a single value. If anything other than a two-tuple is passed, a type error is raised: >>> index.apply('hi') Traceback (most recent call last): ... TypeError: ('two-length tuple expected', 'hi') Open-ended ranges can be provided by provinding None as an end point: >>> index.apply((30, None)) IFSet([2, 3, 4, 5, 6, 7, 8]) >>> index.apply((None, 70)) IFSet([0, 1, 3, 4, 5, 7, 8, 9]) >>> index.apply((None, None)) IFSet([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) To do an exact value search, supply equal minimum and maximum values: >>> index.apply((30, 30)) IFSet([4, 7]) >>> index.apply((70, 70)) IFSet([]) Field indexes support basic statistics: >>> index.documentCount() 10 >>> index.wordCount() 8 Documents can be reindexed: >>> index.apply((15, 15)) IFSet([9]) >>> index.index_doc(9, 14) >>> index.apply((15, 15)) IFSet([]) >>> index.apply((14, 14)) IFSet([9]) Documents can be unindexed: >>> index.unindex_doc(7) >>> index.documentCount() 9 >>> index.wordCount() 8 >>> index.unindex_doc(8) >>> index.documentCount() 8 >>> index.wordCount() 7 >>> index.apply((30, 70)) IFSet([3, 4, 5]) Unindexing a document id that isn't present is ignored: >>> index.unindex_doc(8) >>> index.unindex_doc(80) >>> index.documentCount() 8 >>> index.wordCount() 7 We can also clear the index entirely: >>> index.clear() >>> index.documentCount() 0 >>> index.wordCount() 0 >>> index.apply((30, 70)) IFSet([]) Sorting ------- Field indexes also implement IIndexSort interface that provides a method for sorting document ids by their indexed values. >>> index.index_doc(1, 9) >>> index.index_doc(2, 8) >>> index.index_doc(3, 7) >>> index.index_doc(4, 6) >>> index.index_doc(5, 5) >>> index.index_doc(6, 4) >>> index.index_doc(7, 3) >>> index.index_doc(8, 2) >>> index.index_doc(9, 1) >>> list(index.sort([4, 2, 9, 7, 3, 1, 5])) [9, 7, 5, 4, 3, 2, 1] We can also specify the ``reverse`` argument to reverse results: >>> list(index.sort([4, 2, 9, 7, 3, 1, 5], reverse=True)) [1, 2, 3, 4, 5, 7, 9] And as per IIndexSort, we can limit results by specifying the ``limit`` argument: >>> list(index.sort([4, 2, 9, 7, 3, 1, 5], limit=3)) [9, 7, 5] If we pass an id that is not indexed by this index, it won't be included in the result. >>> list(index.sort([2, 10])) [2] >>> index.clear() Bugfix testing: --------------- Happened at least once that the value dropped out of the forward index, but the index still contains the object, the unindex broke >>> index.index_doc(0, 6) >>> index.index_doc(1, 26) >>> index.index_doc(2, 94) >>> index.index_doc(3, 68) >>> index.index_doc(4, 30) >>> index.index_doc(5, 68) >>> index.index_doc(6, 82) >>> index.index_doc(7, 30) >>> index.index_doc(8, 43) >>> index.index_doc(9, 15) >>> index.apply((None, None)) IFSet([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Here is the damage: >>> del index._fwd_index[68] Unindex should succeed: >>> index.unindex_doc(5) >>> index.unindex_doc(3) >>> index.apply((None, None)) IFSet([0, 1, 2, 4, 6, 7, 8, 9]) Optimizations ------------- There is an optimization which makes sure that nothing is changed in the internal data structures if the value of the ducument was not changed. To test this optimization we patch the index instance to make sure unindex_doc is not called. >>> def unindex_doc(doc_id): ... raise KeyError >>> index.unindex_doc = unindex_doc Now we get a KeyError if we try to change the value. >>> index.index_doc(9, 14) Traceback (most recent call last): ... KeyError Leaving the value unchange doesn't call unindex_doc. >>> index.index_doc(9, 15) >>> index.apply((15, 15)) IFSet([9]) zope.index-3.6.4/src/zope/index/field/index.py0000644000175000017500000000640711727503631022406 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Field index """ import BTrees import persistent import zope.interface from BTrees.Length import Length from zope.index import interfaces from zope.index.field.sorting import SortingIndexMixin class FieldIndex(SortingIndexMixin, persistent.Persistent): zope.interface.implements( interfaces.IInjection, interfaces.IStatistics, interfaces.IIndexSearch, ) family = BTrees.family32 def __init__(self, family=None): if family is not None: self.family = family self.clear() def clear(self): """Initialize forward and reverse mappings.""" # The forward index maps indexed values to a sequence of docids self._fwd_index = self.family.OO.BTree() # The reverse index maps a docid to its index value self._rev_index = self.family.IO.BTree() self._num_docs = Length(0) def documentCount(self): """See interface IStatistics""" return self._num_docs() def wordCount(self): """See interface IStatistics""" return len(self._fwd_index) def index_doc(self, docid, value): """See interface IInjection""" rev_index = self._rev_index if docid in rev_index: if docid in self._fwd_index.get(value, ()): # no need to index the doc, its already up to date return # unindex doc if present self.unindex_doc(docid) # Insert into forward index. set = self._fwd_index.get(value) if set is None: set = self.family.IF.TreeSet() self._fwd_index[value] = set set.insert(docid) # increment doc count self._num_docs.change(1) # Insert into reverse index. rev_index[docid] = value def unindex_doc(self, docid): """See interface IInjection""" rev_index = self._rev_index value = rev_index.get(docid) if value is None: return # not in index del rev_index[docid] try: set = self._fwd_index[value] set.remove(docid) except KeyError: #pragma NO COVERAGE # This is fishy, but we don't want to raise an error. # We should probably log something. # but keep it from throwing a dirty exception set = 1 if not set: del self._fwd_index[value] self._num_docs.change(-1) def apply(self, query): if len(query) != 2 or not isinstance(query, tuple): raise TypeError("two-length tuple expected", query) return self.family.IF.multiunion( self._fwd_index.values(*query)) zope.index-3.6.4/src/zope/index/field/tests.py0000644000175000017500000002640111727503631022435 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Test field index """ import unittest import doctest _marker = object() class FieldIndexTests(unittest.TestCase): def _getTargetClass(self): from zope.index.field import FieldIndex return FieldIndex def _makeOne(self, family=_marker): if family is _marker: return self._getTargetClass()() return self._getTargetClass()(family) def _populateIndex(self, index): index.index_doc(5, 1) # docid, obj index.index_doc(2, 2) index.index_doc(1, 3) index.index_doc(3, 4) index.index_doc(4, 5) index.index_doc(8, 6) index.index_doc(9, 7) index.index_doc(7, 8) index.index_doc(6, 9) index.index_doc(11, 10) index.index_doc(10, 11) def test_class_conforms_to_IInjection(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IInjection verifyClass(IInjection, self._getTargetClass()) def test_instance_conforms_to_IInjection(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IInjection verifyObject(IInjection, self._makeOne()) def test_class_conforms_to_IIndexSearch(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IIndexSearch verifyClass(IIndexSearch, self._getTargetClass()) def test_instance_conforms_to_IIndexSearch(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IIndexSearch verifyObject(IIndexSearch, self._makeOne()) def test_class_conforms_to_IStatistics(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IStatistics verifyClass(IStatistics, self._getTargetClass()) def test_instance_conforms_to_IStatistics(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IStatistics verifyObject(IStatistics, self._makeOne()) def test_ctor_defaults(self): import BTrees index = self._makeOne() self.failUnless(index.family is BTrees.family32) self.assertEqual(index.documentCount(), 0) self.assertEqual(index.wordCount(), 0) def test_ctor_explicit_family(self): import BTrees index = self._makeOne(BTrees.family64) self.failUnless(index.family is BTrees.family64) def test_index_doc_new(self): index = self._makeOne() index.index_doc(1, 'value') self.assertEqual(index.documentCount(), 1) self.assertEqual(index.wordCount(), 1) self.failUnless(1 in index._rev_index) self.failUnless('value' in index._fwd_index) def test_index_doc_existing_same_value(self): index = self._makeOne() index.index_doc(1, 'value') index.index_doc(1, 'value') self.assertEqual(index.documentCount(), 1) self.assertEqual(index.wordCount(), 1) self.failUnless(1 in index._rev_index) self.failUnless('value' in index._fwd_index) self.assertEqual(list(index._fwd_index['value']), [1]) def test_index_doc_existing_new_value(self): index = self._makeOne() index.index_doc(1, 'value') index.index_doc(1, 'new_value') self.assertEqual(index.documentCount(), 1) self.assertEqual(index.wordCount(), 1) self.failUnless(1 in index._rev_index) self.failIf('value' in index._fwd_index) self.failUnless('new_value' in index._fwd_index) self.assertEqual(list(index._fwd_index['new_value']), [1]) def test_unindex_doc_nonesuch(self): index = self._makeOne() index.unindex_doc(1) # doesn't raise def test_unindex_doc_no_residual_fwd_values(self): index = self._makeOne() index.index_doc(1, 'value') index.unindex_doc(1) # doesn't raise self.assertEqual(index.documentCount(), 0) self.assertEqual(index.wordCount(), 0) self.failIf(1 in index._rev_index) self.failIf('value' in index._fwd_index) def test_unindex_doc_w_residual_fwd_values(self): index = self._makeOne() index.index_doc(1, 'value') index.index_doc(2, 'value') index.unindex_doc(1) # doesn't raise self.assertEqual(index.documentCount(), 1) self.assertEqual(index.wordCount(), 1) self.failIf(1 in index._rev_index) self.failUnless(2 in index._rev_index) self.failUnless('value' in index._fwd_index) self.assertEqual(list(index._fwd_index['value']), [2]) def test_apply_non_tuple_raises(self): index = self._makeOne() self.assertRaises(TypeError, index.apply, ['a', 'b']) def test_apply_empty_tuple_raises(self): index = self._makeOne() self.assertRaises(TypeError, index.apply, ('a',)) def test_apply_one_tuple_raises(self): index = self._makeOne() self.assertRaises(TypeError, index.apply, ('a',)) def test_apply_three_tuple_raises(self): index = self._makeOne() self.assertRaises(TypeError, index.apply, ('a', 'b', 'c')) def test_apply_two_tuple_miss(self): index = self._makeOne() self.assertEqual(list(index.apply(('a', 'b'))), []) def test_apply_two_tuple_hit(self): index = self._makeOne() index.index_doc(1, 'albatross') self.assertEqual(list(index.apply(('a', 'b'))), [1]) def test_sort_w_limit_lt_1(self): index = self._makeOne() self.assertRaises(ValueError, lambda: list(index.sort([1, 2, 3], limit=0))) def test_sort_w_empty_index(self): index = self._makeOne() self.assertEqual(list(index.sort([1, 2, 3])), []) def test_sort_w_empty_docids(self): index = self._makeOne() index.index_doc(1, 'albatross') self.assertEqual(list(index.sort([])), []) def test_sort_w_missing_docids(self): index = self._makeOne() index.index_doc(1, 'albatross') self.assertEqual(list(index.sort([2, 3])), []) def test_sort_force_nbest_w_missing_docids(self): index = self._makeOne() index._use_nbest = True index.index_doc(1, 'albatross') self.assertEqual(list(index.sort([2, 3])), []) def test_sort_force_lazy_w_missing_docids(self): index = self._makeOne() index._use_lazy = True index.index_doc(1, 'albatross') self.assertEqual(list(index.sort([2, 3])), []) def test_sort_lazy_nolimit(self): from BTrees.IFBTree import IFSet index = self._makeOne() index._use_lazy = True self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1) self.assertEqual(list(result), [5, 2, 1, 3, 4]) def test_sort_lazy_withlimit(self): from BTrees.IFBTree import IFSet index = self._makeOne() index._use_lazy = True self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1, limit=3) self.assertEqual(list(result), [5, 2, 1]) def test_sort_nonlazy_nolimit(self): from BTrees.IFBTree import IFSet index = self._makeOne() self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1) self.assertEqual(list(result), [5, 2, 1, 3, 4]) def test_sort_nonlazy_missingdocid(self): from BTrees.IFBTree import IFSet index = self._makeOne() self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5, 99]) result = index.sort(c1) self.assertEqual(list(result), [5, 2, 1, 3, 4]) # 99 not present def test_sort_nonlazy_withlimit(self): from BTrees.IFBTree import IFSet index = self._makeOne() self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1, limit=3) self.assertEqual(list(result), [5, 2, 1]) def test_sort_nonlazy_reverse_nolimit(self): from BTrees.IFBTree import IFSet index = self._makeOne() self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1, reverse=True) self.assertEqual(list(result), [4, 3, 1, 2, 5]) def test_sort_nonlazy_reverse_withlimit(self): from BTrees.IFBTree import IFSet index = self._makeOne() self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1, reverse=True, limit=3) self.assertEqual(list(result), [4, 3, 1]) def test_sort_nbest(self): from BTrees.IFBTree import IFSet index = self._makeOne() index._use_nbest = True self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1, limit=3) self.assertEqual(list(result), [5, 2, 1]) def test_sort_nbest_reverse(self): from BTrees.IFBTree import IFSet index = self._makeOne() index._use_nbest = True self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1, reverse=True, limit=3) self.assertEqual(list(result), [4, 3, 1]) def test_sort_nbest_missing(self): from BTrees.IFBTree import IFSet index = self._makeOne() index._use_nbest = True self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5, 99]) result = index.sort(c1, limit=3) self.assertEqual(list(result), [5, 2, 1]) def test_sort_nbest_missing_reverse(self): from BTrees.IFBTree import IFSet index = self._makeOne() index._use_nbest = True self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5, 99]) result = index.sort(c1, reverse=True, limit=3) self.assertEqual(list(result), [4, 3, 1]) def test_sort_nodocs(self): from BTrees.IFBTree import IFSet index = self._makeOne() c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1) self.assertEqual(list(result), []) def test_sort_nodocids(self): from BTrees.IFBTree import IFSet index = self._makeOne() self._populateIndex(index) c1 = IFSet() result = index.sort(c1) self.assertEqual(list(result), []) def test_sort_badlimit(self): from BTrees.IFBTree import IFSet index = self._makeOne() self._populateIndex(index) c1 = IFSet([1, 2, 3, 4, 5]) result = index.sort(c1, limit=0) self.assertRaises(ValueError, list, result) def test_suite(): return unittest.TestSuite(( doctest.DocFileSuite('README.txt', optionflags=doctest.ELLIPSIS), unittest.makeSuite(FieldIndexTests), )) if __name__=='__main__': unittest.main(defaultTest='test_suite') zope.index-3.6.4/src/zope/index/field/__init__.py0000644000175000017500000000005611727503631023030 0ustar tseavertseaver00000000000000from zope.index.field.index import FieldIndex zope.index-3.6.4/src/zope/index/__init__.py0000644000175000017500000000004011727503631021736 0ustar tseavertseaver00000000000000# make this directory a package zope.index-3.6.4/src/zope/index/keyword/0000755000175000017500000000000011727503757021330 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope/index/keyword/interfaces.py0000644000175000017500000000220211727503631024010 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Keyword-index search interface """ from zope.interface import Interface class IKeywordQuerying(Interface): """Query over a set of keywords, seperated by white space.""" def search(query, operator='and'): """Execute a search given by 'query'. 'query' can be a (unicode) string or an iterable of (unicode) strings. 'operator' can be either 'and' or 'or' to search for documents containing all keywords or any keyword. Return an IFSet of docids """ zope.index-3.6.4/src/zope/index/keyword/index.py0000644000175000017500000001554411727503631023011 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Keyword index """ from persistent import Persistent import BTrees from BTrees.Length import Length from zope.index.interfaces import IInjection, IStatistics, IIndexSearch from zope.index.keyword.interfaces import IKeywordQuerying from zope.interface import implements class KeywordIndex(Persistent): """Keyword index""" implements(IInjection, IStatistics, IIndexSearch, IKeywordQuerying) family = BTrees.family32 # If a word is referenced by at least tree_threshold docids, # use a TreeSet for that word instead of a Set. tree_threshold = 64 def __init__(self, family=None): if family is not None: self.family = family self.clear() def clear(self): """Initialize forward and reverse mappings.""" # The forward index maps index keywords to a sequence of docids self._fwd_index = self.family.OO.BTree() # The reverse index maps a docid to its keywords # TODO: Using a vocabulary might be the better choice to store # keywords since it would allow use to use integers instead of strings self._rev_index = self.family.IO.BTree() self._num_docs = Length(0) def documentCount(self): """Return the number of documents in the index.""" return self._num_docs() def wordCount(self): """Return the number of indexed words""" return len(self._fwd_index) def has_doc(self, docid): return bool(self._rev_index.has_key(docid)) def normalize(self, seq): """Perform normalization on sequence of keywords. Return normalized sequence. This method may be overriden by subclasses. """ return seq def index_doc(self, docid, seq): if isinstance(seq, basestring): raise TypeError('seq argument must be a list/tuple of strings') old_kw = self._rev_index.get(docid, None) if not seq: if old_kw: self.unindex_doc(docid) return seq = self.normalize(seq) new_kw = self.family.OO.Set(seq) if old_kw is None: self._insert_forward(docid, new_kw) self._insert_reverse(docid, new_kw) self._num_docs.change(1) else: # determine added and removed keywords kw_added = self.family.OO.difference(new_kw, old_kw) kw_removed = self.family.OO.difference(old_kw, new_kw) # removed keywords are removed from the forward index for word in kw_removed: fwd = self._fwd_index[word] fwd.remove(docid) if not fwd: del self._fwd_index[word] # now update reverse and forward indexes self._insert_forward(docid, kw_added) self._insert_reverse(docid, new_kw) def unindex_doc(self, docid): idx = self._fwd_index try: for word in self._rev_index[docid]: idx[word].remove(docid) if not idx[word]: del idx[word] except KeyError: msg = 'WAAA! Inconsistent' return try: del self._rev_index[docid] except KeyError: #pragma NO COVERAGE msg = 'WAAA! Inconsistent' self._num_docs.change(-1) def _insert_forward(self, docid, words): """insert a sequence of words into the forward index """ idx = self._fwd_index get_word_idx = idx.get IF = self.family.IF Set = IF.Set TreeSet = IF.TreeSet for word in words: word_idx = get_word_idx(word) if word_idx is None: idx[word] = word_idx = Set() word_idx.insert(docid) if (not isinstance(word_idx, TreeSet) and len(word_idx) >= self.tree_threshold): # Convert to a TreeSet. idx[word] = TreeSet(word_idx) def _insert_reverse(self, docid, words): """ add words to forward index """ if words: self._rev_index[docid] = words def search(self, query, operator='and'): """Execute a search given by 'query'.""" if isinstance(query, basestring): query = [query] query = self.normalize(query) sets = [] for word in query: docids = self._fwd_index.get(word, self.family.IF.Set()) sets.append(docids) if operator == 'or': rs = self.family.IF.multiunion(sets) elif operator == 'and': # sort smallest to largest set so we intersect the smallest # number of document identifiers possible sets.sort(key=len) rs = None for set in sets: rs = self.family.IF.intersection(rs, set) if not rs: break else: raise TypeError('Keyword index only supports `and` and `or` ' 'operators, not `%s`.' % operator) if rs: return rs else: return self.family.IF.Set() def apply(self, query): operator = 'and' if isinstance(query, dict): if 'operator' in query: operator = query['operator'] query = query['query'] return self.search(query, operator=operator) def optimize(self): """Optimize the index. Call this after changing tree_threshold. This converts internal data structures between Sets and TreeSets based on tree_threshold. """ idx = self._fwd_index IF = self.family.IF Set = IF.Set TreeSet = IF.TreeSet items = list(self._fwd_index.items()) for word, word_idx in items: if len(word_idx) >= self.tree_threshold: if not isinstance(word_idx, TreeSet): # Convert to a TreeSet. idx[word] = TreeSet(word_idx) else: if isinstance(word_idx, TreeSet): # Convert to a Set. idx[word] = Set(word_idx) class CaseInsensitiveKeywordIndex(KeywordIndex): """A case-normalizing keyword index (for strings as keywords)""" def normalize(self, seq): return [w.lower() for w in seq] zope.index-3.6.4/src/zope/index/keyword/tests.py0000644000175000017500000004123611727503631023041 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## import unittest class _KeywordIndexTestsBase: def _getTargetClass(self): from zope.index.keyword.index import KeywordIndex return KeywordIndex def _populate(self, index): index.index_doc(1, ('zope', 'CMF', 'Zope3')) index.index_doc(2, ('the', 'quick', 'brown', 'FOX')) index.index_doc(3, ('Zope',)) index.index_doc(4, ()) index.index_doc(5, ('cmf',)) _populated_doc_count = 4 _populated_word_count = 9 def test_normalize(self): index = self._makeOne() self.assertEqual(index.normalize(['Foo']), ['Foo']) def test_simplesearch(self): index = self._makeOne() self._populate(index) self._search(index, [''], self.IFSet()) self._search(index, 'cmf', self.IFSet([5])) self._search(index, ['cmf'], self.IFSet([5])) self._search(index, ['Zope'], self.IFSet([3])) self._search(index, ['Zope3'], self.IFSet([1])) self._search(index, ['foo'], self.IFSet()) def test_search_and(self): index = self._makeOne() self._populate(index) self._search_and(index, ('CMF', 'Zope3'), self.IFSet([1])) self._search_and(index, ('CMF', 'zope'), self.IFSet([1])) self._search_and(index, ('cmf', 'zope4'), self.IFSet()) self._search_and(index, ('quick', 'FOX'), self.IFSet([2])) def test_search_or(self): index = self._makeOne() self._populate(index) self._search_or(index, ('cmf', 'Zope3'), self.IFSet([1, 5])) self._search_or(index, ('cmf', 'zope'), self.IFSet([1, 5])) self._search_or(index, ('cmf', 'zope4'), self.IFSet([5])) self._search_or(index, ('zope', 'Zope'), self.IFSet([1,3])) def test_apply(self): index = self._makeOne() self._populate(index) self._apply(index, ('CMF', 'Zope3'), self.IFSet([1])) self._apply(index, ('CMF', 'zope'), self.IFSet([1])) self._apply(index, ('cmf', 'zope4'), self.IFSet()) self._apply(index, ('quick', 'FOX'), self.IFSet([2])) def test_apply_and(self): index = self._makeOne() self._populate(index) self._apply_and(index, ('CMF', 'Zope3'), self.IFSet([1])) self._apply_and(index, ('CMF', 'zope'), self.IFSet([1])) self._apply_and(index, ('cmf', 'zope4'), self.IFSet()) self._apply_and(index, ('quick', 'FOX'), self.IFSet([2])) def test_apply_or(self): index = self._makeOne() self._populate(index) self._apply_or(index, ('cmf', 'Zope3'), self.IFSet([1, 5])) self._apply_or(index, ('cmf', 'zope'), self.IFSet([1, 5])) self._apply_or(index, ('cmf', 'zope4'), self.IFSet([5])) self._apply_or(index, ('zope', 'Zope'), self.IFSet([1,3])) def test_apply_with_only_tree_set(self): index = self._makeOne() index.tree_threshold = 0 self._populate(index) self.assertEqual(type(index._fwd_index['zope']), type(self.IFTreeSet())) self._apply_and(index, ('CMF', 'Zope3'), self.IFSet([1])) self._apply_and(index, ('CMF', 'zope'), self.IFSet([1])) self._apply_and(index, ('cmf', 'zope4'), self.IFSet()) self._apply_and(index, ('quick', 'FOX'), self.IFSet([2])) def test_apply_with_mix_of_tree_set_and_simple_set(self): index = self._makeOne() index.tree_threshold = 2 self._populate(index) self.assertEqual(type(index._fwd_index['zope']), type(self.IFSet())) self._apply_and(index, ('CMF', 'Zope3'), self.IFSet([1])) self._apply_and(index, ('CMF', 'zope'), self.IFSet([1])) self._apply_and(index, ('cmf', 'zope4'), self.IFSet()) self._apply_and(index, ('quick', 'FOX'), self.IFSet([2])) def test_optimize_converts_to_tree_set(self): index = self._makeOne() self._populate(index) self.assertEqual(type(index._fwd_index['zope']), type(self.IFSet())) index.tree_threshold = 0 index.optimize() self.assertEqual(type(index._fwd_index['zope']), type(self.IFTreeSet())) def test_optimize_converts_to_simple_set(self): index = self._makeOne() index.tree_threshold = 0 self._populate(index) self.assertEqual(type(index._fwd_index['zope']), type(self.IFTreeSet())) index.tree_threshold = 99 index.optimize() self.assertEqual(type(index._fwd_index['zope']), type(self.IFSet())) def test_optimize_leaves_words_alone(self): index = self._makeOne() self._populate(index) self.assertEqual(type(index._fwd_index['zope']), type(self.IFSet())) index.tree_threshold = 99 index.optimize() self.assertEqual(type(index._fwd_index['zope']), type(self.IFSet())) def test_index_with_empty_sequence_unindexes(self): index = self._makeOne() self._populate(index) self._search(index, 'cmf', self.IFSet([5])) index.index_doc(5, ()) self._search(index, 'cmf', self.IFSet([])) class CaseInsensitiveKeywordIndexTestsBase: def _getTargetClass(self): from zope.index.keyword.index import CaseInsensitiveKeywordIndex return CaseInsensitiveKeywordIndex def _populate(self, index): index.index_doc(1, ('zope', 'CMF', 'zope3', 'Zope3')) index.index_doc(2, ('the', 'quick', 'brown', 'FOX')) index.index_doc(3, ('Zope', 'zope')) index.index_doc(4, ()) index.index_doc(5, ('cmf',)) _populated_doc_count = 4 _populated_word_count = 7 def test_normalize(self): index = self._makeOne() self.assertEqual(index.normalize(['Foo']), ['foo']) def test_simplesearch(self): index = self._makeOne() self._populate(index) self._search(index, [''], self.IFSet()) self._search(index, 'cmf', self.IFSet([1, 5])) self._search(index, ['cmf'], self.IFSet([1, 5])) self._search(index, ['zope'], self.IFSet([1, 3])) self._search(index, ['zope3'], self.IFSet([1])) self._search(index, ['foo'], self.IFSet()) def test_search_and(self): index = self._makeOne() self._populate(index) self._search_and(index, ('cmf', 'zope3'), self.IFSet([1])) self._search_and(index, ('cmf', 'zope'), self.IFSet([1])) self._search_and(index, ('cmf', 'zope4'), self.IFSet()) self._search_and(index, ('zope', 'ZOPE'), self.IFSet([1, 3])) def test_search_or(self): index = self._makeOne() self._populate(index) self._search_or(index, ('cmf', 'zope3'), self.IFSet([1, 5])) self._search_or(index, ('cmf', 'zope'), self.IFSet([1, 3, 5])) self._search_or(index, ('cmf', 'zope4'), self.IFSet([1, 5])) self._search_or(index, ('zope', 'ZOPE'), self.IFSet([1,3])) def test_apply(self): index = self._makeOne() self._populate(index) self._apply(index, ('cmf', 'zope3'), self.IFSet([1])) self._apply(index, ('cmf', 'zope'), self.IFSet([1])) self._apply(index, ('cmf', 'zope4'), self.IFSet()) self._apply(index, ('zope', 'ZOPE'), self.IFSet([1, 3])) def test_apply_and(self): index = self._makeOne() self._populate(index) self._apply_and(index, ('cmf', 'zope3'), self.IFSet([1])) self._apply_and(index, ('cmf', 'zope'), self.IFSet([1])) self._apply_and(index, ('cmf', 'zope4'), self.IFSet()) self._apply_and(index, ('zope', 'ZOPE'), self.IFSet([1, 3])) def test_apply_or(self): index = self._makeOne() self._populate(index) self._apply_or(index, ('cmf', 'zope3'), self.IFSet([1, 5])) self._apply_or(index, ('cmf', 'zope'), self.IFSet([1, 3, 5])) self._apply_or(index, ('cmf', 'zope4'), self.IFSet([1, 5])) self._apply_or(index, ('zope', 'ZOPE'), self.IFSet([1,3])) class _ThirtyTwoBitBase: def _get_family(self): import BTrees return BTrees.family32 def IFSet(self, *args, **kw): from BTrees.IFBTree import IFSet return IFSet(*args, **kw) def IFTreeSet(self, *args, **kw): from BTrees.IFBTree import IFTreeSet return IFTreeSet(*args, **kw) class _SixtyFourBitBase: def _get_family(self): import BTrees return BTrees.family64 def IFSet(self, *args, **kw): from BTrees.LFBTree import LFSet return LFSet(*args, **kw) def IFTreeSet(self, *args, **kw): from BTrees.LFBTree import LFTreeSet return LFTreeSet(*args, **kw) _marker = object() class _TestCaseBase: def _makeOne(self, family=_marker): if family is _marker: return self._getTargetClass()(self._get_family()) return self._getTargetClass()(family) def _search(self, index, query, expected, mode='and'): results = index.search(query, mode) # results and expected are IFSets() but we can not # compare them directly since __eq__() does not seem # to be implemented for BTrees self.assertEqual(results.keys(), expected.keys()) def _search_and(self, index, query, expected): return self._search(index, query, expected, 'and') def _search_or(self, index, query, expected): return self._search(index, query, expected, 'or') def _apply(self, index, query, expected, mode='and'): results = index.apply(query) self.assertEqual(results.keys(), expected.keys()) def _apply_and(self, index, query, expected): results = index.apply({'operator': 'and', 'query': query}) self.assertEqual(results.keys(), expected.keys()) def _apply_or(self, index, query, expected): results = index.apply({'operator': 'or', 'query': query}) self.assertEqual(results.keys(), expected.keys()) def test_class_conforms_to_IInjection(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IInjection verifyClass(IInjection, self._getTargetClass()) def test_instance_conforms_to_IInjection(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IInjection verifyObject(IInjection, self._makeOne()) def test_class_conforms_to_IIndexSearch(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IIndexSearch verifyClass(IIndexSearch, self._getTargetClass()) def test_instance_conforms_to_IIndexSearch(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IIndexSearch verifyObject(IIndexSearch, self._makeOne()) def test_class_conforms_to_IStatistics(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IStatistics verifyClass(IStatistics, self._getTargetClass()) def test_instance_conforms_to_IStatistics(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IStatistics verifyObject(IStatistics, self._makeOne()) def test_class_conforms_to_IKeywordQuerying(self): from zope.interface.verify import verifyClass from zope.index.keyword.interfaces import IKeywordQuerying verifyClass(IKeywordQuerying, self._getTargetClass()) def test_instance_conforms_to_IKeywordQuerying(self): from zope.interface.verify import verifyObject from zope.index.keyword.interfaces import IKeywordQuerying verifyObject(IKeywordQuerying, self._makeOne()) def test_ctor_defaults(self): index = self._makeOne() self.failUnless(index.family is self._get_family()) def test_ctor_explicit_family(self): import BTrees index = self._makeOne(family=BTrees.family64) self.failUnless(index.family is BTrees.family64) def test_empty_index(self): index = self._makeOne() self.assertEqual(index.documentCount(), 0) self.assertEqual(index.wordCount(), 0) self.failIf(index.has_doc(1)) def test_index_doc_string_value_raises(self): index = self._makeOne() self.assertRaises(TypeError, index.index_doc, 1, 'albatross') def test_index_doc_single(self): index = self._makeOne() index.index_doc(1, ('albatross', 'cormorant')) self.assertEqual(index.documentCount(), 1) self.assertEqual(index.wordCount(), 2) self.failUnless(index.has_doc(1)) self.failUnless('albatross' in index._fwd_index) self.failUnless('cormorant' in index._fwd_index) def test_index_doc_existing(self): index = self._makeOne() index.index_doc(1, ('albatross', 'cormorant')) index.index_doc(1, ('buzzard', 'cormorant')) self.assertEqual(index.documentCount(), 1) self.assertEqual(index.wordCount(), 2) self.failUnless(index.has_doc(1)) self.failIf('albatross' in index._fwd_index) self.failUnless('buzzard' in index._fwd_index) self.failUnless('cormorant' in index._fwd_index) def test_index_doc_many(self): index = self._makeOne() self._populate(index) self.assertEqual(index.documentCount(), self._populated_doc_count) self.assertEqual(index.wordCount(), self._populated_word_count) for docid in range(1, 6): if docid == 4: self.failIf(index.has_doc(docid)) else: self.failUnless(index.has_doc(docid)) def test_clear(self): index = self._makeOne() self._populate(index) index.clear() self.assertEqual(index.documentCount(), 0) self.assertEqual(index.wordCount(), 0) for docid in range(1, 6): self.failIf(index.has_doc(docid)) def test_unindex_doc_missing(self): index = self._makeOne() index.unindex_doc(1) # doesn't raise def test_unindex_no_residue(self): index = self._makeOne() index.index_doc(1, ('albatross', )) index.unindex_doc(1) self.assertEqual(index.documentCount(), 0) self.assertEqual(index.wordCount(), 0) self.failIf(index.has_doc(1)) def test_unindex_w_residue(self): index = self._makeOne() index.index_doc(1, ('albatross', )) index.index_doc(2, ('albatross', 'cormorant')) index.unindex_doc(1) self.assertEqual(index.documentCount(), 1) self.assertEqual(index.wordCount(), 2) self.failIf(index.has_doc(1)) def test_hasdoc(self): index = self._makeOne() self._populate(index) self.assertEqual(index.has_doc(1), 1) self.assertEqual(index.has_doc(2), 1) self.assertEqual(index.has_doc(3), 1) self.assertEqual(index.has_doc(4), 0) self.assertEqual(index.has_doc(5), 1) self.assertEqual(index.has_doc(6), 0) def test_search_bad_operator(self): index = self._makeOne() self.assertRaises(TypeError, index.search, 'whatever', 'maybe') class KeywordIndexTests32(_KeywordIndexTestsBase, _ThirtyTwoBitBase, _TestCaseBase, unittest.TestCase): pass class CaseInsensitiveKeywordIndexTests32(CaseInsensitiveKeywordIndexTestsBase, _ThirtyTwoBitBase, _TestCaseBase, unittest.TestCase): pass class KeywordIndexTests64(_KeywordIndexTestsBase, _SixtyFourBitBase, _TestCaseBase, unittest.TestCase): pass class CaseInsensitiveKeywordIndexTests64(CaseInsensitiveKeywordIndexTestsBase, _SixtyFourBitBase, _TestCaseBase, unittest.TestCase): pass def test_suite(): return unittest.TestSuite(( unittest.makeSuite(KeywordIndexTests32), unittest.makeSuite(KeywordIndexTests64), unittest.makeSuite(CaseInsensitiveKeywordIndexTests32), unittest.makeSuite(CaseInsensitiveKeywordIndexTests64), )) zope.index-3.6.4/src/zope/index/keyword/__init__.py0000644000175000017500000000011711727503631023427 0ustar tseavertseaver00000000000000from zope.index.keyword.index import KeywordIndex, CaseInsensitiveKeywordIndex zope.index-3.6.4/src/zope/index/text/0000755000175000017500000000000011727503757020630 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope/index/text/ricecode.py0000644000175000017500000001402011727503631022743 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Rice coding (a variation of Golomb coding) Based on a Java implementation by Glen McCluskey described in a Usenix ;login: article at http://www.usenix.org/publications/login/2000-4/features/java.html McCluskey's article explains the approach as follows. The encoding for a value x is represented as a unary part and a binary part. The unary part is a sequence of 1 bits followed by a 0 bit. The binary part encodes some of the lower bits of x-1. The encoding is parameterized by a value m that describes how many bits to store in the binary part. If most of the values are smaller than 2**m then they can be stored in only m+1 bits. Compute the length of the unary part, q, where q = math.floor((x-1)/ 2 ** m) Emit q 1 bits followed by a 0 bit. Emit the lower m bits of x-1, treating x-1 as a binary value. """ import array class BitArray(object): def __init__(self, buf=None): self.bytes = array.array('B') self.nbits = 0 self.bitsleft = 0 self.tostring = self.bytes.tostring def __getitem__(self, i): byte, offset = divmod(i, 8) mask = 2 ** offset if self.bytes[byte] & mask: return 1 else: return 0 def __setitem__(self, i, val): byte, offset = divmod(i, 8) mask = 2 ** offset if val: self.bytes[byte] |= mask else: self.bytes[byte] &= ~mask def __len__(self): return self.nbits def append(self, bit): """Append a 1 if bit is true or 1 if it is false.""" if self.bitsleft == 0: self.bytes.append(0) self.bitsleft = 8 self.__setitem__(self.nbits, bit) self.nbits += 1 self.bitsleft -= 1 def __getstate__(self): return self.nbits, self.bitsleft, self.tostring() def __setstate__(self, (nbits, bitsleft, s)): self.bytes = array.array('B', s) self.nbits = nbits self.bitsleft = bitsleft class RiceCode(object): def __init__(self, m): """Constructor a RiceCode for m-bit values.""" if not (0 <= m <= 16): raise ValueError("m must be between 0 and 16") self.init(m) self.bits = BitArray() self.len = 0 def init(self, m): self.m = m self.lower = (1 << m) - 1 self.mask = 1 << (m - 1) def append(self, val): """Append an item to the list.""" if val < 1: raise ValueError("value >= 1 expected, got %s" % `val`) val -= 1 # emit the unary part of the code q = val >> self.m for i in range(q): self.bits.append(1) self.bits.append(0) # emit the binary part r = val & self.lower mask = self.mask while mask: self.bits.append(r & mask) mask >>= 1 self.len += 1 def __len__(self): return self.len def tolist(self): """Return the items as a list.""" l = [] i = 0 # bit offset binary_range = range(self.m) for j in range(self.len): unary = 0 while self.bits[i] == 1: unary += 1 i += 1 assert self.bits[i] == 0 i += 1 binary = 0 for k in binary_range: binary = (binary << 1) | self.bits[i] i += 1 l.append((unary << self.m) + (binary + 1)) return l def tostring(self): """Return a binary string containing the encoded data. The binary string may contain some extra zeros at the end. """ return self.bits.tostring() def __getstate__(self): return self.m, self.bits def __setstate__(self, (m, bits)): self.init(m) self.bits = bits def encode(m, l): c = RiceCode(m) for elt in l: c.append(elt) assert c.tolist() == l return c def encode_deltas(l): if len(l) == 1: return l[0], [] deltas = RiceCode(6) deltas.append(l[1] - l[0]) for i in range(2, len(l)): deltas.append(l[i] - l[i - 1]) return l[0], deltas def decode_deltas(start, enc_deltas): deltas = enc_deltas.tolist() l = [start] for i in range(1, len(deltas)): l.append(l[i-1] + deltas[i]) l.append(l[-1] + deltas[-1]) return l def test(): import random for size in [10, 20, 50, 100, 200]: l = [random.randint(1, size) for i in range(50)] c = encode(random.randint(1, 16), l) assert c.tolist() == l for size in [10, 20, 50, 100, 200]: l = range(random.randint(1, size), size + random.randint(1, size)) t = encode_deltas(l) l2 = decode_deltas(*t) assert l == l2 if l != l2: print l print l2 def pickle_efficiency(): import pickle import random for m in [4, 8, 12]: for size in [10, 20, 50, 100, 200, 500, 1000, 2000, 5000]: for elt_range in [10, 20, 50, 100, 200, 500, 1000]: l = [random.randint(1, elt_range) for i in range(size)] raw = pickle.dumps(l, 1) enc = pickle.dumps(encode(m, l), 1) print "m=%2d size=%4d range=%4d" % (m, size, elt_range), print "%5d %5d" % (len(raw), len(enc)), if len(raw) > len(enc): print "win" else: print "lose" if __name__ == "__main__": test() zope.index-3.6.4/src/zope/index/text/stopdict.py0000644000175000017500000000233211727503631023022 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Provide a default list of stop words for the index. The specific splitter and lexicon are customizable, but the default ZCTextIndex should do something useful. """ def get_stopdict(): """Return a dictionary of stopwords.""" return _dict # This list of English stopwords comes from Lucene _words = [ "a", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" ] _dict = {} for w in _words: _dict[w] = None zope.index-3.6.4/src/zope/index/text/interfaces.py0000644000175000017500000001566111727503631023325 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Text-indexing interfaces """ from zope.interface import Attribute from zope.interface import Interface class ILexicon(Interface): """Object responsible for converting text to word identifiers.""" def termToWordIds(text): """Return a sequence of ids of the words parsed from the text. The input text may be either a string or a list of strings. Parse the text as if they are search terms, and skips words that aren't in the lexicon. """ def sourceToWordIds(text): """Return a sequence of ids of the words parsed from the text. The input text may be either a string or a list of strings. Parse the text as if they come from a source document, and creates new word ids for words that aren't (yet) in the lexicon. """ def globToWordIds(pattern): """Return a sequence of ids of words matching the pattern. The argument should be a single word using globbing syntax, e.g. 'foo*' meaning anything starting with 'foo'. Return the wids for all words in the lexicon that match the pattern. """ def wordCount(): """Return the number of unique terms in the lexicon.""" def get_word(wid): """Return the word for the given word id. Raise KeyError if the word id is not in the lexicon. """ def get_wid(word): """Return the wird id for the given word. Return 0 of the word is not in the lexicon. """ def parseTerms(text): """Pass the text through the pipeline. Return a list of words, normalized by the pipeline (e.g. stopwords removed, case normalized etc.). """ def isGlob(word): """Return true if the word is a globbing pattern. The word should be one of the words returned by parseTerms(). """ class ILexiconBasedIndex(Interface): """ Interface for indexes which hold a lexicon.""" lexicon = Attribute(u'Lexicon used by the index.') class IQueryParser(Interface): """Interface for Query Parsers.""" def parseQuery(query): """Parse a query string. Return a parse tree (which implements IQueryParseTree). Some of the query terms may be ignored because they are stopwords; use getIgnored() to find out which terms were ignored. But if the entire query consists only of stop words, or of stopwords and one or more negated terms, an exception is raised. May raise ParseTree.ParseError. """ def getIgnored(): """Return the list of ignored terms. Return the list of terms that were ignored by the most recent call to parseQuery() because they were stopwords. If parseQuery() was never called this returns None. """ def parseQueryEx(query): """Parse a query string. Return a tuple (tree, ignored) where 'tree' is the parse tree as returned by parseQuery(), and 'ignored' is a list of ignored terms as returned by getIgnored(). May raise ParseTree.ParseError. """ class IQueryParseTree(Interface): """Interface for parse trees returned by parseQuery().""" def nodeType(): """Return the node type. This is one of 'AND', 'OR', 'NOT', 'ATOM', 'PHRASE' or 'GLOB'. """ def getValue(): """Return a node-type specific value. For node type: Return: 'AND' a list of parse trees 'OR' a list of parse trees 'NOT' a parse tree 'ATOM' a string (representing a single search term) 'PHRASE' a string (representing a search phrase) 'GLOB' a string (representing a pattern, e.g. "foo*") """ def terms(): """Return a list of all terms in this node, excluding NOT subtrees.""" def executeQuery(index): """Execute the query represented by this node against the index. The index argument must implement the IIndex interface. Return an IFBucket or IFBTree mapping document ids to scores (higher scores mean better results). May raise ParseTree.QueryError. """ class ISearchableText(Interface): """Interface that text-indexable objects should implement.""" def getSearchableText(): """Return a sequence of unicode strings to be indexed. Each unicode string in the returned sequence will be run through the splitter pipeline; the combined stream of words coming out of the pipeline will be indexed. returning None indicates the object should not be indexed """ class IPipelineElement(Interface): """ An element in a lexicon's processing pipeline. """ def process(terms): """ Transform each term in terms. Return the sequence of transformed terms. """ class ISplitter(IPipelineElement): """ Split text into a sequence of words. """ def processGlob(terms): """ Transform terms, leaving globbing markers in place. """ class IExtendedQuerying(Interface): """An index that supports advanced search setups.""" def search(term): """Execute a search on a single term given as a string. Return an IFBTree mapping docid to score, or None if all docs match due to the lexicon returning no wids for the term (e.g., if the term is entirely composed of stopwords). """ def search_phrase(phrase): """Execute a search on a phrase given as a string. Return an IFBtree mapping docid to score. """ def search_glob(pattern): """Execute a pattern search. The pattern represents a set of words by using * and ?. For example, "foo*" represents the set of all words in the lexicon starting with "foo". Return an IFBTree mapping docid to score. """ def query_weight(terms): """Return the weight for a set of query terms. 'terms' is a sequence of all terms included in the query, although not terms with a not. If a term appears more than once in a query, it should appear more than once in terms. Nothing is defined about what "weight" means, beyond that the result is an upper bound on document scores returned for the query. """ zope.index-3.6.4/src/zope/index/text/okapiindex.py0000644000175000017500000003372611727503631023337 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Full text index with relevance ranking, using an Okapi BM25 rank. "Okapi" (much like "cosine rule" also) is a large family of scoring gimmicks. It's based on probability arguments about how words are distributed in documents, not on an abstract vector space model. A long paper by its principal inventors gives an excellent overview of how it was derived: A probabilistic model of information retrieval: development and status K. Sparck Jones, S. Walker, S.E. Robertson http://citeseer.nj.nec.com/jones98probabilistic.html Spellings that ignore relevance information (which we don't have) are of this high-level form: score(D, Q) = sum(for t in D&Q: TF(D, t) * IDF(Q, t)) where D a specific document Q a specific query t a term (word, atomic phrase, whatever) D&Q the terms common to D and Q TF(D, t) a measure of t's importance in D -- a kind of term frequency weight IDF(Q, t) a measure of t's importance in the query and in the set of documents as a whole -- a kind of inverse document frequency weight The IDF(Q, t) here is identical to the one used for our cosine measure. Since queries are expected to be short, it ignores Q entirely: IDF(Q, t) = log(1.0 + N / f(t)) where N the total number of documents f(t) the number of documents in which t appears Most Okapi literature seems to use log(N/f(t)) instead. We don't, because that becomes 0 for a term that's in every document, and, e.g., if someone is searching for "documentation" on python.org (a term that may well show up on every page, due to the top navigation bar), we still want to find the pages that use the word a lot (which is TF's job to find, not IDF's -- we just want to stop IDF from considering this t to be irrelevant). The TF(D, t) spellings are more interesting. With lots of variations, the most basic spelling is of the form f(D, t) TF(D, t) = --------------- f(D, t) + K(D) where f(D, t) the number of times t appears in D K(D) a measure of the length of D, normalized to mean doc length The functional *form* f/(f+K) is clever. It's a gross approximation to a mixture of two distinct Poisson distributions, based on the idea that t probably appears in D for one of two reasons: 1. More or less at random. 2. Because it's important to D's purpose in life ("eliteness" in papers). Note that f/(f+K) is always between 0 and 1. If f is very large compared to K, it approaches 1. If K is very large compared to f, it approaches 0. If t appears in D more or less "for random reasons", f is likely to be small, and so K will dominate unless it's a very small doc, and the ratio will be small. OTOH, if t appears a lot in D, f will dominate unless it's a very large doc, and the ratio will be close to 1. We use a variation on that simple theme, a simplification of what's called BM25 in the literature (it was the 25th stab at a Best Match function from the Okapi group; "a simplification" means we're setting some of BM25's more esoteric free parameters to 0): f(D, t) * (k1 + 1) TF(D, t) = -------------------- f(D, t) + k1 * K(D) where k1 a "tuning factor", typically between 1.0 and 2.0. We use 1.2, the usual default value. This constant adjusts the curve to look more like a theoretical 2-Poisson curve. Note that as f(D, t) increases, TF(D, t) increases monotonically, approaching an asymptote of k1+1 from below. Finally, we use K(D) = (1-b) + b * len(D)/E(len(D)) where b is another free parameter, discussed below. We use 0.75. len(D) the length of D in words E(len(D)) the expected value of len(D) across the whole document set; or, IOW, the average document length b is a free parameter between 0.0 and 1.0, and adjusts for the expected effect of the "Verbosity Hypothesis". Suppose b is 1, and some word t appears 10 times as often in document d2 than in document d1. If document d2 is also 10 times as long as d1, TF(d1, t) and TF(d2, t) are identical: f(d2, t) * (k1 + 1) TF(d2, t) = --------------------------------- = f(d2, t) + k1 * len(d2)/E(len(D)) 10 * f(d1, t) * (k1 + 1) ----------------------------------------------- = TF(d1, t) 10 * f(d1, t) + k1 * (10 * len(d1))/E(len(D)) because the 10's cancel out. This is appropriate if we believe that a word appearing 10x more often in a doc 10x as long is simply due to that the longer doc is more verbose. If we do believe that, the longer doc and the shorter doc are probably equally relevant. OTOH, it *could* be that the longer doc is talking about t in greater depth too, in which case it's probably more relevant than the shorter doc. At the other extreme, if we set b to 0, the len(D)/E(len(D)) term vanishes completely, and a doc scores higher for having more occurences of a word regardless of the doc's length. Reality is between these extremes, and probably varies by document and word too. Reports in the literature suggest that b=0.75 is a good compromise "in general", favoring the "verbosity hypothesis" end of the scale. Putting it all together, the final TF function is f(D, t) * (k1 + 1) TF(D, t) = -------------------------------------------- f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D))) with k1=1.2 and b=0.75. Query Term Weighting -------------------- I'm ignoring the query adjustment part of Okapi BM25 because I expect our queries are very short. Full BM25 takes them into account by adding the following to every score(D, Q); it depends on the lengths of D and Q, but not on the specific words in Q, or even on whether they appear in D(!): E(len(D)) - len(D) k2 * len(Q) * ------------------- E(len(D)) + len(D) Here k2 is another "tuning constant", len(Q) is the number of words in Q, and len(D) & E(len(D)) were defined above. The Okapi group set k2 to 0 in TREC-9, so it apparently doesn't do much good (or may even hurt). Full BM25 *also* multiplies the following factor into IDF(Q, t): f(Q, t) * (k3 + 1) ------------------ f(Q, t) + k3 where k3 is yet another free parameter, and f(Q,t) is the number of times t appears in Q. Since we're using short "web style" queries, I expect f(Q,t) to always be 1, and then that quotient is 1 * (k3 + 1) ------------ = 1 1 + k3 regardless of k3's value. So, in a trivial sense, we are incorporating this measure (and optimizing it by not bothering to multiply by 1 ). """ from zope.index.text.baseindex import BaseIndex from zope.index.text.baseindex import inverse_doc_frequency try: from zope.index.text.okascore import score except ImportError: #pragma NO COVERAGE score = None from BTrees.Length import Length class OkapiIndex(BaseIndex): # BM25 free parameters. K1 = 1.2 B = 0.75 assert K1 >= 0.0 assert 0.0 <= B <= 1.0 def __init__(self, lexicon, family=None): BaseIndex.__init__(self, lexicon, family=family) # ._wordinfo for Okapi is # wid -> {docid -> frequency}; t -> D -> f(D, t) # ._docweight for Okapi is # docid -> # of words in the doc # This is just len(self._docwords[docid]), but _docwords is stored # in compressed form, so uncompressing it just to count the list # length would be ridiculously expensive. # sum(self._docweight.values()), the total # of words in all docs # This is a long for "better safe than sorry" reasons. It isn't # used often enough that speed should matter. self._totaldoclen = Length(0) def index_doc(self, docid, text): count = BaseIndex.index_doc(self, docid, text) self._change_doc_len(count) return count def _reindex_doc(self, docid, text): self._change_doc_len(-self._docweight[docid]) return BaseIndex._reindex_doc(self, docid, text) def unindex_doc(self, docid): if docid not in self._docwords: return self._change_doc_len(-self._docweight[docid]) BaseIndex.unindex_doc(self, docid) def _change_doc_len(self, delta): # Change total doc length used for scoring delta = int(delta) try: self._totaldoclen.change(delta) except AttributeError: # Opportunistically upgrade _totaldoclen attribute to Length object self._totaldoclen = Length(long(self._totaldoclen + delta)) # The workhorse. Return a list of (IFBucket, weight) pairs, one pair # for each wid t in wids. The IFBucket, times the weight, maps D to # TF(D,t) * IDF(t) for every docid D containing t. # As currently written, the weights are always 1, and the IFBucket maps # D to TF(D,t)*IDF(t) directly, where the product is computed as a float. # NOTE: This may be overridden below, by a function that computes the # same thing but with the inner scoring loop in C. if score is None: #pragma NO COVERAGE def _search_wids(self, wids): if not wids: return [] N = float(self.documentCount()) # total # of docs try: doclen = self._totaldoclen() except TypeError: # _totaldoclen has not yet been upgraded doclen = self._totaldoclen meandoclen = doclen / N K1 = self.K1 B = self.B K1_plus1 = K1 + 1.0 B_from1 = 1.0 - B # f(D, t) * (k1 + 1) # TF(D, t) = ------------------------------------------- # f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D))) L = [] docid2len = self._docweight for t in wids: d2f = self._wordinfo[t] # map {docid -> f(docid, t)} idf = inverse_doc_frequency(len(d2f), N) # an unscaled float result = self.family.IF.Bucket() for docid, f in d2f.items(): lenweight = B_from1 + B * docid2len[docid] / meandoclen tf = f * K1_plus1 / (f + K1 * lenweight) result[docid] = tf * idf L.append((result, 1)) return L # Note about the above: the result is tf * idf. tf is # small -- it can't be larger than k1+1 = 2.2. idf is # formally unbounded, but is less than 14 for a term that # appears in only 1 of a million documents. So the # product is probably less than 32, or 5 bits before the # radix point. If we did the scaled-int business on both # of them, we'd be up to 25 bits. Add 64 of those and # we'd be in overflow territory. That's pretty unlikely, # so we *could* just store scaled_int(tf) in # result[docid], and use scaled_int(idf) as an invariant # weight across the whole result. But besides skating # near the edge, it's not a speed cure, since the # computation of tf would still be done at Python speed, # and it's a lot more work than just multiplying by idf. else: # The same function as _search_wids above, but with the inner scoring # loop written in C (module okascore, function score()). # Cautions: okascore hardcodes the values of K, B1, and the scaled_int # function. def _search_wids(self, wids): if not wids: return [] N = float(self.documentCount()) # total # of docs try: doclen = self._totaldoclen() except TypeError: # _totaldoclen has not yet been upgraded doclen = self._totaldoclen meandoclen = doclen / N #K1 = self.K1 #B = self.B #K1_plus1 = K1 + 1.0 #B_from1 = 1.0 - B # f(D, t) * (k1 + 1) # TF(D, t) = ------------------------------------------- # f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D))) L = [] docid2len = self._docweight for t in wids: d2f = self._wordinfo[t] # map {docid -> f(docid, t)} idf = inverse_doc_frequency(len(d2f), N) # an unscaled float result = self.family.IF.Bucket() score(result, d2f.items(), docid2len, idf, meandoclen) L.append((result, 1)) return L def query_weight(self, terms): # Get the wids. wids = [] for term in terms: termwids = self._lexicon.termToWordIds(term) wids.extend(termwids) # The max score for term t is the maximum value of # TF(D, t) * IDF(Q, t) # We can compute IDF directly, and as noted in the comments below # TF(D, t) is bounded above by 1+K1. N = float(len(self._docweight)) tfmax = 1.0 + self.K1 sum = 0 for t in self._remove_oov_wids(wids): idf = inverse_doc_frequency(len(self._wordinfo[t]), N) sum += idf * tfmax return sum def _get_frequencies(self, wids): d = {} dget = d.get for wid in wids: d[wid] = dget(wid, 0) + 1 return d, len(wids) zope.index-3.6.4/src/zope/index/text/okascore.c0000644000175000017500000000704311727503631022575 0ustar tseavertseaver00000000000000/***************************************************************************** Copyright (c) 2002 Zope Foundation and Contributors. All Rights Reserved. This software is subject to the provisions of the Zope Public License, Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE ****************************************************************************/ /* okascore.c * * The inner scoring loop of OkapiIndex._search_wids() coded in C. * * Example from an indexed Python-Dev archive, where "python" shows up in all * but 2 of the 19,058 messages. With the Python scoring loop, * * query: python * # results: 10 of 19056 in 534.77 ms * query: python * # results: 10 of 19056 in 277.52 ms * * The first timing is cold, the second timing from an immediate repeat of * the same query. With the scoring loop here in C: * * query: python * # results: 10 of 19056 in 380.74 ms -- 40% speedup * query: python * # results: 10 of 19056 in 118.96 ms -- 133% speedup */ #include "Python.h" #define K1 1.2 #define B 0.75 #ifndef PyTuple_CheckExact #define PyTuple_CheckExact PyTuple_Check #endif static PyObject * score(PyObject *self, PyObject *args) { /* Believe it or not, floating these common subexpressions "by hand" gets better code out of MSVC 6. */ const double B_FROM1 = 1.0 - B; const double K1_PLUS1 = K1 + 1.0; /* Inputs */ PyObject *result; /* IIBucket result, maps d to score */ PyObject *d2fitems; /* ._wordinfo[t].items(), maps d to f(d, t) */ PyObject *d2len; /* ._docweight, maps d to # words in d */ double idf; /* inverse doc frequency of t */ double meandoclen; /* average number of words in a doc */ int n, i; if (!PyArg_ParseTuple(args, "OOOdd:score", &result, &d2fitems, &d2len, &idf, &meandoclen)) return NULL; n = PyObject_Length(d2fitems); for (i = 0; i < n; ++i) { PyObject *d_and_f; /* d2f[i], a (d, f) pair */ PyObject *d; double f; PyObject *doclen; /* ._docweight[d] */ double lenweight; double tf; PyObject *doc_score; int status; d_and_f = PySequence_GetItem(d2fitems, i); if (d_and_f == NULL) return NULL; if (!(PyTuple_CheckExact(d_and_f) && PyTuple_GET_SIZE(d_and_f) == 2)) { PyErr_SetString(PyExc_TypeError, "d2fitems must produce 2-item tuples"); Py_DECREF(d_and_f); return NULL; } d = PyTuple_GET_ITEM(d_and_f, 0); f = (double)PyInt_AsLong(PyTuple_GET_ITEM(d_and_f, 1)); doclen = PyObject_GetItem(d2len, d); if (doclen == NULL) { Py_DECREF(d_and_f); return NULL; } lenweight = B_FROM1 + B * PyInt_AsLong(doclen) / meandoclen; tf = f * K1_PLUS1 / (f + K1 * lenweight); doc_score = PyFloat_FromDouble(tf * idf); if (doc_score == NULL) status = -1; else status = PyObject_SetItem(result, d, doc_score); Py_DECREF(d_and_f); Py_DECREF(doclen); Py_XDECREF(doc_score); if (status < 0) return NULL; } Py_INCREF(Py_None); return Py_None; } static char score__doc__[] = "score(result, d2fitems, d2len, idf, meandoclen)\n" "\n" "Do the inner scoring loop for an Okapi index.\n"; static PyMethodDef okascore_functions[] = { {"score", score, METH_VARARGS, score__doc__}, {NULL} }; void initokascore(void) { PyObject *m; m = Py_InitModule3("okascore", okascore_functions, "inner scoring loop for Okapi rank"); } zope.index-3.6.4/src/zope/index/text/lexicon.py0000644000175000017500000001426411727503631022641 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Lexicon """ import re from zope.interface import implements from BTrees.IOBTree import IOBTree from BTrees.OIBTree import OIBTree from BTrees.Length import Length from persistent import Persistent from zope.index.text.interfaces import ILexicon from zope.index.text.interfaces import IPipelineElement from zope.index.text.interfaces import ISplitter from zope.index.text.stopdict import get_stopdict from zope.index.text.parsetree import QueryError class Lexicon(Persistent): implements(ILexicon) def __init__(self, *pipeline): self._wids = OIBTree() # word -> wid self._words = IOBTree() # wid -> word # wid 0 is reserved for words that aren't in the lexicon (OOV -- out # of vocabulary). This can happen, e.g., if a query contains a word # we never saw before, and that isn't a known stopword (or otherwise # filtered out). Returning a special wid value for OOV words is a # way to let clients know when an OOV word appears. self.wordCount = Length() self._pipeline = pipeline def wordCount(self): """Return the number of unique terms in the lexicon.""" # overridden per instance return len(self._wids) def words(self): return self._wids.keys() def wids(self): return self._words.keys() def items(self): return self._wids.items() def sourceToWordIds(self, text): if text is None: text = '' last = _text2list(text) for element in self._pipeline: last = element.process(last) if not isinstance(self.wordCount, Length): # Make sure wordCount is overridden with a BTrees.Length.Length self.wordCount = Length(self.wordCount()) # Strategically unload the length value so that we get the most # recent value written to the database to minimize conflicting wids # Because length is independent, this will load the most # recent value stored, regardless of whether MVCC is enabled self.wordCount._p_deactivate() return map(self._getWordIdCreate, last) def termToWordIds(self, text): last = _text2list(text) for element in self._pipeline: last = element.process(last) wids = [] for word in last: wids.append(self._wids.get(word, 0)) return wids def parseTerms(self, text): last = _text2list(text) for element in self._pipeline: process = getattr(element, "processGlob", element.process) last = process(last) return last def isGlob(self, word): return "*" in word or "?" in word def get_word(self, wid): return self._words[wid] def get_wid(self, word): return self._wids.get(word, 0) def globToWordIds(self, pattern): # Implement * and ? just as in the shell, except the pattern # must not start with either of these prefix = "" while pattern and pattern[0] not in "*?": prefix += pattern[0] pattern = pattern[1:] if not pattern: # There were no globbing characters in the pattern wid = self._wids.get(prefix, 0) if wid: return [wid] else: return [] if not prefix: # The pattern starts with a globbing character. # This is too efficient, so we raise an exception. raise QueryError( "pattern %r shouldn't start with glob character" % pattern) pat = prefix for c in pattern: if c == "*": pat += ".*" elif c == "?": pat += "." else: pat += re.escape(c) pat += "$" prog = re.compile(pat) keys = self._wids.keys(prefix) # Keys starting at prefix wids = [] for key in keys: if not key.startswith(prefix): break if prog.match(key): wids.append(self._wids[key]) return wids def _getWordIdCreate(self, word): wid = self._wids.get(word) if wid is None: wid = self._new_wid() self._wids[word] = wid self._words[wid] = word return wid def _new_wid(self): count = self.wordCount count.change(1) while self._words.has_key(count()): # just to be safe count.change(1) return count() def _text2list(text): # Helper: splitter input may be a string or a list of strings try: text + u"" except: return text else: return [text] # Sample pipeline elements class Splitter(object): implements(ISplitter) rx = re.compile(r"(?u)\w+") rxGlob = re.compile(r"(?u)\w+[\w*?]*") # See globToWordIds() above def process(self, lst): result = [] for s in lst: result += self.rx.findall(s) return result def processGlob(self, lst): result = [] for s in lst: result += self.rxGlob.findall(s) return result class CaseNormalizer(object): implements(IPipelineElement) def process(self, lst): return [w.lower() for w in lst] class StopWordRemover(object): implements(IPipelineElement) dict = get_stopdict().copy() def process(self, lst): has_key = self.dict.has_key return [w for w in lst if not has_key(w)] class StopWordAndSingleCharRemover(StopWordRemover): dict = get_stopdict().copy() for c in range(255): dict[chr(c)] = None zope.index-3.6.4/src/zope/index/text/textindex.txt0000644000175000017500000001166311727503631023403 0ustar tseavertseaver00000000000000Text Indexes ============ Text indexes combine an inverted index and a lexicon to support text indexing and searching. A text index can be created without passing any arguments: >>> from zope.index.text.textindex import TextIndex >>> index = TextIndex() By default, it uses an "Okapi" inverted index and a lexicon with a pipeline consistening is a simple word splitter, a case normalizer, and a stop-word remover. We index text using the `index_doc` method: >>> index.index_doc(1, u"the quick brown fox jumps over the lazy dog") >>> index.index_doc(2, ... u"the brown fox and the yellow fox don't need the retriever") >>> index.index_doc(3, u""" ... The Conservation Pledge ... ======================= ... ... I give my pledge, as an American, to save, and faithfully ... to defend from waste, the natural resources of my Country; ... it's soils, minerals, forests, waters and wildlife. ... """) >>> index.index_doc(4, u"Fran\xe7ois") >>> word = ( ... u"\N{GREEK SMALL LETTER DELTA}" ... u"\N{GREEK SMALL LETTER EPSILON}" ... u"\N{GREEK SMALL LETTER LAMDA}" ... u"\N{GREEK SMALL LETTER TAU}" ... u"\N{GREEK SMALL LETTER ALPHA}" ... ) >>> index.index_doc(5, word + u"\N{EM DASH}\N{GREEK SMALL LETTER ALPHA}") >>> index.index_doc(6, u""" ... What we have here, is a failure to communicate. ... """) >>> index.index_doc(7, u""" ... Hold on to your butts! ... """) >>> index.index_doc(8, u""" ... The Zen of Python, by Tim Peters ... ... Beautiful is better than ugly. ... Explicit is better than implicit. ... Simple is better than complex. ... Complex is better than complicated. ... Flat is better than nested. ... Sparse is better than dense. ... Readability counts. ... Special cases aren't special enough to break the rules. ... Although practicality beats purity. ... Errors should never pass silently. ... Unless explicitly silenced. ... In the face of ambiguity, refuse the temptation to guess. ... There should be one-- and preferably only one --obvious way to do it. ... Although that way may not be obvious at first unless you're Dutch. ... Now is better than never. ... Although never is often better than *right* now. ... If the implementation is hard to explain, it's a bad idea. ... If the implementation is easy to explain, it may be a good idea. ... Namespaces are one honking great idea -- let's do more of those! ... """) Then we can search using the apply method, which takes a search string. >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown fox').items()] [(1, '0.6153'), (2, '0.6734')] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'quick fox').items()] [(1, '0.6153')] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown python').items()] [] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'dalmatian').items()] [] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown or python').items()] [(1, '0.2602'), (2, '0.2529'), (8, '0.0934')] >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'butts').items()] [(7, '0.6948')] The outputs are mappings from document ids to float scores. Items with higher scores are more relevent. We can use unicode characters in search strings. >>> [(k, "%.4f" % v) for (k, v) in index.apply(u"Fran\xe7ois").items()] [(4, '0.7427')] >>> [(k, "%.4f" % v) for (k, v) in index.apply(word).items()] [(5, '0.7179')] We can use globbing in search strings. >>> [(k, "%.3f" % v) for (k, v) in index.apply('fo*').items()] [(1, '2.179'), (2, '2.651'), (3, '2.041')] Text indexes support basic statistics: >>> index.documentCount() 8 >>> index.wordCount() 114 If we index the same document twice, once with a zero value, and then with a normal value, it should still work: >>> index2 = TextIndex() >>> index2.index_doc(1, []) >>> index2.index_doc(1, ["Zorro"]) >>> [(k, "%.4f" % v) for (k, v) in index2.apply("Zorro").items()] [(1, '0.4545')] Tracking Changes ================ If we index a document the first time it updates the _totaldoclen of the underlying object. >>> index = TextIndex() >>> index.index._totaldoclen() 0 >>> index.index_doc(100, u"a new funky value") >>> index.index._totaldoclen() 3 If we index it a second time, the underlying index length should not be changed. >>> index.index_doc(100, u"a new funky value") >>> index.index._totaldoclen() 3 But if we change it the length changes too. >>> index.index_doc(100, u"an even newer funky value") >>> index.index._totaldoclen() 5 The same as for index_doc applies to unindex_doc, if an object is unindexed that is not indexed no indexes chould change state. >>> index.unindex_doc(100) >>> index.index._totaldoclen() 0 >>> index.unindex_doc(100) >>> index.index._totaldoclen() 0 zope.index-3.6.4/src/zope/index/text/baseindex.py0000644000175000017500000003240311727503631023135 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Abstract base class for full text index with relevance ranking. """ import math from persistent import Persistent from zope.interface import implements import BTrees from BTrees import Length from BTrees.IOBTree import IOBTree from zope.index.interfaces import IInjection from zope.index.interfaces import IStatistics from zope.index.text.interfaces import IExtendedQuerying from zope.index.text.interfaces import ILexiconBasedIndex from zope.index.text import widcode from zope.index.text.setops import mass_weightedIntersection from zope.index.text.setops import mass_weightedUnion class BaseIndex(Persistent): implements(IInjection, IStatistics, ILexiconBasedIndex, IExtendedQuerying) family = BTrees.family32 lexicon = property(lambda self: self._lexicon,) def __init__(self, lexicon, family=None): if family is not None: self.family = family self._lexicon = lexicon self.clear() def clear(self): # wid -> {docid -> weight}; t -> D -> w(D, t) # Different indexers have different notions of term weight, but we # expect each indexer to use ._wordinfo to map wids to its notion # of a docid-to-weight map. # There are two kinds of OOV words: wid 0 is explicitly OOV, # and it's possible that the lexicon will return a non-zero wid # for a word we don't currently know about. For example, if we # unindex the last doc containing a particular word, that wid # remains in the lexicon, but is no longer in our _wordinfo map; # lexicons can also be shared across indices, and some other index # may introduce a lexicon word we've never seen. # A word is in-vocabulary for this index if and only if # _wordinfo.has_key(wid). Note that wid 0 must not be a key. # This does not use the BTree family since wids are always "I" # flavor trees. self._wordinfo = IOBTree() # docid -> weight # Different indexers have different notions of doc weight, but we # expect each indexer to use ._docweight to map docids to its # notion of what a doc weight is. self._docweight = self.family.IF.BTree() # docid -> WidCode'd list of wids # Used for un-indexing, and for phrase search. self._docwords = self.family.IO.BTree() # Use a BTree length for efficient length computation w/o conflicts self.wordCount = Length.Length() self.documentCount = Length.Length() def wordCount(self): """Return the number of words in the index.""" # This must be overridden by subclasses which do not set the # attribute on their instances. raise NotImplementedError def documentCount(self): """Return the number of documents in the index.""" # This must be overridden by subclasses which do not set the # attribute on their instances. raise NotImplementedError def get_words(self, docid): """Return a list of the wordids for a given docid.""" return widcode.decode(self._docwords[docid]) # A subclass may wish to extend or override this. def index_doc(self, docid, text): if docid in self._docwords: return self._reindex_doc(docid, text) wids = self._lexicon.sourceToWordIds(text) wid2weight, docweight = self._get_frequencies(wids) self._mass_add_wordinfo(wid2weight, docid) self._docweight[docid] = docweight self._docwords[docid] = widcode.encode(wids) try: self.documentCount.change(1) except AttributeError: # upgrade documentCount to Length object self.documentCount = Length.Length(len(self._docweight)) return len(wids) # A subclass may wish to extend or override this. This is for adjusting # to a new version of a doc that already exists. The goal is to be # faster than simply unindexing the old version in its entirety and then # adding the new version in its entirety. def _reindex_doc(self, docid, text): # Touch as few docid->w(docid, score) maps in ._wordinfo as possible. old_wids = self.get_words(docid) old_wid2w, old_docw = self._get_frequencies(old_wids) new_wids = self._lexicon.sourceToWordIds(text) new_wid2w, new_docw = self._get_frequencies(new_wids) old_widset = self.family.IF.TreeSet(old_wid2w.keys()) new_widset = self.family.IF.TreeSet(new_wid2w.keys()) IF = self.family.IF in_both_widset = IF.intersection(old_widset, new_widset) only_old_widset = IF.difference(old_widset, in_both_widset) only_new_widset = IF.difference(new_widset, in_both_widset) del old_widset, new_widset for wid in only_old_widset.keys(): self._del_wordinfo(wid, docid) for wid in only_new_widset.keys(): self._add_wordinfo(wid, new_wid2w[wid], docid) for wid in in_both_widset.keys(): # For the Okapi indexer, the "if" will trigger only for words # whose counts have changed. For the cosine indexer, the "if" # may trigger for every wid, since W(d) probably changed and # W(d) is divided into every score. newscore = new_wid2w[wid] if old_wid2w[wid] != newscore: self._add_wordinfo(wid, newscore, docid) self._docweight[docid] = new_docw self._docwords[docid] = widcode.encode(new_wids) return len(new_wids) # Subclass must override. def _get_frequencies(self, wids): # Compute term frequencies and a doc weight, whatever those mean # to an indexer. # Return pair: # {wid0: w(d, wid0), wid1: w(d, wid1), ...], # docweight # The wid->weight mappings are fed into _add_wordinfo, and docweight # becomes the value of _docweight[docid]. raise NotImplementedError def has_doc(self, docid): return docid in self._docwords # A subclass may wish to extend or override this. def unindex_doc(self, docid): if docid not in self._docwords: return for wid in self.family.IF.TreeSet(self.get_words(docid)).keys(): self._del_wordinfo(wid, docid) del self._docwords[docid] del self._docweight[docid] try: self.documentCount.change(-1) except AttributeError: # upgrade documentCount to Length object self.documentCount = Length.Length(len(self._docweight)) def search(self, term): wids = self._lexicon.termToWordIds(term) if not wids: return None # All docs match wids = self._remove_oov_wids(wids) return mass_weightedUnion(self._search_wids(wids), self.family) def search_glob(self, pattern): wids = self._lexicon.globToWordIds(pattern) wids = self._remove_oov_wids(wids) return mass_weightedUnion(self._search_wids(wids), self.family) def search_phrase(self, phrase): wids = self._lexicon.termToWordIds(phrase) cleaned_wids = self._remove_oov_wids(wids) if len(wids) != len(cleaned_wids): # At least one wid was OOV: can't possibly find it. return self.family.IF.BTree() scores = self._search_wids(wids) hits = mass_weightedIntersection(scores, self.family) if not hits: return hits code = widcode.encode(wids) result = self.family.IF.BTree() for docid, weight in hits.items(): docwords = self._docwords[docid] if docwords.find(code) >= 0: result[docid] = weight return result def _remove_oov_wids(self, wids): return filter(self._wordinfo.has_key, wids) # Subclass must override. # The workhorse. Return a list of (IFBucket, weight) pairs, one pair # for each wid t in wids. The IFBucket, times the weight, maps D to # TF(D,t) * IDF(t) for every docid D containing t. wids must not # contain any OOV words. def _search_wids(self, wids): raise NotImplementedError # Subclass must override. # It's not clear what it should do. It must return an upper bound on # document scores for the query. It would be nice if a document score # divided by the query's query_weight gave the proabability that a # document was relevant, but nobody knows how to do that. For # CosineIndex, the ratio is the cosine of the angle between the document # and query vectors. For OkapiIndex, the ratio is a (probably # unachievable) upper bound with no "intuitive meaning" beyond that. def query_weight(self, terms): raise NotImplementedError DICT_CUTOFF = 10 def _add_wordinfo(self, wid, f, docid): # Store a wordinfo in a dict as long as there are less than # DICT_CUTOFF docids in the dict. Otherwise use an IFBTree. # The pickle of a dict is smaller than the pickle of an # IFBTree, substantially so for small mappings. Thus, we use # a dictionary until the mapping reaches DICT_CUTOFF elements. # The cutoff is chosen based on the implementation # characteristics of Python dictionaries. The dict hashtable # always has 2**N slots and is resized whenever it is 2/3s # full. A pickled dict with 10 elts is half the size of an # IFBTree with 10 elts, and 10 happens to be 2/3s of 2**4. So # choose 10 as the cutoff for now. # The IFBTree has a smaller in-memory representation than a # dictionary, so pickle size isn't the only consideration when # choosing the threshold. The pickle of a 500-elt dict is 92% # of the size of the same IFBTree, but the dict uses more # space when it is live in memory. An IFBTree stores two C # arrays of ints, one for the keys and one for the values. It # holds up to 120 key-value pairs in a single bucket. doc2score = self._wordinfo.get(wid) if doc2score is None: doc2score = {} # XXX Holy ConflictError, Batman! try: self.wordCount.change(1) except AttributeError: # upgrade wordCount to Length object self.wordCount = Length.Length(len(self._wordinfo)) self.wordCount.change(1) else: # _add_wordinfo() is called for each update. If the map # size exceeds the DICT_CUTOFF, convert to an IFBTree. # Obscure: First check the type. If it's not a dict, it # can't need conversion, and then we can avoid an expensive # len(IFBTree). if (isinstance(doc2score, type({})) and len(doc2score) == self.DICT_CUTOFF): doc2score = self.family.IF.BTree(doc2score) doc2score[docid] = f self._wordinfo[wid] = doc2score # not redundant: Persistency! # self._mass_add_wordinfo(wid2weight, docid) # # is the same as # # for wid, weight in wid2weight.items(): # self._add_wordinfo(wid, weight, docid) # # except that _mass_add_wordinfo doesn't require so many function calls. def _mass_add_wordinfo(self, wid2weight, docid): dicttype = type({}) get_doc2score = self._wordinfo.get new_word_count = 0 for wid, weight in wid2weight.items(): doc2score = get_doc2score(wid) if doc2score is None: doc2score = {} new_word_count += 1 elif (isinstance(doc2score, dicttype) and len(doc2score) == self.DICT_CUTOFF): doc2score = self.family.IF.BTree(doc2score) doc2score[docid] = weight self._wordinfo[wid] = doc2score # not redundant: Persistency! try: self.wordCount.change(new_word_count) except AttributeError: # upgrade wordCount to Length object self.wordCount = Length.Length(len(self._wordinfo)) def _del_wordinfo(self, wid, docid): doc2score = self._wordinfo[wid] del doc2score[docid] if doc2score: self._wordinfo[wid] = doc2score # not redundant: Persistency! else: del self._wordinfo[wid] try: self.wordCount.change(-1) except AttributeError: # upgrade wordCount to Length object self.wordCount = Length.Length(len(self._wordinfo)) def inverse_doc_frequency(term_count, num_items): """Return the inverse doc frequency for a term, that appears in term_count items in a collection with num_items total items. """ # implements IDF(q, t) = log(1 + N/f(t)) return math.log(1.0 + float(num_items) / term_count) zope.index-3.6.4/src/zope/index/text/setops.py0000644000175000017500000000475511727503631022521 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """SetOps -- Weighted intersections and unions applied to many inputs. """ import BTrees from zope.index.nbest import NBest def mass_weightedIntersection(L, family=BTrees.family32): "A list of (mapping, weight) pairs -> their weightedIntersection IFBucket." L = [(x, wx) for (x, wx) in L if x is not None] if len(L) < 2: return _trivial(L, family) # Intersect with smallest first. We expect the input maps to be # IFBuckets, so it doesn't hurt to get their lengths repeatedly # (len(Bucket) is fast; len(BTree) is slow). L.sort(lambda x, y: cmp(len(x[0]), len(y[0]))) (x, wx), (y, wy) = L[:2] dummy, result = family.IF.weightedIntersection(x, y, wx, wy) for x, wx in L[2:]: dummy, result = family.IF.weightedIntersection(result, x, 1, wx) return result def mass_weightedUnion(L, family=BTrees.family32): "A list of (mapping, weight) pairs -> their weightedUnion IFBucket." if len(L) < 2: return _trivial(L, family) # Balance unions as closely as possible, smallest to largest. merge = NBest(len(L)) for x, weight in L: merge.add((x, weight), len(x)) while len(merge) > 1: # Merge the two smallest so far, and add back to the queue. (x, wx), dummy = merge.pop_smallest() (y, wy), dummy = merge.pop_smallest() dummy, z = family.IF.weightedUnion(x, y, wx, wy) merge.add((z, 1), len(z)) (result, weight), dummy = merge.pop_smallest() return result def _trivial(L, family): # L is empty or has only one (mapping, weight) pair. If there is a # pair, we may still need to multiply the mapping by its weight. assert len(L) <= 1 if len(L) == 0: return family.IF.Bucket() [(result, weight)] = L if weight != 1: dummy, result = family.IF.weightedUnion( family.IF.Bucket(), result, 0, weight) return result zope.index-3.6.4/src/zope/index/text/parsetree.py0000644000175000017500000000722611727503631023172 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Generic parser support: exception and parse tree nodes. """ from zope.index.text.interfaces import IQueryParseTree from zope.index.text.setops import mass_weightedIntersection from zope.index.text.setops import mass_weightedUnion from zope.interface import implements class QueryError(Exception): pass class ParseError(Exception): pass class ParseTreeNode(object): implements(IQueryParseTree) _nodeType = None def __init__(self, value): self._value = value def nodeType(self): return self._nodeType def getValue(self): return self._value def __repr__(self): return "%s(%r)" % (self.__class__.__name__, self.getValue()) def terms(self): t = [] for v in self.getValue(): t.extend(v.terms()) return t def executeQuery(self, index): raise NotImplementedError class NotNode(ParseTreeNode): _nodeType = "NOT" def terms(self): return [] def executeQuery(self, index): raise QueryError("NOT parse tree node cannot be executed directly") class AndNode(ParseTreeNode): _nodeType = "AND" def executeQuery(self, index): L = [] Nots = [] for subnode in self.getValue(): if subnode.nodeType() == "NOT": r = subnode.getValue().executeQuery(index) # If None, technically it matches every doc, but we treat # it as if it matched none (we want # real_word AND NOT stop_word # to act like plain real_word). if r is not None: Nots.append((r, 1)) else: r = subnode.executeQuery(index) # If None, technically it matches every doc, so needn't be # included. if r is not None: L.append((r, 1)) set = mass_weightedIntersection(L, index.family) if Nots: notset = mass_weightedUnion(Nots, index.family) set = index.family.IF.difference(set, notset) return set class OrNode(ParseTreeNode): _nodeType = "OR" def executeQuery(self, index): weighted = [] for node in self.getValue(): r = node.executeQuery(index) # If None, technically it matches every doc, but we treat # it as if it matched none (we want # real_word OR stop_word # to act like plain real_word). if r is not None: weighted.append((r, 1)) return mass_weightedUnion(weighted, index.family) class AtomNode(ParseTreeNode): _nodeType = "ATOM" def terms(self): return [self.getValue()] def executeQuery(self, index): return index.search(self.getValue()) class PhraseNode(AtomNode): _nodeType = "PHRASE" def executeQuery(self, index): return index.search_phrase(self.getValue()) class GlobNode(AtomNode): _nodeType = "GLOB" def executeQuery(self, index): return index.search_glob(self.getValue()) zope.index-3.6.4/src/zope/index/text/htmlsplitter.py0000644000175000017500000000256511727503631023734 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """HTML Splitter """ import re from zope.interface import implements from zope.index.text.interfaces import ISplitter MARKUP = re.compile(r"(<[^<>]*>|&[A-Za-z]+;)") WORDS = re.compile(r"(?L)\w+") GLOBS = re.compile(r"(?L)\w+[\w*?]*") class HTMLWordSplitter(object): implements(ISplitter) def process(self, text): return self._apply(text, WORDS) def processGlob(self, text): # see Lexicon.globToWordIds() return self._apply(text, GLOBS) def _apply(self, text, pattern): result = [] for chunk in text: result.extend(self._split(chunk, pattern)) return result def _split(self, text, pattern): text = MARKUP.sub(' ', text.lower()) return pattern.findall(text) zope.index-3.6.4/src/zope/index/text/tests/0000755000175000017500000000000011727503757021772 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope/index/text/tests/test_queryparser.py0000644000175000017500000003557211727503631025770 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Query Parser """ import unittest class TestQueryParserBase(unittest.TestCase): def _getTargetClass(self): from zope.index.text.queryparser import QueryParser return QueryParser def _makeOne(self, lexicon=None): if lexicon is None: lexicon = self._makeLexicon() return self._getTargetClass()(lexicon) def _makePipeline(self): from zope.index.text.lexicon import Splitter return (Splitter(),) def _makeLexicon(self): from zope.index.text.lexicon import Lexicon return Lexicon(*self._makePipeline()) def _expect(self, parser, input, output, expected_ignored=[]): tree = parser.parseQuery(input) ignored = parser.getIgnored() self._compareParseTrees(tree, output) self.assertEqual(ignored, expected_ignored) # Check that parseQueryEx() == (parseQuery(), getIgnored()) ex_tree, ex_ignored = parser.parseQueryEx(input) self._compareParseTrees(ex_tree, tree) self.assertEqual(ex_ignored, expected_ignored) def _failure(self, parser, input): from zope.index.text.parsetree import ParseError self.assertRaises(ParseError, parser.parseQuery, input) self.assertRaises(ParseError, parser.parseQueryEx, input) def _compareParseTrees(self, got, expected, msg=None): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import GlobNode from zope.index.text.parsetree import NotNode from zope.index.text.parsetree import OrNode from zope.index.text.parsetree import ParseTreeNode from zope.index.text.parsetree import PhraseNode if msg is None: msg = repr(got) self.assertEqual(isinstance(got, ParseTreeNode), 1) self.assertEqual(got.__class__, expected.__class__, msg) if isinstance(got, PhraseNode): self.assertEqual(got.nodeType(), "PHRASE", msg) self.assertEqual(got.getValue(), expected.getValue(), msg) elif isinstance(got, GlobNode): self.assertEqual(got.nodeType(), "GLOB", msg) self.assertEqual(got.getValue(), expected.getValue(), msg) elif isinstance(got, AtomNode): self.assertEqual(got.nodeType(), "ATOM", msg) self.assertEqual(got.getValue(), expected.getValue(), msg) elif isinstance(got, NotNode): self.assertEqual(got.nodeType(), "NOT") self._compareParseTrees(got.getValue(), expected.getValue(), msg) elif isinstance(got, AndNode) or isinstance(got, OrNode): self.assertEqual(got.nodeType(), isinstance(got, AndNode) and "AND" or "OR", msg) list1 = got.getValue() list2 = expected.getValue() self.assertEqual(len(list1), len(list2), msg) for i in range(len(list1)): self._compareParseTrees(list1[i], list2[i], msg) class TestQueryParser(TestQueryParserBase): def test_class_conforms_to_IQueryParser(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import IQueryParser verifyClass(IQueryParser, self._getTargetClass()) def test_instance_conforms_to_IQueryParser(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import IQueryParser verifyObject(IQueryParser, self._makeOne()) def test001(self): from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, "foo", AtomNode("foo")) def test002(self): from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, "note", AtomNode("note")) def test003(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, "aa and bb AND cc", AndNode([AtomNode("aa"), AtomNode("bb"), AtomNode("cc")])) def test004(self): from zope.index.text.parsetree import OrNode from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, "aa OR bb or cc", OrNode([AtomNode("aa"), AtomNode("bb"), AtomNode("cc")])) def test005(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import OrNode parser = self._makeOne() self._expect(parser, "aa AND bb OR cc AnD dd", OrNode([AndNode([AtomNode("aa"), AtomNode("bb")]), AndNode([AtomNode("cc"), AtomNode("dd")])])) def test006(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import OrNode parser = self._makeOne() self._expect(parser, "(aa OR bb) AND (cc OR dd)", AndNode([OrNode([AtomNode("aa"), AtomNode("bb")]), OrNode([AtomNode("cc"), AtomNode("dd")])])) def test007(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import NotNode parser = self._makeOne() self._expect(parser, "aa AND NOT bb", AndNode([AtomNode("aa"), NotNode(AtomNode("bb"))])) def test008(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import NotNode parser = self._makeOne() self._expect(parser, "aa NOT bb", AndNode([AtomNode("aa"), NotNode(AtomNode("bb"))])) def test010(self): from zope.index.text.parsetree import PhraseNode parser = self._makeOne() self._expect(parser, '"foo bar"', PhraseNode(["foo", "bar"])) def test011(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, "foo bar", AndNode([AtomNode("foo"), AtomNode("bar")])) def test012(self): from zope.index.text.parsetree import PhraseNode parser = self._makeOne() self._expect(parser, '(("foo bar"))"', PhraseNode(["foo", "bar"])) def test013(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, "((foo bar))", AndNode([AtomNode("foo"), AtomNode("bar")])) def test014(self): from zope.index.text.parsetree import PhraseNode parser = self._makeOne() self._expect(parser, "foo-bar", PhraseNode(["foo", "bar"])) def test015(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import NotNode parser = self._makeOne() self._expect(parser, "foo -bar", AndNode([AtomNode("foo"), NotNode(AtomNode("bar"))])) def test016(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import NotNode parser = self._makeOne() self._expect(parser, "-foo bar", AndNode([AtomNode("bar"), NotNode(AtomNode("foo"))])) def test017(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import NotNode from zope.index.text.parsetree import PhraseNode parser = self._makeOne() self._expect(parser, "booh -foo-bar", AndNode([AtomNode("booh"), NotNode(PhraseNode(["foo", "bar"]))])) def test018(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import NotNode from zope.index.text.parsetree import PhraseNode parser = self._makeOne() self._expect(parser, 'booh -"foo bar"', AndNode([AtomNode("booh"), NotNode(PhraseNode(["foo", "bar"]))])) def test019(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, 'foo"bar"', AndNode([AtomNode("foo"), AtomNode("bar")])) def test020(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, '"foo"bar', AndNode([AtomNode("foo"), AtomNode("bar")])) def test021(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, 'foo"bar"blech', AndNode([AtomNode("foo"), AtomNode("bar"), AtomNode("blech")])) def test022(self): from zope.index.text.parsetree import GlobNode parser = self._makeOne() self._expect(parser, "foo*", GlobNode("foo*")) def test023(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import GlobNode parser = self._makeOne() self._expect(parser, "foo* bar", AndNode([GlobNode("foo*"), AtomNode("bar")])) def test101(self): parser = self._makeOne() self._failure(parser, "") def test102(self): parser = self._makeOne() self._failure(parser, "not") def test103(self): parser = self._makeOne() self._failure(parser, "or") def test104(self): parser = self._makeOne() self._failure(parser, "and") def test105(self): parser = self._makeOne() self._failure(parser, "NOT") def test106(self): parser = self._makeOne() self._failure(parser, "OR") def test107(self): parser = self._makeOne() self._failure(parser, "AND") def test108(self): parser = self._makeOne() self._failure(parser, "NOT foo") def test109(self): parser = self._makeOne() self._failure(parser, ")") def test110(self): parser = self._makeOne() self._failure(parser, "(") def test111(self): parser = self._makeOne() self._failure(parser, "foo OR") def test112(self): parser = self._makeOne() self._failure(parser, "foo AND") def test113(self): parser = self._makeOne() self._failure(parser, "OR foo") def test114(self): parser = self._makeOne() self._failure(parser, "AND foo") def test115(self): parser = self._makeOne() self._failure(parser, "(foo) bar") def test116(self): parser = self._makeOne() self._failure(parser, "(foo OR)") def test117(self): parser = self._makeOne() self._failure(parser, "(foo AND)") def test118(self): parser = self._makeOne() self._failure(parser, "(NOT foo)") def test119(self): parser = self._makeOne() self._failure(parser, "-foo") def test120(self): parser = self._makeOne() self._failure(parser, "-foo -bar") def test121(self): parser = self._makeOne() self._failure(parser, "foo OR -bar") def test122(self): parser = self._makeOne() self._failure(parser, "foo AND -bar") class StopWordTestQueryParser(TestQueryParserBase): def _makePipeline(self): from zope.index.text.lexicon import Splitter return (Splitter(), FakeStopWordRemover()) def test201(self): from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, 'and/', AtomNode("and")) def test202(self): from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, 'foo AND stop', AtomNode("foo"), ["stop"]) def test203(self): from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, 'foo AND NOT stop', AtomNode("foo"), ["stop"]) def test204(self): from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, 'stop AND foo', AtomNode("foo"), ["stop"]) def test205(self): from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, 'foo OR stop', AtomNode("foo"), ["stop"]) def test206(self): from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, 'stop OR foo', AtomNode("foo"), ["stop"]) def test207(self): from zope.index.text.parsetree import AndNode from zope.index.text.parsetree import AtomNode parser = self._makeOne() self._expect(parser, 'foo AND bar NOT stop', AndNode([AtomNode("foo"), AtomNode("bar")]), ["stop"]) def test301(self): parser = self._makeOne() self._failure(parser, 'stop') def test302(self): parser = self._makeOne() self._failure(parser, 'stop stop') def test303(self): parser = self._makeOne() self._failure(parser, 'stop AND stop') def test304(self): parser = self._makeOne() self._failure(parser, 'stop OR stop') def test305(self): parser = self._makeOne() self._failure(parser, 'stop -foo') def test306(self): parser = self._makeOne() self._failure(parser, 'stop AND NOT foo') class FakeStopWordRemover(object): def process(self, list): return [word for word in list if word != "stop"] def test_suite(): return unittest.TestSuite(( unittest.makeSuite(TestQueryParser), unittest.makeSuite(StopWordTestQueryParser), )) zope.index-3.6.4/src/zope/index/text/tests/test_setops.py0000644000175000017500000002245511727503631024717 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Set Options tests """ import unittest _marker = object() class Test_mass_weightedIntersection(unittest.TestCase): def _callFUT(self, L, family=_marker): from zope.index.text.setops import mass_weightedIntersection if family is _marker: return mass_weightedIntersection(L) return mass_weightedIntersection(L, family) def test_empty_list_no_family(self): from BTrees.IFBTree import IFBucket t = self._callFUT([]) self.assertEqual(len(t), 0) self.assertEqual(t.__class__, IFBucket) def test_empty_list_family32(self): import BTrees from BTrees.IFBTree import IFBucket t = self._callFUT([], BTrees.family32) self.assertEqual(len(t), 0) self.assertEqual(t.__class__, IFBucket) def test_empty_list_family64(self): import BTrees from BTrees.LFBTree import LFBucket t = self._callFUT([], BTrees.family64) self.assertEqual(len(t), 0) self.assertEqual(t.__class__, LFBucket) def test_identity_tree(self): from BTrees.IFBTree import IFBTree x = IFBTree([(1, 2)]) result = self._callFUT([(x, 1)]) self.assertEqual(len(result), 1) self.assertEqual(list(result.items()), list(x.items())) def test_identity_bucket(self): from BTrees.IFBTree import IFBucket x = IFBucket([(1, 2)]) result = self._callFUT([(x, 1)]) self.assertEqual(len(result), 1) self.assertEqual(list(result.items()), list(x.items())) def test_scalar_multiply_tree(self): from BTrees.IFBTree import IFBTree x = IFBTree([(1, 2), (2, 3), (3, 4)]) allkeys = list(x.keys()) for factor in 0, 1, 5, 10: result = self._callFUT([(x, factor)]) self.assertEqual(allkeys, list(result.keys())) for key in x.keys(): self.assertEqual(result[key], x[key]*factor) def test_scalar_multiply_bucket(self): from BTrees.IFBTree import IFBucket x = IFBucket([(1, 2), (2, 3), (3, 4)]) allkeys = list(x.keys()) for factor in 0, 1, 5, 10: result = self._callFUT([(x, factor)]) self.assertEqual(allkeys, list(result.keys())) for key in x.keys(): self.assertEqual(result[key], x[key]*factor) def test_pairs(self): from BTrees.IFBTree import IFBTree from BTrees.IFBTree import IFBucket t1 = IFBTree([(1, 10), (3, 30), (7, 70)]) t2 = IFBTree([(3, 30), (5, 50), (7, 7), (9, 90)]) allkeys = [1, 3, 5, 7, 9] b1 = IFBucket(t1) b2 = IFBucket(t2) for x in t1, t2, b1, b2: for key in x.keys(): self.assertEqual(key in allkeys, 1) for y in t1, t2, b1, b2: for w1, w2 in (0, 0), (1, 10), (10, 1), (2, 3): expected = [] for key in allkeys: if x.has_key(key) and y.has_key(key): result = x[key] * w1 + y[key] * w2 expected.append((key, result)) expected.sort() got = self._callFUT([(x, w1), (y, w2)]) self.assertEqual(expected, list(got.items())) got = self._callFUT([(y, w2), (x, w1)]) self.assertEqual(expected, list(got.items())) def testMany(self): import random from BTrees.IFBTree import IFBTree N = 15 # number of IFBTrees to feed in L = [] commonkey = N * 1000 allkeys = {commonkey: 1} for i in range(N): t = IFBTree() t[commonkey] = i for j in range(N-i): key = i + j allkeys[key] = 1 t[key] = N*i + j L.append((t, i+1)) random.shuffle(L) allkeys = allkeys.keys() allkeys.sort() # Test the intersection. expected = [] for key in allkeys: sum = 0 for t, w in L: if t.has_key(key): sum += t[key] * w else: break else: # We didn't break out of the loop so it's in the intersection. expected.append((key, sum)) # print 'intersection', expected got = self._callFUT(L) self.assertEqual(expected, list(got.items())) class Test_mass_weightedUnion(unittest.TestCase): def _callFUT(self, L, family=_marker): from zope.index.text.setops import mass_weightedUnion if family is _marker: return mass_weightedUnion(L) return mass_weightedUnion(L, family) def test_empty_list_no_family(self): from BTrees.IFBTree import IFBucket t = self._callFUT([]) self.assertEqual(len(t), 0) self.assertEqual(t.__class__, IFBucket) def test_empty_list_family32(self): import BTrees from BTrees.IFBTree import IFBucket t = self._callFUT([], BTrees.family32) self.assertEqual(len(t), 0) self.assertEqual(t.__class__, IFBucket) def test_empty_list_family64(self): import BTrees from BTrees.LFBTree import LFBucket t = self._callFUT([], BTrees.family64) self.assertEqual(len(t), 0) self.assertEqual(t.__class__, LFBucket) def test_identity_tree(self): from BTrees.IFBTree import IFBTree x = IFBTree([(1, 2)]) result = self._callFUT([(x, 1)]) self.assertEqual(len(result), 1) self.assertEqual(list(result.items()), list(x.items())) def test_identity_bucket(self): from BTrees.IFBTree import IFBucket x = IFBucket([(1, 2)]) result = self._callFUT([(x, 1)]) self.assertEqual(len(result), 1) self.assertEqual(list(result.items()), list(x.items())) def test_scalar_multiply_tree(self): from BTrees.IFBTree import IFBTree x = IFBTree([(1, 2), (2, 3), (3, 4)]) allkeys = list(x.keys()) for factor in 0, 1, 5, 10: result = self._callFUT([(x, factor)]) self.assertEqual(allkeys, list(result.keys())) for key in x.keys(): self.assertEqual(result[key], x[key]*factor) def test_scalar_multiply_bucket(self): from BTrees.IFBTree import IFBucket x = IFBucket([(1, 2), (2, 3), (3, 4)]) allkeys = list(x.keys()) for factor in 0, 1, 5, 10: result = self._callFUT([(x, factor)]) self.assertEqual(allkeys, list(result.keys())) for key in x.keys(): self.assertEqual(result[key], x[key]*factor) def test_pairs(self): from BTrees.IFBTree import IFBTree from BTrees.IFBTree import IFBucket t1 = IFBTree([(1, 10), (3, 30), (7, 70)]) t2 = IFBTree([(3, 30), (5, 50), (7, 7), (9, 90)]) allkeys = [1, 3, 5, 7, 9] b1 = IFBucket(t1) b2 = IFBucket(t2) for x in t1, t2, b1, b2: for key in x.keys(): self.assertEqual(key in allkeys, 1) for y in t1, t2, b1, b2: for w1, w2 in (0, 0), (1, 10), (10, 1), (2, 3): expected = [] for key in allkeys: if x.has_key(key) or y.has_key(key): result = x.get(key, 0) * w1 + y.get(key, 0) * w2 expected.append((key, result)) expected.sort() got = self._callFUT([(x, w1), (y, w2)]) self.assertEqual(expected, list(got.items())) got = self._callFUT([(y, w2), (x, w1)]) self.assertEqual(expected, list(got.items())) def test_many(self): import random from BTrees.IFBTree import IFBTree N = 15 # number of IFBTrees to feed in L = [] commonkey = N * 1000 allkeys = {commonkey: 1} for i in range(N): t = IFBTree() t[commonkey] = i for j in range(N-i): key = i + j allkeys[key] = 1 t[key] = N*i + j L.append((t, i+1)) random.shuffle(L) allkeys = allkeys.keys() allkeys.sort() expected = [] for key in allkeys: sum = 0 for t, w in L: if t.has_key(key): sum += t[key] * w expected.append((key, sum)) # print 'union', expected got = self._callFUT(L) self.assertEqual(expected, list(got.items())) def test_suite(): return unittest.TestSuite(( unittest.makeSuite(Test_mass_weightedIntersection), unittest.makeSuite(Test_mass_weightedUnion), )) zope.index-3.6.4/src/zope/index/text/tests/mhindex.py0000644000175000017500000004430511727503631023775 0ustar tseavertseaver00000000000000#!/usr/bin/env python2.4 ############################################################################## # # Copyright (c) 2003 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """MH mail indexer. To index messages from a single folder (messages defaults to 'all'): mhindex.py [options] -u +folder [messages ...] To bulk index all messages from several folders: mhindex.py [options] -b folder ...; the folder name ALL means all folders. To execute a single query: mhindex.py [options] query To enter interactive query mode: mhindex.py [options] Common options: -d FILE -- specify the Data.fs to use (default ~/.Data.fs) -w -- dump the word list in alphabetical order and exit -W -- dump the word list ordered by word id and exit Indexing options: -O -- do a prescan on the data to compute optimal word id assignments; this is only useful the first time the Data.fs is used -t N -- commit a transaction after every N messages (default 20000) -p N -- pack after every N commits (by default no packing is done) Querying options: -m N -- show at most N matching lines from the message (default 3) -n N -- show the N best matching messages (default 3) """ import os import re import sys import time import mhlib import getopt import traceback from StringIO import StringIO from stat import ST_MTIME DATAFS = "~/.mhindex.fs" ZOPECODE = "~/projects/Zope3/lib/python" zopecode = os.path.expanduser(ZOPECODE) sys.path.insert(0, zopecode) from ZODB.DB import DB from ZODB.Storage.FileStorage import FileStorage import transaction from BTrees.IOBTree import IOBTree from BTrees.OIBTree import OIBTree from BTrees.IIBTree import IIBTree from zope.index.text.okapiindex import OkapiIndex from zope.index.text.lexicon import Splitter from zope.index.text.lexicon import CaseNormalizer, StopWordRemover from zope.index.text.stopdict import get_stopdict from zope.index.text.textindexwrapper import TextIndexWrapper NBEST = 3 MAXLINES = 3 def main(): try: opts, args = getopt.getopt(sys.argv[1:], "bd:fhm:n:Op:t:uwW") except getopt.error, msg: print msg print "use -h for help" return 2 update = 0 bulk = 0 optimize = 0 nbest = NBEST maxlines = MAXLINES datafs = os.path.expanduser(DATAFS) pack = 0 trans = 20000 dumpwords = dumpwids = dumpfreqs = 0 for o, a in opts: if o == "-b": bulk = 1 if o == "-d": datafs = a if o == "-f": dumpfreqs = 1 if o == "-h": print __doc__ return if o == "-m": maxlines = int(a) if o == "-n": nbest = int(a) if o == "-O": optimize = 1 if o == "-p": pack = int(a) if o == "-t": trans = int(a) if o == "-u": update = 1 if o == "-w": dumpwords = 1 if o == "-W": dumpwids = 1 ix = Indexer(datafs, writable=update or bulk, trans=trans, pack=pack) if dumpfreqs: ix.dumpfreqs() if dumpwords: ix.dumpwords() if dumpwids: ix.dumpwids() if dumpwords or dumpwids or dumpfreqs: return if bulk: if optimize: ix.optimize(args) ix.bulkupdate(args) elif update: ix.update(args) elif args: for i in range(len(args)): a = args[i] if " " in a: if a[0] == "-": args[i] = '-"' + a[1:] + '"' else: args[i] = '"' + a + '"' ix.query(" ".join(args), nbest, maxlines) else: ix.interact(nbest) if pack: ix.pack() class Indexer(object): filestorage = database = connection = root = None def __init__(self, datafs, writable=0, trans=0, pack=0): self.trans_limit = trans self.pack_limit = pack self.trans_count = 0 self.pack_count = 0 self.stopdict = get_stopdict() self.mh = mhlib.MH() self.filestorage = FileStorage(datafs, read_only=(not writable)) self.database = DB(self.filestorage) self.connection = self.database.open() self.root = self.connection.root() try: self.index = self.root["index"] except KeyError: self.index = self.root["index"] = TextIndexWrapper() try: self.docpaths = self.root["docpaths"] except KeyError: self.docpaths = self.root["docpaths"] = IOBTree() try: self.doctimes = self.root["doctimes"] except KeyError: self.doctimes = self.root["doctimes"] = IIBTree() try: self.watchfolders = self.root["watchfolders"] except KeyError: self.watchfolders = self.root["watchfolders"] = {} self.path2docid = OIBTree() for docid in self.docpaths.keys(): path = self.docpaths[docid] self.path2docid[path] = docid try: self.maxdocid = max(self.docpaths.keys()) except ValueError: self.maxdocid = 0 print len(self.docpaths), "Document ids" print len(self.path2docid), "Pathnames" print self.index.lexicon.length(), "Words" def dumpfreqs(self): lexicon = self.index.lexicon index = self.index.index assert isinstance(index, OkapiIndex) L = [] for wid in lexicon.wids(): freq = 0 for f in index._wordinfo.get(wid, {}).values(): freq += f L.append((freq, wid, lexicon.get_word(wid))) L.sort() L.reverse() for freq, wid, word in L: print "%10d %10d %s" % (wid, freq, word) def dumpwids(self): lexicon = self.index.lexicon index = self.index.index assert isinstance(index, OkapiIndex) for wid in lexicon.wids(): freq = 0 for f in index._wordinfo.get(wid, {}).values(): freq += f print "%10d %10d %s" % (wid, freq, lexicon.get_word(wid)) def dumpwords(self): lexicon = self.index.lexicon index = self.index.index assert isinstance(index, OkapiIndex) for word in lexicon.words(): wid = lexicon.get_wid(word) freq = 0 for f in index._wordinfo.get(wid, {}).values(): freq += f print "%10d %10d %s" % (wid, freq, word) def close(self): self.root = None if self.connection is not None: self.connection.close() self.connection = None if self.database is not None: self.database.close() self.database = None if self.filestorage is not None: self.filestorage.close() self.filestorage = None def interact(self, nbest=NBEST, maxlines=MAXLINES): try: import readline except ImportError: pass text = "" top = 0 results = [] while 1: try: line = raw_input("Query: ") except EOFError: print "\nBye." break line = line.strip() if line.startswith("/"): self.specialcommand(line, results, top - nbest) continue if line: text = line top = 0 else: if not text: continue try: results, n = self.timequery(text, top + nbest) except KeyboardInterrupt: raise except: reportexc() text = "" continue if len(results) <= top: if not n: print "No hits for %r." % text else: print "No more hits for %r." % text text = "" continue print "[Results %d-%d from %d" % (top+1, min(n, top+nbest), n), print "for query %s]" % repr(text) self.formatresults(text, results, maxlines, top, top+nbest) top += nbest def specialcommand(self, line, results, first): assert line.startswith("/") line = line[1:] if not line: n = first else: try: n = int(line) - 1 except: print "Huh?" return if n < 0 or n >= len(results): print "Out of range" return docid, score = results[n] path = self.docpaths[docid] i = path.rfind("/") assert i > 0 folder = path[:i] n = path[i+1:] cmd = "show +%s %s" % (folder, n) if os.getenv("DISPLAY"): os.system("xterm -e sh -c '%s | less' &" % cmd) else: os.system(cmd) def query(self, text, nbest=NBEST, maxlines=MAXLINES): results, n = self.timequery(text, nbest) if not n: print "No hits for %r." % text return print "[Results 1-%d from %d]" % (len(results), n) self.formatresults(text, results, maxlines) def timequery(self, text, nbest): t0 = time.time() c0 = time.clock() results, n = self.index.query(text, 0, nbest) t1 = time.time() c1 = time.clock() print "[Query time: %.3f real, %.3f user]" % (t1-t0, c1-c0) return results, n def formatresults(self, text, results, maxlines=MAXLINES, lo=0, hi=sys.maxint): stop = self.stopdict.has_key words = [w for w in re.findall(r"\w+\*?", text.lower()) if not stop(w)] pattern = r"\b(" + "|".join(words) + r")\b" pattern = pattern.replace("*", ".*") # glob -> re syntax prog = re.compile(pattern, re.IGNORECASE) print '='*70 rank = lo for docid, score in results[lo:hi]: rank += 1 path = self.docpaths[docid] score *= 100.0 print "Rank: %d Score: %d%% File: %s" % (rank, score, path) path = os.path.join(self.mh.getpath(), path) try: fp = open(path) except (IOError, OSError), msg: print "Can't open:", msg continue msg = mhlib.Message("", 0, fp) for header in "From", "To", "Cc", "Bcc", "Subject", "Date": h = msg.getheader(header) if h: print "%-8s %s" % (header+":", h) text = self.getmessagetext(msg) if text: print nleft = maxlines for part in text: for line in part.splitlines(): if prog.search(line): print line nleft -= 1 if nleft <= 0: break if nleft <= 0: break print '-'*70 def update(self, args): folder = None seqs = [] for arg in args: if arg.startswith("+"): if folder is None: folder = arg[1:] else: print "only one folder at a time" return else: seqs.append(arg) if not folder: folder = self.mh.getcontext() if not seqs: seqs = ['all'] try: f = self.mh.openfolder(folder) except mhlib.Error, msg: print msg return dict = {} for seq in seqs: try: nums = f.parsesequence(seq) except mhlib.Error, msg: print msg or "unparsable message sequence: %s" % `seq` return for n in nums: dict[n] = n msgs = dict.keys() msgs.sort() self.updatefolder(f, msgs) self.commit() def optimize(self, args): uniqwords = {} for folder in args: if folder.startswith("+"): folder = folder[1:] print "\nOPTIMIZE FOLDER", folder try: f = self.mh.openfolder(folder) except mhlib.Error, msg: print msg continue self.prescan(f, f.listmessages(), uniqwords) L = [(uniqwords[word], word) for word in uniqwords.keys()] L.sort() L.reverse() for i in range(100): print "%3d. %6d %s" % ((i+1,) + L[i]) self.index.lexicon.sourceToWordIds([word for (count, word) in L]) def prescan(self, f, msgs, uniqwords): pipeline = [Splitter(), CaseNormalizer(), StopWordRemover()] for n in msgs: print "prescanning", n m = f.openmessage(n) text = self.getmessagetext(m, f.name) for p in pipeline: text = p.process(text) for word in text: uniqwords[word] = uniqwords.get(word, 0) + 1 def bulkupdate(self, args): if not args: print "No folders specified; use ALL to bulk-index all folders" return if "ALL" in args: i = args.index("ALL") args[i:i+1] = self.mh.listfolders() for folder in args: if folder.startswith("+"): folder = folder[1:] print "\nFOLDER", folder try: f = self.mh.openfolder(folder) except mhlib.Error, msg: print msg continue self.updatefolder(f, f.listmessages()) print "Total", len(self.docpaths) self.commit() print "Indexed", self.index.lexicon._nbytes, "bytes and", print self.index.lexicon._nwords, "words;", print len(self.index.lexicon._words), "unique words." def updatefolder(self, f, msgs): self.watchfolders[f.name] = self.getmtime(f.name) for n in msgs: path = "%s/%s" % (f.name, n) docid = self.path2docid.get(path, 0) if docid and self.getmtime(path) == self.doctimes.get(docid, 0): print "unchanged", docid, path continue docid = self.newdocid(path) try: m = f.openmessage(n) except IOError: print "disappeared", docid, path self.unindexpath(path) continue text = self.getmessagetext(m, f.name) if not text: self.unindexpath(path) continue print "indexing", docid, path self.index.index_doc(docid, text) self.maycommit() # Remove messages from the folder that no longer exist for path in list(self.path2docid.keys(f.name)): if not path.startswith(f.name + "/"): break if self.getmtime(path) == 0: self.unindexpath(path) print "done." def unindexpath(self, path): if self.path2docid.has_key(path): docid = self.path2docid[path] print "unindexing", docid, path del self.docpaths[docid] del self.doctimes[docid] del self.path2docid[path] try: self.index.unindex_doc(docid) except KeyError, msg: print "KeyError", msg self.maycommit() def getmessagetext(self, m, name=None): L = [] if name: L.append("_folder " + name) # To restrict search to a folder self.getheaders(m, L) try: self.getmsgparts(m, L, 0) except KeyboardInterrupt: raise except: print "(getmsgparts failed:)" reportexc() return L def getmsgparts(self, m, L, level): ctype = m.gettype() if level or ctype != "text/plain": print ". "*level + str(ctype) if ctype == "text/plain": L.append(m.getbodytext()) elif ctype in ("multipart/alternative", "multipart/mixed"): for part in m.getbodyparts(): self.getmsgparts(part, L, level+1) elif ctype == "message/rfc822": f = StringIO(m.getbodytext()) m = mhlib.Message("", 0, f) self.getheaders(m, L) self.getmsgparts(m, L, level+1) def getheaders(self, m, L): H = [] for key in "from", "to", "cc", "bcc", "subject": value = m.get(key) if value: H.append(value) if H: L.append("\n".join(H)) def newdocid(self, path): docid = self.path2docid.get(path) if docid is not None: self.doctimes[docid] = self.getmtime(path) return docid docid = self.maxdocid + 1 self.maxdocid = docid self.docpaths[docid] = path self.doctimes[docid] = self.getmtime(path) self.path2docid[path] = docid return docid def getmtime(self, path): path = os.path.join(self.mh.getpath(), path) try: st = os.stat(path) except os.error, msg: return 0 return int(st[ST_MTIME]) def maycommit(self): self.trans_count += 1 if self.trans_count >= self.trans_limit > 0: self.commit() def commit(self): if self.trans_count > 0: print "committing..." transaction.commit() self.trans_count = 0 self.pack_count += 1 if self.pack_count >= self.pack_limit > 0: self.pack() def pack(self): if self.pack_count > 0: print "packing..." self.database.pack() self.pack_count = 0 def reportexc(): traceback.print_exc() if __name__ == "__main__": sys.exit(main()) zope.index-3.6.4/src/zope/index/text/tests/test_lexicon.py0000644000175000017500000003521711727503631025043 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Lexicon tests """ import unittest class LexiconTests(unittest.TestCase): def _getTargetClass(self): from zope.index.text.lexicon import Lexicon return Lexicon def _makeOne(self, *pipeline): from zope.index.text.lexicon import Splitter pipeline = (Splitter(),) + pipeline return self._getTargetClass()(*pipeline) def test_class_conforms_to_ILexicon(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import ILexicon verifyClass(ILexicon, self._getTargetClass()) def test_instance_conforms_to_ILexicon(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import ILexicon verifyObject(ILexicon, self._makeOne()) def test_empty(self): lexicon = self._makeOne() self.assertEqual(len(lexicon.words()), 0) self.assertEqual(len(lexicon.wids()), 0) self.assertEqual(len(lexicon.items()), 0) self.assertEqual(lexicon.wordCount(), 0) def test_wordCount_legacy_instance_no_write_on_read(self): from BTrees.Length import Length lexicon = self._makeOne() # Simulate old instance, which didn't have Length attr del lexicon.wordCount self.assertEqual(lexicon.wordCount(), 0) # No write-on-read! self.failIf(isinstance(lexicon.wordCount, Length)) def test_sourceToWordIds_empty_string(self): lexicon = self._makeOne() wids = lexicon.sourceToWordIds('') self.assertEqual(wids, []) def test_sourceToWordIds_none(self): # See LP #598776 lexicon = self._makeOne() wids = lexicon.sourceToWordIds(None) self.assertEqual(wids, []) def test_sourceToWordIds(self): lexicon = self._makeOne() wids = lexicon.sourceToWordIds('cats and dogs') self.assertEqual(wids, [1, 2, 3]) self.assertEqual(lexicon.get_word(1), 'cats') self.assertEqual(lexicon.get_wid('cats'), 1) def test_sourceToWordIds_promotes_wordCount_attr(self): from BTrees.Length import Length lexicon = self._makeOne() # Simulate old instance, which didn't have Length attr del lexicon.wordCount wids = lexicon.sourceToWordIds('cats and dogs') self.assertEqual(wids, [1, 2, 3]) self.assertEqual(lexicon.wordCount(), 3) self.failUnless(isinstance(lexicon.wordCount, Length)) def test_termToWordIds_hit(self): lexicon = self._makeOne() lexicon.sourceToWordIds('cats and dogs') wids = lexicon.termToWordIds('dogs') self.assertEqual(wids, [3]) def test_termToWordIds_miss(self): lexicon = self._makeOne() lexicon.sourceToWordIds('cats and dogs') wids = lexicon.termToWordIds('boxes') self.assertEqual(wids, [0]) def test_termToWordIds_w_extra_pipeline_element(self): lexicon = self._makeOne(StupidPipelineElement('dogs', 'fish')) lexicon.sourceToWordIds('cats and dogs') wids = lexicon.termToWordIds('fish') self.assertEqual(wids, [3]) def test_termToWordIds_w_case_normalizer(self): from zope.index.text.lexicon import CaseNormalizer lexicon = self._makeOne(CaseNormalizer()) lexicon.sourceToWordIds('CATS and dogs') wids = lexicon.termToWordIds('cats and dogs') self.assertEqual(wids, [1, 2, 3]) def test_termToWordIds_wo_case_normalizer(self): lexicon = self._makeOne() wids = lexicon.sourceToWordIds('CATS and dogs') wids = lexicon.termToWordIds('cats and dogs') self.assertEqual(wids, [0, 2, 3]) def test_termToWordIds_w_two_extra_pipeline_elements(self): lexicon = self._makeOne(StupidPipelineElement('cats', 'fish'), WackyReversePipelineElement('fish'), ) lexicon.sourceToWordIds('cats and dogs') wids = lexicon.termToWordIds('hsif') self.assertEqual(wids, [1]) def test_termToWordIds_w_three_extra_pipeline_elements(self): lexicon = self._makeOne(StopWordPipelineElement({'and':1}), StupidPipelineElement('dogs', 'fish'), WackyReversePipelineElement('fish'), ) wids = lexicon.sourceToWordIds('cats and dogs') wids = lexicon.termToWordIds('hsif') self.assertEqual(wids, [2]) def test_parseTerms_tuple(self): TERMS = ('a', 'b*c', 'de?f') lexicon = self._makeOne() self.assertEqual(lexicon.parseTerms(TERMS), list(TERMS)) def test_parseTerms_list(self): TERMS = ('a', 'b*c', 'de?f') lexicon = self._makeOne() self.assertEqual(lexicon.parseTerms(TERMS), list(TERMS)) def test_parseTerms_empty_string(self): lexicon = self._makeOne() self.assertEqual(lexicon.parseTerms('a b*c de?f'), ['a', 'b*c', 'de?f']) def test_parseTerms_nonempty_string(self): lexicon = self._makeOne() self.assertEqual(lexicon.parseTerms(''), []) def test_isGlob_empty(self): lexicon = self._makeOne() self.failIf(lexicon.isGlob('')) def test_isGlob_miss(self): lexicon = self._makeOne() self.failIf(lexicon.isGlob('abc')) def test_isGlob_question_mark(self): lexicon = self._makeOne() self.failUnless(lexicon.isGlob('a?c')) def test_isGlob_asterisk(self): lexicon = self._makeOne() self.failUnless(lexicon.isGlob('abc*')) def test_globToWordIds_invalid_pattern(self): from zope.index.text.parsetree import QueryError lexicon = self._makeOne() lexicon.sourceToWordIds('cats and dogs') self.assertRaises(QueryError, lexicon.globToWordIds, '*s') def test_globToWordIds_simple_pattern(self): lexicon = self._makeOne() lexicon.sourceToWordIds('cats and dogs are enemies') self.assertEqual(lexicon.globToWordIds('a*'), [2, 4]) def test_globToWordIds_simple_pattern2(self): lexicon = self._makeOne() lexicon.sourceToWordIds('cats and dogs are enemies') self.assertEqual(lexicon.globToWordIds('a?e'), [4]) def test_globToWordIds_prefix(self): lexicon = self._makeOne() lexicon.sourceToWordIds('cats and dogs are enemies') self.assertEqual(lexicon.globToWordIds('are'), [4]) def test_getWordIdCreate_new(self): lexicon = self._makeOne() wid = lexicon._getWordIdCreate('nonesuch') self.assertEqual(wid, 1) self.assertEqual(lexicon.get_word(1), 'nonesuch') self.assertEqual(lexicon.get_wid('nonesuch'), 1) def test_getWordIdCreate_extant(self): lexicon = self._makeOne() lexicon.sourceToWordIds('cats and dogs are enemies') wid = lexicon._getWordIdCreate('cats') self.assertEqual(wid, 1) self.assertEqual(lexicon.get_word(1), 'cats') self.assertEqual(lexicon.get_wid('cats'), 1) def test__new_wid_recovers_from_damaged_length(self): lexicon = self._makeOne() lexicon.sourceToWordIds('cats and dogs') lexicon.wordCount.set(0) wid = lexicon._new_wid() self.assertEqual(wid, 4) self.assertEqual(lexicon.wordCount(), 4) class SplitterTests(unittest.TestCase): _old_locale = None def tearDown(self): if self._old_locale is not None: import locale locale.setlocale(locale.LC_ALL, self._old_locale) def _getTargetClass(self): from zope.index.text.lexicon import Splitter return Splitter def _makeOne(self): return self._getTargetClass()() def test_class_conforms_to_ISplitter(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import ISplitter verifyClass(ISplitter, self._getTargetClass()) def test_instance_conforms_to_ISplitter(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import ISplitter verifyObject(ISplitter, self._makeOne()) def test_process_empty_string(self): splitter = self._makeOne() self.assertEqual(splitter.process(['']), []) def test_process_simple(self): splitter = self._makeOne() self.assertEqual(splitter.process(['abc def']), ['abc', 'def']) def test_process_w_locale_awareness(self): import locale import sys self._old_locale = locale.setlocale(locale.LC_ALL) # set German locale try: if sys.platform == 'win32': locale.setlocale(locale.LC_ALL, 'German_Germany.1252') else: locale.setlocale(locale.LC_ALL, 'de_DE.ISO8859-1') except locale.Error: return # This test doesn't work here :-( expected = ['m\xfclltonne', 'waschb\xe4r', 'beh\xf6rde', '\xfcberflieger'] splitter = self._makeOne() self.assertEqual(splitter.process([' '.join(expected)]), expected) def test_process_w_glob(self): splitter = self._makeOne() self.assertEqual(splitter.process(['abc?def hij*klm nop* qrs?']), ['abc', 'def', 'hij', 'klm', 'nop', 'qrs']) def test_processGlob_empty_string(self): splitter = self._makeOne() self.assertEqual(splitter.processGlob(['']), []) def test_processGlob_simple(self): splitter = self._makeOne() self.assertEqual(splitter.processGlob(['abc def']), ['abc', 'def']) def test_processGlob_w_glob(self): splitter = self._makeOne() self.assertEqual(splitter.processGlob(['abc?def hij*klm nop* qrs?']), ['abc?def', 'hij*klm', 'nop*', 'qrs?']) class CaseNormalizerTests(unittest.TestCase): def _getTargetClass(self): from zope.index.text.lexicon import CaseNormalizer return CaseNormalizer def _makeOne(self): return self._getTargetClass()() def test_class_conforms_to_IPipelineElement(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import IPipelineElement verifyClass(IPipelineElement, self._getTargetClass()) def test_instance_conforms_to_IPipelineElement(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import IPipelineElement verifyObject(IPipelineElement, self._makeOne()) def test_process_empty(self): cn = self._makeOne() self.assertEqual(cn.process([]), []) def test_process_nonempty(self): cn = self._makeOne() self.assertEqual(cn.process(['ABC Def']), ['abc def']) class StopWordRemoverTests(unittest.TestCase): def _getTargetClass(self): from zope.index.text.lexicon import StopWordRemover return StopWordRemover def _makeOne(self): return self._getTargetClass()() def test_class_conforms_to_IPipelineElement(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import IPipelineElement verifyClass(IPipelineElement, self._getTargetClass()) def test_instance_conforms_to_IPipelineElement(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import IPipelineElement verifyObject(IPipelineElement, self._makeOne()) def test_process_empty(self): cn = self._makeOne() self.assertEqual(cn.process([]), []) def test_process_nonempty(self): QUOTE = 'The end of government is justice' cn = self._makeOne() self.assertEqual(cn.process(QUOTE.lower().split()), ['end', 'government', 'justice']) class StopWordAndSingleCharRemoverTests(unittest.TestCase): def _getTargetClass(self): from zope.index.text.lexicon import StopWordAndSingleCharRemover return StopWordAndSingleCharRemover def _makeOne(self): return self._getTargetClass()() def test_class_conforms_to_IPipelineElement(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import IPipelineElement verifyClass(IPipelineElement, self._getTargetClass()) def test_instance_conforms_to_IPipelineElement(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import IPipelineElement verifyObject(IPipelineElement, self._makeOne()) def test_process_empty(self): cn = self._makeOne() self.assertEqual(cn.process([]), []) def test_process_nonempty(self): QUOTE = 'The end of government is justice z x q' cn = self._makeOne() self.assertEqual(cn.process(QUOTE.lower().split()), ['end', 'government', 'justice']) class StupidPipelineElement(object): def __init__(self, fromword, toword): self.__fromword = fromword self.__toword = toword def process(self, seq): res = [] for term in seq: if term == self.__fromword: res.append(self.__toword) else: res.append(term) return res class WackyReversePipelineElement(object): def __init__(self, revword): self.__revword = revword def process(self, seq): res = [] for term in seq: if term == self.__revword: x = list(term) x.reverse() res.append(''.join(x)) else: res.append(term) return res class StopWordPipelineElement(object): def __init__(self, stopdict={}): self.__stopdict = stopdict def process(self, seq): res = [] for term in seq: if self.__stopdict.get(term): continue else: res.append(term) return res def test_suite(): return unittest.TestSuite(( unittest.makeSuite(LexiconTests), unittest.makeSuite(SplitterTests), unittest.makeSuite(CaseNormalizerTests), unittest.makeSuite(StopWordRemoverTests), unittest.makeSuite(StopWordAndSingleCharRemoverTests), )) zope.index-3.6.4/src/zope/index/text/tests/test_queryengine.py0000644000175000017500000000530011727503631025723 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Query Engine tests """ import unittest class FauxIndex(object): def _get_family(self): import BTrees return BTrees.family32 family = property(_get_family,) def search(self, term): b = self.family.IF.Bucket() if term == "foo": b[1] = b[3] = 1 elif term == "bar": b[1] = b[2] = 1 elif term == "ham": b[1] = b[2] = b[3] = b[4] = 1 return b class TestQueryEngine(unittest.TestCase): def _makeIndexAndParser(self): from zope.index.text.lexicon import Lexicon from zope.index.text.lexicon import Splitter from zope.index.text.queryparser import QueryParser lexicon = Lexicon(Splitter()) parser = QueryParser(lexicon) index = FauxIndex() return index, parser def _compareSet(self, set, dict): d = {} for k, v in set.items(): d[k] = v self.assertEqual(d, dict) def _compareQuery(self, query, dict): index, parser = self._makeIndexAndParser() tree = parser.parseQuery(query) set = tree.executeQuery(index) self._compareSet(set, dict) def testExecuteQuery(self): self._compareQuery("foo AND bar", {1: 2}) self._compareQuery("foo OR bar", {1: 2, 2: 1, 3:1}) self._compareQuery("foo AND NOT bar", {3: 1}) self._compareQuery("foo AND foo AND foo", {1: 3, 3: 3}) self._compareQuery("foo OR foo OR foo", {1: 3, 3: 3}) self._compareQuery("ham AND NOT foo AND NOT bar", {4: 1}) self._compareQuery("ham OR foo OR bar", {1: 3, 2: 2, 3: 2, 4: 1}) self._compareQuery("ham AND foo AND bar", {1: 3}) def testInvalidQuery(self): from zope.index.text.parsetree import AtomNode from zope.index.text.parsetree import NotNode from zope.index.text.parsetree import QueryError index, parser = self._makeIndexAndParser() tree = NotNode(AtomNode("foo")) self.assertRaises(QueryError, tree.executeQuery, index) def test_suite(): return unittest.makeSuite(TestQueryEngine) zope.index-3.6.4/src/zope/index/text/tests/test_textindex.py0000644000175000017500000002140311727503631025406 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2009 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Text Index Tests """ import unittest _marker = object() class TextIndexTests(unittest.TestCase): def _getTargetClass(self): from zope.index.text.textindex import TextIndex return TextIndex def _makeOne(self, lexicon=_marker, index=_marker): if lexicon is _marker: if index is _marker: # defaults return self._getTargetClass()() else: return self._getTargetClass()(index=index) else: if index is _marker: return self._getTargetClass()(lexicon) else: return self._getTargetClass()(lexicon, index) def _makeLexicon(self, *pipeline): from zope.index.text.lexicon import Lexicon from zope.index.text.lexicon import Splitter if not pipeline: pipeline = (Splitter(),) return Lexicon(*pipeline) def _makeOkapi(self, lexicon=None, family=None): import BTrees from zope.index.text.okapiindex import OkapiIndex if lexicon is None: lexicon = self._makeLexicon() if family is None: family = BTrees.family64 return OkapiIndex(lexicon, family=family) def test_class_conforms_to_IInjection(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IInjection verifyClass(IInjection, self._getTargetClass()) def test_instance_conforms_to_IInjection(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IInjection verifyObject(IInjection, self._makeOne()) def test_class_conforms_to_IIndexSearch(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IIndexSearch verifyClass(IIndexSearch, self._getTargetClass()) def test_instance_conforms_to_IIndexSearch(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IIndexSearch verifyObject(IIndexSearch, self._makeOne()) def test_class_conforms_to_IStatistics(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IStatistics verifyClass(IStatistics, self._getTargetClass()) def test_instance_conforms_to_IStatistics(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IStatistics verifyObject(IStatistics, self._makeOne()) def test_ctor_defaults(self): index = self._makeOne() from zope.index.text.lexicon import CaseNormalizer from zope.index.text.lexicon import Lexicon from zope.index.text.lexicon import Splitter from zope.index.text.lexicon import StopWordRemover from zope.index.text.okapiindex import OkapiIndex self.failUnless(isinstance(index.index, OkapiIndex)) self.failUnless(isinstance(index.lexicon, Lexicon)) self.failUnless(index.index._lexicon is index.lexicon) pipeline = index.lexicon._pipeline self.assertEqual(len(pipeline), 3) self.failUnless(isinstance(pipeline[0], Splitter)) self.failUnless(isinstance(pipeline[1], CaseNormalizer)) self.failUnless(isinstance(pipeline[2], StopWordRemover)) def test_ctor_explicit_lexicon(self): from zope.index.text.okapiindex import OkapiIndex lexicon = object() index = self._makeOne(lexicon) self.failUnless(index.lexicon is lexicon) self.failUnless(isinstance(index.index, OkapiIndex)) self.failUnless(index.index._lexicon is lexicon) def test_ctor_explicit_index(self): lexicon = object() okapi = DummyOkapi(lexicon) index = self._makeOne(index=okapi) self.failUnless(index.index is okapi) # See LP #232516 self.failUnless(index.lexicon is lexicon) def test_ctor_explicit_lexicon_and_index(self): lexicon = object() okapi = object() index = self._makeOne(lexicon, okapi) self.failUnless(index.lexicon is lexicon) self.failUnless(index.index is okapi) def test_index_doc(self): lexicon = object() okapi = DummyOkapi(lexicon) index = self._makeOne(lexicon, okapi) index.index_doc(1, 'cats and dogs') self.assertEqual(okapi._indexed[0], (1, 'cats and dogs')) def test_unindex_doc(self): lexicon = object() okapi = DummyOkapi(lexicon) index = self._makeOne(lexicon, okapi) index.unindex_doc(1) self.assertEqual(okapi._unindexed[0], 1) def test_clear(self): lexicon = object() okapi = DummyOkapi(lexicon) index = self._makeOne(lexicon, okapi) index.clear() self.failUnless(okapi._cleared) def test_documentCount(self): lexicon = object() okapi = DummyOkapi(lexicon) index = self._makeOne(lexicon, okapi) self.assertEqual(index.documentCount(), 4) def test_wordCount(self): lexicon = object() okapi = DummyOkapi(lexicon) index = self._makeOne(lexicon, okapi) self.assertEqual(index.wordCount(), 45) def test_apply_no_results(self): lexicon = DummyLexicon() okapi = DummyOkapi(lexicon, {}) index = self._makeOne(lexicon, okapi) self.assertEqual(index.apply('anything'), {}) self.assertEqual(okapi._query_weighted, []) self.assertEqual(okapi._searched, ['anything']) def test_apply_w_results(self): lexicon = DummyLexicon() okapi = DummyOkapi(lexicon) index = self._makeOne(lexicon, okapi) results = index.apply('anything') self.assertEqual(results[1], 14.0 / 42.0) self.assertEqual(results[2], 7.4 / 42.0) self.assertEqual(results[3], 3.2 / 42.0) self.assertEqual(okapi._query_weighted[0], ['anything']) self.assertEqual(okapi._searched, ['anything']) def test_apply_w_results_zero_query_weight(self): lexicon = DummyLexicon() okapi = DummyOkapi(lexicon) okapi._query_weight = 0 index = self._makeOne(lexicon, okapi) results = index.apply('anything') self.assertEqual(results[1], 14.0) self.assertEqual(results[2], 7.4) self.assertEqual(results[3], 3.2) self.assertEqual(okapi._query_weighted[0], ['anything']) self.assertEqual(okapi._searched, ['anything']) def test_apply_w_results_bogus_query_weight(self): import sys DIVISOR = sys.maxint / 10 lexicon = DummyLexicon() # cause TypeError in division okapi = DummyOkapi(lexicon, {1: '14.0', 2: '7.4', 3: '3.2'}) index = self._makeOne(lexicon, okapi) results = index.apply('anything') self.assertEqual(results[1], DIVISOR) self.assertEqual(results[2], DIVISOR) self.assertEqual(results[3], DIVISOR) self.assertEqual(okapi._query_weighted[0], ['anything']) self.assertEqual(okapi._searched, ['anything']) class DummyOkapi: _cleared = False _document_count = 4 _word_count = 45 _query_weight = 42.0 def __init__(self, lexicon, search_results=None): self.lexicon = lexicon self._indexed = [] self._unindexed = [] self._searched = [] self._query_weighted = [] if search_results is None: search_results = {1: 14.0, 2: 7.4, 3: 3.2} self._search_results = search_results def index_doc(self, docid, text): self._indexed.append((docid, text)) def unindex_doc(self, docid): self._unindexed.append(docid) def clear(self): self._cleared = True def documentCount(self): return self._document_count def wordCount(self): return self._word_count def query_weight(self, terms): self._query_weighted.append(terms) return self._query_weight def search(self, term): self._searched.append(term) return self._search_results search_phrase = search_glob = search class DummyLexicon: def parseTerms(self, term): return term def test_suite(): return unittest.TestSuite(( unittest.makeSuite(TextIndexTests), )) zope.index-3.6.4/src/zope/index/text/tests/hs-tool.py0000644000175000017500000000754211727503631023730 0ustar tseavertseaver00000000000000#! /usr/bin/env python ############################################################################## # # Copyright (c) 2003 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """HS-Tool """ import cPickle import os.path import sys from hotshot.log import LogReader def load_line_info(log): byline = {} prevloc = None for what, place, tdelta in log: if tdelta > 0: t, nhits = byline.get(prevloc, (0, 0)) byline[prevloc] = (tdelta + t), (nhits + 1) prevloc = place return byline def basename(path, cache={}): try: return cache[path] except KeyError: fn = os.path.basename(path) cache[path] = fn return fn def print_results(results): for info, place in results: if place is None: # This is the startup time for the profiler, and only # occurs at the very beginning. Just ignore it, since it # corresponds to frame setup of the outermost call, not # anything that's actually interesting. continue filename, line, funcname = place print '%8d %8d' % info, basename(filename), line def annotate_results(results): files = {} for stats, place in results: if not place: continue time, hits = stats file, line, func = place l = files.get(file) if l is None: l = files[file] = [] l.append((line, hits, time)) order = files.keys() order.sort() for k in order: if os.path.exists(k): v = files[k] v.sort() annotate(k, v) def annotate(file, lines): print "-" * 60 print file print "-" * 60 f = open(file) i = 1 match = lines[0][0] for line in f: if match == i: print "%6d %8d " % lines[0][1:], line, del lines[0] if lines: match = lines[0][0] else: match = None else: print " " * 16, line, i += 1 print def get_cache_name(filename): d, fn = os.path.split(filename) cache_dir = os.path.join(d, '.hs-tool') cache_file = os.path.join(cache_dir, fn) return cache_dir, cache_file def cache_results(filename, results): cache_dir, cache_file = get_cache_name(filename) if not os.path.exists(cache_dir): os.mkdir(cache_dir) fp = open(cache_file, 'wb') try: cPickle.dump(results, fp, 1) finally: fp.close() def main(filename, annotate): cache_dir, cache_file = get_cache_name(filename) if ( os.path.isfile(cache_file) and os.path.getmtime(cache_file) > os.path.getmtime(filename)): # cached data is up-to-date: fp = open(cache_file, 'rb') results = cPickle.load(fp) fp.close() else: log = LogReader(filename) byline = load_line_info(log) # Sort results = [(v, k) for k, v in byline.items()] results.sort() cache_results(filename, results) if annotate: annotate_results(results) else: print_results(results) if __name__ == "__main__": import getopt annotate_p = 0 opts, args = getopt.getopt(sys.argv[1:], 'A') for o, v in opts: if o == '-A': annotate_p = 1 if args: filename, = args else: filename = "profile.dat" main(filename, annotate_p) zope.index-3.6.4/src/zope/index/text/tests/test_baseindex.py0000644000175000017500000005017211727503631025341 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2009 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Text Index Tests """ import unittest class BaseIndexTestBase: # Subclasses must define '_getBTreesFamily' def _getTargetClass(self): from zope.index.text.baseindex import BaseIndex return BaseIndex def _makeOne(self, family=None): from zope.index.text.lexicon import Lexicon from zope.index.text.lexicon import Splitter if family is None: family = self._getBTreesFamily() lexicon = Lexicon(Splitter()) return self._getTargetClass()(lexicon, family=family) def test_class_conforms_to_IInjection(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IInjection verifyClass(IInjection, self._getTargetClass()) def test_instance_conforms_to_IInjection(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IInjection verifyObject(IInjection, self._makeOne()) def test_class_conforms_to_IStatistics(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IStatistics verifyClass(IStatistics, self._getTargetClass()) def test_instance_conforms_to_IStatistics(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IStatistics verifyObject(IStatistics, self._makeOne()) def test_class_conforms_to_ILexiconBasedIndex(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import ILexiconBasedIndex verifyClass(ILexiconBasedIndex, self._getTargetClass()) def test_instance_conforms_to_ILexiconBasedIndex(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import ILexiconBasedIndex verifyObject(ILexiconBasedIndex, self._makeOne()) def test_class_conforms_to_IExtendedQuerying(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import IExtendedQuerying verifyClass(IExtendedQuerying, self._getTargetClass()) def test_instance_conforms_to_IExtendedQuerying(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import IExtendedQuerying verifyObject(IExtendedQuerying, self._makeOne()) def test_empty(self): index = self._makeOne() self.assertEqual(len(index._wordinfo), 0) self.assertEqual(len(index._docweight), 0) self.assertEqual(len(index._docwords), 0) self.assertEqual(index.wordCount(), 0) self.assertEqual(index.documentCount(), 0) self.failIf(index.has_doc(1)) def test_clear_doesnt_lose_family(self): import BTrees index = self._makeOne(family=BTrees.family64) index.clear() self.failUnless(index.family is BTrees.family64) def test_wordCount_method_raises_NotImplementedError(self): class DerviedDoesntSet_wordCount(self._getTargetClass()): def __init__(self): pass index = DerviedDoesntSet_wordCount() self.assertRaises(NotImplementedError, index.wordCount) def test_documentCount_method_raises_NotImplementedError(self): class DerviedDoesntSet_documentCount(self._getTargetClass()): def __init__(self): pass index = DerviedDoesntSet_documentCount() self.assertRaises(NotImplementedError, index.documentCount) def test_index_doc_simple(self): index = self._makeOne() # Fake out _get_frequencies, which is supposed to be overridden. def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 index._get_frequencies = _faux_get_frequencies count = index.index_doc(1, 'one two three') self.assertEqual(count, 3) self.assertEqual(index.wordCount(), 3) self.failUnless(index._lexicon._wids['one'] in index._wordinfo) self.failUnless(index._lexicon._wids['two'] in index._wordinfo) self.failUnless(index._lexicon._wids['three'] in index._wordinfo) self.assertEqual(index.documentCount(), 1) self.failUnless(index.has_doc(1)) self.failUnless(1 in index._docwords) self.failUnless(1 in index._docweight) wids = index.get_words(1) self.assertEqual(len(wids), 3) self.failUnless(index._lexicon._wids['one'] in wids) self.failUnless(index._lexicon._wids['two'] in wids) self.failUnless(index._lexicon._wids['three'] in wids) def test_index_doc_existing_docid(self): index = self._makeOne() # Fake out _get_frequencies, which is supposed to be overridden. def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 index._get_frequencies = _faux_get_frequencies index.index_doc(1, 'one two three') count = index.index_doc(1, 'two three four') self.assertEqual(count, 3) self.assertEqual(index.wordCount(), 3) self.failIf(index._lexicon._wids['one'] in index._wordinfo) self.failUnless(index._lexicon._wids['two'] in index._wordinfo) self.failUnless(index._lexicon._wids['three'] in index._wordinfo) self.failUnless(index._lexicon._wids['four'] in index._wordinfo) wids = index.get_words(1) self.assertEqual(len(wids), 3) self.failIf(index._lexicon._wids['one'] in wids) self.failUnless(index._lexicon._wids['two'] in wids) self.failUnless(index._lexicon._wids['three'] in wids) self.failUnless(index._lexicon._wids['four'] in wids) def test_index_doc_upgrades_wordCount_documentCount(self): index = self._makeOne() # Simulate old instances which didn't have these as attributes del index.wordCount del index.documentCount # Fake out _get_frequencies, which is supposed to be overridden. def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 index._get_frequencies = _faux_get_frequencies count = index.index_doc(1, 'one two three') self.assertEqual(count, 3) self.assertEqual(index.wordCount(), 3) self.assertEqual(index.documentCount(), 1) def test__reindex_doc_identity(self): index = self._makeOne() # Fake out _get_frequencies, which is supposed to be overridden. def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 index._get_frequencies = _faux_get_frequencies index.index_doc(1, 'one two three') # Don't mutate _wordinfo if no changes def _dont_go_here(*args, **kw): assert 0 index._add_wordinfo = index._del_wordinfo = _dont_go_here count = index._reindex_doc(1, 'one two three') self.assertEqual(count, 3) self.assertEqual(index.wordCount(), 3) self.failUnless(index._lexicon._wids['one'] in index._wordinfo) self.failUnless(index._lexicon._wids['two'] in index._wordinfo) self.failUnless(index._lexicon._wids['three'] in index._wordinfo) wids = index.get_words(1) self.assertEqual(len(wids), 3) self.failUnless(index._lexicon._wids['one'] in wids) self.failUnless(index._lexicon._wids['two'] in wids) self.failUnless(index._lexicon._wids['three'] in wids) def test__reindex_doc_disjoint(self): index = self._makeOne() def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 # Fake out _get_frequencies, which is supposed to be overridden. index._get_frequencies = _faux_get_frequencies index.index_doc(1, 'one two three') count = index._reindex_doc(1, 'four five six') self.assertEqual(count, 3) self.assertEqual(index.wordCount(), 3) self.failIf(index._lexicon._wids['one'] in index._wordinfo) self.failIf(index._lexicon._wids['two'] in index._wordinfo) self.failIf(index._lexicon._wids['three'] in index._wordinfo) self.failUnless(index._lexicon._wids['four'] in index._wordinfo) self.failUnless(index._lexicon._wids['five'] in index._wordinfo) self.failUnless(index._lexicon._wids['six'] in index._wordinfo) wids = index.get_words(1) self.assertEqual(len(wids), 3) self.failIf(index._lexicon._wids['one'] in wids) self.failIf(index._lexicon._wids['two'] in wids) self.failIf(index._lexicon._wids['three'] in wids) self.failUnless(index._lexicon._wids['four'] in wids) self.failUnless(index._lexicon._wids['five'] in wids) self.failUnless(index._lexicon._wids['six'] in wids) def test__reindex_doc_subset(self): index = self._makeOne() def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 # Fake out _get_frequencies, which is supposed to be overridden. index._get_frequencies = _faux_get_frequencies index.index_doc(1, 'one two three') count = index._reindex_doc(1, 'two three') self.assertEqual(count, 2) self.assertEqual(index.wordCount(), 2) self.failIf(index._lexicon._wids['one'] in index._wordinfo) self.failUnless(index._lexicon._wids['two'] in index._wordinfo) self.failUnless(index._lexicon._wids['three'] in index._wordinfo) wids = index.get_words(1) self.assertEqual(len(wids), 2) self.failIf(index._lexicon._wids['one'] in wids) self.failUnless(index._lexicon._wids['two'] in wids) self.failUnless(index._lexicon._wids['three'] in wids) def test__reindex_doc_superset(self): # TODO index = self._makeOne() def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 # Fake out _get_frequencies, which is supposed to be overridden. index._get_frequencies = _faux_get_frequencies index.index_doc(1, 'one two three') count = index._reindex_doc(1, 'one two three four five six') self.assertEqual(count, 6) self.assertEqual(index.wordCount(), 6) self.failUnless(index._lexicon._wids['one'] in index._wordinfo) self.failUnless(index._lexicon._wids['two'] in index._wordinfo) self.failUnless(index._lexicon._wids['three'] in index._wordinfo) self.failUnless(index._lexicon._wids['four'] in index._wordinfo) self.failUnless(index._lexicon._wids['five'] in index._wordinfo) self.failUnless(index._lexicon._wids['six'] in index._wordinfo) wids = index.get_words(1) self.assertEqual(len(wids), 6) self.failUnless(index._lexicon._wids['one'] in wids) self.failUnless(index._lexicon._wids['two'] in wids) self.failUnless(index._lexicon._wids['three'] in wids) self.failUnless(index._lexicon._wids['four'] in wids) self.failUnless(index._lexicon._wids['five'] in wids) self.failUnless(index._lexicon._wids['six'] in wids) def test__get_frequencies_raises_NotImplementedError(self): index = self._makeOne() self.assertRaises(NotImplementedError, index._get_frequencies, ()) def test_unindex_doc_simple(self): index = self._makeOne() def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 # Fake out _get_frequencies, which is supposed to be overridden. index._get_frequencies = _faux_get_frequencies index.index_doc(1, 'one two three') index.unindex_doc(1) self.assertEqual(index.wordCount(), 0) self.failIf(index._lexicon._wids['one'] in index._wordinfo) self.failIf(index._lexicon._wids['two'] in index._wordinfo) self.failIf(index._lexicon._wids['three'] in index._wordinfo) self.assertEqual(index.documentCount(), 0) self.failIf(index.has_doc(1)) self.failIf(1 in index._docwords) self.failIf(1 in index._docweight) self.assertRaises(KeyError, index.get_words, 1) def test_unindex_doc_upgrades_wordCount_documentCount(self): index = self._makeOne() def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 # Fake out _get_frequencies, which is supposed to be overridden. index._get_frequencies = _faux_get_frequencies index.index_doc(1, 'one two three') # Simulate old instances which didn't have these as attributes del index.wordCount del index.documentCount index.unindex_doc(1) self.assertEqual(index.wordCount(), 0) self.failIf(index._lexicon._wids['one'] in index._wordinfo) self.failIf(index._lexicon._wids['two'] in index._wordinfo) self.failIf(index._lexicon._wids['three'] in index._wordinfo) self.assertEqual(index.documentCount(), 0) self.failIf(index.has_doc(1)) self.failIf(1 in index._docwords) self.failIf(1 in index._docweight) self.assertRaises(KeyError, index.get_words, 1) def test_search_w_empty_term(self): index = self._makeOne() self.assertEqual(index.search(''), None) def test_search_w_oov_term(self): index = self._makeOne() def _faux_search_wids(wids): assert len(wids) == 0 return [] index._search_wids = _faux_search_wids self.assertEqual(dict(index.search('nonesuch')), {}) def test_search_hit(self): index = self._makeOne() def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 index._get_frequencies = _faux_get_frequencies def _faux_search_wids(wids): assert len(wids) == 1 assert index._lexicon._wids['hit'] in wids result = index.family.IF.Bucket() result[1] = 1.0 return [(result, 1)] index._search_wids = _faux_search_wids index.index_doc(1, 'hit') self.assertEqual(dict(index.search('hit')), {1: 1.0}) def test_search_glob_w_empty_term(self): index = self._makeOne() def _faux_search_wids(wids): assert len(wids) == 0 return [] index._search_wids = _faux_search_wids self.assertEqual(dict(index.search_glob('')), {}) def test_search_glob_w_oov_term(self): index = self._makeOne() def _faux_search_wids(wids): assert len(wids) == 0 return [] index._search_wids = _faux_search_wids self.assertEqual(dict(index.search_glob('nonesuch*')), {}) def test_search_glob_hit(self): index = self._makeOne() def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 index._get_frequencies = _faux_get_frequencies def _faux_search_wids(wids): assert len(wids) == 1 assert index._lexicon._wids['hitter'] in wids result = index.family.IF.Bucket() result[1] = 1.0 return [(result, 1)] index._search_wids = _faux_search_wids index.index_doc(1, 'hitter') self.assertEqual(dict(index.search_glob('hit*')), {1: 1.0}) def test_search_phrase_w_empty_term(self): index = self._makeOne() def _faux_search_wids(wids): assert len(wids) == 0 return [] index._search_wids = _faux_search_wids self.assertEqual(dict(index.search_phrase('')), {}) def test_search_phrase_w_oov_term(self): index = self._makeOne() self.assertEqual(dict(index.search_phrase('nonesuch')), {}) def test_search_phrase_hit(self): index = self._makeOne() def _faux_get_frequencies(wids): return dict([(y, x) for x, y in enumerate(wids)]), 1 index._get_frequencies = _faux_get_frequencies def _faux_search_wids(wids): assert len(wids) == 3 assert index._lexicon._wids['hit'] in wids assert index._lexicon._wids['the'] in wids assert index._lexicon._wids['nail'] in wids result = index.family.IF.Bucket() result[1] = 1.0 return [(result, 1)] index._search_wids = _faux_search_wids index.index_doc(1, 'hit the nail on the head') self.assertEqual(dict(index.search_phrase('hit the nail')), {1: 1.0}) def test__search_wids_raises_NotImplementedError(self): index = self._makeOne() self.assertRaises(NotImplementedError, index._search_wids, ()) def test_query_weight_raises_NotImplementedError(self): index = self._makeOne() self.assertRaises(NotImplementedError, index.query_weight, ()) def test__add_wordinfo_simple(self): index = self._makeOne() # Simulate old instances which didn't have these as attributes index._add_wordinfo(123, 4, 1) self.assertEqual(index.wordCount(), 1) self.assertEqual(index._wordinfo[123], {1: 4}) def test__add_wordinfo_upgrades_wordCount(self): index = self._makeOne() # Simulate old instances which didn't have these as attributes del index.wordCount index._add_wordinfo(123, 4, 1) self.assertEqual(index.wordCount(), 1) def test__add_wordinfo_promotes_dict_to_tree_at_DICT_CUTOFF(self): index = self._makeOne() index.DICT_CUTOFF = 2 index._add_wordinfo(123, 4, 1) index._add_wordinfo(123, 5, 2) self.failUnless(isinstance(index._wordinfo[123], dict)) index._add_wordinfo(123, 6, 3) self.failUnless(isinstance(index._wordinfo[123], index.family.IF.BTree)) self.assertEqual(dict(index._wordinfo[123]), {1: 4, 2: 5, 3: 6}) def test__mass_add_wordinfo_promotes_dict_to_tree_at_DICT_CUTOFF(self): index = self._makeOne() index.DICT_CUTOFF = 2 index._add_wordinfo(123, 4, 1) index._add_wordinfo(123, 5, 2) index._mass_add_wordinfo({123: 6, 124: 1}, 3) self.failUnless(isinstance(index._wordinfo[123], index.family.IF.BTree)) self.assertEqual(dict(index._wordinfo[123]), {1: 4, 2: 5, 3: 6}) def test__del_wordinfo_no_residual_docscore(self): index = self._makeOne() # Simulate old instances which didn't have these as attributes index._add_wordinfo(123, 4, 1) index._del_wordinfo(123, 1) self.assertEqual(index.wordCount(), 0) self.assertRaises(KeyError, index._wordinfo.__getitem__, 123) def test__del_wordinfo_w_residual_docscore(self): index = self._makeOne() # Simulate old instances which didn't have these as attributes index._add_wordinfo(123, 4, 1) index._add_wordinfo(123, 5, 2) index._del_wordinfo(123, 1) self.assertEqual(index.wordCount(), 1) self.assertEqual(index._wordinfo[123], {2: 5}) def test__del_wordinfo_upgrades_wordCount(self): index = self._makeOne() index._add_wordinfo(123, 4, 1) # Simulate old instances which didn't have these as attributes del index.wordCount index._del_wordinfo(123, 1) self.assertEqual(index.wordCount(), 0) class BaseIndexTest32(BaseIndexTestBase, unittest.TestCase): def _getBTreesFamily(self): import BTrees return BTrees.family32 class BaseIndexTest64(BaseIndexTestBase, unittest.TestCase): def _getBTreesFamily(self): import BTrees return BTrees.family64 def test_suite(): return unittest.TestSuite(( unittest.makeSuite(BaseIndexTest32), unittest.makeSuite(BaseIndexTest64), )) zope.index-3.6.4/src/zope/index/text/tests/test_textindexwrapper.py0000644000175000017500000000137211727503631027012 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Unit tests for TextIndexWrapper. """ import doctest def test_suite(): return doctest.DocFileSuite("../textindex.txt") zope.index-3.6.4/src/zope/index/text/tests/test_parsetree.py0000644000175000017500000002212711727503631025370 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2009 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## import unittest class ConformsToIQueryParseTree: def test_class_conforms_to_IQueryParseTree(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import IQueryParseTree verifyClass(IQueryParseTree, self._getTargetClass()) def test_instance_conforms_to_IQueryParseTree(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import IQueryParseTree verifyObject(IQueryParseTree, self._makeOne()) class ParseTreeNodeTests(unittest.TestCase, ConformsToIQueryParseTree): def _getTargetClass(self): from zope.index.text.parsetree import ParseTreeNode return ParseTreeNode def _makeOne(self, value=None): if value is None: value = [FauxValue('XXX')] return self._getTargetClass()(value) def test_nodeType(self): node = self._makeOne() self.assertEqual(node.nodeType(), None) def test_getValue(self): value = [FauxValue('XXX')] node = self._makeOne(value) self.assertEqual(node.getValue(), value) def test___repr__(self): node = self._makeOne() self.assertEqual(repr(node), "ParseTreeNode([FV:XXX])") def test___repr___subclass(self): class Derived(self._getTargetClass()): pass node = Derived('XXX') self.assertEqual(repr(node), "Derived('XXX')") def test_terms(self): node = self._makeOne() self.assertEqual(list(node.terms()), ['XXX']) def test_executeQuery_raises(self): node = self._makeOne() self.assertRaises(NotImplementedError, node.executeQuery, FauxIndex()) class NotNodeTests(unittest.TestCase, ConformsToIQueryParseTree): def _getTargetClass(self): from zope.index.text.parsetree import NotNode return NotNode def _makeOne(self, value=None): if value is None: value = [FauxValue('XXX')] return self._getTargetClass()(value) def test_nodeType(self): node = self._makeOne() self.assertEqual(node.nodeType(), 'NOT') def test_terms(self): node = self._makeOne(object()) self.assertEqual(list(node.terms()), []) def test_executeQuery_raises(self): from zope.index.text.parsetree import QueryError node = self._makeOne() self.assertRaises(QueryError, node.executeQuery, FauxIndex()) class BucketMaker: def _makeBucket(self, index, count, start=0): bucket = index.family.IF.Bucket() for i in range(start, count): bucket[i] = count * 3.1415926 return bucket class AndNodeTests(unittest.TestCase, ConformsToIQueryParseTree, BucketMaker): def _getTargetClass(self): from zope.index.text.parsetree import AndNode return AndNode def _makeOne(self, value=None): if value is None: value = [FauxValue('XXX')] return self._getTargetClass()(value) def test_nodeType(self): node = self._makeOne() self.assertEqual(node.nodeType(), 'AND') def test_executeQuery_no_results(self): node = self._makeOne([FauxSubnode('FOO', None)]) result = node.executeQuery(FauxIndex()) self.assertEqual(dict(result), {}) def test_executeQuery_w_positive_results(self): index = FauxIndex() node = self._makeOne( [FauxSubnode('FOO', self._makeBucket(index, 5)), FauxSubnode('FOO', self._makeBucket(index, 6)), ]) result = node.executeQuery(index) self.assertEqual(sorted(result.keys()), [0, 1, 2, 3, 4]) def test_executeQuery_w_negative_results(self): # TODO index = FauxIndex() node = self._makeOne( [FauxSubnode('NOT', self._makeBucket(index, 5)), FauxSubnode('FOO', self._makeBucket(index, 6)), ]) result = node.executeQuery(index) self.assertEqual(sorted(result.keys()), [5]) class OrNodeTests(unittest.TestCase, ConformsToIQueryParseTree, BucketMaker): def _getTargetClass(self): from zope.index.text.parsetree import OrNode return OrNode def _makeOne(self, value=None): if value is None: value = [FauxValue('XXX')] return self._getTargetClass()(value) def test_nodeType(self): node = self._makeOne() self.assertEqual(node.nodeType(), 'OR') def test_executeQuery_no_results(self): node = self._makeOne([FauxSubnode('FOO', None)]) result = node.executeQuery(FauxIndex()) self.assertEqual(dict(result), {}) def test_executeQuery_w_results(self): index = FauxIndex() node = self._makeOne( [FauxSubnode('FOO', self._makeBucket(index, 5)), FauxSubnode('FOO', self._makeBucket(index, 6)), ]) result = node.executeQuery(index) self.assertEqual(sorted(result.keys()), [0, 1, 2, 3, 4, 5]) class AtomNodeTests(unittest.TestCase, ConformsToIQueryParseTree, BucketMaker): def _getTargetClass(self): from zope.index.text.parsetree import AtomNode return AtomNode def _makeOne(self, value=None): if value is None: value = 'XXX' return self._getTargetClass()(value) def test_nodeType(self): node = self._makeOne() self.assertEqual(node.nodeType(), 'ATOM') def test_terms(self): node = self._makeOne() self.assertEqual(node.terms(), ['XXX']) def test_executeQuery(self): node = self._makeOne() index = FauxIndex() index.search = lambda term: self._makeBucket(index, 5) result = node.executeQuery(index) self.assertEqual(sorted(result.keys()), [0, 1, 2, 3, 4]) class PhraseNodeTests(unittest.TestCase, ConformsToIQueryParseTree): def _getTargetClass(self): from zope.index.text.parsetree import PhraseNode return PhraseNode def _makeOne(self, value=None): if value is None: value = 'XXX YYY' return self._getTargetClass()(value) def test_nodeType(self): node = self._makeOne() self.assertEqual(node.nodeType(), 'PHRASE') def test_executeQuery(self): _called_with = [] def _search(*args, **kw): _called_with.append((args, kw)) return [] index = FauxIndex() index.search_phrase = _search node = self._makeOne() self.assertEqual(node.executeQuery(index), []) self.assertEqual(_called_with[0], (('XXX YYY',), {})) class GlobNodeTests(unittest.TestCase, ConformsToIQueryParseTree): def _getTargetClass(self): from zope.index.text.parsetree import GlobNode return GlobNode def _makeOne(self, value=None): if value is None: value = 'XXX*' return self._getTargetClass()(value) def test_nodeType(self): node = self._makeOne() self.assertEqual(node.nodeType(), 'GLOB') def test_executeQuery(self): _called_with = [] def _search(*args, **kw): _called_with.append((args, kw)) return [] index = FauxIndex() index.search_glob = _search node = self._makeOne() self.assertEqual(node.executeQuery(index), []) self.assertEqual(_called_with[0], (('XXX*',), {})) class FauxIndex(object): def _get_family(self): import BTrees return BTrees.family32 family = property(_get_family,) class FauxValue: def __init__(self, *terms): self._terms = terms[:] def terms(self): return self._terms def __eq__(self, other): return self._terms == other._terms def __repr__(self): return 'FV:%s' % ' '.join(self._terms) class FauxSubnode: def __init__(self, node_type, query_results, value=None): self._nodeType = node_type self._query_results = query_results self._value = value def nodeType(self): return self._nodeType def executeQuery(self, index): return self._query_results def getValue(self): if self._value is not None: return self._value return self def test_suite(): return unittest.TestSuite(( unittest.makeSuite(ParseTreeNodeTests), unittest.makeSuite(NotNodeTests), unittest.makeSuite(AndNodeTests), unittest.makeSuite(OrNodeTests), unittest.makeSuite(AtomNodeTests), unittest.makeSuite(PhraseNodeTests), unittest.makeSuite(GlobNodeTests), )) zope.index-3.6.4/src/zope/index/text/tests/test_cosineindex.py0000644000175000017500000001155111727503631025705 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002, 2009 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Text Index Tests """ import unittest class CosineIndexTestBase: # Subclasses must define '_getBTreesFamily' def _getTargetClass(self): from zope.index.text.cosineindex import CosineIndex return CosineIndex def _makeOne(self): from zope.index.text.lexicon import Lexicon from zope.index.text.lexicon import Splitter lexicon = Lexicon(Splitter()) return self._getTargetClass()(lexicon, family=self._getBTreesFamily()) def test_class_conforms_to_IInjection(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IInjection verifyClass(IInjection, self._getTargetClass()) def test_instance_conforms_to_IInjection(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IInjection verifyObject(IInjection, self._makeOne()) def test_class_conforms_to_IStatistics(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IStatistics verifyClass(IStatistics, self._getTargetClass()) def test_instance_conforms_to_IStatistics(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IStatistics verifyObject(IStatistics, self._makeOne()) def test_class_conforms_to_ILexiconBasedIndex(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import ILexiconBasedIndex verifyClass(ILexiconBasedIndex, self._getTargetClass()) def test_instance_conforms_to_ILexiconBasedIndex(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import ILexiconBasedIndex verifyObject(ILexiconBasedIndex, self._makeOne()) def test_class_conforms_to_IExtendedQuerying(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import IExtendedQuerying verifyClass(IExtendedQuerying, self._getTargetClass()) def test_instance_conforms_to_IExtendedQuerying(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import IExtendedQuerying verifyObject(IExtendedQuerying, self._makeOne()) def test__search_wids_empty_wids(self): index = self._makeOne() index.index_doc(1, 'one two three') self.assertEqual(index._search_wids(()), []) def test__search_wids_non_empty_wids(self): TEXT = 'one two three' index = self._makeOne() index.index_doc(1, TEXT ) wids = [index._lexicon._wids[x] for x in TEXT.split()] relevances = index._search_wids(wids) self.assertEqual(len(relevances), len(wids)) for relevance in relevances: self.failUnless(isinstance(relevance[0], index.family.IF.Bucket)) self.assertEqual(len(relevance[0]), 1) self.failUnless(isinstance(relevance[0][1], float)) self.failUnless(isinstance(relevance[1], float)) def test_query_weight_empty_wids(self): index = self._makeOne() index.index_doc(1, 'one two three') self.assertEqual(index.query_weight(()), 0.0) def test_query_weight_oov_wids(self): index = self._makeOne() index.index_doc(1, 'one two three') self.assertEqual(index.query_weight(['nonesuch']), 0.0) def test_query_weight_hit_single_occurence(self): index = self._makeOne() index.index_doc(1, 'one two three') self.failUnless(0.0 < index.query_weight(['one']) < 1.0) def test_query_weight_hit_multiple_occurences(self): index = self._makeOne() index.index_doc(1, 'one one two three one') self.failUnless(0.0 < index.query_weight(['one']) < 1.0) class CosineIndexTest32(CosineIndexTestBase, unittest.TestCase): def _getBTreesFamily(self): import BTrees return BTrees.family32 class CosineIndexTest64(CosineIndexTestBase, unittest.TestCase): def _getBTreesFamily(self): import BTrees return BTrees.family64 def test_suite(): return unittest.TestSuite(( unittest.makeSuite(CosineIndexTest32), unittest.makeSuite(CosineIndexTest64), )) zope.index-3.6.4/src/zope/index/text/tests/test_okapiindex.py0000644000175000017500000001351711727503631025534 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2009 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Okapi text index tests """ import unittest class OkapiIndexTestBase: # Subclasses must define '_getBTreesFamily' def _getTargetClass(self): from zope.index.text.okapiindex import OkapiIndex return OkapiIndex def _makeOne(self): from zope.index.text.lexicon import Lexicon from zope.index.text.lexicon import Splitter lexicon = Lexicon(Splitter()) return self._getTargetClass()(lexicon, family=self._getBTreesFamily()) def test_class_conforms_to_IInjection(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IInjection verifyClass(IInjection, self._getTargetClass()) def test_instance_conforms_to_IInjection(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IInjection verifyObject(IInjection, self._makeOne()) def test_class_conforms_to_IStatistics(self): from zope.interface.verify import verifyClass from zope.index.interfaces import IStatistics verifyClass(IStatistics, self._getTargetClass()) def test_instance_conforms_to_IStatistics(self): from zope.interface.verify import verifyObject from zope.index.interfaces import IStatistics verifyObject(IStatistics, self._makeOne()) def test_class_conforms_to_ILexiconBasedIndex(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import ILexiconBasedIndex verifyClass(ILexiconBasedIndex, self._getTargetClass()) def test_instance_conforms_to_ILexiconBasedIndex(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import ILexiconBasedIndex verifyObject(ILexiconBasedIndex, self._makeOne()) def test_class_conforms_to_IExtendedQuerying(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import IExtendedQuerying verifyClass(IExtendedQuerying, self._getTargetClass()) def test_instance_conforms_to_IExtendedQuerying(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import IExtendedQuerying verifyObject(IExtendedQuerying, self._makeOne()) def test_empty(self): index = self._makeOne() self.assertEqual(index._totaldoclen(), 0) def test_index_doc_updates_totaldoclen(self): index = self._makeOne() index.index_doc(1, 'one two three') index.index_doc(2, 'two three four') self.assertEqual(index._totaldoclen(), 6) def test_index_doc_existing_updates_totaldoclen(self): index = self._makeOne() index.index_doc(1, 'one two three') index.index_doc(1, 'two three four') self.assertEqual(index._totaldoclen(), 3) def test_index_doc_upgrades_totaldoclen(self): index = self._makeOne() # Simulate old instances which didn't have Length attributes index._totaldoclen = 0 index.index_doc(1, 'one two three') self.assertEqual(index._totaldoclen(), 3) def test__search_wids_non_empty_wids(self): TEXT = 'one two three' index = self._makeOne() index.index_doc(1, TEXT ) wids = [index._lexicon._wids[x] for x in TEXT.split()] relevances = index._search_wids(wids) self.assertEqual(len(relevances), len(wids)) for relevance in relevances: self.failUnless(isinstance(relevance[0], index.family.IF.Bucket)) self.assertEqual(len(relevance[0]), 1) self.failUnless(isinstance(relevance[0][1], float)) self.failUnless(isinstance(relevance[1], int)) def test__search_wids_old_totaldoclen_no_write_on_read(self): index = self._makeOne() index.index_doc(1, 'one two three') # Simulate old instances which didn't have Length attributes index._totaldoclen = 3 relevances = index._search_wids([1]) self.failUnless(isinstance(index._totaldoclen, int)) def test_query_weight_empty_wids(self): index = self._makeOne() index.index_doc(1, 'one two three') self.assertEqual(index.query_weight(()), 0.0) def test_query_weight_oov_wids(self): index = self._makeOne() index.index_doc(1, 'one two three') self.assertEqual(index.query_weight(['nonesuch']), 0.0) def test_query_weight_hit_single_occurence(self): index = self._makeOne() index.index_doc(1, 'one two three') self.failUnless(0.0 < index.query_weight(['one'])) def test_query_weight_hit_multiple_occurences(self): index = self._makeOne() index.index_doc(1, 'one one two three one') self.failUnless(0.0 < index.query_weight(['one'])) class OkapiIndexTest32(OkapiIndexTestBase, unittest.TestCase): def _getBTreesFamily(self): import BTrees return BTrees.family32 class OkapiIndexTest64(OkapiIndexTestBase, unittest.TestCase): def _getBTreesFamily(self): import BTrees return BTrees.family64 def test_suite(): return unittest.TestSuite(( unittest.makeSuite(OkapiIndexTest32), unittest.makeSuite(OkapiIndexTest64), )) zope.index-3.6.4/src/zope/index/text/tests/wordstats.py0000644000175000017500000000325411727503631024371 0ustar tseavertseaver00000000000000#! /usr/bin/env python ############################################################################## # # Copyright (c) 2003 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Dump statistics about each word in the index. usage: wordstats.py data.fs [index key] """ from ZODB.Storage.FileStorage import FileStorage def main(fspath, key): fs = FileStorage(fspath, read_only=1) db = ZODB.DB(fs) rt = db.open().root() index = rt[key] lex = index.lexicon idx = index.index print "Words", lex.length() print "Documents", idx.length() print "Word frequencies: count, word, wid" for word, wid in lex.items(): docs = idx._wordinfo[wid] print len(docs), word, wid print "Per-doc scores: wid, (doc, score,)+" for wid in lex.wids(): print wid, docs = idx._wordinfo[wid] for docid, score in docs.items(): print docid, score, print if __name__ == "__main__": import sys args = sys.argv[1:] index_key = "index" if len(args) == 1: fspath = args[0] elif len(args) == 2: fspath, index_key = args else: print "Expected 1 or 2 args, got", len(args) main(fspath, index_key) zope.index-3.6.4/src/zope/index/text/tests/test_widcode.py0000644000175000017500000000756711727503631025027 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2009 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Text Index Tests """ import unittest _marker = object() class Test_widcode(unittest.TestCase): def test_encode_1_to_7_bits(self): from zope.index.text.widcode import encode for wid in xrange(2**7): code = encode([wid]) self.assertEqual(code, chr(wid + 128)) def test_encode_8_to_14_bits(self): from zope.index.text.widcode import encode for wid in xrange(2**7, 2**14): hi, lo = divmod(wid, 128) code = encode([wid]) self.assertEqual(code, chr(hi + 128) + chr(lo)) def test_encode_15_to_21_bits(self): from zope.index.text.widcode import encode for wid in xrange(2**14, 2**21, 255): mid, lo = divmod(wid, 128) hi, mid = divmod(mid, 128) code = encode([wid]) self.assertEqual(code, chr(hi + 128) + chr(mid) + chr(lo)) def test_encode_22_to_28_bits(self): from zope.index.text.widcode import encode STEP = (256 * 512) - 1 for wid in xrange(2**21, 2**28, STEP): lmid, lo = divmod(wid, 128) hmid, lmid = divmod(lmid, 128) hi, hmid = divmod(hmid, 128) code = encode([wid]) self.assertEqual(code, chr(hi + 128) + chr(hmid) + chr(lmid) + chr(lo)) def test_decode_zero(self): from zope.index.text.widcode import decode self.assertEqual(decode('\x80'), [0]) def test__decode_other_one_byte_asserts(self): from zope.index.text.widcode import _decode for wid in range(1, 128): try: _decode(chr(128 + wid)) except AssertionError: pass else: self.fail("Didn't assert: %d" % wid) def test__decode_two_bytes_asserts(self): from zope.index.text.widcode import _decode for wid in range(128, 2**14): try: hi, lo = divmod(wid, 128) code = chr(hi + 128) + chr(lo) _decode(code) except AssertionError: pass else: self.fail("Didn't assert: %d" % wid) def test__decode_three_bytes(self): from zope.index.text.widcode import _decode for wid in range(2**14, 2**21, 247): mid, lo = divmod(wid, 128) hi, mid = divmod(mid, 128) code = chr(hi + 128) + chr(mid) + chr(lo) self.assertEqual(_decode(code), wid) def test__decode_four_bytes(self): from zope.index.text.widcode import _decode STEP = (256 * 512) - 7 for wid in range(2**21, 2**28, STEP): lmid, lo = divmod(wid, 128) hmid, lmid = divmod(lmid, 128) hi, hmid = divmod(hmid, 128) code = chr(hi + 128) + chr(hmid) + chr(lmid) + chr(lo) self.assertEqual(_decode(code), wid) def test_symmetric(self): from zope.index.text.widcode import decode from zope.index.text.widcode import encode for wid in xrange(2**28, 1117): wids = [wid] code = encode(wids) self.assertEqual(decode(code), wids) def test_suite(): return unittest.TestSuite(( unittest.makeSuite(Test_widcode), )) zope.index-3.6.4/src/zope/index/text/tests/__init__.py0000644000175000017500000000007511727503631024074 0ustar tseavertseaver00000000000000# # This file is necessary to make this directory a package. zope.index-3.6.4/src/zope/index/text/tests/test_htmlsplitter.py0000644000175000017500000000765411727503631026141 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2009 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Test zope.index.text.htmlsplitter """ import unittest class HTMLWordSplitterTests(unittest.TestCase): _old_locale = None def tearDown(self): if self._old_locale is not None: import locale locale.setlocale(locale.LC_ALL, self._old_locale) def _getTargetClass(self): from zope.index.text.htmlsplitter import HTMLWordSplitter return HTMLWordSplitter def _makeOne(self): return self._getTargetClass()() def test_class_conforms_to_ISplitter(self): from zope.interface.verify import verifyClass from zope.index.text.interfaces import ISplitter verifyClass(ISplitter, self._getTargetClass()) def test_instance_conforms_to_ISplitter(self): from zope.interface.verify import verifyObject from zope.index.text.interfaces import ISplitter verifyObject(ISplitter, self._makeOne()) def test_process_empty_string(self): splitter = self._makeOne() self.assertEqual(splitter.process(['']), []) def test_process_no_markup(self): splitter = self._makeOne() self.assertEqual(splitter.process(['abc def']), ['abc', 'def']) def test_process_w_locale_awareness(self): import locale import sys self._old_locale = locale.setlocale(locale.LC_ALL) # set German locale try: if sys.platform == 'win32': locale.setlocale(locale.LC_ALL, 'German_Germany.1252') else: locale.setlocale(locale.LC_ALL, 'de_DE.ISO8859-1') except locale.Error: return # This test doesn't work here :-( expected = ['m\xfclltonne', 'waschb\xe4r', 'beh\xf6rde', '\xfcberflieger'] splitter = self._makeOne() self.assertEqual(splitter.process([' '.join(expected)]), expected) def test_process_w_markup(self): splitter = self._makeOne() self.assertEqual(splitter.process(['

abc

 

def

']), ['abc', 'def']) def test_process_w_markup_no_spaces(self): splitter = self._makeOne() self.assertEqual(splitter.process(['

abc

 

def

']), ['abc', 'def']) def test_process_no_markup_w_glob(self): splitter = self._makeOne() self.assertEqual(splitter.process(['abc?def hij*klm nop* qrs?']), ['abc', 'def', 'hij', 'klm', 'nop', 'qrs']) def test_processGlob_empty_string(self): splitter = self._makeOne() self.assertEqual(splitter.processGlob(['']), []) def test_processGlob_no_markup_no_glob(self): splitter = self._makeOne() self.assertEqual(splitter.processGlob(['abc def']), ['abc', 'def']) def test_processGlob_w_markup_no_glob(self): splitter = self._makeOne() self.assertEqual(splitter.processGlob(['

abc

  ' '

def

']), ['abc', 'def']) def test_processGlob_no_markup_w_glob(self): splitter = self._makeOne() self.assertEqual(splitter.processGlob(['abc?def hij*klm nop* qrs?']), ['abc?def', 'hij*klm', 'nop*', 'qrs?']) def test_suite(): return unittest.TestSuite(( unittest.makeSuite(HTMLWordSplitterTests), )) zope.index-3.6.4/src/zope/index/text/tests/test_index.py0000644000175000017500000002035111727503631024502 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Text Index Tests """ import unittest class IndexTestBase: # Subclasses must define '_getTargetClass' and '_getBTreesFamily' def _makeOne(self): from zope.index.text.lexicon import Lexicon from zope.index.text.lexicon import Splitter lexicon = Lexicon(Splitter()) return self._getTargetClass()(lexicon, family=self._getBTreesFamily()) def _check_index_has_document(self, index, docid, word_count=5): self.assertEqual(index.documentCount(), 1) self.assertEqual(index.wordCount(), word_count) self.assertEqual(index._lexicon.wordCount(), word_count) self.assert_(index.has_doc(docid)) self.assert_(index._docweight[docid]) self.assertEqual(len(index._docweight), 1) self.assertEqual(len(index._wordinfo), word_count) self.assertEqual(len(index._docwords), 1) self.assertEqual(len(index.get_words(docid)), word_count) self.assertEqual(len(index._wordinfo), index.wordCount()) for map in index._wordinfo.values(): self.assertEqual(len(map), 1) self.assert_(map.has_key(docid)) def _check_index_is_empty(self, index): self.assertEqual(len(index._docweight), 0) self.assertEqual(len(index._wordinfo), 0) self.assertEqual(len(index._docwords), 0) self.assertEqual(len(index._wordinfo), index.wordCount()) def test_empty(self): index = self._makeOne() self._check_index_is_empty(index) def test_index_document(self): doc = "simple document contains five words" index = self._makeOne() self.assert_(not index.has_doc(1)) index.index_doc(1, doc) self._check_index_has_document(index, 1) def test_unindex_document(self): doc = "simple document contains five words" index = self._makeOne() index.index_doc(1, doc) index.unindex_doc(1) self._check_index_is_empty(index) def test_unindex_document_absent_docid(self): doc = "simple document contains five words" index = self._makeOne() index.index_doc(1, doc) index.unindex_doc(2) self._check_index_has_document(index, 1) def test_clear(self): doc = "simple document contains five words" index = self._makeOne() index.index_doc(1, doc) index.clear() self._check_index_is_empty(index) def test_index_two_documents(self): doc1 = "simple document contains five words" doc2 = "another document just four" index = self._makeOne() index.index_doc(1, doc1) index.index_doc(2, doc2) self.failUnless(index._docweight[2]) self.assertEqual(len(index._docweight), 2) self.assertEqual(len(index._wordinfo), 8) self.assertEqual(len(index._docwords), 2) self.assertEqual(len(index.get_words(2)), 4) self.assertEqual(len(index._wordinfo), index.wordCount()) wids = index._lexicon.termToWordIds("document") self.assertEqual(len(wids), 1) document_wid = wids[0] for wid, map in index._wordinfo.items(): if wid == document_wid: self.assertEqual(len(map), 2) self.assert_(map.has_key(1)) self.assert_(map.has_key(2)) else: self.assertEqual(len(map), 1) def test_index_two_unindex_one(self): # index two documents, unindex one, and test the results doc1 = "simple document contains five words" doc2 = "another document just four" index = self._makeOne() index.index_doc(1, doc1) index.index_doc(2, doc2) index.unindex_doc(1) self.assertEqual(len(index._docweight), 1) self.assert_(index._docweight[2]) self.assertEqual(len(index._wordinfo), 4) self.assertEqual(len(index._docwords), 1) self.assertEqual(len(index.get_words(2)), 4) self.assertEqual(len(index._wordinfo), index.wordCount()) for map in index._wordinfo.values(): self.assertEqual(len(map), 1) self.assert_(map.has_key(2)) def test_index_duplicated_words(self): doc = "very simple repeat repeat repeat document test" index = self._makeOne() index.index_doc(1, doc) self.assert_(index._docweight[1]) self.assertEqual(len(index._wordinfo), 5) self.assertEqual(len(index._docwords), 1) self.assertEqual(len(index.get_words(1)), 7) self.assertEqual(len(index._wordinfo), index.wordCount()) wids = index._lexicon.termToWordIds("repeat") self.assertEqual(len(wids), 1) repititive_wid = wids[0] for wid, map in index._wordinfo.items(): self.assertEqual(len(map), 1) self.assert_(map.has_key(1)) def test_simple_query_oneresult(self): index = self._makeOne() index.index_doc(1, 'not the same document') results = index.search("document") self.assertEqual(list(results.keys()), [1]) def test_simple_query_noresults(self): index = self._makeOne() index.index_doc(1, 'not the same document') results = index.search("frobnicate") self.assertEqual(list(results.keys()), []) def test_query_oneresult(self): index = self._makeOne() index.index_doc(1, 'not the same document') index.index_doc(2, 'something about something else') results = index.search("document") self.assertEqual(list(results.keys()), [1]) def test_search_phrase(self): index = self._makeOne() index.index_doc(1, "the quick brown fox jumps over the lazy dog") index.index_doc(2, "the quick fox jumps lazy over the brown dog") results = index.search_phrase("quick brown fox") self.assertEqual(list(results.keys()), [1]) def test_search_glob(self): index = self._makeOne() index.index_doc(1, "how now brown cow") index.index_doc(2, "hough nough browne cough") index.index_doc(3, "bar brawl") results = index.search_glob("bro*") self.assertEqual(list(results.keys()), [1, 2]) results = index.search_glob("b*") self.assertEqual(list(results.keys()), [1, 2, 3]) class CosineIndexTest32(IndexTestBase, unittest.TestCase): def _getTargetClass(self): from zope.index.text.cosineindex import CosineIndex return CosineIndex def _getBTreesFamily(self): import BTrees return BTrees.family32 class OkapiIndexTest32(IndexTestBase, unittest.TestCase): def _getTargetClass(self): from zope.index.text.okapiindex import OkapiIndex return OkapiIndex def _getBTreesFamily(self): import BTrees return BTrees.family32 class CosineIndexTest64(IndexTestBase, unittest.TestCase): def _getTargetClass(self): from zope.index.text.cosineindex import CosineIndex return CosineIndex def _getBTreesFamily(self): import BTrees return BTrees.family64 class OkapiIndexTest64(IndexTestBase, unittest.TestCase): def _getTargetClass(self): from zope.index.text.okapiindex import OkapiIndex return OkapiIndex def _getBTreesFamily(self): import BTrees return BTrees.family64 def test_suite(): return unittest.TestSuite(( unittest.makeSuite(CosineIndexTest32), unittest.makeSuite(OkapiIndexTest32), unittest.makeSuite(CosineIndexTest64), unittest.makeSuite(OkapiIndexTest64), )) zope.index-3.6.4/src/zope/index/text/widcode.py0000644000175000017500000000761011727503631022613 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Widcode A byte-aligned encoding for lists of non-negative ints, using fewer bytes for smaller ints. This is intended for lists of word ids (wids). The ordinary string .find() method can be used to find the encoded form of a desired wid-string in an encoded wid-string. As in UTF-8, the initial byte of an encoding can't appear in the interior of an encoding, so find() can't be fooled into starting a match "in the middle" of an encoding. Unlike UTF-8, the initial byte does not tell you how many continuation bytes follow; and there's no ASCII superset property. Details: + Only the first byte of an encoding has the sign bit set. + The first byte has 7 bits of data. + Bytes beyond the first in an encoding have the sign bit clear, followed by 7 bits of data. + The first byte doesn't tell you how many continuation bytes are following. You can tell by searching for the next byte with the high bit set (or the end of the string). The int to be encoded can contain no more than 28 bits. If it contains no more than 7 bits, 0abcdefg, the encoding is 1abcdefg If it contains 8 thru 14 bits, 00abcdef ghijkLmn the encoding is 1abcdefg 0hijkLmn Static tables _encoding and _decoding capture all encodes and decodes for 14 or fewer bits. If it contains 15 thru 21 bits, 000abcde fghijkLm nopqrstu the encoding is 1abcdefg 0hijkLmn 0opqrstu If it contains 22 thru 28 bits, 0000abcd efghijkL mnopqrst uvwxyzAB the encoding is 1abcdefg 0hijkLmn 0opqrstu 0vwxyzAB """ assert 0x80**2 == 0x4000 assert 0x80**4 == 0x10000000 import re def encode(wids): # Encode a list of wids as a string. wid2enc = _encoding n = len(wid2enc) return "".join([w < n and wid2enc[w] or _encode(w) for w in wids]) _encoding = [None] * 0x4000 # Filled later, and converted to a tuple def _encode(w): assert 0x4000 <= w < 0x10000000 b, c = divmod(w, 0x80) a, b = divmod(b, 0x80) s = chr(b) + chr(c) if a < 0x80: # no more than 21 data bits return chr(a + 0x80) + s a, b = divmod(a, 0x80) assert a < 0x80, (w, a, b, s) # else more than 28 data bits return (chr(a + 0x80) + chr(b)) + s _prog = re.compile(r"[\x80-\xFF][\x00-\x7F]*") def decode(code): # Decode a string into a list of wids. get = _decoding.get # Obscure: while _decoding does have the key '\x80', its value is 0, # so the "or" here calls _decode('\x80') anyway. return [get(p) or _decode(p) for p in _prog.findall(code)] _decoding = {} # Filled later def _decode(s): if s == '\x80': # See comment in decode(). This is here to allow a trick to work. return 0 if len(s) == 3: a, b, c = map(ord, s) assert a & 0x80 == 0x80 and not b & 0x80 and not c & 0x80 return ((a & 0x7F) << 14) | (b << 7) | c assert len(s) == 4, `s` a, b, c, d = map(ord, s) assert a & 0x80 == 0x80 and not b & 0x80 and not c & 0x80 and not d & 0x80 return ((a & 0x7F) << 21) | (b << 14) | (c << 7) | d def _fill(): global _encoding for i in range(0x80): s = chr(i + 0x80) _encoding[i] = s _decoding[s] = i for i in range(0x80, 0x4000): hi, lo = divmod(i, 0x80) s = chr(hi + 0x80) + chr(lo) _encoding[i] = s _decoding[s] = i _encoding = tuple(_encoding) _fill() zope.index-3.6.4/src/zope/index/text/cosineindex.py0000644000175000017500000000752511727503631023512 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE # ############################################################################## """Full text index with relevance ranking, using a cosine measure. """ import math from zope.index.text.baseindex import BaseIndex from zope.index.text.baseindex import inverse_doc_frequency class CosineIndex(BaseIndex): def __init__(self, lexicon, family=None): BaseIndex.__init__(self, lexicon, family=family) # ._wordinfo for cosine is wid -> {docid -> weight}; # t -> D -> w(d, t)/W(d) # ._docweight for cosine is # docid -> W(docid) # Most of the computation for computing a relevance score for the # document occurs in the _search_wids() method. The code currently # implements the cosine similarity function described in Managing # Gigabytes, eq. 4.3, p. 187. The index_object() method # precomputes some values that are independent of the particular # query. # The equation is # # sum(for t in I(d,q): w(d,t) * w(q,t)) # cosine(d, q) = ------------------------------------- # W(d) * W(q) # # where # I(d, q) = the intersection of the terms in d and q. # # w(d, t) = 1 + log f(d, t) # computed by doc_term_weight(); for a given word t, # self._wordinfo[t] is a map from d to w(d, t). # # w(q, t) = log(1 + N/f(t)) # computed by inverse_doc_frequency() # # W(d) = sqrt(sum(for t in d: w(d, t) ** 2)) # computed by _get_frequencies(), and remembered in # self._docweight[d] # # W(q) = sqrt(sum(for t in q: w(q, t) ** 2)) # computed by self.query_weight() def _search_wids(self, wids): if not wids: return [] N = float(len(self._docweight)) L = [] DictType = type({}) for wid in wids: assert self._wordinfo.has_key(wid) # caller responsible for OOV d2w = self._wordinfo[wid] # maps docid to w(docid, wid) idf = inverse_doc_frequency(len(d2w), N) # an unscaled float #print "idf = %.3f" % idf if isinstance(d2w, DictType): d2w = self.family.IF.Bucket(d2w) L.append((d2w, idf)) return L def query_weight(self, terms): wids = [] for term in terms: wids += self._lexicon.termToWordIds(term) N = float(len(self._docweight)) sum = 0.0 for wid in self._remove_oov_wids(wids): wt = inverse_doc_frequency(len(self._wordinfo[wid]), N) sum += wt ** 2.0 return math.sqrt(sum) def _get_frequencies(self, wids): d = {} dget = d.get for wid in wids: d[wid] = dget(wid, 0) + 1 Wsquares = 0.0 for wid, count in d.items(): w = doc_term_weight(count) Wsquares += w * w d[wid] = w W = math.sqrt(Wsquares) #print "W = %.3f" % W for wid, weight in d.items(): #print i, ":", "%.3f" % weight, d[wid] = weight / W #print "->", d[wid] return d, W def doc_term_weight(count): """Return the doc-term weight for a term that appears count times.""" # implements w(d, t) = 1 + log f(d, t) return 1.0 + math.log(count) zope.index-3.6.4/src/zope/index/text/__init__.py0000644000175000017500000000006011727503631022724 0ustar tseavertseaver00000000000000from zope.index.text.textindex import TextIndex zope.index-3.6.4/src/zope/index/text/textindex.py0000644000175000017500000000557611727503631023222 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Text index. """ import sys from persistent import Persistent from zope.interface import implements from zope.index.interfaces import IIndexSearch from zope.index.interfaces import IInjection from zope.index.interfaces import IStatistics from zope.index.text.lexicon import CaseNormalizer from zope.index.text.lexicon import Lexicon from zope.index.text.lexicon import Splitter from zope.index.text.lexicon import StopWordRemover from zope.index.text.okapiindex import OkapiIndex from zope.index.text.queryparser import QueryParser class TextIndex(Persistent): implements(IInjection, IIndexSearch, IStatistics) def __init__(self, lexicon=None, index=None): """Provisional constructor. This creates the lexicon and index if not passed in. """ _explicit_lexicon = True if lexicon is None: _explicit_lexicon = False lexicon = Lexicon(Splitter(), CaseNormalizer(), StopWordRemover()) if index is None: index = OkapiIndex(lexicon) self.lexicon = _explicit_lexicon and lexicon or index.lexicon self.index = index def index_doc(self, docid, text): self.index.index_doc(docid, text) def unindex_doc(self, docid): self.index.unindex_doc(docid) def clear(self): self.index.clear() def documentCount(self): """Return the number of documents in the index.""" return self.index.documentCount() def wordCount(self): """Return the number of words in the index.""" return self.index.wordCount() def apply(self, querytext, start=0, count=None): parser = QueryParser(self.lexicon) tree = parser.parseQuery(querytext) results = tree.executeQuery(self.index) if results: qw = self.index.query_weight(tree.terms()) # Hack to avoid ZeroDivisionError if qw == 0: qw = 1.0 qw *= 1.0 for docid, score in results.iteritems(): try: results[docid] = score/qw except TypeError: # We overflowed the score, perhaps wildly unlikely. # Who knows. results[docid] = sys.maxint/10 return results zope.index-3.6.4/src/zope/index/text/queryparser.py0000644000175000017500000001705011727503631023556 0ustar tseavertseaver00000000000000############################################################################## # # Copyright (c) 2002 Zope Foundation and Contributors. # All Rights Reserved. # # This software is subject to the provisions of the Zope Public License, # Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution. # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS # FOR A PARTICULAR PURPOSE. # ############################################################################## """Query Parser. This particular parser recognizes the following syntax: Start = OrExpr OrExpr = AndExpr ('OR' AndExpr)* AndExpr = Term ('AND' NotExpr | 'NOT' AndExpr)* NotExpr = ['NOT'] Term Term = '(' OrExpr ')' | ATOM+ The key words (AND, OR, NOT) are recognized in any mixture of case. An ATOM is either: + A sequence of characters not containing whitespace or parentheses or double quotes, and not equal (ignoring case) to one of the key words 'AND', 'OR', 'NOT'; or + A non-empty string enclosed in double quotes. The interior of the string can contain whitespace, parentheses and key words, but not quotes. + A hyphen followed by one of the two forms above, meaning that it must not be present. An unquoted ATOM may also contain globbing characters. Globbing syntax is defined by the lexicon; for example "foo*" could mean any word starting with "foo". When multiple consecutive ATOMs are found at the leaf level, they are connected by an implied AND operator, and an unquoted leading hyphen is interpreted as a NOT operator. Summarizing the default operator rules: - a sequence of words without operators implies AND, e.g. ``foo bar'' - double-quoted text implies phrase search, e.g. ``"foo bar"'' - words connected by punctuation implies phrase search, e.g. ``foo-bar'' - a leading hyphen implies NOT, e.g. ``foo -bar'' - these can be combined, e.g. ``foo -"foo bar"'' or ``foo -foo-bar'' - * and ? are used for globbing (i.e. prefix search), e.g. ``foo*'' """ import re from zope.interface import implements from zope.index.text.interfaces import IQueryParser from zope.index.text import parsetree # Create unique symbols for token types. _AND = intern("AND") _OR = intern("OR") _NOT = intern("NOT") _LPAREN = intern("(") _RPAREN = intern(")") _ATOM = intern("ATOM") _EOF = intern("EOF") # Map keyword string to token type. _keywords = { _AND: _AND, _OR: _OR, _NOT: _NOT, _LPAREN: _LPAREN, _RPAREN: _RPAREN, } # Regular expression to tokenize. _tokenizer_regex = re.compile(r""" # a paren [()] # or an optional hyphen | -? # followed by (?: # a string inside double quotes (and not containing these) " [^"]* " # or a non-empty stretch w/o whitespace, parens or double quotes | [^()\s"]+ ) """, re.VERBOSE) class QueryParser(object): implements(IQueryParser) # This class is not thread-safe; # each thread should have its own instance def __init__(self, lexicon): self._lexicon = lexicon self._ignored = None # Public API methods def parseQuery(self, query): # Lexical analysis. tokens = _tokenizer_regex.findall(query) self._tokens = tokens # classify tokens self._tokentypes = [_keywords.get(token.upper(), _ATOM) for token in tokens] # add _EOF self._tokens.append(_EOF) self._tokentypes.append(_EOF) self._index = 0 # Syntactical analysis. self._ignored = [] # Ignored words in the query, for parseQueryEx tree = self._parseOrExpr() self._require(_EOF) if tree is None: raise parsetree.ParseError( "Query contains only common words: %s" % repr(query)) return tree def getIgnored(self): return self._ignored def parseQueryEx(self, query): tree = self.parseQuery(query) ignored = self.getIgnored() return tree, ignored # Recursive descent parser def _require(self, tokentype): if not self._check(tokentype): t = self._tokens[self._index] msg = "Token %r required, %r found" % (tokentype, t) raise parsetree.ParseError(msg) def _check(self, tokentype): if self._tokentypes[self._index] is tokentype: self._index += 1 return 1 else: return 0 def _peek(self, tokentype): return self._tokentypes[self._index] is tokentype def _get(self, tokentype): t = self._tokens[self._index] self._require(tokentype) return t def _parseOrExpr(self): L = [] L.append(self._parseAndExpr()) while self._check(_OR): L.append(self._parseAndExpr()) L = filter(None, L) if not L: return None # Only stopwords elif len(L) == 1: return L[0] else: return parsetree.OrNode(L) def _parseAndExpr(self): L = [] t = self._parseTerm() if t is not None: L.append(t) Nots = [] while 1: if self._check(_AND): t = self._parseNotExpr() if t is None: continue if isinstance(t, parsetree.NotNode): Nots.append(t) else: L.append(t) elif self._check(_NOT): t = self._parseTerm() if t is None: continue # Only stopwords Nots.append(parsetree.NotNode(t)) else: break if not L: return None # Only stopwords L.extend(Nots) if len(L) == 1: return L[0] else: return parsetree.AndNode(L) def _parseNotExpr(self): if self._check(_NOT): t = self._parseTerm() if t is None: return None # Only stopwords return parsetree.NotNode(t) else: return self._parseTerm() def _parseTerm(self): if self._check(_LPAREN): tree = self._parseOrExpr() self._require(_RPAREN) else: nodes = [] nodes = [self._parseAtom()] while self._peek(_ATOM): nodes.append(self._parseAtom()) nodes = filter(None, nodes) if not nodes: return None # Only stopwords structure = [(isinstance(nodes[i], parsetree.NotNode), i, nodes[i]) for i in range(len(nodes))] structure.sort() nodes = [node for (bit, index, node) in structure] if isinstance(nodes[0], parsetree.NotNode): raise parsetree.ParseError( "a term must have at least one positive word") if len(nodes) == 1: return nodes[0] tree = parsetree.AndNode(nodes) return tree def _parseAtom(self): term = self._get(_ATOM) words = self._lexicon.parseTerms(term) if not words: self._ignored.append(term) return None if len(words) > 1: tree = parsetree.PhraseNode(words) elif self._lexicon.isGlob(words[0]): tree = parsetree.GlobNode(words[0]) else: tree = parsetree.AtomNode(words[0]) if term[0] == "-": tree = parsetree.NotNode(tree) return tree zope.index-3.6.4/src/zope.index.egg-info/0000755000175000017500000000000011727503757021335 5ustar tseavertseaver00000000000000zope.index-3.6.4/src/zope.index.egg-info/not-zip-safe0000644000175000017500000000000111727503646023560 0ustar tseavertseaver00000000000000 zope.index-3.6.4/src/zope.index.egg-info/namespace_packages.txt0000644000175000017500000000000511727503757025663 0ustar tseavertseaver00000000000000zope zope.index-3.6.4/src/zope.index.egg-info/requires.txt0000644000175000017500000000005511727503757023735 0ustar tseavertseaver00000000000000setuptools ZODB3>=3.8 zope.interface [test] zope.index-3.6.4/src/zope.index.egg-info/SOURCES.txt0000644000175000017500000000442211727503757023223 0ustar tseavertseaver00000000000000CHANGES.txt COPYRIGHT.txt LICENSE.txt README.txt bootstrap.py buildout.cfg setup.cfg setup.py src/zope/__init__.py src/zope.index.egg-info/PKG-INFO src/zope.index.egg-info/SOURCES.txt src/zope.index.egg-info/dependency_links.txt src/zope.index.egg-info/namespace_packages.txt src/zope.index.egg-info/not-zip-safe src/zope.index.egg-info/requires.txt src/zope.index.egg-info/top_level.txt src/zope/index/DEPENDENCIES.cfg src/zope/index/__init__.py src/zope/index/interfaces.py src/zope/index/nbest.py src/zope/index/tests.py src/zope/index/field/README.txt src/zope/index/field/__init__.py src/zope/index/field/index.py src/zope/index/field/sorting.py src/zope/index/field/tests.py src/zope/index/keyword/__init__.py src/zope/index/keyword/index.py src/zope/index/keyword/interfaces.py src/zope/index/keyword/tests.py src/zope/index/text/__init__.py src/zope/index/text/baseindex.py src/zope/index/text/cosineindex.py src/zope/index/text/htmlsplitter.py src/zope/index/text/interfaces.py src/zope/index/text/lexicon.py src/zope/index/text/okapiindex.py src/zope/index/text/okascore.c src/zope/index/text/parsetree.py src/zope/index/text/queryparser.py src/zope/index/text/ricecode.py src/zope/index/text/setops.py src/zope/index/text/stopdict.py src/zope/index/text/textindex.py src/zope/index/text/textindex.txt src/zope/index/text/widcode.py src/zope/index/text/tests/__init__.py src/zope/index/text/tests/hs-tool.py src/zope/index/text/tests/mhindex.py src/zope/index/text/tests/test_baseindex.py src/zope/index/text/tests/test_cosineindex.py src/zope/index/text/tests/test_htmlsplitter.py src/zope/index/text/tests/test_index.py src/zope/index/text/tests/test_lexicon.py src/zope/index/text/tests/test_okapiindex.py src/zope/index/text/tests/test_parsetree.py src/zope/index/text/tests/test_queryengine.py src/zope/index/text/tests/test_queryparser.py src/zope/index/text/tests/test_setops.py src/zope/index/text/tests/test_textindex.py src/zope/index/text/tests/test_textindexwrapper.py src/zope/index/text/tests/test_widcode.py src/zope/index/text/tests/wordstats.py src/zope/index/topic/__init__.py src/zope/index/topic/filter.py src/zope/index/topic/index.py src/zope/index/topic/interfaces.py src/zope/index/topic/tests/__init__.py src/zope/index/topic/tests/test_filter.py src/zope/index/topic/tests/test_index.pyzope.index-3.6.4/src/zope.index.egg-info/top_level.txt0000644000175000017500000000000511727503757024062 0ustar tseavertseaver00000000000000zope zope.index-3.6.4/src/zope.index.egg-info/dependency_links.txt0000644000175000017500000000000111727503757025403 0ustar tseavertseaver00000000000000 zope.index-3.6.4/src/zope.index.egg-info/PKG-INFO0000644000175000017500000001500511727503757022433 0ustar tseavertseaver00000000000000Metadata-Version: 1.0 Name: zope.index Version: 3.6.4 Summary: Indices for using with catalog like text, field, etc. Home-page: http://pypi.python.org/pypi/zope.index Author: Zope Foundation and Contributors Author-email: zope-dev@zope.org License: ZPL 2.1 Description: Overview -------- The ``zope.index`` package provides several indices for the Zope catalog. These include: * a field index (for indexing orderable values), * a keyword index, * a topic index, * a text index (with support for lexicon, splitter, normalizer, etc.) Changes ======= 3.6.4 (2012-03-12) ------------------ - Insure proper unindex behavior if index_doc is called with a empty sequence. - Use the standard Python doctest module instead of zope.testing.doctest 3.6.3 (2011-12-03) ------------------ - KeywordIndex: Minor optimization; use __nonzero__ instead of __len__ to avoid loading the full TreeSet. 3.6.2 (2011-12-03) ------------------ - KeywordIndex: Store docids in TreeSet rather than a Set when the number of documents matching a word reaches a configurable threshold (default 64). The rule is applied to individual words at indexing time, but you can call the new optimize method to optimize all the words in an index at once. Designed to fix LP #881950. 3.6.1 (2010-07-08) ------------------ - TextIndex: reuse the lexicon from the underlying Okapi / Cosine index, if passed. (LP #232516) - Lexicon: avoid raising an exception when indexing None. (LP #598776) 3.6.0 (2009-08-03) ------------------ - Improved test readability and reached 100% test coverage. - Fixed a broken optimization in okascore.c: it was passing a Python float to the PyInt_AS_LONG() macro. This resulted in wrong scores, especially on 64 bit platforms, where all scores typically ended up being zero. - Changed okascore.c to produce the same results as its Python equivalent, reducing the brittleness of the text index tests. 3.5.2 (2009-06-09) ------------------ - Port okascore.c optimization used in okapiiindex from Zope2 catalog implementation. This module is compiled conditionally, based on whether your environment has a working C compiler. - Don't use ``len(self._docweight)`` in okapiindex _search_wids method (obtaining the length of a BTree is very expensive at scale). Instead use self.documentCount(). Also a Zope2 port. 3.5.1 (2009-02-27) ------------------ - The baseindex, okapiindex, and lexicon used plain counters for various lengths, which is unsuitable for production applications. Backport code from Zope2 indexes which opportunistically replaces the counters with BTree.Length objects. - Backport non-insane version of baseindex._del_wordinfo from Zope2 text index. This improves deletion performance by several orders of magnitude. - Don't modify given query dictionary in the KeywordIndex.apply method. - Move FieldIndex's sorting functionality to a mixin class so it can be reused by zc.catalog's ValueIndex. 3.5.0 (2008-12-30) ------------------ - Remove zope.testing from dependencies, as it's not really needed. - Define IIndexSort interface for indexes that support sorting. - Implement sorting for FieldIndex (adapted from repoze.catalog/ZCatalog). - Add an ``apply`` method for KeywordIndex/TopicIndex, making them implement IIndexSearch that can be useful in catalog. - Optimize the ``search`` method of KeywordIndex/TopicIndex by using multiunion for the ``or`` operator and sorting before intersection for ``and``. - IMPORTANT: KeywordIndex/TopicIndex now use IFSets instead of IISets. This makes it more compatible with other indexes (for example, when using in catalog). This change can lead to problems, if your code somehow depends on the II nature of sets, as it was before. Also, FilteredSets used to use IFSets as well, if you have any FilteredSets pickled in the database, you need to migrate them to IFSets yourself. You can do it like that: filter._ids = filter.family.IF.Set(filter._ids) Where ``filter`` is an instance of FilteredSet. - IMPORTANT: KeywordIndex are now non-normalizing. Because it can be useful for non-string keywords, where case-normalizing doesn't make any sense. Instead, it provides the ``normalize`` method that can be overriden by subclasses to provide some normalization. The CaseInsensitiveKeywordIndex class is now provided that do case-normalization for string-based keywords. The old CaseSensitiveKeywordIndex is gone, applications should use KeywordIndex for that. Looks like the KeywordIndex/TopicIndex was sort of abadonware and wasn't used by application developers, so after some discussion we decided to refactor them to make them more usable, optimal and compatible with other indexes and catalog. Porting application from old KeywordIndex/TopicIndex to new ones are rather easy and explained above, so we believe that it isn't a problem. Please, use zope3-users@zope.org or zope-dev@zope.org mailing lists, if you have any problems with migration. Thanks Chris McDonough of repoze for supporting and useful code. 3.4.1 (2007-09-28) ------------------ - Fixed bug in package metadata (wrong homepage URL). 3.4.0 (2007-09-28) ------------------ No further changes since 3.4.0a1. 3.4.0a1 (2007-04-22) -------------------- Initial release as a separate project, corresponds to zope.index from Zope 3.4.0a1 Platform: UNKNOWN zope.index-3.6.4/CHANGES.txt0000644000175000017500000001136611727503631016600 0ustar tseavertseaver00000000000000Changes ======= 3.6.4 (2012-03-12) ------------------ - Insure proper unindex behavior if index_doc is called with a empty sequence. - Use the standard Python doctest module instead of zope.testing.doctest 3.6.3 (2011-12-03) ------------------ - KeywordIndex: Minor optimization; use __nonzero__ instead of __len__ to avoid loading the full TreeSet. 3.6.2 (2011-12-03) ------------------ - KeywordIndex: Store docids in TreeSet rather than a Set when the number of documents matching a word reaches a configurable threshold (default 64). The rule is applied to individual words at indexing time, but you can call the new optimize method to optimize all the words in an index at once. Designed to fix LP #881950. 3.6.1 (2010-07-08) ------------------ - TextIndex: reuse the lexicon from the underlying Okapi / Cosine index, if passed. (LP #232516) - Lexicon: avoid raising an exception when indexing None. (LP #598776) 3.6.0 (2009-08-03) ------------------ - Improved test readability and reached 100% test coverage. - Fixed a broken optimization in okascore.c: it was passing a Python float to the PyInt_AS_LONG() macro. This resulted in wrong scores, especially on 64 bit platforms, where all scores typically ended up being zero. - Changed okascore.c to produce the same results as its Python equivalent, reducing the brittleness of the text index tests. 3.5.2 (2009-06-09) ------------------ - Port okascore.c optimization used in okapiiindex from Zope2 catalog implementation. This module is compiled conditionally, based on whether your environment has a working C compiler. - Don't use ``len(self._docweight)`` in okapiindex _search_wids method (obtaining the length of a BTree is very expensive at scale). Instead use self.documentCount(). Also a Zope2 port. 3.5.1 (2009-02-27) ------------------ - The baseindex, okapiindex, and lexicon used plain counters for various lengths, which is unsuitable for production applications. Backport code from Zope2 indexes which opportunistically replaces the counters with BTree.Length objects. - Backport non-insane version of baseindex._del_wordinfo from Zope2 text index. This improves deletion performance by several orders of magnitude. - Don't modify given query dictionary in the KeywordIndex.apply method. - Move FieldIndex's sorting functionality to a mixin class so it can be reused by zc.catalog's ValueIndex. 3.5.0 (2008-12-30) ------------------ - Remove zope.testing from dependencies, as it's not really needed. - Define IIndexSort interface for indexes that support sorting. - Implement sorting for FieldIndex (adapted from repoze.catalog/ZCatalog). - Add an ``apply`` method for KeywordIndex/TopicIndex, making them implement IIndexSearch that can be useful in catalog. - Optimize the ``search`` method of KeywordIndex/TopicIndex by using multiunion for the ``or`` operator and sorting before intersection for ``and``. - IMPORTANT: KeywordIndex/TopicIndex now use IFSets instead of IISets. This makes it more compatible with other indexes (for example, when using in catalog). This change can lead to problems, if your code somehow depends on the II nature of sets, as it was before. Also, FilteredSets used to use IFSets as well, if you have any FilteredSets pickled in the database, you need to migrate them to IFSets yourself. You can do it like that: filter._ids = filter.family.IF.Set(filter._ids) Where ``filter`` is an instance of FilteredSet. - IMPORTANT: KeywordIndex are now non-normalizing. Because it can be useful for non-string keywords, where case-normalizing doesn't make any sense. Instead, it provides the ``normalize`` method that can be overriden by subclasses to provide some normalization. The CaseInsensitiveKeywordIndex class is now provided that do case-normalization for string-based keywords. The old CaseSensitiveKeywordIndex is gone, applications should use KeywordIndex for that. Looks like the KeywordIndex/TopicIndex was sort of abadonware and wasn't used by application developers, so after some discussion we decided to refactor them to make them more usable, optimal and compatible with other indexes and catalog. Porting application from old KeywordIndex/TopicIndex to new ones are rather easy and explained above, so we believe that it isn't a problem. Please, use zope3-users@zope.org or zope-dev@zope.org mailing lists, if you have any problems with migration. Thanks Chris McDonough of repoze for supporting and useful code. 3.4.1 (2007-09-28) ------------------ - Fixed bug in package metadata (wrong homepage URL). 3.4.0 (2007-09-28) ------------------ No further changes since 3.4.0a1. 3.4.0a1 (2007-04-22) -------------------- Initial release as a separate project, corresponds to zope.index from Zope 3.4.0a1 zope.index-3.6.4/setup.cfg0000644000175000017500000000022311727503757016607 0ustar tseavertseaver00000000000000[nosetests] nocapture = 1 with-coverage = 1 cover-erase = 1 cover-package = zope.index [egg_info] tag_build = tag_date = 0 tag_svn_revision = 0 zope.index-3.6.4/PKG-INFO0000644000175000017500000001500511727503757016067 0ustar tseavertseaver00000000000000Metadata-Version: 1.0 Name: zope.index Version: 3.6.4 Summary: Indices for using with catalog like text, field, etc. Home-page: http://pypi.python.org/pypi/zope.index Author: Zope Foundation and Contributors Author-email: zope-dev@zope.org License: ZPL 2.1 Description: Overview -------- The ``zope.index`` package provides several indices for the Zope catalog. These include: * a field index (for indexing orderable values), * a keyword index, * a topic index, * a text index (with support for lexicon, splitter, normalizer, etc.) Changes ======= 3.6.4 (2012-03-12) ------------------ - Insure proper unindex behavior if index_doc is called with a empty sequence. - Use the standard Python doctest module instead of zope.testing.doctest 3.6.3 (2011-12-03) ------------------ - KeywordIndex: Minor optimization; use __nonzero__ instead of __len__ to avoid loading the full TreeSet. 3.6.2 (2011-12-03) ------------------ - KeywordIndex: Store docids in TreeSet rather than a Set when the number of documents matching a word reaches a configurable threshold (default 64). The rule is applied to individual words at indexing time, but you can call the new optimize method to optimize all the words in an index at once. Designed to fix LP #881950. 3.6.1 (2010-07-08) ------------------ - TextIndex: reuse the lexicon from the underlying Okapi / Cosine index, if passed. (LP #232516) - Lexicon: avoid raising an exception when indexing None. (LP #598776) 3.6.0 (2009-08-03) ------------------ - Improved test readability and reached 100% test coverage. - Fixed a broken optimization in okascore.c: it was passing a Python float to the PyInt_AS_LONG() macro. This resulted in wrong scores, especially on 64 bit platforms, where all scores typically ended up being zero. - Changed okascore.c to produce the same results as its Python equivalent, reducing the brittleness of the text index tests. 3.5.2 (2009-06-09) ------------------ - Port okascore.c optimization used in okapiiindex from Zope2 catalog implementation. This module is compiled conditionally, based on whether your environment has a working C compiler. - Don't use ``len(self._docweight)`` in okapiindex _search_wids method (obtaining the length of a BTree is very expensive at scale). Instead use self.documentCount(). Also a Zope2 port. 3.5.1 (2009-02-27) ------------------ - The baseindex, okapiindex, and lexicon used plain counters for various lengths, which is unsuitable for production applications. Backport code from Zope2 indexes which opportunistically replaces the counters with BTree.Length objects. - Backport non-insane version of baseindex._del_wordinfo from Zope2 text index. This improves deletion performance by several orders of magnitude. - Don't modify given query dictionary in the KeywordIndex.apply method. - Move FieldIndex's sorting functionality to a mixin class so it can be reused by zc.catalog's ValueIndex. 3.5.0 (2008-12-30) ------------------ - Remove zope.testing from dependencies, as it's not really needed. - Define IIndexSort interface for indexes that support sorting. - Implement sorting for FieldIndex (adapted from repoze.catalog/ZCatalog). - Add an ``apply`` method for KeywordIndex/TopicIndex, making them implement IIndexSearch that can be useful in catalog. - Optimize the ``search`` method of KeywordIndex/TopicIndex by using multiunion for the ``or`` operator and sorting before intersection for ``and``. - IMPORTANT: KeywordIndex/TopicIndex now use IFSets instead of IISets. This makes it more compatible with other indexes (for example, when using in catalog). This change can lead to problems, if your code somehow depends on the II nature of sets, as it was before. Also, FilteredSets used to use IFSets as well, if you have any FilteredSets pickled in the database, you need to migrate them to IFSets yourself. You can do it like that: filter._ids = filter.family.IF.Set(filter._ids) Where ``filter`` is an instance of FilteredSet. - IMPORTANT: KeywordIndex are now non-normalizing. Because it can be useful for non-string keywords, where case-normalizing doesn't make any sense. Instead, it provides the ``normalize`` method that can be overriden by subclasses to provide some normalization. The CaseInsensitiveKeywordIndex class is now provided that do case-normalization for string-based keywords. The old CaseSensitiveKeywordIndex is gone, applications should use KeywordIndex for that. Looks like the KeywordIndex/TopicIndex was sort of abadonware and wasn't used by application developers, so after some discussion we decided to refactor them to make them more usable, optimal and compatible with other indexes and catalog. Porting application from old KeywordIndex/TopicIndex to new ones are rather easy and explained above, so we believe that it isn't a problem. Please, use zope3-users@zope.org or zope-dev@zope.org mailing lists, if you have any problems with migration. Thanks Chris McDonough of repoze for supporting and useful code. 3.4.1 (2007-09-28) ------------------ - Fixed bug in package metadata (wrong homepage URL). 3.4.0 (2007-09-28) ------------------ No further changes since 3.4.0a1. 3.4.0a1 (2007-04-22) -------------------- Initial release as a separate project, corresponds to zope.index from Zope 3.4.0a1 Platform: UNKNOWN