pax_global_header00006660000000000000000000000064150404500130014502gustar00rootroot0000000000000052 comment=63df79c9f781656f8e761ebc0f5b95d86b765adf OHF-Voice-sentence-stream-63df79c/000077500000000000000000000000001504045001300167035ustar00rootroot00000000000000OHF-Voice-sentence-stream-63df79c/.gitignore000066400000000000000000000001721504045001300206730ustar00rootroot00000000000000.DS_Store .idea *.log tmp/ *.py[cod] *.egg build htmlcov .projectile .venv/ venv/ .tox/ .mypy_cache/ *.egg-info/ dist/ OHF-Voice-sentence-stream-63df79c/.isort.cfg000066400000000000000000000001611504045001300206000ustar00rootroot00000000000000[settings] multi_line_output=3 include_trailing_comma=True force_grid_wrap=0 use_parentheses=True line_length=88 OHF-Voice-sentence-stream-63df79c/CHANGELOG.md000066400000000000000000000001311504045001300205070ustar00rootroot00000000000000# Changelog ## 1.1.0 - Split sentences on double newlines ## 1.0.0 - Initial release OHF-Voice-sentence-stream-63df79c/LICENSE.md000066400000000000000000000261351504045001300203160ustar00rootroot00000000000000 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. OHF-Voice-sentence-stream-63df79c/README.md000066400000000000000000000011551504045001300201640ustar00rootroot00000000000000# Sentence Stream A small sentence splitter for text streams. ## Install ``` sh pip install sentence-stream ``` ## Example ``` python from sentence_stream import stream_to_sentences text_chunks = [ "Text chunks that a", "re not on", " word or se", "ntence boundarie", "s. But, they w", "ill sti", "ll get sp", "lit right", "!!! Goo", "d", ] assert list(stream_to_sentences(text_chunks)) == [ "Text chunks that are not on word or sentence boundaries.", "But, they will still get split right!!!", "Good", ] ``` For async streams, use `async_stream_to_sentences`. OHF-Voice-sentence-stream-63df79c/mypy.ini000066400000000000000000000002301504045001300203750ustar00rootroot00000000000000 [mypy] [mypy-setuptools.*] ignore_missing_imports = True [mypy-pytest.*] ignore_missing_imports = True [mypy-regex.*] ignore_missing_imports = True OHF-Voice-sentence-stream-63df79c/pylintrc000066400000000000000000000014151504045001300204730ustar00rootroot00000000000000[MESSAGES CONTROL] disable= format, abstract-method, cyclic-import, duplicate-code, global-statement, import-outside-toplevel, inconsistent-return-statements, locally-disabled, not-context-manager, too-few-public-methods, too-many-arguments, too-many-branches, too-many-instance-attributes, too-many-lines, too-many-locals, too-many-public-methods, too-many-return-statements, too-many-statements, too-many-boolean-expressions, unnecessary-pass, unused-argument, broad-except, too-many-nested-blocks, invalid-name, unused-import, fixme, useless-super-delegation, missing-module-docstring, missing-class-docstring, missing-function-docstring, import-error, consider-using-with [FORMAT] expected-line-ending-format=LF OHF-Voice-sentence-stream-63df79c/pyproject.toml000066400000000000000000000025631504045001300216250ustar00rootroot00000000000000[build-system] requires = ["setuptools>=62.3"] build-backend = "setuptools.build_meta" [project] name = "sentence_stream" version = "1.1.0" license = {text = "Apache-2.0"} description = "A small sentence splitter for text streams" readme = "README.md" authors = [ {name = "The Home Assistant Authors", email = "hello@home-assistant.io"} ] keywords = ["home", "assistant", "sentence boundary"] classifiers = [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Topic :: Text Processing :: Linguistic", "License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 3.9", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", "Programming Language :: Python :: 3.12", "Programming Language :: Python :: 3.13", ] requires-python = ">=3.9.0" dependencies = [ "regex==2024.11.6" ] [project.optional-dependencies] dev = [ "black==24.8.0", "flake8==7.2.0", "mypy==1.14.0", "pylint==3.2.7", "pytest==8.3.5", "pytest-asyncio==1.1.0", "tox==4.26.0", "build==1.2.2", ] [project.urls] "Source Code" = "http://github.com/OHF-Voice/sentence-stream" [tool.setuptools] platforms = ["any"] zip-safe = true include-package-data = true [tool.setuptools.packages.find] include = ["sentence_stream"] exclude = ["tests", "tests.*"] OHF-Voice-sentence-stream-63df79c/script/000077500000000000000000000000001504045001300202075ustar00rootroot00000000000000OHF-Voice-sentence-stream-63df79c/script/format000077500000000000000000000011041504045001300214210ustar00rootroot00000000000000#!/usr/bin/env python3 import subprocess import venv from pathlib import Path _DIR = Path(__file__).parent _PROGRAM_DIR = _DIR.parent _VENV_DIR = _PROGRAM_DIR / ".venv" _MODULE_DIR = _PROGRAM_DIR / "sentence_stream" _TESTS_DIR = _PROGRAM_DIR / "tests" _FORMAT_DIRS = [_MODULE_DIR, _TESTS_DIR] if _VENV_DIR.exists(): context = venv.EnvBuilder().ensure_directories(_VENV_DIR) python_exe = context.env_exe else: python_exe = "python3" subprocess.check_call([python_exe, "-m", "black"] + _FORMAT_DIRS) subprocess.check_call([python_exe, "-m", "isort"] + _FORMAT_DIRS) OHF-Voice-sentence-stream-63df79c/script/lint000077500000000000000000000014331504045001300211040ustar00rootroot00000000000000#!/usr/bin/env python3 import subprocess import venv from pathlib import Path _DIR = Path(__file__).parent _PROGRAM_DIR = _DIR.parent _VENV_DIR = _PROGRAM_DIR / ".venv" _MODULE_DIR = _PROGRAM_DIR / "sentence_stream" _TESTS_DIR = _PROGRAM_DIR / "tests" _LINT_DIRS = [_MODULE_DIR, _TESTS_DIR] if _VENV_DIR.exists(): context = venv.EnvBuilder().ensure_directories(_VENV_DIR) python_exe = context.env_exe else: python_exe = "python3" subprocess.check_call([python_exe, "-m", "black"] + _LINT_DIRS + ["--check"]) subprocess.check_call([python_exe, "-m", "isort"] + _LINT_DIRS + ["--check"]) subprocess.check_call([python_exe, "-m", "flake8"] + _LINT_DIRS) subprocess.check_call([python_exe, "-m", "pylint"] + _LINT_DIRS) subprocess.check_call([python_exe, "-m", "mypy"] + _LINT_DIRS) OHF-Voice-sentence-stream-63df79c/script/package000077500000000000000000000006221504045001300215300ustar00rootroot00000000000000#!/usr/bin/env python3 import subprocess import venv from pathlib import Path _DIR = Path(__file__).parent _PROGRAM_DIR = _DIR.parent _VENV_DIR = _PROGRAM_DIR / ".venv" if _VENV_DIR.exists(): context = venv.EnvBuilder().ensure_directories(_VENV_DIR) python_exe = context.env_exe else: python_exe = "python3" subprocess.check_call( [python_exe, "-m", "build", "--sdist", "--wheel"] ) OHF-Voice-sentence-stream-63df79c/script/setup000077500000000000000000000015661504045001300213050ustar00rootroot00000000000000#!/usr/bin/env python3 import argparse import subprocess import venv from pathlib import Path _DIR = Path(__file__).parent _PROGRAM_DIR = _DIR.parent _VENV_DIR = _PROGRAM_DIR / ".venv" parser = argparse.ArgumentParser() parser.add_argument("--dev", action="store_true", help="Install dev requirements") args = parser.parse_args() # Create virtual environment builder = venv.EnvBuilder(with_pip=True) context = builder.ensure_directories(_VENV_DIR) builder.create(_VENV_DIR) # Upgrade dependencies pip = [context.env_exe, "-m", "pip"] subprocess.check_call(pip + ["install", "--upgrade", "pip"]) subprocess.check_call(pip + ["install", "--upgrade", "setuptools", "wheel"]) # Install requirements subprocess.check_call(pip + ["install", "-e", str(_PROGRAM_DIR)]) if args.dev: # Install dev requirements subprocess.check_call(pip + ["install", "-e", f"{_PROGRAM_DIR}[dev]"]) OHF-Voice-sentence-stream-63df79c/script/test000077500000000000000000000006771504045001300211260ustar00rootroot00000000000000#!/usr/bin/env python3 import subprocess import sys import venv from pathlib import Path _DIR = Path(__file__).parent _PROGRAM_DIR = _DIR.parent _VENV_DIR = _PROGRAM_DIR / ".venv" _TEST_DIR = _PROGRAM_DIR / "tests" if _VENV_DIR.exists(): context = venv.EnvBuilder().ensure_directories(_VENV_DIR) python_exe = context.env_exe else: python_exe = "python3" subprocess.check_call([python_exe, "-m", "pytest", _TEST_DIR] + sys.argv[1:]) OHF-Voice-sentence-stream-63df79c/sentence_stream/000077500000000000000000000000001504045001300220625ustar00rootroot00000000000000OHF-Voice-sentence-stream-63df79c/sentence_stream/__init__.py000066400000000000000000000004311504045001300241710ustar00rootroot00000000000000"""Guess the sentence boundaries in a text stream.""" from .sentence_stream import ( SentenceBoundaryDetector, async_stream_to_sentences, stream_to_sentences, ) __all__ = [ "async_stream_to_sentences", "stream_to_sentences", "SentenceBoundaryDetector", ] OHF-Voice-sentence-stream-63df79c/sentence_stream/sentence_stream.py000066400000000000000000000065571504045001300256300ustar00rootroot00000000000000"""Guess the sentence boundaries in a text stream.""" from collections.abc import AsyncGenerator, AsyncIterable, Generator, Iterable import regex as re from .util import remove_asterisks SENTENCE_END = r"[.!?…]|[。!?]|[؟]|[।॥]" ABBREVIATION_RE = re.compile(r"\b\p{L}{1,3}\.$", re.UNICODE) SENTENCE_BOUNDARY_RE = re.compile( rf"(?:{SENTENCE_END}+)(?=\s+[\p{{Lu}}\p{{Lt}}\p{{Lo}}]|(?:\s+\d+[.)]{{1,2}}\s+))", re.DOTALL, ) BLANK_LINES_RE = re.compile(r"(?:\r?\n){2,}") # ----------------------------------------------------------------------------- def stream_to_sentences(text_stream: Iterable[str]) -> Generator[str]: """Generate sentences from a text stream.""" boundary_detector = SentenceBoundaryDetector() for text_chunk in text_stream: yield from boundary_detector.add_chunk(text_chunk) final_text = boundary_detector.finish() if final_text: yield final_text async def async_stream_to_sentences( text_stream: AsyncIterable[str], ) -> AsyncGenerator[str]: """Generate sentences from an async text stream.""" boundary_detector = SentenceBoundaryDetector() async for text_chunk in text_stream: for sentence in boundary_detector.add_chunk(text_chunk): yield sentence final_text = boundary_detector.finish() if final_text: yield final_text # ----------------------------------------------------------------------------- class SentenceBoundaryDetector: """Detect sentence boundaries from a text stream.""" def __init__(self) -> None: self.remaining_text = "" self.current_sentence = "" def add_chunk(self, chunk: str) -> Iterable[str]: """Add text chunk to stream and yield all detected sentences.""" self.remaining_text += chunk while self.remaining_text: match_blank_lines = BLANK_LINES_RE.search(self.remaining_text) match_punctuation = SENTENCE_BOUNDARY_RE.search(self.remaining_text) if match_blank_lines and match_punctuation: if match_blank_lines.start() < match_punctuation.start(): first_match = match_blank_lines else: first_match = match_punctuation elif match_blank_lines: first_match = match_blank_lines elif match_punctuation: first_match = match_punctuation else: break match_text = self.remaining_text[: first_match.start() + 1] match_end = first_match.end() if not self.current_sentence: self.current_sentence = match_text elif ABBREVIATION_RE.search(self.current_sentence[-5:]): self.current_sentence += match_text else: yield remove_asterisks(self.current_sentence.strip()) self.current_sentence = match_text if not ABBREVIATION_RE.search(self.current_sentence[-5:]): yield remove_asterisks(self.current_sentence.strip()) self.current_sentence = "" self.remaining_text = self.remaining_text[match_end:] def finish(self) -> str: """End text stream and yield final sentence.""" text = (self.current_sentence + self.remaining_text).strip() self.remaining_text = "" self.current_sentence = "" return remove_asterisks(text) OHF-Voice-sentence-stream-63df79c/sentence_stream/util.py000066400000000000000000000005131504045001300234100ustar00rootroot00000000000000"""Utility methods.""" import regex as re WORD_ASTERISKS = re.compile(r"\*+([^\*]+)\*+") LINE_ASTERICKS = re.compile(r"(?<=^|\n)\s*\*+") def remove_asterisks(text: str) -> str: """Remove *asterisks* surrounding **words**""" text = WORD_ASTERISKS.sub(r"\1", text) text = LINE_ASTERICKS.sub("", text) return text OHF-Voice-sentence-stream-63df79c/setup.cfg000066400000000000000000000006661504045001300205340ustar00rootroot00000000000000[flake8] # To work with Black max-line-length = 88 # E501: line too long # W503: Line break occurred before a binary operator # E203: Whitespace before ':' # D202 No blank lines allowed after function docstring # W504 line break after binary operator ignore = E501, W503, E203, D202, W504 [isort] multi_line_output = 3 include_trailing_comma=True force_grid_wrap=0 use_parentheses=True line_length=88 indent = " " OHF-Voice-sentence-stream-63df79c/tests/000077500000000000000000000000001504045001300200455ustar00rootroot00000000000000OHF-Voice-sentence-stream-63df79c/tests/__init__.py000066400000000000000000000000001504045001300221440ustar00rootroot00000000000000OHF-Voice-sentence-stream-63df79c/tests/english_golden_rules.py000066400000000000000000000240521504045001300246150ustar00rootroot00000000000000"""Golden rules for English from pySBD. See: https://github.com/nipunsadvilkar/pySBD """ # NOTE: Added boolean to indiciate if rule is expected to pass (True) or fail. GOLDEN_EN_RULES = [ # 1) Simple period to end sentence (True, "Hello World. My name is Jonas.", ["Hello World.", "My name is Jonas."]), # 2) Question mark to end sentence ( True, "What is your name? My name is Jonas.", ["What is your name?", "My name is Jonas."], ), # 3) Exclamation point to end sentence (True, "There it is! I found it.", ["There it is!", "I found it."]), # 4) One letter upper case abbreviations (True, "My name is Jonas E. Smith.", ["My name is Jonas E. Smith."]), # 5) One letter lower case abbreviations (True, "Please turn to p. 55.", ["Please turn to p. 55."]), # 6) Two letter lower case abbreviations in the middle of a sentence (True, "Were Jane and co. at the party?", ["Were Jane and co. at the party?"]), # 7) Two letter upper case abbreviations in the middle of a sentence ( True, "They closed the deal with Pitt, Briggs & Co. at noon.", ["They closed the deal with Pitt, Briggs & Co. at noon."], ), # 8) Two letter lower case abbreviations at the end of a sentence ( False, "Let's ask Jane and co. They should know.", ["Let's ask Jane and co.", "They should know."], ), # 9) Two letter upper case abbreviations at the end of a sentence ( False, "They closed the deal with Pitt, Briggs & Co. It closed yesterday.", ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."], ), # 10) Two letter (False,prepositive) abbreviations (True, "I can see Mt. Fuji from here.", ["I can see Mt. Fuji from here."]), # 11) Two letter (False,prepositive & postpositive) abbreviations ( True, "St. Michael's Church is on 5th st. near the light.", ["St. Michael's Church is on 5th st. near the light."], ), # 12) Possesive two letter abbreviations (True, "That is JFK Jr.'s book.", ["That is JFK Jr.'s book."]), # 13) Multi-period abbreviations in the middle of a sentence (True, "I visited the U.S.A. last year.", ["I visited the U.S.A. last year."]), # 14) Multi-period abbreviations at the end of a sentence ( False, "I live in the E.U. How about you?", ["I live in the E.U.", "How about you?"], ), # 15) U.S. as sentence boundary ( False, "I live in the U.S. How about you?", ["I live in the U.S.", "How about you?"], ), # 16) U.S. as non sentence boundary with next word capitalized ( True, "I work for the U.S. Government in Virginia.", ["I work for the U.S. Government in Virginia."], ), # 17) U.S. as non sentence boundary ( True, "I have lived in the U.S. for 20 years.", ["I have lived in the U.S. for 20 years."], ), # Most difficult sentence to crack # 18) A.M. / P.M. as non sentence boundary and sentence boundary ( False, "At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.", [ "At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store.", ], ), # 19) Number as non sentence boundary (True, "She has $100.00 in her bag.", ["She has $100.00 in her bag."]), # 20) Number as sentence boundary ( True, "She has $100.00. It is in her bag.", ["She has $100.00.", "It is in her bag."], ), # 21) Parenthetical inside sentence ( True, "He teaches science (False,He previously worked for 5 years as an engineer.) at the local University.", [ "He teaches science (False,He previously worked for 5 years as an engineer.) at the local University." ], ), # 22) Email addresses ( False, "Her email is Jane.Doe@example.com. I sent her an email.", ["Her email is Jane.Doe@example.com.", "I sent her an email."], ), # 23) Web addresses ( True, "The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.", [ "The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out.", ], ), # 24) Single quotations inside sentence ( True, "She turned to him, 'This is great.' she said.", ["She turned to him, 'This is great.' she said."], ), # 25) Double quotations inside sentence ( True, 'She turned to him, "This is great." she said.', ['She turned to him, "This is great." she said.'], ), # 26) Double quotations at the end of a sentence ( False, 'She turned to him, "This is great." She held the book out to show him.', ['She turned to him, "This is great."', "She held the book out to show him."], ), # 27) Double punctuation (False,exclamation point) (True, "Hello!! Long time no see.", ["Hello!!", "Long time no see."]), # 28) Double punctuation (False,question mark) (True, "Hello?? Who is there?", ["Hello??", "Who is there?"]), # 29) Double punctuation (False,exclamation point / question mark) (True, "Hello!? Is that you?", ["Hello!?", "Is that you?"]), # 30) Double punctuation (False,question mark / exclamation point) (True, "Hello?! Is that you?", ["Hello?!", "Is that you?"]), # 31) List (False,period followed by parens and no period to end item) ( False, "1.) The first item 2.) The second item", ["1.) The first item", "2.) The second item"], ), # 32) List (False,period followed by parens and period to end item) ( True, "1.) The first item. 2.) The second item.", ["1.) The first item.", "2.) The second item."], ), # 33) List (False,parens and no period to end item) ( False, "1) The first item 2) The second item", ["1) The first item", "2) The second item"], ), # 34) List (False,parens and period to end item) ( True, "1) The first item. 2) The second item.", ["1) The first item.", "2) The second item."], ), # 35) List (False,period to mark list and no period to end item) ( False, "1. The first item 2. The second item", ["1. The first item", "2. The second item"], ), # 36) List (False,period to mark list and period to end item) ( False, "1. The first item. 2. The second item.", ["1. The first item.", "2. The second item."], ), # 37) List with bullet ( False, "• 9. The first item • 10. The second item", ["• 9. The first item", "• 10. The second item"], ), # 38) List with hypthen ( False, "⁃9. The first item ⁃10. The second item", ["⁃9. The first item", "⁃10. The second item"], ), # 39) Alphabetical list ( False, "a. The first item b. The second item c. The third list item", ["a. The first item", "b. The second item", "c. The third list item"], ), # 40) Geo Coordinates ( True, "You can find it at N°. 1026.253.553. That is where the treasure is.", ["You can find it at N°. 1026.253.553.", "That is where the treasure is."], ), # 41) Named entities with an exclamation point ( True, "She works at Yahoo! in the accounting department.", ["She works at Yahoo! in the accounting department."], ), # 42) I as a sentence boundary and I as an abbreviation ( False, "We make a good team, you and I. Did you see Albert I. Jones yesterday?", ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"], ), # 43) Ellipsis at end of quotation ( True, "Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”", [ "Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”" ], ), # 44) Ellipsis with square brackets ( True, """"Bohr [...] used the analogy of parallel stairways [...]" (False,Smith 55).""", ['"Bohr [...] used the analogy of parallel stairways [...]" (False,Smith 55).'], ), # 45) Ellipsis as sentence boundary (False,standard ellipsis rules) ( True, "If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (False,preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.", [ "If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (False,preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence.", ], ), # 46) Ellipsis as sentence boundary (False,non-standard ellipsis rules) ( True, "I never meant that.... She left the store.", ["I never meant that....", "She left the store."], ), # 47) Ellipsis as non sentence boundary ( False, "I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it.", [ "I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it." ], ), # 48) 4-dot ellipsis ( False, "One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .", [ "One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . .", ], ), ] OHF-Voice-sentence-stream-63df79c/tests/test_remove_asterisks.py000066400000000000000000000011411504045001300250400ustar00rootroot00000000000000"""Tests for removing asterisks from text (Markdown).""" from sentence_stream import stream_to_sentences from sentence_stream.util import remove_asterisks def test_remove_word_asterisks() -> None: assert list( stream_to_sentences( "**Test** sentence with *emphasized* words! Another *** sentence." ) ) == ["Test sentence with emphasized words!", "Another *** sentence."] def test_remove_line_asterisks() -> None: assert ( remove_asterisks("* Test item 1.\n\n** Test item 2\n * Test item 3.") == " Test item 1.\n\n Test item 2\n Test item 3." ) OHF-Voice-sentence-stream-63df79c/tests/test_sentence_stream.py000066400000000000000000000113311504045001300246340ustar00rootroot00000000000000"""Tests for sentence boundary detection.""" from typing import List import pytest from sentence_stream import async_stream_to_sentences, stream_to_sentences from .english_golden_rules import GOLDEN_EN_RULES @pytest.mark.asyncio async def test_one_chunk() -> None: """Test that a single text chunk produces a single sentence.""" text = "Test chunk" assert list(stream_to_sentences([text])) == [text] async def text_gen(): yield text assert [sent async for sent in async_stream_to_sentences(text_gen())] == [text] @pytest.mark.parametrize("punctuation", (".", "?", "!", "!?")) @pytest.mark.asyncio async def test_one_chunk_with_punctuation(punctuation: str) -> None: """Test that punctuation splits sentences in a single chunk.""" text_1 = f"Test chunk 1{punctuation}" text_2 = "Test chunk 2" text = f"{text_1} {text_2}" assert list(stream_to_sentences([text])) == [text_1, text_2] async def text_gen(): yield text assert [sent async for sent in async_stream_to_sentences(text_gen())] == [ text_1, text_2, ] @pytest.mark.asyncio async def test_multiple_chunks() -> None: """Test sentence splitting across multiple chunks.""" text_1 = "Test chunk 1." text_2 = "Test chunk 2." texts = ["Test chunk", " 1. Test chunk", " 2."] assert list(stream_to_sentences(texts)) == [text_1, text_2] async def text_gen(): for text in texts: yield text assert [sent async for sent in async_stream_to_sentences(text_gen())] == [ text_1, text_2, ] def test_numbered_lists() -> None: """Test breaking apart numbered lists (+ removing astericks).""" sentences = list( stream_to_sentences( "Final Fantasy VII features several key characters who drive the narrative: " "1. **Cloud Strife** - The protagonist, an ex-SOLDIER mercenary and a skilled fighter. " "2. **Aerith Gainsborough (Aeris)** - A kindhearted flower seller with spiritual powers and deep connections to the planet's ecosystem. " "3. **Barret Wallace** - A leader of eco-terrorists called AVALANCHE, fighting against Shinra Corporation's exploitation of the planet. " "4. **Tifa Lockhart** - Cloud's childhood friend who runs a bar in Sector 7 and helps him recover from past trauma. " "5. **Sephiroth** - The main antagonist, an ex-SOLDIER with god-like abilities, seeking to control or destroy the planet. " "6. **Red XIII (aka Red 13)** - A member of a catlike race called Cetra, searching for answers about his heritage and destiny. " "7. **Vincent Valentine** - A brooding former Turk who lives in isolation from guilt over past failures but aids Cloud's party with his powerful abilities. " "8. **Cid Highwind** - The pilot of the rocket plane Highwind and a skilled engineer working on various airship projects. 9. " "**Shinra Employees (JENOVA Project)** - Characters like Professor Hojo, President Shinra, and Reno who play crucial roles in the plot's development. " "Each character brings unique skills and perspectives to the story, contributing to its rich narrative and gameplay dynamics." ) ) assert len(sentences) == 10 assert sentences[1].startswith("2. Aerith Gainsborough") @pytest.mark.asyncio async def test_blank_line() -> None: """Test that a double newline splits a sentence.""" text_1 = "Test sentence 1" text_2 = "Test sentence 2." text_3 = "Test sentence 3" text = f"{text_1}\n\n{text_2} {text_3}" assert list(stream_to_sentences([text])) == [text_1, text_2, text_3] async def text_gen(): yield text assert [sent async for sent in async_stream_to_sentences(text_gen())] == [ text_1, text_2, text_3, ] @pytest.mark.asyncio async def test_newline_punctuation() -> None: """Test that a newline with punctuation splits a sentence.""" text_1 = "Test sentence 1." text_2 = "Test sentence 2." text = f"{text_1}\n{text_2}" assert list(stream_to_sentences([text])) == [text_1, text_2] async def text_gen(): yield text assert [sent async for sent in async_stream_to_sentences(text_gen())] == [ text_1, text_2, ] @pytest.mark.parametrize(("should_pass", "text", "expected_sentences"), GOLDEN_EN_RULES) def test_golden_rules_en( should_pass: bool, text: str, expected_sentences: List[str] ) -> None: """Test English 'golden rules'.""" actual_sentences = list(stream_to_sentences(text)) if should_pass: assert expected_sentences == actual_sentences else: # Expected to fail assert expected_sentences != actual_sentences, "Expected to fail but succeeded"