pax_global_header 0000666 0000000 0000000 00000000064 12775654761 0014536 g ustar 00root root 0000000 0000000 52 comment=5460f915bde6a81a8fc3fcfd8a794e9bcb7352ca
language-detector-language-detector-0.6/ 0000775 0000000 0000000 00000000000 12775654761 0020365 5 ustar 00root root 0000000 0000000 language-detector-language-detector-0.6/.gitignore 0000664 0000000 0000000 00000000045 12775654761 0022354 0 ustar 00root root 0000000 0000000 /target
/language-detector.iml
.idea/ language-detector-language-detector-0.6/LICENSE 0000664 0000000 0000000 00000024173 12775654761 0021401 0 ustar 00root root 0000000 0000000 Apache License, Version 2.0
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
2. Grant of Copyright License.
Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
3. Grant of Patent License.
Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution.
You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
You must give any other recipients of the Work or Derivative Works a copy of this License; and
You must cause any modified files to carry prominent notices stating that You changed the files; and
You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
5. Submission of Contributions.
Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
6. Trademarks.
This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty.
Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability.
In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability.
While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work
To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
language-detector-language-detector-0.6/README.md 0000664 0000000 0000000 00000025337 12775654761 0021656 0 ustar 00root root 0000000 0000000 # language-detector
Language Detection Library for Java
com.optimaize.languagedetectorlanguage-detector0.5
## Language Support
### 70 Built-in Language Profiles
1. af Afrikaans
1. an Aragonese
1. ar Arabic
1. ast Asturian
1. be Belarusian
1. br Breton
1. ca Catalan
1. bg Bulgarian
1. bn Bengali
1. cs Czech
1. cy Welsh
1. da Danish
1. de German
1. el Greek
1. en English
1. es Spanish
1. et Estonian
1. eu Basque
1. fa Persian
1. fi Finnish
1. fr French
1. ga Irish
1. gl Galician
1. gu Gujarati
1. he Hebrew
1. hi Hindi
1. hr Croatian
1. ht Haitian
1. hu Hungarian
1. id Indonesian
1. is Icelandic
1. it Italian
1. ja Japanese
1. km Khmer
1. kn Kannada
1. ko Korean
1. lt Lithuanian
1. lv Latvian
1. mk Macedonian
1. ml Malayalam
1. mr Marathi
1. ms Malay
1. mt Maltese
1. ne Nepali
1. nl Dutch
1. no Norwegian
1. oc Occitan
1. pa Punjabi
1. pl Polish
1. pt Portuguese
1. ro Romanian
1. ru Russian
1. sk Slovak
1. sl Slovene
1. so Somali
1. sq Albanian
1. sr Serbian
1. sv Swedish
1. sw Swahili
1. ta Tamil
1. te Telugu
1. th Thai
1. tl Tagalog
1. tr Turkish
1. uk Ukrainian
1. ur Urdu
1. vi Vietnamese
1. yi Yiddish
1. zh-cn Simplified Chinese
1. zh-tw Traditional Chinese
User danielnaber has made available a profile for Esperanto on his website, see open tasks.
### Other Languages
You can create a language profile for your own language easily.
See https://github.com/optimaize/language-detector/blob/master/src/main/resources/README.md
## How it Works
The software uses language profiles which were created based on common text for each language.
N-grams http://en.wikipedia.org/wiki/N-gram were then extracted from that text, and that's what is stored in the profiles.
When trying to figure out in what language a certain text is written, the program goes through the same process:
It creates the same kind of n-grams of the input text. Then it compares the relative frequency of them, and finds the
language that matches best.
### Challenges
This software does not work as well when the input text to analyze is short, or unclean. For example tweets.
When a text is written in multiple languages, the default algorithm of this software is not appropriate.
You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the language guesser
on the whole text will just tell you the language that is most dominant, in the best case.
This software cannot handle it well when the input text is in none of the expected (and supported) languages.
For example if you only load the language profiles from English and German, but the text is written in French,
the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that
it's unlikely one of the supported languages.)
If you are looking for a language detector / language guesser library in Java, this seems to be the best open source
library you can get at this time. If it doesn't need to be Java, you may want to take a look at https://code.google.com/p/cld2/
## How to Use
#### Language Detection for your Text
//load all languages:
List languageProfiles = new LanguageProfileReader().readAllBuiltIn();
//build language detector:
LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
.withProfiles(languageProfiles)
.build();
//create a text object factory
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
//query:
TextObject textObject = textObjectFactory.forText("my text");
Optional lang = languageDetector.detect(textObject);
#### Creating Language Profiles for your Training Text
//create text object factory:
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forIndexingCleanText();
//load your training text:
TextObject inputText = textObjectFactory.create()
.append("this is my")
.append("training text")
//create the profile:
LanguageProfile languageProfile = new LanguageProfileBuilder("en")
.ngramExtractor(NgramExtractors.standard())
.minimalFrequency(5) //adjust please
.addText(inputText)
.build();
//store it to disk if you like:
new LanguageProfileWriter().writeToDirectory(languageProfile, "c:/foo/bar");
For the profile name, use he ISO 639-1 language code if there is one, otherwise the ISO 639-3 code.
The training text should be rather clean; it is a good idea to remove parts written in other languages
(like English phrases, or Latin script content in a Cyrillic text for example). Some also like to remove
proper nouns like (international) place names in case there are too many. It's up to you how far you go.
As a general rule, the cleaner the text is, the better is its profile.
If you scrape text from Wikipedia then please only use the main content, without the left side navigation etc.
The profile size should be similar to the existing profiles for practical reasons. To compute the likeliness
for an identified language, the index size is put in relation, therefore a language with a larger profile
won't have a higher probability to be chosen.
Please contribute your new language profile to this project. The file can be added to the languages folder, and
then referenced in the BuiltInLanguages class. Or else open a ticket, and provide a download link.
Also, it's a good idea to put the original text along with the modifying (cleaning) code into a new
project on GitHub. This gives others the possibility to improve on your work. Or maybe even use the
training text in other, non-Java software.
## How You Can Help
If your language is not supported yet, then you can provide clean "training text", that is, common text written in your
language. The text should be fairly long (a couple of pages at the very least). If you can provide that, please open
a ticket.
If your language is supported already, but not identified clearly all the time, you can still provide such training
text. We might then be able to improve detection for your language.
If you're a programmer, dig in the source and see what you can improve. Check the open tasks.
## History and Changes
This is a fork from https://code.google.com/p/lang-guess/ (forked on 2014-02-27) which itself is a fork
of the original project https://code.google.com/p/language-detection/
#### Changes made here
##### Functional Changes
* Made results for short text consistent, no random n-gram selection for short text (configurable).
* Configurable when to remove ASCII (foreign script). Old code did it when ascii was less than 1/3 of the content, and only for ASCII.
* New n-gram generation. Faster, and flexible (filter, space-padding). Previously it was hardcoded to 1, 2 and 3-grams, and it had hardcoded which n-grams were ignored.
* LanguageDetector is now safe to use multi-threaded.
* Clear code to safely load profiles and use them, no state in static fields.
* Easier to generate your own language profiles based on training text, and to load and store them.
* Feature to weight prefix and suffix n-grams higher.
##### Technical Changes
* Updated to use Java 7 for compilation, and for syntax. It's 2015, and 7/8 are the only officially supported version by Oracle.
* Code quality improvements:
* Returning interfaces instead of implementations (List instead of ArrayList etc)
* String .equals instead of ==
* Replaced StringBuffer with StringBuilder
* Renamed classes for clarity
* Made classes immutable, and thus thread safe
* Made fields private, using accessors
* Clear null reference concept:
* using IntelliJ's @Nullable and @NotNull annotations
* using Guava's Optional
* Added JavaDoc, fixed typos
* Added interfaces
* More tests. Thanks to the refactorings, code is now testable that was too much embedded before.
* Removed the "seed" completely (for the Random() number generator, I don't see the use). UPDATE: now I do, there's an open task to re-add this.
* Updated all Maven dependency versions
* Replaced last lib dependency with Maven (jsonic)
##### Legal
Apache2 license, just like the work from which this is derived.
(I had temporarily changed it to LGPLv3, but that change was invalid and therefore reverted.)
##### TODO
The software works well, there are things that can be improved. Check the Issues list.
#### Why so much forking?
The original project hasn't seen any commit in a while. The issue list is growing.
The news page says for 2012 that it now has Maven support, but there is no pom in git.
There is a release in Maven see http://mvnrepository.com/artifact/com.cybozu.labs/langdetect/1.1-20120112
for version 1.1-20120112 but not in git. So I don't know what's going on there.
The lang-guess fork saw quite some commits in 2011 and up to march 2012, then nothing anymore.
It uses Maven.
The 2 projects are not in sync, it looks like they did not integrate changes from each other anymore.
Both are on Google Code, I believe that GitHub is a much better place for contributing.
My goals were to bring the code up to current standards, and to update it for Java 7. So I quickly
noticed that I have to touch pretty much all code. And with the status of the other two projects,
I figured that I better start my own. This ensures that my work is published to the public.
## Where it's used
An adapted version of this is used by the http://www.NameAPI.org server.
https://www.languagetool.org/ is a proof-reading software for LibreOffice/OpenOffice, for the Desktop and for Firefox.
## License
Apache 2 (business friendly)
## Authors
###### Nakatani Shuyo
* Started the project and built most of the functionality. Provided the language profiles.
* Project is at https://code.google.com/p/language-detection/
###### Fabian Kessler
* Forked to https://github.com/optimaize/language-detector from Francois on 2014-02-27
* Rewrote most of the code
* Added JavaDoc
* See changes above, or check the GitHub commit history
###### Francois ROLAND
* Forked to https://code.google.com/p/lang-guess/ from Shuyo's original project.
* Maven integration
###### Robert Theis
* Forked to https://github.com/rmtheis/language-detection from Shuyo's original project.
* Added 16 more language profiles
* Features not (yet) integrated here:
* profiles stored as Java code
* Maven multi-module project to reduce size for Android apps
## For Maven Users
The project is in Maven central http://search.maven.org/#artifactdetails%7Ccom.optimaize.languagedetector%7Clanguage-detector%7C0.4%7Cjar this is the latest version:
com.optimaize.languagedetectorlanguage-detector0.5
language-detector-language-detector-0.6/pom.xml 0000664 0000000 0000000 00000023375 12775654761 0021714 0 ustar 00root root 0000000 0000000
4.0.0com.optimaize.languagedetectorlanguage-detectorlanguage-detector0.6jarhttps://github.com/optimaize/language-detector
Language Detection Library for Java.
Apache 2http://www.apache.org/licenses/LICENSE-2.0UTF-81.71.7Nakatani ShuyoFabian KesslerFrançois ROLANDRobert Theisscm:git:https://github.com/optimaize/language-detectorscm:git:https://github.com/optimaize/language-detectorhttps://github.com/optimaize/language-detectorlanguage-detector-0.6src/main/resources${project.basedir}README*LICENSE*org.apache.maven.pluginsmaven-compiler-plugin3.1${compiler.source}${compiler.target}truetrueorg.apache.maven.pluginsmaven-source-plugin2.2.1attach-sourcesverifyjar-no-forkorg.apache.maven.pluginsmaven-javadoc-plugin2.9.1attach-javadocverifyjarorg.codehaus.mojocobertura-maven-plugin2.6org.apache.maven.pluginsmaven-site-plugin3.3org.apache.maven.pluginsmaven-dependency-plugin2.8org.apache.maven.pluginsmaven-release-plugin2.4.1forked-pathmaven-gpg-plugin1.4org.apache.maven.pluginsmaven-source-plugin2.2.1org.apache.maven.pluginsmaven-deploy-plugin2.8.1sonatype-nexus-snapshotsSonatype Nexus snapshot repositoryhttps://oss.sonatype.org/content/repositories/snapshotssonatype-nexus-stagingSonatype Nexus release repositoryhttps://oss.sonatype.org/service/local/staging/deploy/maven2/release-sign-artifactsperformReleasetrueorg.apache.maven.pluginsmaven-gpg-plugin1.4sign-artifactsverifysignnet.arnxjsonic1.2.11com.intellijannotations12.0com.google.guavaguava18.0org.slf4jslf4j-api1.7.6org.testngtestng6.9.9testjunitjunit-dep4.11testorg.hamcresthamcrest-core1.3testorg.hamcresthamcrest-library1.3testorg.mockitomockito-all1.9.5testch.qos.logbacklogback-classic1.1.1test